Videogame localization & technology_ how to enhance the power of translation.pdf
Roeder posterismb2010
1. Scaling Text Mining to One Million Documents
Christophe Roeder, Karin Verspoor
Chris.Roeder@ucdenver.edu
Applying text mining to a large document collection demands more resources than the lab PC can provide. Preparing for
such a task requires an understanding of the demands of the text mining software and capabilities of supporting hardware
and software. We describe efforts to scale a large text mining task.
Resource Requirements of Selected Scaling Framework Options
Analytics UIMA CPE
• Basic UIMA pipeline engine
Name Time Memory* • Can run many threads
• Limted to one machine
XSLT Converter 0.03 sec./doc. < 256MB
XML Parser 0.02 sec./doc. < 256 MB
UIMA AS (Asynchronous Scaleout)
• Uses message queues to link analytics on different machines
• Message queues allow for flexibility regarding time of
Sentence Detector 0.01 sec./doc. < 256 MB
message delivery
• Useful for putting many instances of a “heavy” analytic on a
POS Tagger 2.6 sec./doc. < 256 MB
separate machine
• Can be used to run many pipelines on many machines
Parser 1500 sec./doc. > 1GB
• XMI Serialization as overhead is not trivial
XMI Serialization 2.5 sec./doc. ** < 256 MB
GridEngine
Concept Mapper *** > 2GB • Cluster management software makes it easy to copy to many
machines at once
Factors of ten in memory requirements and factors of 5 • Scripts can be started on many machines with one command
orders of magnitude in run times suggest a good pipeline
description is vital for specifying hardware. Hadoop
• Map-reduce implementation
• Map distributes, reduce collates
* Memory usage includes UIMA and other analytics, 64 bit JVM
** annotations from sentence detection, tokenization, and POS tagging, time
• Related tools very interesting: hdfs (hadoop file system)
includes file i/o • Behemoth: UIMA and GATE adapted to Hadoop
*** Data not available, memory use from loading Swiss-Prot
The Devil is in the Details
Corpus Management: Analytics / Analysis Engines:
• Arrange access from publishers • Identify, Integrate into UIMA
• Download files • Check for possible concurrency issues
• Parse XML of various DTDs to plain text • Test for bugs, memory leaks
• Parse PDF if XML not available • Detailed error reporting
• Find or maintain section zoning information • Find memory and cpu requirements
• Track source and citation information • Track source, build and modifiation information
• Keep up to date with periodic updates
Pipeline:
• Error reporting Output:
• Identify and restart after memory leaks • Store all annotations
• Identify parameters passed to analytics • RDB or serialized CAS
• Progress tracking, restart from last processed • Track provenance
• Identify individual document errors and continue
processing others.
Scaling: Integration:
• Store semantic information in knowledge
• UIMA CPE Threads: simple, effective limited to one base for further processing.
machine • Web application to manage and initiate job
• UIMA AS: put “heavy” engines on other machines
runs
• Grid Engine: move files, run scripts across a cluster
• Allow for change in one analytic and re-run
• Hadoop (map/reduce): elegant Java interface
partial pipeline
Acknowledgements: NIH grant R01-LM010120-01to Karin Verspoor and the SciKnowMine project funded by NSF grant #0849977 and supported by U24 RR025736-01, NIGMS: RO1-
GM083871, NLM: 2R01LM009254, NLM:2R01LM008111, NLM:1R01LM010120-01, NHGRI:5P41HG000330