SlideShare una empresa de Scribd logo
1 de 1
Descargar para leer sin conexión
Scaling Text Mining to One Million Documents
                                                                         Christophe Roeder, Karin Verspoor	
  
                                                                                 Chris.Roeder@ucdenver.edu	
  




           Applying text mining to a large document collection demands more resources than the lab PC can provide. Preparing for
           such a task requires an understanding of the demands of the text mining software and capabilities of supporting hardware
           and software. We describe efforts to scale a large text mining task.




    Resource Requirements of Selected                                                                 Scaling Framework Options
               Analytics                                                                   UIMA CPE
                                                                                           •  Basic UIMA pipeline engine
   Name                        Time                         Memory*                        •  Can run many threads
                                                                                           •  Limted to one machine
   XSLT Converter              0.03 sec./doc.               < 256MB

   XML Parser                  0.02 sec./doc.               < 256 MB
                                                                                           UIMA AS (Asynchronous Scaleout)
                                                                                           •  Uses message queues to link analytics on different machines
                                                                                           •  Message queues allow for flexibility regarding time of
   Sentence Detector           0.01 sec./doc.               < 256 MB
                                                                                             message delivery
                                                                                           •  Useful for putting many instances of a “heavy” analytic on a
   POS Tagger                  2.6 sec./doc.                < 256 MB
                                                                                             separate machine
                                                                                           •  Can be used to run many pipelines on many machines
   Parser                      1500 sec./doc.               > 1GB
                                                                                           •  XMI Serialization as overhead is not trivial
   XMI Serialization           2.5 sec./doc. **             < 256 MB
                                                                                           GridEngine
   Concept Mapper              ***                          > 2GB                          •  Cluster management software makes it easy to copy to many
                                                                                             machines at once
    Factors of ten in memory requirements and factors of 5                                 •  Scripts can be started on many machines with one command
    orders of magnitude in run times suggest a good pipeline
    description is vital for specifying hardware.                                          Hadoop
                                                                                           •  Map-reduce implementation
                                                                                           •  Map distributes, reduce collates
   * Memory usage includes UIMA and other analytics, 64 bit JVM
   ** annotations from sentence detection, tokenization, and POS tagging, time
                                                                                           •  Related tools very interesting: hdfs (hadoop file system)
      includes file i/o                                                                    •  Behemoth: UIMA and GATE adapted to Hadoop
   *** Data not available, memory use from loading Swiss-Prot




                                                    The Devil is in the Details
  Corpus Management:                                                                                  Analytics / Analysis Engines:
  •  Arrange access from publishers                                                                   •  Identify, Integrate into UIMA
  •  Download files                                                                                   •  Check for possible concurrency issues
  •  Parse XML of various DTDs to plain text                                                          •  Test for bugs, memory leaks
  •  Parse PDF if XML not available                                                                   •  Detailed error reporting
  •  Find or maintain section zoning information                                                      •  Find memory and cpu requirements
  •  Track source and citation information                                                            •  Track source, build and modifiation information
  •  Keep up to date with periodic updates



  Pipeline:
  •  Error reporting                                                                                       Output:
  •  Identify and restart after memory leaks                                                               •  Store all annotations
  •  Identify parameters passed to analytics                                                               •  RDB or serialized CAS
  •  Progress tracking, restart from last processed                                                        •  Track provenance
  •  Identify individual document errors and continue
    processing others.



 Scaling:                                                                                            Integration:
                                                                                                     •  Store semantic information in knowledge
 •  UIMA CPE Threads: simple, effective limited to one                                                  base for further processing.
   machine                                                                                           •  Web application to manage and initiate job
 •  UIMA AS: put “heavy” engines on other machines
                                                                                                        runs
 •  Grid Engine: move files, run scripts across a cluster
                                                                                                     •  Allow for change in one analytic and re-run
 •  Hadoop (map/reduce): elegant Java interface
                                                                                                        partial pipeline

Acknowledgements: NIH grant R01-LM010120-01to Karin Verspoor and the SciKnowMine project funded by NSF grant #0849977 and supported by U24 RR025736-01, NIGMS: RO1-
GM083871, NLM: 2R01LM009254, NLM:2R01LM008111, NLM:1R01LM010120-01, NHGRI:5P41HG000330

Más contenido relacionado

La actualidad más candente

HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHortonworks
 
Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
Luxun a Persistent Messaging System Tailored for Big Data Collecting & AnalyticsLuxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
Luxun a Persistent Messaging System Tailored for Big Data Collecting & AnalyticsWilliam Yang
 
Gluster Blog 11.15.2010
Gluster Blog 11.15.2010Gluster Blog 11.15.2010
Gluster Blog 11.15.2010GlusterFS
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Community
 
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Lars Marowsky-Brée
 
Caching principles-solutions
Caching principles-solutionsCaching principles-solutions
Caching principles-solutionspmanvi
 
PostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverPostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverJohn Paulett
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java DevelopersRichard McDougall
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack eurobsdcon
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi coremukul bhardwaj
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaumeurobsdcon
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfsNAVER D2
 
Building SuperComputers @ Home
Building SuperComputers @ HomeBuilding SuperComputers @ Home
Building SuperComputers @ HomeAbhishek Parolkar
 
Future of cloud storage
Future of cloud storageFuture of cloud storage
Future of cloud storageGlusterFS
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsJignesh Shah
 
Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1rit2011
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecturePiyush Mittal
 
Tuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentTuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentJignesh Shah
 

La actualidad más candente (20)

HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
 
Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
Luxun a Persistent Messaging System Tailored for Big Data Collecting & AnalyticsLuxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
 
Gluster Blog 11.15.2010
Gluster Blog 11.15.2010Gluster Blog 11.15.2010
Gluster Blog 11.15.2010
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
 
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
 
Caching principles-solutions
Caching principles-solutionsCaching principles-solutions
Caching principles-solutions
 
PostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverPostgreSQL Scaling And Failover
PostgreSQL Scaling And Failover
 
Virtualization Primer for Java Developers
Virtualization Primer for Java DevelopersVirtualization Primer for Java Developers
Virtualization Primer for Java Developers
 
Gluster 3.3 deep dive
Gluster 3.3 deep diveGluster 3.3 deep dive
Gluster 3.3 deep dive
 
Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack Running Applications on the NetBSD Rump Kernel by Justin Cormack
Running Applications on the NetBSD Rump Kernel by Justin Cormack
 
Introduction to multi core
Introduction to multi coreIntroduction to multi core
Introduction to multi core
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
Building SuperComputers @ Home
Building SuperComputers @ HomeBuilding SuperComputers @ Home
Building SuperComputers @ Home
 
Future of cloud storage
Future of cloud storageFuture of cloud storage
Future of cloud storage
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
 
Methods of NoSQL database systems benchmarking
Methods of NoSQL database systems benchmarkingMethods of NoSQL database systems benchmarking
Methods of NoSQL database systems benchmarking
 
Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1Building high traffic http front-ends. theo schlossnagle. зал 1
Building high traffic http front-ends. theo schlossnagle. зал 1
 
Multi core-architecture
Multi core-architectureMulti core-architecture
Multi core-architecture
 
Tuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris EnvironmentTuning DB2 in a Solaris Environment
Tuning DB2 in a Solaris Environment
 

Destacado

Roeder rocky 2011_46
Roeder rocky 2011_46Roeder rocky 2011_46
Roeder rocky 2011_46Chris Roeder
 
JavaScript Frameworks and Java EE – A Great Match
JavaScript Frameworks and Java EE – A Great MatchJavaScript Frameworks and Java EE – A Great Match
JavaScript Frameworks and Java EE – A Great MatchReza Rahman
 
7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications7 Stages of Scaling Web Applications
7 Stages of Scaling Web ApplicationsDavid Mitzenmacher
 
Functional Reactive Programming with RxJS
Functional Reactive Programming with RxJSFunctional Reactive Programming with RxJS
Functional Reactive Programming with RxJSstefanmayer13
 
Architecture of a Modern Web App
Architecture of a Modern Web AppArchitecture of a Modern Web App
Architecture of a Modern Web Appscothis
 

Destacado (6)

Roeder rocky 2011_46
Roeder rocky 2011_46Roeder rocky 2011_46
Roeder rocky 2011_46
 
Hibernate
HibernateHibernate
Hibernate
 
JavaScript Frameworks and Java EE – A Great Match
JavaScript Frameworks and Java EE – A Great MatchJavaScript Frameworks and Java EE – A Great Match
JavaScript Frameworks and Java EE – A Great Match
 
7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications
 
Functional Reactive Programming with RxJS
Functional Reactive Programming with RxJSFunctional Reactive Programming with RxJS
Functional Reactive Programming with RxJS
 
Architecture of a Modern Web App
Architecture of a Modern Web AppArchitecture of a Modern Web App
Architecture of a Modern Web App
 

Similar a Roeder posterismb2010

Caching technology comparison
Caching technology comparisonCaching technology comparison
Caching technology comparisonRohit Kelapure
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudCloudera, Inc.
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions Alfresco Software
 
Private cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomPrivate cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomMicrosoft Singapore
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computingNiranjana Ambadi
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Sandeep Kunkunuru
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Open Stack
 
Red Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFSRed Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFSGlusterFS
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 

Similar a Roeder posterismb2010 (20)

Caching technology comparison
Caching technology comparisonCaching technology comparison
Caching technology comparison
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Private cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicomPrivate cloud virtual reality to reality a partner story daniel mar_technicom
Private cloud virtual reality to reality a partner story daniel mar_technicom
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
CPU Caches
CPU CachesCPU Caches
CPU Caches
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
network ram parallel computing
network ram parallel computingnetwork ram parallel computing
network ram parallel computing
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1Hadoop: Components and Key Ideas, -part1
Hadoop: Components and Key Ideas, -part1
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
 
Red Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFSRed Hat Storage - Introduction to GlusterFS
Red Hat Storage - Introduction to GlusterFS
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Último

Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 

Último (20)

Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 

Roeder posterismb2010

  • 1. Scaling Text Mining to One Million Documents Christophe Roeder, Karin Verspoor   Chris.Roeder@ucdenver.edu   Applying text mining to a large document collection demands more resources than the lab PC can provide. Preparing for such a task requires an understanding of the demands of the text mining software and capabilities of supporting hardware and software. We describe efforts to scale a large text mining task. Resource Requirements of Selected Scaling Framework Options Analytics UIMA CPE •  Basic UIMA pipeline engine Name Time Memory* •  Can run many threads •  Limted to one machine XSLT Converter 0.03 sec./doc. < 256MB XML Parser 0.02 sec./doc. < 256 MB UIMA AS (Asynchronous Scaleout) •  Uses message queues to link analytics on different machines •  Message queues allow for flexibility regarding time of Sentence Detector 0.01 sec./doc. < 256 MB message delivery •  Useful for putting many instances of a “heavy” analytic on a POS Tagger 2.6 sec./doc. < 256 MB separate machine •  Can be used to run many pipelines on many machines Parser 1500 sec./doc. > 1GB •  XMI Serialization as overhead is not trivial XMI Serialization 2.5 sec./doc. ** < 256 MB GridEngine Concept Mapper *** > 2GB •  Cluster management software makes it easy to copy to many machines at once Factors of ten in memory requirements and factors of 5 •  Scripts can be started on many machines with one command orders of magnitude in run times suggest a good pipeline description is vital for specifying hardware. Hadoop •  Map-reduce implementation •  Map distributes, reduce collates * Memory usage includes UIMA and other analytics, 64 bit JVM ** annotations from sentence detection, tokenization, and POS tagging, time •  Related tools very interesting: hdfs (hadoop file system) includes file i/o •  Behemoth: UIMA and GATE adapted to Hadoop *** Data not available, memory use from loading Swiss-Prot The Devil is in the Details Corpus Management: Analytics / Analysis Engines: •  Arrange access from publishers •  Identify, Integrate into UIMA •  Download files •  Check for possible concurrency issues •  Parse XML of various DTDs to plain text •  Test for bugs, memory leaks •  Parse PDF if XML not available •  Detailed error reporting •  Find or maintain section zoning information •  Find memory and cpu requirements •  Track source and citation information •  Track source, build and modifiation information •  Keep up to date with periodic updates Pipeline: •  Error reporting Output: •  Identify and restart after memory leaks •  Store all annotations •  Identify parameters passed to analytics •  RDB or serialized CAS •  Progress tracking, restart from last processed •  Track provenance •  Identify individual document errors and continue processing others. Scaling: Integration: •  Store semantic information in knowledge •  UIMA CPE Threads: simple, effective limited to one base for further processing. machine •  Web application to manage and initiate job •  UIMA AS: put “heavy” engines on other machines runs •  Grid Engine: move files, run scripts across a cluster •  Allow for change in one analytic and re-run •  Hadoop (map/reduce): elegant Java interface partial pipeline Acknowledgements: NIH grant R01-LM010120-01to Karin Verspoor and the SciKnowMine project funded by NSF grant #0849977 and supported by U24 RR025736-01, NIGMS: RO1- GM083871, NLM: 2R01LM009254, NLM:2R01LM008111, NLM:1R01LM010120-01, NHGRI:5P41HG000330