Vector Search -An Introduction in Oracle Database 23ai.pptx
Roeder rocky 2011_46
1. A Distributed Framework for
Computation on the Results of
Large Scale NLP
Christophe Roeder, William A. Baumgartner Jr., Kevin Livingston,
Lawrence E. Hunter
(University of Colorado Anschutz Medial Campus)
Chris.Roeder@ucdenver.edu
http://compbio.ucdenver.edu
2. Motivation
• A vast amount of information is available
in journal articles
• Journal articles are unstructured text
• Many applications require structured
knowledge
– Curated ontologies (Gene Ontology)
– Databases (UniProt, EntrezGene)
• Challenge: extract structured knowledge
from unstructured text and integrate with
existing knowledge…at massive scale
5. Summary
• NLP pipelines extract structured annotations
• Our framework provides massively parallel access
to these structured document annotations
• Structured representation is integrated with
knowledge base
• Affords parallelization when possible, and access
to knowledge base when necessary
• Provides integration of unstructured document text
with structured knowledge for enabling
applications such as:
– Visualization (BioJigsaw, Hanalyzer,…)
– Natural Language Understanding (OpenDMAP)
– Leveraging text data for validation and evaluation of
other methods
6. Thank You / Questions
• http://tinyurl.com/bio-trends
• Co-authors
– William A. Baumgartner Jr. for data generation
– Kevin Livingston for RDF and Clojure help
• Grants and PIs
– Lawrence E Hunter, UCDenver SOM
• NIH 2R01LM009254-04, NIH 2R01LM008111-04A1,
NIH 5R01GM083649-02
– Karin Verspoor, UCDenver SOM
• NIH R01 LM010120-01
– Gully Burns, ISI
• NSF 0849977
Notas del editor
Plug KabobPlug Open Access, Mention Elsevier collections, size
Mention UIMA Distringuish NER from normalization, and how that ID ties it into the KBPutting High Precision Enttiytrecog to work at large scaleInduction, abductionGet around noise issues by using a LOT of dataPrecision and recal require scaleMight learn something, if said often enoughCorrleations between proteins, coorrenceppiCoorrence with other ontology terms or other extracted terms or biological processes
No excuses, don’t trivialize, but emphasize its value as a demoBuilt in about a week, computation over PMC OA in 2 hours on a very modest cluster (40 cores)(inefficiencies exist as well) lot of data, runs qucilyDemonstrates that the framework can be used quickly and worksSame technology can be used
On that last point, think of coorelatoins and stuff.** who knows what we’ll think of with the possibilities this opens up