SlideShare una empresa de Scribd logo
1 de 151
Lars Juhl Jensen
Text mining and data integration
exponential growth
~45 seconds per paper
information retrieval
named entity recognition
information extraction
association networks
data integration
information retrieval
find the relevant papers
ad hoc retrieval
user-specified query
“yeast AND cell cycle”
PubMed
indexing
fast lookup
stemming
word endings
dynamic query expansion
MeSH terms
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
step to promote subsequent Cdc5-
dependent Swe1 hyperphosphorylation
and degradation
no tool will find that
named entity recognition
computer
as smart as a dog
teach it specific tricks
identify the concepts
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
step to promote subsequent Cdc5-
dependent Swe1 hyperphosphorylation
and degradation
comprehensive lexicon
CDC2
cyclin dependent kinase 1
orthographic variation
upper- and lower-case
CDC2
Cdc2
spaces and hyphens
cyclin dependent kinase 1
cyclin-dependent kinase 1
prefixes and postfixes
CDC2
hCDC2
“black list”
SDS
scalable implementation
>10 km
<10 hours
augmented browsing
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
step to promote subsequent Cdc5-
dependent Swe1 hyperphosphorylation
and degradation
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
step to promote subsequent Cdc5-
dependent Swe1 hyperphosphorylation
and degradation
Reflect
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009
O’Donoghue et al., Journal of Web Semantics, 2010
information extraction
formalize the facts
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1
homolog) directly phosphorylated Swe1
and this modification served as a priming
step to promote subsequent Cdc5-
dependent Swe1 hyperphosphorylation
and degradation
two approaches
co-mentioning
counting
within documents
within paragraphs
within sentences
co-mentioning score
NLP
Natural Language Processing
grammatical analysis
part-of-speech tagging
multiword detection
semantic tagging
sentence parsing
Gene and protein names
Cue words for entity
recognition
Verbs for relation extraction
[nxexpr The expression of
[nxgene the cytochrome
genes
[nxpg CYC1 and CYC7]]]
is controlled by
[nxpg HAP1]
extract stated facts
high precision
poor recall
text corpus
most use abstracts
few use full-text articles
no access
PDF files
layout-aware extraction
my corpus
~22 million abstracts
~4 million articles
association networks
guilt by association
STRING
Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
computational predictions
gene fusion
Korbel et al., Nature Biotechnology, 2004
gene neighborhood
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
a real example
Cell
Cellulosomes
Cellulose
experimental data
gene coexpression
physical interactions
Jensen & Bork, Science, 2008
curated knowledge
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
different formats
different identifiers
variable quality
not comparable
hard work
quality scores
von Mering et al., Nucleic Acids Research, 2005
calibrate vs. gold standard
von Mering et al., Nucleic Acids Research, 2005
data integration
general approach
suite of web resources
STITCH
STRING + 300k chemicals
Kuhn et al., Nucleic Acids Research, 2012
COMPARTMENTS
subcellular localization
compartments.jensenlab.org
TISSUES
tissue expression
tissues.jensenlab.org
DISEASES
disease genes
unification
curated knowledge
text mining
experimental data
computational predictions
common identifiers
quality scores
visualization
dissemination
web interfaces
evidence viewers
web services
diseases.jensenlab.org
bulk download
thank you!

Más contenido relacionado

La actualidad más candente

The pragmatic text miner: From literature to electronic health records
The pragmatic text miner: From literature to electronic health recordsThe pragmatic text miner: From literature to electronic health records
The pragmatic text miner: From literature to electronic health records
Lars Juhl Jensen
 

La actualidad más candente (9)

Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysis
 
Explaining the assembly model
Explaining the assembly modelExplaining the assembly model
Explaining the assembly model
 
Ashg2015 schneider final
Ashg2015 schneider finalAshg2015 schneider final
Ashg2015 schneider final
 
The pragmatic text miner: From literature to electronic health records
The pragmatic text miner: From literature to electronic health recordsThe pragmatic text miner: From literature to electronic health records
The pragmatic text miner: From literature to electronic health records
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
agbt 2016 workshop church
agbt 2016 workshop churchagbt 2016 workshop church
agbt 2016 workshop church
 
TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
Ashg2014 grc workshop_schneider
Ashg2014 grc workshop_schneiderAshg2014 grc workshop_schneider
Ashg2014 grc workshop_schneider
 
Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 

Destacado

Data mining query languages
Data mining query languagesData mining query languages
Data mining query languages
Marcy Morales
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

Destacado (12)

The Integration Issue
The Integration IssueThe Integration Issue
The Integration Issue
 
Data mining query languages
Data mining query languagesData mining query languages
Data mining query languages
 
IS Issue: IT Integration during M&A
IS Issue: IT Integration during M&AIS Issue: IT Integration during M&A
IS Issue: IT Integration during M&A
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Types of Data Processing
Types of Data ProcessingTypes of Data Processing
Types of Data Processing
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
 
DataPreProcessing
DataPreProcessing DataPreProcessing
DataPreProcessing
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similar a Text mining and data integration

Mining literature and medical records
Mining literature and medical recordsMining literature and medical records
Mining literature and medical records
Lars Juhl Jensen
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
Lars Juhl Jensen
 
Systems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systemsSystems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systems
Lars Juhl Jensen
 

Similar a Text mining and data integration (20)

Mining literature and medical records
Mining literature and medical recordsMining literature and medical records
Mining literature and medical records
 
Biomedical text mining
Biomedical text miningBiomedical text mining
Biomedical text mining
 
Text mining
Text miningText mining
Text mining
 
Text mining
Text miningText mining
Text mining
 
Integration of biomedical literature and databases
Integration of biomedical literature and databasesIntegration of biomedical literature and databases
Integration of biomedical literature and databases
 
Literature mining and large-scale data integration
Literature mining and large-scale data integrationLiterature mining and large-scale data integration
Literature mining and large-scale data integration
 
Integration of biomedical literature and databases
Integration of biomedical literature and databasesIntegration of biomedical literature and databases
Integration of biomedical literature and databases
 
Literature Mining and Systems Biology
Literature Mining and Systems BiologyLiterature Mining and Systems Biology
Literature Mining and Systems Biology
 
Biomedical literature mining (and why we really need open access)
Biomedical literature mining (and why we really need open access)Biomedical literature mining (and why we really need open access)
Biomedical literature mining (and why we really need open access)
 
Biological literature mining - from information retrieval to biological disco...
Biological literature mining - from information retrieval to biological disco...Biological literature mining - from information retrieval to biological disco...
Biological literature mining - from information retrieval to biological disco...
 
Biomedical literature mining
Biomedical literature miningBiomedical literature mining
Biomedical literature mining
 
STRING: Large-scale data and text mining
STRING: Large-scale data and text miningSTRING: Large-scale data and text mining
STRING: Large-scale data and text mining
 
Text mining for protein and small molecule relations
Text mining for protein and small molecule relationsText mining for protein and small molecule relations
Text mining for protein and small molecule relations
 
Network biology: Large-scale data and text mining
Network biology: Large-scale data and text miningNetwork biology: Large-scale data and text mining
Network biology: Large-scale data and text mining
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
 
The STRING database and related tools
The STRING database and related toolsThe STRING database and related tools
The STRING database and related tools
 
Large-scale integration of data and text
Large-scale integration of data and textLarge-scale integration of data and text
Large-scale integration of data and text
 
Systems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systemsSystems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systems
 
Information integration
Information integrationInformation integration
Information integration
 
STRING: protein association networks
STRING: protein association networksSTRING: protein association networks
STRING: protein association networks
 

Más de Lars Juhl Jensen

Más de Lars Juhl Jensen (20)

One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Illustrating the power of dictionary-based named entit...One tagger, many uses: Illustrating the power of dictionary-based named entit...
One tagger, many uses: Illustrating the power of dictionary-based named entit...
 
One tagger, many uses: Simple text-mining strategies for biomedicine
One tagger, many uses: Simple text-mining strategies for biomedicineOne tagger, many uses: Simple text-mining strategies for biomedicine
One tagger, many uses: Simple text-mining strategies for biomedicine
 
Extract 2.0: Text-mining-assisted interactive annotation
Extract 2.0: Text-mining-assisted interactive annotationExtract 2.0: Text-mining-assisted interactive annotation
Extract 2.0: Text-mining-assisted interactive annotation
 
Network visualization: A crash course on using Cytoscape
Network visualization: A crash course on using CytoscapeNetwork visualization: A crash course on using Cytoscape
Network visualization: A crash course on using Cytoscape
 
STRING & STITCH : Network integration of heterogeneous data
STRING & STITCH: Network integration of heterogeneous dataSTRING & STITCH: Network integration of heterogeneous data
STRING & STITCH : Network integration of heterogeneous data
 
Biomedical text mining: Automatic processing of unstructured text
Biomedical text mining: Automatic processing of unstructured textBiomedical text mining: Automatic processing of unstructured text
Biomedical text mining: Automatic processing of unstructured text
 
Medical network analysis: Linking diseases and genes through data and text mi...
Medical network analysis: Linking diseases and genes through data and text mi...Medical network analysis: Linking diseases and genes through data and text mi...
Medical network analysis: Linking diseases and genes through data and text mi...
 
Network Biology: A crash course on STRING and Cytoscape
Network Biology: A crash course on STRING and CytoscapeNetwork Biology: A crash course on STRING and Cytoscape
Network Biology: A crash course on STRING and Cytoscape
 
Cellular networks
Cellular networksCellular networks
Cellular networks
 
Cellular Network Biology: Large-scale integration of data and text
Cellular Network Biology: Large-scale integration of data and textCellular Network Biology: Large-scale integration of data and text
Cellular Network Biology: Large-scale integration of data and text
 
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
Statistics on big biomedical data: Methods and pitfalls when analyzing high-t...
 
STRING & related databases: Large-scale integration of heterogeneous data
STRING & related databases: Large-scale integration of heterogeneous dataSTRING & related databases: Large-scale integration of heterogeneous data
STRING & related databases: Large-scale integration of heterogeneous data
 
Tagger: Rapid dictionary-based named entity recognition
Tagger: Rapid dictionary-based named entity recognitionTagger: Rapid dictionary-based named entity recognition
Tagger: Rapid dictionary-based named entity recognition
 
Network Biology: Large-scale integration of data and text
Network Biology: Large-scale integration of data and textNetwork Biology: Large-scale integration of data and text
Network Biology: Large-scale integration of data and text
 
Medical text mining: Linking diseases, drugs, and adverse reactions
Medical text mining: Linking diseases, drugs, and adverse reactionsMedical text mining: Linking diseases, drugs, and adverse reactions
Medical text mining: Linking diseases, drugs, and adverse reactions
 
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
 
Medical data and text mining: Linking diseases, drugs, and adverse reactions
Medical data and text mining: Linking diseases, drugs, and adverse reactionsMedical data and text mining: Linking diseases, drugs, and adverse reactions
Medical data and text mining: Linking diseases, drugs, and adverse reactions
 
Cellular Network Biology
Cellular Network BiologyCellular Network Biology
Cellular Network Biology
 
Network biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and textNetwork biology: Large-scale integration of data and text
Network biology: Large-scale integration of data and text
 
Biomarker bioinformatics: Network-based candidate prioritization
Biomarker bioinformatics: Network-based candidate prioritizationBiomarker bioinformatics: Network-based candidate prioritization
Biomarker bioinformatics: Network-based candidate prioritization
 

Último

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Último (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 

Text mining and data integration