SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
Image Mining from Gel Diagrams in
Biomedical Publications
Tobias Kuhn and Michael Krauthammer
Krauthammer Lab, Department of Pathology
Yale University School of Medicine
5th International Symposium on
Semantic Mining in Biomedicine (SMBM)
3 September 2012
Zurich, Switzerland
Introduction
The inclusion of figure images is a recent trend in the area of
literature mining.
The increasing amount of open access publications makes such
images available for automated analysis.
Image mining techniques can be used for image search interfaces,
for relation mining, and to complement text mining approaches.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 2 / 19
Yale Image Finder
http://krauthammerlab.med.yale.edu/imagefinder/
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 3 / 19
Gel Images
Our approach focuses on gel images:
• They are the result of gel electrophoresis (e.g. Southern,
Western and Northern blotting)
• They are often shown in biomedical publication as evidence for
the discussed findings (e.g. protein-protein interactions and
protein expressions under different conditions)
• About 15% of all subfigures are gel images
• They are structured according to common regular patterns
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 4 / 19
Relations from Gel Images
Condition Measurement Result
MDA-MB-231 14-3-3σ high expression
NHEM 14-3-3σ no expression
C8161.9 14-3-3σ high expression
LOX 14-3-3σ low expression
MDA-MB-231 β-actin high expression
NHEM β-actin high expression
C8161.9 β-actin high expression
LOX β-actin high expression
Condition Measurement Result
IL-1β (–) DEX (–) RU486 (–) p-p38 low expression
IL-1β (+) DEX (–) RU486 (–) p-p38 high expression
IL-1β (–) DEX (+) RU486 (–) p-p38 no expression
IL-1β (+) DEX (+) RU486 (–) p-p38 low expression
IL-1β (–) DEX (–) RU486 (+) p-p38 no expression
IL-1β (+) DEX (–) RU486 (+) p-p38 high expression
IL-1β (–) DEX (+) RU486 (+) p-p38 low expression
IL-1β (+) DEX (+) RU486 (+) p-p38 high expression
... ... ...
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 5 / 19
Image Mining Processes
In principle, image mining involves the same processes as classical
literature mining1 (with some subtle but important differences):
• Document categorization (image categorization has to deal
with the two-dimensional space of pixels, instead of text)
• Named entity tagging (pinpointing the mention of an entity is
more difficult with images; OCR errors have to be considered)
• Fact extraction (analysis of graphical elements instead of
parsing complete sentences)
• Collection-wide analysis
1
Berry De Bruijn and Joel Martin. 2002. Getting to the (c)ore of knowledge: mining biomedical literature.
International Journal of Medical Informatics, 67(1-3):7–18.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 6 / 19
Procedure
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
A B
X
Y
P
articles figures segments text gels gel panels named entities
1 21 3 4 5 6
relations
7
1 Figure Extraction
2 Segmentation
3 Text Recognition
4 Gel Segment Detection
5 Gel Panel Detection
6 Named Entity Recognition
7 Relation Extraction
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 7 / 19
Figure Extraction
A B
X
Y
P
A B
X
Y
P
articles figures
11
We use structured XML files of the open access subset of PubMed
Central.
(Figure extraction from PDF files or even bitmaps of scanned articles
would be more difficult, but definitely feasible.)
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 8 / 19
Segmentation and Text Recognition
A B
X
Y
P
A B
X
Y
P
segments text
2 3
For segmentation and text recognition we rely on our previous work.2
This includes:
• Detection of layout elements
• Text region detection
• OCR (using the Microsoft Document Imaging package of MS
Office)
2
Songhua Xu and Michael Krauthammer. 2010. A new pivoting and iterative text detection algorithm for
biomedical images. J. of Biomedical Informatics, 43(6):924–931, December.
Songhua Xu and Michael Krauthammer. 2011. Boosting text extraction from biomedical images using text region
detection. In Biomedical Sciences and Engineering Conference (BSEC), 2011, pages 1–4. IEEE.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 9 / 19
Gel Segment Detection
A B
X
Y
P
gels
4
Random forest classifiers (based on 75 random trees) on the following
features of image segments:
• coordinates of the relative position within the image
• relative and absolute width and height
• 16 grayscale histogram features
• color features: red, green and blue
• 13 texture features
• number of recognized characters
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 10 / 19
Gel Segment Detection Results
Manually annotated training and testing sets of 500 random figures
each.
Results for three different thresholds:
Threshold Precision Recall F-score
high recall 0.15 0.439 0.909 0.592
0.30 0.765 0.739 0.752
high precision 0.60 0.926 0.301 0.455
Accuracy (area under ROC curve): 98.0%
Unbalanced set: 3% gel segments vs. 97% non-gel segments
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 11 / 19
Gel Panel Detection
A B
X
Y
P
gel panels
5
Algorithm:
• Start with a gel segment according to the high-precision classifier
• Repeatedly look for adjacent gel segments according to the
high-recall classifier, and merge them
• Collect labels in the form of text segments arround the detected
gel region
Results on another set of 500 manually annotated figures:
Precision Recall F-score
0.951 0.379 0.542
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 12 / 19
Named Entity Recognition
named entities
6
Detection of gene and protein names in gel labels:
• Tokenization of gel label texts
• Lookup in Entrez Gene database
• Case-sensitive matching
• Exclude tokens:
• Less than 3 characters
• Arabic or Latin numbers
• Common short words (from a list of the 100 most frequent words
in biomedical articles)
• 22 general words frequently used in gel diagrams (e.g. min, hrs,
line, type, protein, DNA)
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 13 / 19
Named Entity Recognition Results
Recognized gene/protein tokens in 2000 random figures:
absolute relative
Total 156 100.0%
Incorrect 54 34.6%
– Not mentioned (OCR errors) 28 17.9%
– Not references to genes or proteins 26 16.7%
Correct 102 65.3%
– Partially correct (could be more specific) 14 9.0%
– Fully correct 88 56.4%
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 14 / 19
Relation Extraction
relations
7
Relation extraction is future work and we do not have concrete
results at this point.
It would involve the following steps:
• Gene/protein name disambiguation
• Identify semantic roles (condition, measurement, ...)
• Quantify degree of expression
Combination with classical text mining techniques seems promising.
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 15 / 19
Overall Results on PubMed Central
We ran our pipeline on the whole open access subset of PubMed
Central:
Total articles 410 950
Processed articles 386 428
Total figures from processed articles 1 110 643
Processed figures 884 152
Detected gel panels 85 942
Detected gel panels per figure 0.097
Detected gel labels 309 340
Detected gel labels per panel 3.599
Detected gene tokens 1 854 609
Detected gene tokens in gel labels 75 610
Gene token ratio 0.033
Gene token ratio in gel labels 0.068
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 16 / 19
Discussion: Standardized Biomedical Diagrams?
It seems feasible to extract relations from gel images at satisfactory
accuracy, but it is clear that this procedure is far from perfect.
Shouldn’t we standardize biomedical diagrams? A Unified
Modeling Language (UML) for biomedicine?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 17 / 19
Conclusions and Future Work
Conclusions:
• Gel segments can be detected with high accuracy
• Detection of gel panels at high precision
• Gene/protein name recognition in gel labels at satisfactory
precision
→ Image mining from gel diagrams is feasible
Future Work:
• Relation extraction
• Combination with classical text mining techniques
• Other named entity types: cell lines, drugs, ...
• Standard for biomedical diagrams?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 18 / 19
Thank you for your Attention!
Questions?
T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 19 / 19

Más contenido relacionado

La actualidad más candente

Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
CSCJournals
 
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
CSCJournals
 
Engineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolEngineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization tool
arx-deidentifier
 
IEEE AP-MTT OSU Seminar
IEEE AP-MTT OSU SeminarIEEE AP-MTT OSU Seminar
IEEE AP-MTT OSU Seminar
Ogan Gurel MD
 
Subgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurSubgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructur
IAEME Publication
 
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
arx-deidentifier
 

La actualidad más candente (18)

(2011) Comparison of Face Image Quality Metrics
(2011) Comparison of Face Image Quality Metrics(2011) Comparison of Face Image Quality Metrics
(2011) Comparison of Face Image Quality Metrics
 
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
A Review of Various Methods Used in the Analysis of Functional Gene Expressio...
 
Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0
 
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
 
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...Biological Significance of Gene Expression Data Using Similarity Based Biclus...
Biological Significance of Gene Expression Data Using Similarity Based Biclus...
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug design
 
Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...Classification of Microarray Gene Expression Data by Gene Combinations using ...
Classification of Microarray Gene Expression Data by Gene Combinations using ...
 
An Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-IdentificationAn Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-Identification
 
Engineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolEngineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization tool
 
IEEE AP-MTT OSU Seminar
IEEE AP-MTT OSU SeminarIEEE AP-MTT OSU Seminar
IEEE AP-MTT OSU Seminar
 
IRJET- Plant Disease Identification System
IRJET- Plant Disease Identification SystemIRJET- Plant Disease Identification System
IRJET- Plant Disease Identification System
 
Comparison of Feature selection methods for diagnosis of cervical cancer usin...
Comparison of Feature selection methods for diagnosis of cervical cancer usin...Comparison of Feature selection methods for diagnosis of cervical cancer usin...
Comparison of Feature selection methods for diagnosis of cervical cancer usin...
 
Subgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructurSubgraph relative frequency approach for extracting interesting substructur
Subgraph relative frequency approach for extracting interesting substructur
 
diffraction techniques
 diffraction techniques diffraction techniques
diffraction techniques
 
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
 
Segmentation and removal of interphase cells from chromosome
Segmentation and removal of interphase cells from chromosomeSegmentation and removal of interphase cells from chromosome
Segmentation and removal of interphase cells from chromosome
 
CV
CVCV
CV
 
Advances in prokaryote classification from microscopic images
Advances in prokaryote classification from microscopic imagesAdvances in prokaryote classification from microscopic images
Advances in prokaryote classification from microscopic images
 

Similar a Image Mining from Gel Diagrams in Biomedical Publications

Chemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientistsChemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientists
unyil96
 
BolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisBolingerJustin - Honors Thesis
BolingerJustin - Honors Thesis
Justin P. Bolinger
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
Michael Atkins
 

Similar a Image Mining from Gel Diagrams in Biomedical Publications (20)

Challenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical researchChallenges and opportunities for machine learning in biomedical research
Challenges and opportunities for machine learning in biomedical research
 
Introduction to graph databases: Neo4j and Cypher
Introduction to graph databases: Neo4j and CypherIntroduction to graph databases: Neo4j and Cypher
Introduction to graph databases: Neo4j and Cypher
 
Algorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphsAlgorithmic approach to computational biology using graphs
Algorithmic approach to computational biology using graphs
 
The Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related SciencesThe Impact of Information Technology on Chemistry and Related Sciences
The Impact of Information Technology on Chemistry and Related Sciences
 
Images as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for SegmentationImages as Occlusions of Textures: A Framework for Segmentation
Images as Occlusions of Textures: A Framework for Segmentation
 
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
Texture-Based Computational Models of Tissue in Biomedical Images: Initial Ex...
 
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
Introduction to Machine Learning and Texture Analysis for Lesion Characteriza...
 
American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1American Statistical Association October 23 2009 Presentation Part 1
American Statistical Association October 23 2009 Presentation Part 1
 
NetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang SuNetBioSIG2013-Talk Gang Su
NetBioSIG2013-Talk Gang Su
 
Viva201393(1).pptxbaru
Viva201393(1).pptxbaruViva201393(1).pptxbaru
Viva201393(1).pptxbaru
 
Research summary
Research summaryResearch summary
Research summary
 
Chemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientistsChemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientists
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
BolingerJustin - Honors Thesis
BolingerJustin - Honors ThesisBolingerJustin - Honors Thesis
BolingerJustin - Honors Thesis
 
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
Detection of Cancer in Pap smear Cytological Images Using Bag of Texture Feat...
 
A01110107
A01110107A01110107
A01110107
 
Bio ontology drtc-seminar_anwesha
Bio ontology drtc-seminar_anweshaBio ontology drtc-seminar_anwesha
Bio ontology drtc-seminar_anwesha
 
Basics of Data Analysis in Bioinformatics
Basics of Data Analysis in BioinformaticsBasics of Data Analysis in Bioinformatics
Basics of Data Analysis in Bioinformatics
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
Bioinformatics-R program의 실례
Bioinformatics-R program의 실례Bioinformatics-R program의 실례
Bioinformatics-R program의 실례
 

Más de Tobias Kuhn

A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
Tobias Kuhn
 
Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?
Tobias Kuhn
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
Tobias Kuhn
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
Tobias Kuhn
 
Automatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen WikiAutomatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen Wiki
Tobias Kuhn
 

Más de Tobias Kuhn (20)

Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized Publishing
 
Linked Data Publishing with Nanopublications
Linked Data Publishing with NanopublicationsLinked Data Publishing with Nanopublications
Linked Data Publishing with Nanopublications
 
Genuine semantic publishing
Genuine semantic publishingGenuine semantic publishing
Genuine semantic publishing
 
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of DataA Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
A Decentralized Approach to Dissemination, Retrieval, and Archiving of Data
 
The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer The Controlled Natural Language of Randall Munroe’s Thing Explainer
The Controlled Natural Language of Randall Munroe’s Thing Explainer
 
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
Publishing without Publishers: a Decentralized Approach to Dissemination, Ret...
 
nanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublicationsnanopub-java: A Java Library for Nanopublications
nanopub-java: A Java Library for Nanopublications
 
Semantic Publishing and Nanopublications
Semantic Publishing and NanopublicationsSemantic Publishing and Nanopublications
Semantic Publishing and Nanopublications
 
Scientific Data Publishing
Scientific Data PublishingScientific Data Publishing
Scientific Data Publishing
 
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
A Decentralized Network for Publishing Linked Data — Nanopublications, Trusty...
 
Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?Science Bots: A Model for the Future of Scientific Computation?
Science Bots: A Model for the Future of Scientific Computation?
 
Data Publishing and Post-Publication Reviews
Data Publishing and Post-Publication ReviewsData Publishing and Post-Publication Reviews
Data Publishing and Post-Publication Reviews
 
Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications Semantic Publishing with Nanopublications
Semantic Publishing with Nanopublications
 
Nanopubs
NanopubsNanopubs
Nanopubs
 
Meme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation NetworksMeme Extraction from Corpora of Scientific Literature using Citation Networks
Meme Extraction from Corpora of Scientific Literature using Citation Networks
 
A Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural LanguageA Multilingual Semantic Wiki Based on Controlled Natural Language
A Multilingual Semantic Wiki Based on Controlled Natural Language
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
 
Automatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen WikiAutomatische Übersetzung in einem multilingualen, semantischen Wiki
Automatische Übersetzung in einem multilingualen, semantischen Wiki
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Image Mining from Gel Diagrams in Biomedical Publications

  • 1. Image Mining from Gel Diagrams in Biomedical Publications Tobias Kuhn and Michael Krauthammer Krauthammer Lab, Department of Pathology Yale University School of Medicine 5th International Symposium on Semantic Mining in Biomedicine (SMBM) 3 September 2012 Zurich, Switzerland
  • 2. Introduction The inclusion of figure images is a recent trend in the area of literature mining. The increasing amount of open access publications makes such images available for automated analysis. Image mining techniques can be used for image search interfaces, for relation mining, and to complement text mining approaches. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 2 / 19
  • 3. Yale Image Finder http://krauthammerlab.med.yale.edu/imagefinder/ T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 3 / 19
  • 4. Gel Images Our approach focuses on gel images: • They are the result of gel electrophoresis (e.g. Southern, Western and Northern blotting) • They are often shown in biomedical publication as evidence for the discussed findings (e.g. protein-protein interactions and protein expressions under different conditions) • About 15% of all subfigures are gel images • They are structured according to common regular patterns T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 4 / 19
  • 5. Relations from Gel Images Condition Measurement Result MDA-MB-231 14-3-3σ high expression NHEM 14-3-3σ no expression C8161.9 14-3-3σ high expression LOX 14-3-3σ low expression MDA-MB-231 β-actin high expression NHEM β-actin high expression C8161.9 β-actin high expression LOX β-actin high expression Condition Measurement Result IL-1β (–) DEX (–) RU486 (–) p-p38 low expression IL-1β (+) DEX (–) RU486 (–) p-p38 high expression IL-1β (–) DEX (+) RU486 (–) p-p38 no expression IL-1β (+) DEX (+) RU486 (–) p-p38 low expression IL-1β (–) DEX (–) RU486 (+) p-p38 no expression IL-1β (+) DEX (–) RU486 (+) p-p38 high expression IL-1β (–) DEX (+) RU486 (+) p-p38 low expression IL-1β (+) DEX (+) RU486 (+) p-p38 high expression ... ... ... T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 5 / 19
  • 6. Image Mining Processes In principle, image mining involves the same processes as classical literature mining1 (with some subtle but important differences): • Document categorization (image categorization has to deal with the two-dimensional space of pixels, instead of text) • Named entity tagging (pinpointing the mention of an entity is more difficult with images; OCR errors have to be considered) • Fact extraction (analysis of graphical elements instead of parsing complete sentences) • Collection-wide analysis 1 Berry De Bruijn and Joel Martin. 2002. Getting to the (c)ore of knowledge: mining biomedical literature. International Journal of Medical Informatics, 67(1-3):7–18. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 6 / 19
  • 7. Procedure A B X Y P A B X Y P A B X Y P A B X Y P A B X Y P A B X Y P articles figures segments text gels gel panels named entities 1 21 3 4 5 6 relations 7 1 Figure Extraction 2 Segmentation 3 Text Recognition 4 Gel Segment Detection 5 Gel Panel Detection 6 Named Entity Recognition 7 Relation Extraction T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 7 / 19
  • 8. Figure Extraction A B X Y P A B X Y P articles figures 11 We use structured XML files of the open access subset of PubMed Central. (Figure extraction from PDF files or even bitmaps of scanned articles would be more difficult, but definitely feasible.) T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 8 / 19
  • 9. Segmentation and Text Recognition A B X Y P A B X Y P segments text 2 3 For segmentation and text recognition we rely on our previous work.2 This includes: • Detection of layout elements • Text region detection • OCR (using the Microsoft Document Imaging package of MS Office) 2 Songhua Xu and Michael Krauthammer. 2010. A new pivoting and iterative text detection algorithm for biomedical images. J. of Biomedical Informatics, 43(6):924–931, December. Songhua Xu and Michael Krauthammer. 2011. Boosting text extraction from biomedical images using text region detection. In Biomedical Sciences and Engineering Conference (BSEC), 2011, pages 1–4. IEEE. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 9 / 19
  • 10. Gel Segment Detection A B X Y P gels 4 Random forest classifiers (based on 75 random trees) on the following features of image segments: • coordinates of the relative position within the image • relative and absolute width and height • 16 grayscale histogram features • color features: red, green and blue • 13 texture features • number of recognized characters T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 10 / 19
  • 11. Gel Segment Detection Results Manually annotated training and testing sets of 500 random figures each. Results for three different thresholds: Threshold Precision Recall F-score high recall 0.15 0.439 0.909 0.592 0.30 0.765 0.739 0.752 high precision 0.60 0.926 0.301 0.455 Accuracy (area under ROC curve): 98.0% Unbalanced set: 3% gel segments vs. 97% non-gel segments T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 11 / 19
  • 12. Gel Panel Detection A B X Y P gel panels 5 Algorithm: • Start with a gel segment according to the high-precision classifier • Repeatedly look for adjacent gel segments according to the high-recall classifier, and merge them • Collect labels in the form of text segments arround the detected gel region Results on another set of 500 manually annotated figures: Precision Recall F-score 0.951 0.379 0.542 T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 12 / 19
  • 13. Named Entity Recognition named entities 6 Detection of gene and protein names in gel labels: • Tokenization of gel label texts • Lookup in Entrez Gene database • Case-sensitive matching • Exclude tokens: • Less than 3 characters • Arabic or Latin numbers • Common short words (from a list of the 100 most frequent words in biomedical articles) • 22 general words frequently used in gel diagrams (e.g. min, hrs, line, type, protein, DNA) T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 13 / 19
  • 14. Named Entity Recognition Results Recognized gene/protein tokens in 2000 random figures: absolute relative Total 156 100.0% Incorrect 54 34.6% – Not mentioned (OCR errors) 28 17.9% – Not references to genes or proteins 26 16.7% Correct 102 65.3% – Partially correct (could be more specific) 14 9.0% – Fully correct 88 56.4% T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 14 / 19
  • 15. Relation Extraction relations 7 Relation extraction is future work and we do not have concrete results at this point. It would involve the following steps: • Gene/protein name disambiguation • Identify semantic roles (condition, measurement, ...) • Quantify degree of expression Combination with classical text mining techniques seems promising. T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 15 / 19
  • 16. Overall Results on PubMed Central We ran our pipeline on the whole open access subset of PubMed Central: Total articles 410 950 Processed articles 386 428 Total figures from processed articles 1 110 643 Processed figures 884 152 Detected gel panels 85 942 Detected gel panels per figure 0.097 Detected gel labels 309 340 Detected gel labels per panel 3.599 Detected gene tokens 1 854 609 Detected gene tokens in gel labels 75 610 Gene token ratio 0.033 Gene token ratio in gel labels 0.068 T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 16 / 19
  • 17. Discussion: Standardized Biomedical Diagrams? It seems feasible to extract relations from gel images at satisfactory accuracy, but it is clear that this procedure is far from perfect. Shouldn’t we standardize biomedical diagrams? A Unified Modeling Language (UML) for biomedicine? T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 17 / 19
  • 18. Conclusions and Future Work Conclusions: • Gel segments can be detected with high accuracy • Detection of gel panels at high precision • Gene/protein name recognition in gel labels at satisfactory precision → Image mining from gel diagrams is feasible Future Work: • Relation extraction • Combination with classical text mining techniques • Other named entity types: cell lines, drugs, ... • Standard for biomedical diagrams? T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 18 / 19
  • 19. Thank you for your Attention! Questions? T. Kuhn and M. Krauthammer, Yale University Image Mining from Gel Diagrams in Biomedical Publications 19 / 19