3. Agenda
Introduction to Biomedical Text Mining
System Overview
Problem Description
Motivation
Challenges
System Framework
Application upon System Framework
Swanson’s Algorithm
Protein to Protein Interactions (PPI)
Gene Clustering based on Text Mining
Extended Work
Conclusion and Future Work.
4. Agenda
Introduction to Biomedical Text Mining
System Overview
Problem Description
Motivation
Challenges
System Framework
Application upon System Framework
Swanson’s Algorithm
Protein to Protein Interactions (PPI)
Gene Clustering based on Text Mining
Extended Work
Conclusion and Future Work.
5. Introduction to Biomedical Text
Mining
Text Mining = Process unstructured (textual)
information, extract meaningful data, make the
information contained in the text accessible to the
various data mining (statistical and machine learning)
algorithms.
Biomedical Text Mining = Working on biomedical
documents.
6. Agenda
Introduction to Biomedical Text Mining
System Overview
Problem Description
Motivation
Challenges
System Framework
Application upon System Framework
Swanson’s Algorithm
Protein to Protein Interactions (PPI)
Gene Clustering based on Text Mining
Extended Work
Conclusion and Future Work.
7. System Overview
Problem Description
Huge amount of information stored in million of
documents
These information can be used effectively to solve many
problems
Knowledge retrieval with no much effort
Discover relationship between different entities
Assessing relationship strength between different entities
Group entities into different clusters
8. System Overview
Motivation:
Build semantic structure of documents which
facilitates navigation through thousands of
documents.
Extract relationships between biomedical terms using
text mining techniques with aid of biomedical
ontologies.
Using text mining to group genes into different clusters.
9. System Overview
Challenges:
Concept Recognition
Build semantic structure of annotated documents using
ontologies
Relationship Recognition
Similarity (distance) between different entities.
10. Overall System Components
Framework
Searching and Browsing
Swanson’s Algorithm
PPI
Gene Clustering
12. Agenda
Introduction to Biomedical Text Mining
System Overview
Problem Description
Motivation
Challenges
System Framework
Application upon System Framework
Swanson’s Algorithm
Protein to Protein Interactions (PPI)
Gene Clustering based on Text Mining
Extended Work
Conclusion and Future Work.
15. System Framework
Objective:
Use ontologies to markup biomedical text documents.
Based on established semantic links between documents
and ontology concepts, the goal is build semantic
representation of information.
Provide services to other applications and users.
18. Framework Concept Issues
User Expanded Query
Query Expansion
Query Fetching
Documents
Search PubMed
Gene Documents
Ontology
Extract GO terms
Annotate PubMed
documents
Structure Representation
of documents
Annotated Documents
19. System Framework
PubMed:
Largest documents source in the biomedical field
Contains over 18 million documents
Maintained by the United States National Library
of Medicine (NLM)
Indexes all documents by MeSH terms to facilitate
searching and retrieval
20. System Framework
Gene Ontology:
The Gene Ontology project is a major
bioinformatics initiative with the aim of
standardizing the representation of gene and gene
product attributes across species and databases
Includes a controlled vocabulary of terms for
describing gene product characteristics.
Consists of three main categories
Cellular component
Biological process
Molecular function
21. System Framework
MeSH database:
Comprehensive controlled vocabulary for the purpose of indexing journal articles and
books in the life sciences; it can also serve as a thesaurus that facilitates searching
[Wikipedia]
MeSH main heading:
Anatomy
Organisms
Diseases
Chemicals and Drugs
Analytical, Diagnostic and Therapeutic Techniques and Equipment
Psychiatry and Psychology
Phenomena and Processes
Disciplines and Occupations
Anthropology, Education, Sociology and Social Phenomena
Technology, Industry, Agriculture
Humanities
Information Science
Named Groups
Health Care
Publication Characteristics
Geographical liocations
22. System Framework
Query Expansion (QE):is the process of reformulating
a seed query to improve retrieval performance in
information retrieval operations [Wikipedia]
How ?
Example
23. Query
Expansion Ocellus
pigmentation
Example
Pigment
Pigment
metabolic Pigmentation
accumulation
process
Cellular
pigmentation
24. System Framework
Documents Annotating
Annotate documents with Gene Ontology Terms, Genes
and proteins.
Represent each documents by set of terms. (How ?)
25. GO extractor
●GO’s vocabulary consists of 7,841 words. The majority of the GO words found
occur only once in the whole ontology. On the other hand 51 of the GO words
occur at least 100 times in the ontology. More than 90%, do not occur more
than 10 times.
●words with a very high frequency do not give much information as they are
part of many labels in the ontology. However, extracting a word with a low
frequency gives a much better hint about a mentioned concept. (Zipf's law).
●From the nature of GO-terms, the words in the end are very general
ex.(activity , transport).
●Besides, many GO-terms are substring of descending GO-terms.
●The algorithm is taken from GOPubMed (2008) “GoPubMed: Ontology-based
literature search for the life sciences”.
26. GO extractor algorithm
Get last
word
Compar Set main
e with root as a
root N root
Do BFS
The same
word
N and take Reache Y Get
occurred at each one as s leaf next
any sibling
a root word
Y
get next word
& do BFS and
consider each
one as a root
27. Go Extractor
Example:-
Abstract
“............................................and it's effected by the Kinase activity”. Abstract.
● Starting from the last word of the paragraph “activity”.
●Starting from the root of the GO tree searching for GO-term ending with
“activity”.
● When we rich it, fetch the next word and starting from the new root.
● Now we are looking in the subtree for an ontology ends with “Kinase activity”.
●While on search we reach leaf . It means that we got a GO-term. Now restart
by take the next word and from the root.
29. Framework Design Issues
Top Level Architecture of the System can be divided into:-
Data Handling Components
Information Handling Components
Information Extraction
Information Representation
Information Retrieval
31. System Framework
Framework main components:
Document Sources
Extractor
Document Annotators
Ontology Manager
System Engine
Database Manager
Cache Manager
Document
32. System Framework
Document Sources
Fetching of singles or collections of documents from
remote stores.
Extractor
Implements Information Extraction algorithms to extract
ontology terms from the documents
Document Annotators
establish semantic link between documents and ontology
concepts.
For example linking documents with its GO terms, MeSH
terms . . . etc.
33. System Framework
Ontology Manager
Provide interface to around ontologies
Composed by sub-managers to merge ontologies such as
Gene ontology
System Engine
Main component of the system.
Responsible for maintaining all the operations and
communications between various components of the
system
34. System Framework
Database Manager
implemented as a pool object (connections pool)
handles and maintains queries to the database such
insert, update and delete documents
Cache Manager
Implemented as client side of MemCached (open source
caching project).
Handles operations to the system cache
43. Our system Textpresso XplorMed Vivismo
Ontology Full Gene Only 30 Top hierarchy Drive
used Ontology category of ontology
the MeSH from the
ontology search
result
Output Uses the deep Returns a list For each Returns a list
ontology to of relevant MeSH of relevant
navigate abstract category, abstract
through a there is an
large result set associated list
in a non-
sequential
order
44. IBN-SINA vs. Others
IBN-SINA Textpresso XplorMed Vivismo
Works on works on all Designed for works on all works on all
the PubMed full paper which the PubMed the PubMed
abstracts not available abstracts abstracts
most of the
time
Term Allows gaps Tries to nd the Extract terms Extract terms
Extraction within category terms based on based on term
matches and directly in the term frequency in
considers the text only frequency in the collected
information allowing the collected documents
content of the for some documents
words, which variations in
leads to more lower/uppercas
rened term e letters and
extraction plural forms
48. Agenda
Introduction to Biomedical Text Mining
System Overview
Problem Description
Motivation
Challenges
System Framework
Application upon System Framework
Swanson’s Algorithm
Protein to Protein Interactions (PPI)
Gene Clustering based on Text Mining
Extended Work
Conclusion and Future Work.
50. Swanson Algorithm(1986)
Swanson’s method is a away of finding indirect relations between
objects.
A B
Related Related
term A1 term B1
Related Related
term A2 term B2
1986: “Undiscovered public knowledge”
51. Cosine Similarity
Cosine similarity is a measure of similarity between two vectors of n
dimensions by finding the cosine of the angle between them, often used to
compare documents in text mining [Wikipedia].
Terms related to first term “As’ related terms”
A B C D E F G H
Terms related to second term “Bs’ related terms”
A X Y B Z D E F
A B C D E F G H X Y Z
1 1 1 1 1 1 1 1 0 0 0
A B C D E F G H X Y Z
1 1 0 1 1 1 0 0 1 1 1
52. Cosine Similarity (Cont.)
Finally, applying cosine similarity function :-
A B C D E F G H X Y Z
1 1 1 1 1 1 1 1 0 0 0
A B C D E F G H X Y Z
1 1 0 1 1 1 0 0 1 1 1
Similarity = (1+1+0+1+1+1+0+0+0+0+0)/ (√8*√8) = 5/8 = 0.625
53. Swanson example
Relation between P53 and P51
1986: “Fish oil, Raynaud’s syndrome, and
undiscovered public knowledge”
54. Agenda
Introduction to Biomedical Text Mining
System Overview
Problem Description
Motivation
Challenges
System Framework
Application upon System Framework
Swanson’s Algorithm
Protein to Protein Interactions (PPI)
Gene Clustering based on Text Mining
Extended Work
Conclusion and Future Work.
56. PPI Agenda
Problem Description
Motivation
PPI System Overview
PPI System Main Components
Dependency Parse Tree
Similarity Metrics
K-Nearest Neighbor Classifier
Evaluation of PPI
Evaluation Metrics
Results and Comparison
57. PPI Agenda
Problem Description
Motivation
PPI System Overview
PPI System Main Components
Dependency Parse Tree
Similarity Metrics
K-Nearest Neighbor Classifier
Evaluation of PPI
Evaluation Metrics
Results and Comparison
58. Problem Description
Due to the ever growing amount of publications about
protein-protein interactions, information extraction from
text is increasingly recognized as one of crucial
technologies in bioinformatics
Reference:
Gunes Erkan, Arzucan Ozgur, Dragomir R. Radev. Semi-Supervised Classication
for Extracting Protein Interaction Sentences using Dependency Parsing.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning, pp. 228237,
Prague, June 2007
59. PPI Agenda
Problem Description
Motivation
PPI System Overview
PPI System Main Components
Dependency Parse Tree
Similarity Metrics
K-Nearest Neighbor Classifier
Evaluation of PPI
Evaluation Metrics
Results and Comparison
60. Motivation
The interactions between proteins are important for
very numerous if not all biological functions.
The function of a protein can be characterized more
precisely through knowledge of PPI.
Information about these interactions improves our
understanding of diseases and can provide the basis
for new therapeutic approaches.
Validate experimental results and test benches.
61. PPI Agenda
Problem Description
Motivation
PPI System Overview
PPI System Main Components
Dependency Parse Tree
Similarity Metrics
K-Nearest Neighbor Classifier
Evaluation of PPI
Evaluation Metrics
Results and Comparison
62. System Overview
We worked on Sentence level (Why?)
It increases the semantic understood from the sentence.
Synthesis of the sentence increases the knowledge
obtained from it.
Specific relation between proteins can be deduced from
it.
64. System Overview
Our approach depends on:
The shortest path between the entities in dependency
tree of a sentence usually captures the necessary
information to identify their relationship.
65. PPI Agenda
Problem Description
Motivation
PPI System Overview
PPI System Main Components
Dependency Parse Tree
Similarity Metrics
K-Nearest Neighbor Classifier
Evaluation of PPI
Evaluation Metrics
Results and Comparison
67. Dependency Parse Tree
• Unlike a syntactic parse, it captures the semantic
predicate-argument relationships among its words.
Stanford Parser API to make the Natural Language
processing task.
Shortest path is found using Breadth First Search
(BFS) as each edge has equal wait, and therefore this
leads to most near path discovered first.
68. Dependency Parse Tree (Example)
"The dependency tree of the sentence “The results demonstrated
that KaiC interacts rhythmically with KaiA, KaiB, and SasA.”
69. Example (Cont.)
• Then, we select the shortest paths between the
protein pairs:
• KaiC - nsubj - interacts - prep with – SasA
• KaiC - nsubj - interacts - prep with - SasA - conj and -
KaiA
• KaiC - nsubj - interacts - prep with – SasA - conj and –
KaiB
• SasA - conj and – KaiA
• SasA - conj and – KaiB
• KaiA – conj and – SasA - conj and - KaiB
70. Example (Cont.)
• Then, we rename the proteins in the pair as PROTX1
and PROTX2, and all the other proteins in the sentence
as PROTX0:
• PROTX1 - nsubj - interacts - prep with - ROTX2
• PROTX1 - nsubj - interacts - prep with - ROTX0 – conj_and -
PROTX2
• PROTX1 - nsubj - interacts - prep with – ROTX0 –conj_and -
PROTX2
• PROTX1 – conj_and - PROTX2
• PROTX1 – conj_and - PROTX2
• PROTX1 – conj_and – PROTX0 – conj_and - PROTX2
71. PPI Agenda
Problem Description
Motivation
PPI System Overview
PPI System Main Components
Dependency Parse Tree
Similarity Metrics
K-Nearest Neighbor Classifier
Evaluation of PPI
Evaluation Metrics
Results and Comparison
73. Similarity Metrics
The main idea of using similarity metrics is to
find a function that maps input patterns into a
target space such that a simple distance in the
target space approximates the “semantic”
distance in the input space.
74. Similarity Metrics
We implemented Levenshtein distance (Edit
Distance).
number of transpositions, substitutions and deletions
needed to transform one string into another.
We also used an open source library called
“SimMetrics” – Java library of 23 string similarity
metrics.
• Developed at the University of Sheffield (Chapman,
2004)
75. Similarity Metrics
• We used only 10 string similarities from SimMetrics.
• Cosine Similarity
• Block Distance
• Dice Similarity
• Euclidean Distance
• Jaccard Similarity
• Jaro Similarity
• Jaro Winkler Similarity
• Matching Coecient
• Monge Elkan Similarity
76. PPI Agenda
Problem Description
Motivation
PPI System Overview
PPI System Main Components
Dependency Parse Tree
Similarity Metrics
K-Nearest Neighbor Classifier
Evaluation of PPI
Evaluation Metrics
Results and Comparison
78. K-Nearest Neighbor Classifier
• k nearest neighbor-assign label according to the
majority label of k nearest-neighboor training
patterns.
79. KNN Example
• If k = 3, it is classified as
a triangle
• k = 5, it is classified as a
square
80. KNN Strengths and Weaknesses
• Strengths:
• Simple to implement and use
• Comprehensible – easy to explain prediction
• Robust to noisy data by averaging k-nearest neighbors
81. KNN Strengths and Weaknesses
• Weaknesses:
• Need a lot of space to store all examples.
• Takes more time to classify a new example than with a
model (need to calculate and compare distance from new
example to all other examples).
82. PPI Agenda
Problem Description
Motivation
PPI System Overview
PPI System Main Components
Dependency Parse Tree
Similarity Metrics
K-Nearest Neighbor Classifier
Evaluation of PPI
Evaluation Metrics
Results and Comparison
84. Evaluation of PPI
• we used five different datasets which are:
• BioInfer dataset.
• AIMed dataset.
• LLL dataset.
• IEPA dataset.
• HPRD50 dataset.
• We used KNN classier and changing K and similarity
metric as parameters.
87. PPI Agenda
Problem Description
Motivation
PPI System Overview
PPI System Main Components
Dependency Parse Tree
Similarity Metrics
K-Nearest Neighbor Classifier
Evaluation of PPI
Evaluation Metrics
Results and Comparison
93. Results and Comparison
Dataset Min. Result Max. Result
BioInfer 32 56.9
AIMed 5 48.9
LLL 48.8 73
IEPA 36.6 72
HPRD50 12.9 63.49
94. Our PPI System Vs. Graph Kernel
Approach
Dataset Our System Graph Kernel
(%) Approach (%)
BioInfer 56.9 52.9
AIMed 48.9 56.4
LLL 73 76.8
IEPA 72 75.1
HPRD50 67 63.4
95. Agenda
Introduction to Biomedical Text Mining
System Overview
Problem Description
Motivation
Challenges
System Framework
Application upon System Framework
Swanson’s Algorithm
Protein to Protein Interactions (PPI)
Gene Clustering based on Text Mining
Extended Work
Conclusion and Future Work.
110. Swanson Algorithm
Search PubMed for gene A and extract set A ( the
most related keywords - MeSH or GO terms - ) .
Search PubMed for gene B and extract set B ( the most
related keywords - MeSH or GO terms - ) .
Based on the intersection between set A and set B, we
apply the cosine similarity.
111. Document Occurrences
Search PubMed for gene A and extract set A
(documents Ids of gene A) .
Search PubMed for gene B and extract set B
(documents Ids of gene B).
Based on the intersection between set A and set B, we
apply the Jaccard Similarity Coefficient.
112. Agenda
Introduction to Biomedical Text Mining
System Overview
Problem Description
Motivation
Challenges
System Framework
Application upon System Framework
Swanson’s Algorithm
Protein to Protein Interactions (PPI)
Gene Clustering based on Text Mining
Extended Work
Conclusion and Future Work.
113. Extended Work: PPI System with
SVM Classifier (1)
Equation :
u=w⋅x-b
- Objective :
min (1/2) || w||2
subject to
yi (w ⋅ xi-b) ≥ 1,
∀i
114. Extended Work: PPI System with
SVM Classifier (2)
min Ψ (α ) = min (1/2) ∑ ∑ yi yj (xi ⋅xj)αi αj- ∑ αi
α is called multiplier and if we can get α we can get (w , b) .
w = ∑ yi αi xi , b = w ⋅ xk-yk for some αk > 0
115. Agenda
Introduction to Biomedical Text Mining
System Overview
Problem Description
Motivation
Challenges
System Framework
Application upon System Framework
Swanson’s Algorithm
Protein to Protein Interactions (PPI)
Gene Clustering based on Text Mining
Extended Work
Conclusion and Future Work.
116. Conclusion
Problem 1: Algorithms for concept recognition in
documents abstracts and titles
We introduced an algorithm to annotate the Gene Ontology
terms in the documents.
Problem 2: Use the annotated documents to build a
structured representation of documents
We introduced how framework uses Gene Ontology to build a
semantic representation of the obtained documents
Problem 3: Design a system for ontology based search
engines for biological researchers
We introduced design of the framework and how it is flexible
for future modifications and scalable with respect to number
of documents and number of users.
117. Conclusion
Problem 4: Using Swanson’s algorithm to assess the similarity between
different biological terms
We introduced how can Swanson's algorithm be used to estimate the
similarity between two instances (P53 and P21)
Problem 5: Supervised machine learning algorithms for prediction of
Protein to Protein interactions
We introduced how we used supervised machine learning algorithms such
as KNN and a new technique to estimate the distance between sentence in
order to predict the possible interactions between proteins mentioned in
the documents.
Problem 6: Unsupervised machine learning algorithms to identify
different clusters of Genes
We introduced how we used unsupervised machine learning algorithms
such as DBScan and the similarity based on Swanson Algorithms and
Cosine similarity in order to group genes mentioned in the documents in
different clusters.
118. Future work
There are hot research areas and open problems
in the biological text mining
The content Provider for Documents
Google Scholar
Using Semantic web 3.0 ( Online Journals )
The Ontology Generation
Ability to Edit the Ontologies and Adding knowledge
Other Ontologies
Using Wikipedia as an Ontology
119. Future work
There are some features that may be added to the
System
Biomedical Ontology based Search Engine
Provide documents summary for each group of documents
Allow the user to save and print the results obtained by the system.
Protein-Protein Interaction (PPI)
Use more sophisticated classifiers and machine learning techniques
such as AdaBoost to enhance the classification process.
Use a background knowledge of verbs as there are many verbs gives the
same meaning.
This will help the system to have more accurate results, as we can
introduce some fuzzy distance to the differences between the meaning
of verbs. This also will introduce the ability to discover the type of
relations between the terms and to be more semantic relations
identification.
120. Future work
• There are some features that may be added to
the System
Gene Clustering
Using more sophisticated clustering algorithms which originally
designed for gene clustering.
More Applications:
Based on the services provided by the ontology based
engine, we can construct some applications such as
extracting the relation between the drugs and diseases,
group diseases in different clusters which decision helps
to identify the characteristics of a new discovered disease
and other applications that relay on text mining in
biomedical documents.