Ibn Sina

Please turn off your mobiles or put
them on silence mode

Biological Relation Extraction Tools Using Biomedical
Ontologies and Text Mining

Agenda
 Introduction to Biomedical Text Mining
 System Overview
 Problem Description
 Motivation
 Challenges
 System Framework
 Application upon System Framework
 Swanson’s Algorithm
 Protein to Protein Interactions (PPI)
 Gene Clustering based on Text Mining
 Extended Work
 Conclusion and Future Work.

Introduction to Biomedical Text
Mining
 Text Mining = Process unstructured (textual)
information, extract meaningful data, make the
information contained in the text accessible to the
various data mining (statistical and machine learning)
algorithms.
 Biomedical Text Mining = Working on biomedical
documents.

System Overview
 Problem Description
 Huge amount of information stored in million of
documents
 These information can be used effectively to solve many
problems
 Knowledge retrieval with no much effort
 Discover relationship between different entities
 Assessing relationship strength between different entities
 Group entities into different clusters

System Overview
 Motivation:
 Build semantic structure of documents which
facilitates navigation through thousands of
documents.
 Extract relationships between biomedical terms using
text mining techniques with aid of biomedical
ontologies.
 Using text mining to group genes into different clusters.

System Overview
 Challenges:
 Concept Recognition
 Build semantic structure of annotated documents using
ontologies
 Relationship Recognition
 Similarity (distance) between different entities.

Overall System Components
 Framework
 Searching and Browsing
 Swanson’s Algorithm
 PPI
 Gene Clustering

Overall System Architecture

Searching
Gene Swanson’s
PPI &
Clustering Algorithm
Browsing

Framework

System Framework Agenda
 Objective
 Framework Concept Issues
 Framework Design Issues
 Framework Sequence Diagram
 Framework Database
 Framework GUI
 Framework Demo

System Framework Agenda
 Objective
 Framework Concept Issues
 Framework Design Issues
 Framework Sequence Diagram
 Framework Database
 Framework GUI
 Comparison
 Framework Demo

System Framework
 Objective:
 Use ontologies to markup biomedical text documents.
 Based on established semantic links between documents
and ontology concepts, the goal is build semantic
representation of information.
 Provide services to other applications and users.

System Framework
Framework

Concept Issues Design Issues

Framework Concept Issues
User Expanded Query
Query Expansion
Query Fetching
Documents

Search PubMed

Gene Documents
Ontology

Extract GO terms
Annotate PubMed
documents

Structure Representation
of documents
Annotated Documents

System Framework
 PubMed:
 Largest documents source in the biomedical field
 Contains over 18 million documents
 Maintained by the United States National Library
of Medicine (NLM)
 Indexes all documents by MeSH terms to facilitate
searching and retrieval

System Framework
 Gene Ontology:
 The Gene Ontology project is a major
bioinformatics initiative with the aim of
standardizing the representation of gene and gene
product attributes across species and databases
 Includes a controlled vocabulary of terms for
describing gene product characteristics.
 Consists of three main categories
 Cellular component
 Biological process
 Molecular function

System Framework
 MeSH database:
 Comprehensive controlled vocabulary for the purpose of indexing journal articles and
books in the life sciences; it can also serve as a thesaurus that facilitates searching
[Wikipedia]
 MeSH main heading:
 Anatomy
 Organisms
 Diseases
 Chemicals and Drugs
 Analytical, Diagnostic and Therapeutic Techniques and Equipment
 Psychiatry and Psychology
 Phenomena and Processes
 Disciplines and Occupations
 Anthropology, Education, Sociology and Social Phenomena
 Technology, Industry, Agriculture
 Humanities
 Information Science
 Named Groups
 Health Care
 Publication Characteristics
 Geographical liocations

System Framework
 Query Expansion (QE):is the process of reformulating
a seed query to improve retrieval performance in
information retrieval operations [Wikipedia]
 How ?
 Example

Query
Expansion Ocellus
pigmentation
Example

Pigment
Pigment
metabolic Pigmentation
accumulation
process

Cellular
pigmentation

System Framework
 Documents Annotating
 Annotate documents with Gene Ontology Terms, Genes
and proteins.
 Represent each documents by set of terms. (How ?)

GO extractor
●GO’s vocabulary consists of 7,841 words. The majority of the GO words found
occur only once in the whole ontology. On the other hand 51 of the GO words
occur at least 100 times in the ontology. More than 90%, do not occur more
than 10 times.

●words with a very high frequency do not give much information as they are
part of many labels in the ontology. However, extracting a word with a low
frequency gives a much better hint about a mentioned concept. (Zipf's law).

●From the nature of GO-terms, the words in the end are very general
ex.(activity , transport).
●Besides, many GO-terms are substring of descending GO-terms.

●The algorithm is taken from GOPubMed (2008) “GoPubMed: Ontology-based
literature search for the life sciences”.

GO extractor algorithm
Get last
word

Compar Set main
e with root as a
root N root

Do BFS
The same
word
N and take Reache Y Get
occurred at each one as s leaf next
any sibling
a root word

Y

get next word
& do BFS and
consider each
one as a root

Go Extractor
Example:-
Abstract
“............................................and it's effected by the Kinase activity”. Abstract.

● Starting from the last word of the paragraph “activity”.
●Starting from the root of the GO tree searching for GO-term ending with
“activity”.
● When we rich it, fetch the next word and starting from the new root.
● Now we are looking in the subtree for an ontology ends with “Kinase activity”.
●While on search we reach leaf . It means that we got a GO-term. Now restart
by take the next word and from the root.

Framework Design Issues
 Top Level Architecture of the System can be divided into:-
 Data Handling Components
 Information Handling Components
 Information Extraction
 Information Representation
 Information Retrieval

System Framework
 Framework main components:
 Document Sources
 Extractor
 Document Annotators
 Ontology Manager
 System Engine
 Database Manager
 Cache Manager
 Document

System Framework
 Document Sources
 Fetching of singles or collections of documents from
remote stores.
 Extractor
 Implements Information Extraction algorithms to extract
ontology terms from the documents
 Document Annotators
 establish semantic link between documents and ontology
concepts.
 For example linking documents with its GO terms, MeSH
terms . . . etc.

System Framework
 Ontology Manager
 Provide interface to around ontologies
 Composed by sub-managers to merge ontologies such as
Gene ontology
 System Engine
 Main component of the system.
 Responsible for maintaining all the operations and
communications between various components of the
system

System Framework
 Database Manager
 implemented as a pool object (connections pool)
 handles and maintains queries to the database such
insert, update and delete documents
 Cache Manager
 Implemented as client side of MemCached (open source
caching project).
 Handles operations to the system cache

Framework GUI
 GUI Goals
 User friendly
 Consistency
 Model View Control (MVC)
 Human-Computer Interaction concepts
 Usability
 Specific Application services satisfaction
 Standard Data Exchange
 Internationalization

Our system Textpresso XplorMed Vivismo

Ontology Full Gene Only 30 Top hierarchy Drive
used Ontology category of ontology
the MeSH from the
ontology search
result
Output Uses the deep Returns a list For each Returns a list
ontology to of relevant MeSH of relevant
navigate abstract category, abstract
through a there is an
large result set associated list
in a non-
sequential
order

IBN-SINA vs. Others
IBN-SINA Textpresso XplorMed Vivismo

Works on works on all Designed for works on all works on all
the PubMed full paper which the PubMed the PubMed
abstracts not available abstracts abstracts
most of the
time
Term Allows gaps Tries to nd the Extract terms Extract terms
Extraction within category terms based on based on term
matches and directly in the term frequency in
considers the text only frequency in the collected
information allowing the collected documents
content of the for some documents
words, which variations in
leads to more lower/uppercas
rened term e letters and
extraction plural forms

Framework Demo

DEMO

Swanson Algorithm(1986)
Swanson’s method is a away of finding indirect relations between
objects.

A B
Related Related
term A1 term B1

Related Related
term A2 term B2

1986: “Undiscovered public knowledge”

Cosine Similarity
Cosine similarity is a measure of similarity between two vectors of n
dimensions by finding the cosine of the angle between them, often used to
compare documents in text mining [Wikipedia].

Terms related to first term “As’ related terms”

A B C D E F G H

Terms related to second term “Bs’ related terms”

A X Y B Z D E F

A B C D E F G H X Y Z
1 1 1 1 1 1 1 1 0 0 0

1 1 0 1 1 1 0 0 1 1 1

Cosine Similarity (Cont.)
Finally, applying cosine similarity function :-

1 1 1 1 1 1 1 1 0 0 0

1 1 0 1 1 1 0 0 1 1 1

Similarity = (1+1+0+1+1+1+0+0+0+0+0)/ (√8*√8) = 5/8 = 0.625

Swanson example
Relation between P53 and P51

1986: “Fish oil, Raynaud’s syndrome, and
undiscovered public knowledge”

PPI Agenda
 Problem Description
 Motivation
 PPI System Overview
 PPI System Main Components
 Dependency Parse Tree
 Similarity Metrics
 K-Nearest Neighbor Classifier
 Evaluation of PPI
 Evaluation Metrics
 Results and Comparison

Problem Description
 Due to the ever growing amount of publications about
protein-protein interactions, information extraction from
text is increasingly recognized as one of crucial
technologies in bioinformatics

Reference:
Gunes Erkan, Arzucan Ozgur, Dragomir R. Radev. Semi-Supervised Classication
for Extracting Protein Interaction Sentences using Dependency Parsing.
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning, pp. 228237,
Prague, June 2007

Motivation
 The interactions between proteins are important for
very numerous if not all biological functions.
 The function of a protein can be characterized more
precisely through knowledge of PPI.
 Information about these interactions improves our
understanding of diseases and can provide the basis
for new therapeutic approaches.
 Validate experimental results and test benches.

System Overview
 We worked on Sentence level (Why?)
 It increases the semantic understood from the sentence.
 Synthesis of the sentence increases the knowledge
obtained from it.
 Specific relation between proteins can be deduced from
it.

System Overview
 Our approach depends on:
The shortest path between the entities in dependency
tree of a sentence usually captures the necessary
information to identify their relationship.

Dependency Parse Tree
• Unlike a syntactic parse, it captures the semantic
predicate-argument relationships among its words.
 Stanford Parser API to make the Natural Language

processing task.
 Shortest path is found using Breadth First Search

(BFS) as each edge has equal wait, and therefore this
leads to most near path discovered first.

Dependency Parse Tree (Example)
 "The dependency tree of the sentence “The results demonstrated
that KaiC interacts rhythmically with KaiA, KaiB, and SasA.”

Example (Cont.)
• Then, we select the shortest paths between the
protein pairs:
• KaiC - nsubj - interacts - prep with – SasA
• KaiC - nsubj - interacts - prep with - SasA - conj and -
KaiA
• KaiC - nsubj - interacts - prep with – SasA - conj and –
KaiB
• SasA - conj and – KaiA
• SasA - conj and – KaiB
• KaiA – conj and – SasA - conj and - KaiB

Example (Cont.)
• Then, we rename the proteins in the pair as PROTX1
and PROTX2, and all the other proteins in the sentence
as PROTX0:
• PROTX1 - nsubj - interacts - prep with - ROTX2
• PROTX1 - nsubj - interacts - prep with - ROTX0 – conj_and -
PROTX2
• PROTX1 - nsubj - interacts - prep with – ROTX0 –conj_and -
PROTX2
• PROTX1 – conj_and - PROTX2
• PROTX1 – conj_and - PROTX2
• PROTX1 – conj_and – PROTX0 – conj_and - PROTX2

Similarity Metrics
 The main idea of using similarity metrics is to
find a function that maps input patterns into a
target space such that a simple distance in the
target space approximates the “semantic”
distance in the input space.

Similarity Metrics
 We implemented Levenshtein distance (Edit
Distance).
 number of transpositions, substitutions and deletions
needed to transform one string into another.
 We also used an open source library called
“SimMetrics” – Java library of 23 string similarity
metrics.
• Developed at the University of Sheffield (Chapman,
2004)

Similarity Metrics
• We used only 10 string similarities from SimMetrics.
• Cosine Similarity
• Block Distance
• Dice Similarity
• Euclidean Distance
• Jaccard Similarity
• Jaro Similarity
• Jaro Winkler Similarity
• Matching Coecient
• Monge Elkan Similarity

K-Nearest Neighbor Classifier
• k nearest neighbor-assign label according to the
majority label of k nearest-neighboor training
patterns.

KNN Example
• If k = 3, it is classified as
a triangle

• k = 5, it is classified as a
square

KNN Strengths and Weaknesses
• Strengths:
• Simple to implement and use
• Comprehensible – easy to explain prediction
• Robust to noisy data by averaging k-nearest neighbors

KNN Strengths and Weaknesses
• Weaknesses:
• Need a lot of space to store all examples.
• Takes more time to classify a new example than with a
model (need to calculate and compare distance from new
example to all other examples).

Evaluation of PPI
• we used five different datasets which are:
• BioInfer dataset.
• AIMed dataset.
• LLL dataset.
• IEPA dataset.
• HPRD50 dataset.
• We used KNN classier and changing K and similarity
metric as parameters.

Evaluation Metrics
• Precision:

• Recall:

• F-measure:

Results and Comparison
Dataset Min. Result Max. Result
BioInfer 32 56.9

AIMed 5 48.9

LLL 48.8 73

IEPA 36.6 72

HPRD50 12.9 63.49

Our PPI System Vs. Graph Kernel
Approach
Dataset Our System Graph Kernel
(%) Approach (%)

BioInfer 56.9 52.9
AIMed 48.9 56.4
LLL 73 76.8
IEPA 72 75.1
HPRD50 67 63.4

Motivation
 Goal :
 Grouping genes according some features .

 Challenges :
 Large number of genes .
 The complexity of biological networks .

Motivation
 The solution is :

 Gene Clustering

Gene Clustering Techniques
 Based on Gene Expression :
 Advantages :
 High Accuracy .

 Disadvantages :
 High cost .
 Time Consuming .
 Noise .

Gene Clustering Techniques
 Based on Text Mining :
 Advantages :
 Low Cost .
 Low Time Consuming .

 Disadvantages :
 Low accuracy .

Gene Clustering Based on Text
Mining
 To perform Gene Clustering we need :
 Clustering Algorithms .
 Similarity Measurements .

Clustering Algorithms
 Hierarchical Algorithms .

 Partitioning Algorithms .

 Density-Based Algorithms .

Hierarchical Algorithms
 Single Linkage

Partitioning Algorithms
 K-Medoids

Density-Based Algorithms
 DBScan

Graph-Theoretic Algorithms
 Zahn Algorithm

Similarity Measurements
 Swanson Algorithm .

 Document Occurrences .

Swanson Algorithm
 Search PubMed for gene A and extract set A ( the
most related keywords - MeSH or GO terms - ) .
 Search PubMed for gene B and extract set B ( the most
related keywords - MeSH or GO terms - ) .
 Based on the intersection between set A and set B, we
apply the cosine similarity.

Document Occurrences
 Search PubMed for gene A and extract set A
(documents Ids of gene A) .
 Search PubMed for gene B and extract set B
(documents Ids of gene B).
 Based on the intersection between set A and set B, we
apply the Jaccard Similarity Coefficient.

Extended Work: PPI System with
SVM Classifier (1)
 Equation :
u=w⋅x－b
- Objective :
min (1/2) || w||2
subject to
yi (w ⋅ xi－b) ≥ 1,
∀i

Extended Work: PPI System with
SVM Classifier (2)
 min Ψ (α ) = min (1/2) ∑ ∑ yi yj (xi ⋅xj)αi αj－ ∑ αi
 α is called multiplier and if we can get α we can get (w , b) .

 w = ∑ yi αi xi , b = w ⋅ xk－yk for some αk > 0

Conclusion
 Problem 1: Algorithms for concept recognition in
documents abstracts and titles
 We introduced an algorithm to annotate the Gene Ontology
terms in the documents.
 Problem 2: Use the annotated documents to build a
structured representation of documents
 We introduced how framework uses Gene Ontology to build a
semantic representation of the obtained documents
 Problem 3: Design a system for ontology based search
engines for biological researchers
 We introduced design of the framework and how it is flexible
for future modifications and scalable with respect to number
of documents and number of users.

Conclusion
 Problem 4: Using Swanson’s algorithm to assess the similarity between
different biological terms
 We introduced how can Swanson's algorithm be used to estimate the
similarity between two instances (P53 and P21)
 Problem 5: Supervised machine learning algorithms for prediction of
Protein to Protein interactions
 We introduced how we used supervised machine learning algorithms such
as KNN and a new technique to estimate the distance between sentence in
order to predict the possible interactions between proteins mentioned in
the documents.
 Problem 6: Unsupervised machine learning algorithms to identify
different clusters of Genes
 We introduced how we used unsupervised machine learning algorithms
such as DBScan and the similarity based on Swanson Algorithms and
Cosine similarity in order to group genes mentioned in the documents in
different clusters.

Future work
 There are hot research areas and open problems
in the biological text mining
 The content Provider for Documents
 Google Scholar
 Using Semantic web 3.0 ( Online Journals )
 The Ontology Generation
 Ability to Edit the Ontologies and Adding knowledge
 Other Ontologies
 Using Wikipedia as an Ontology

Future work
 There are some features that may be added to the
System
 Biomedical Ontology based Search Engine
 Provide documents summary for each group of documents
 Allow the user to save and print the results obtained by the system.
 Protein-Protein Interaction (PPI)
 Use more sophisticated classifiers and machine learning techniques
such as AdaBoost to enhance the classification process.
 Use a background knowledge of verbs as there are many verbs gives the
same meaning.
 This will help the system to have more accurate results, as we can
introduce some fuzzy distance to the differences between the meaning
of verbs. This also will introduce the ability to discover the type of
relations between the terms and to be more semantic relations
identification.

Future work
• There are some features that may be added to
the System
 Gene Clustering
 Using more sophisticated clustering algorithms which originally
designed for gene clustering.
 More Applications:
 Based on the services provided by the ontology based
engine, we can construct some applications such as
extracting the relation between the drugs and diseases,
group diseases in different clusters which decision helps
to identify the characteristics of a new discovered disease
and other applications that relay on text mining in
biomedical documents.

Ibn Sina

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Ibn Sina

Similar a Ibn Sina (20)

Más de Yasmine Gaber

Más de Yasmine Gaber (8)

Último

Último (20)

Ibn Sina