Biomedical Word Sense Disambiguation presentation [Autosaved]

Biomedical Word Sense Disambiguation with
Neural Word and Concept Embedding
Department of Computer Science
University of Kentucky
Oct 7, 2016
AKM Sabbir
Advisor,
Dr. Ramakanth Kavuluru
10/27/2016 1

Outline
 Introduction
• Application of Word Sense Disambiguation(WSD)
• Motivation
• Related Methods to Solve WSD
• Our Method
• Word Vectors
• Tools Used
• Our Method in Detail
• Experiment and Analysis
• Conclusion
10/27/2016 2

Introduction
• WSD is the task of detecting correct sense or assigning proper
sense
– the air in the center of the vortex of a cyclone is generally
very cold
– I Could not come to office last week because I had a cold
• Retrieving information from Machine is not easy task
• Number of Natural Language Processing(NLP) tasks require
WSD
10/27/2016 3

Outline
• Introduction
 Application of Word Sense Disambiguation(WSD)
• Motivation
• Our Method
• Word Vectors
• Tools Used
• Conclusion
10/27/2016 4

Application
• Text to Speech Conversion
– Bass can be pronounced either base or baes
• Machine Translation
– French Word Grille can be translated into gate or bar
• Information Retrieval
• Named Entity Recognition
• Document Summary Generation
10/27/2016 5

Outline
• Introduction
 Motivation
• Our Method
• Word Vectors
• Tools Used
• Our Approach
• Conclusion
10/27/2016 6

Motivation
• Generalized WSD is a difficult problem
• Solve it for each domain
• Biomedical domain contains a large number of ambiguous
words
• Medical report summary generation
• Drug side effect prediction
10/27/2016 7

Outline
• Introduction
• Motivation
 Related Methods to Solve WSD
• Our Method
• Word Vectors
• Tools Used
• Conclusion
10/27/2016 8

Related Method
• Supervised Methods
– Support Vector Machines, Convolutional Neural Net
• Unsupervised Methods
– Clustering, generative model
– If vocabulary has four words w1, w2, w3, w4
• Knowledge Based Methods
– WordNet, UMLS(Unified Medical Language System)
10/27/2016 9
w1 … w4
w1 1/5 2/5 0
w2
…

Outline
• Introduction
• Motivation
 Our Method
• Word Vectors
• Tools Used
• Our Approach
• Conclusion
10/27/2016 10

Our Method
• We build a semi supervised model
• Model involves usage of concept/sense/CUI vectors just like
how people use word vectors (more later)
• Metamap is an knowledge based NER tool. We use its
decisions is used to generate concept vectors
• Model also involves the usage of P(w|c) where c is a concept
or sense generated using other knowledge based approaches
• Generated word vectors using unstructured data source
10/27/2016 11

Outline
• Introduction
• Motivation
• Our Approach
 Word Vectors
• Tools Used
• Conclusion
10/27/2016 12

What is Word Vector
• Distributed representation of words
• Representation of word spread across all dimension of vector
• The idea is different from other representation where the
length is equal to the vocabulary size. Here we choose a small
dimension say d=200 and generate dense vectors
• Each element of the vector contributes to the definition of
many different words
10/27/2016 13
0.07 0.05 0.8 0.002 0.1 0.3King
0.7 0.05 0.67 0.002 0.2 0.3Queen

What is Word Vector
• It is a numerical way of word representation
• Where each dimension captures some semantic and syntactic
information related to that word
• Using the similar idea we can generate concept/sense/CUI vectors.
10/27/2016 14

Why Word Vectors Work ?
• Learned word vectors capture the syntactic and semantic
information exist in text
– vector(“king”) – vector(“man”) + vector(“woman”) ≈ vector(“queen”)
10/27/2016 15
Fig 5: resultant queen vector and other vectors [5]

Outline
• Introduction
• Motivation
• Our Method
• Word Vectors
 Tools Used
• Conclusion
10/27/2016 16

Required Tools
• language model
10/27/2016 17

Required Tools Contd.
10/27/2016 18
Step1 Parsing: text parsed in noun phrases using xerox POS tagger to
perform syntactic analysis [4].
Step2 Variant Generation: Varaint for each input phrase are
generated using the knowledge of specialist lexicons and
supplementary database of synonyms
Step3 Candidate Retrieval: the candidate sets retrieved from the
UMLS metathesaurus contains at least One of the variants generated
from step three
Step 4 Candidate Evaluation
Fig2 : Variants for word ocular
Fig3 : evaluated candidates for Ocular complication
• Metamap

Outline
• Introduction
• Motivation
• Our Method
• Word Vectors
• Tools Used
 Our Method in Detail
• Conclusion
10/27/2016 19

Our Method in Detail
• Text preprocessing
– English stop words
– Nltk word tokenization
– Frequency greater than five
– Lower case everything
• Word context is ten words long
• Generated word vectors are 300 dimension
10/27/2016 20

Generating word and concept vectors
• Generate word and concept vectors
• 20 million citations from pubmed for training word vectors
• Randomly chosen 5 million citations
• Retrieved 7.1 million sentences containing target ambiguous
words
• Each sentence is 16-17 words long
• Combined sentence are used to generate the bigrams
• Each bigrams fed into metamap with WSD option turned on
• Replace each bigram with corresponding concepts
• Then fed the data to language model to generate concept vector10/27/2016 21

Estimate P(D|c) [Yepes et al.]
• Using Jimeno-Yepes and Berlanga[3] model used Markov
Chain to calculate P(D|c)
• In order to get P(D|c), need to calculate P(w|c)
– 𝑃 𝑤𝑗 𝑐𝑖 =
𝑐𝑜𝑢𝑛𝑡(𝑤 𝑗,𝑐𝑖)
𝑤 𝑗∈𝑙𝑒𝑥(𝑐 𝑖) 𝑐𝑜𝑢𝑛𝑡(𝑤 𝑗, 𝑐 𝑖)
10/27/2016 22

Biomedical MSH WSD
• A dataset with 203 ambiguous words
• 424 unique concept identifiers (senses)
• 38,495 test context instances with an average of 200 test
instances for each ambiguous word.
• Goal -- to correctly identify the senses for each test instance
10/27/2016 23

Model I Cosine Similarity
𝑓 𝑐
𝑇, 𝑤, 𝐶 𝑤 = 𝑎𝑟𝑔max
𝑐∈𝐶(𝑤)
cos(𝑇𝑎𝑣𝑔, 𝑐)
• W is the ambiguous word
• T is test instance context containing the ambiguous word w
• C(w) is the set of concepts that w can assume
10/27/2016 24

Model II projection magnitude
𝑓 𝑐(𝑇, 𝑤, 𝐶(𝑤)) = argmax
𝑐∈𝐶(𝑤)
𝜌(cos 𝑇𝑎𝑣𝑔, 𝑐 ).
𝑃𝑟(𝑇𝑎𝑣𝑔, 𝑐)
𝑐
• Took projection along concept vector and then consider the
Euclidean norm
10/27/2016 25

Model III
𝑓 𝑐
(𝑇, 𝑤, 𝐶(𝑤)) = argmax
𝑐∈𝐶(𝑤)
cos 𝑇𝑎𝑣𝑔, 𝑐 .
𝑐
• Combined both angular and magnitude
10/27/2016 26

Model IV
𝑓 𝑐
𝑐∈𝐶(𝑤)
cos 𝑇𝑎𝑣𝑔, 𝑐 .
𝑐
+ 𝑃(𝑇|𝑐)
10/27/2016 27

Model V KNN
• Now we have multiple ways to resolve sense for ambiguous terms
• Built distantly supervised dataset by collecting data from
biomedical citations
• For each ambiguous words there is on average 40000 sentences
• Resolved senses for each sentences using Model IV
10/27/2016 28

KNN in Pseudo Code
10/27/2016 29

KNN contd.
10/27/2016 30
𝑓 𝑘−𝑁𝑁
𝑐∈𝐶(𝑤)
(𝐷,𝑤,𝑐)∈𝑅 𝑘(𝐷 𝑤)
cos 𝑇𝑎𝑣𝑔, 𝐷 𝑎𝑣𝑔
Training instance 1 (c_1)
Training instance 2(c_1)
……………..
………………
Training instance n (c_2)
Test Instance 1 (__)
Cosine similarity
Training instance 1 (c_1, 0.7)
Training instance 2(c_1, 0.9)
……………..
………………
Training instance n (c_2, 0.12)

KNN Accuracy graph
10/27/2016 31

Distant Supervision with CNN
• Used the refined assignment of CUIs to sentences as a training set
• Then used MSH WSD data as a test data set
• Trained 203 Convolutional Neural Net
• With one convolutional layer and one hidden layer
• Used 900 filters of 3 different size
• Used the test case for testing purpose
10/27/2016 32

Distant Supervision Using CNN
10/27/2016 33

Ensembling of CNNs
• Five CNN training and testing for each ambiguous words
• Average the output and takes the best one
• Tends to improve the result at the cost of computation
10/27/2016 34

Outline
• Introduction
• Motivation
• Our Method
• Word Vectors
• Tools Used
• Our Approach
 Experiment and Analysis
• Conclusion
10/27/2016 35

Results and Analysis
Methods Results
Jimeno-Yepes and Berlanga [1] 89.10%
Cosine similarity (Model I 𝑓 𝑐
) 85.54%
Projection length proportion(Model II 𝑓 𝑝) 88.68%
Combining Model I and II (𝑓 𝑐,𝑝) 89.26%
Combining Model I, II and [1] 92.24%
Convolutional Neural Net 86.17%
Ensembling CNN 87.78%
K-NN with k = 3500 (𝑓 𝑘−𝑁𝑁) 94.34%
10/27/2016 36

Outline
• Introduction
• Motivation
• Our Method
• Word Vectors
• Tools Used
• Our Approach
 Conclusion
10/27/2016 37

Conclusion
• The developed model is highly accurate beating previous best
• It is unsupervised no requirement of hand label information
• It is scalable however the accuracy level will be uncertain
– By increasing the number of training sentence and the context of
sentence more information may be extractable
• Graph based algorithm need to be explored
• HPC, Theano, Nltk, Gensim Word2Vec
10/27/2016 38

References
1. Eneko Agirre and Philip Edmonds. Word sense disambiguation:
Algorithms and applications, volume 33. Springer Science &
Business Media, 2007.
2. Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian
Janvin. A neural probabilistic language model. The Journal of
Machine Learning Research, 3:1137-1155, 2003
3. Antonio Jimeno Yepes and Rafael Berlanga. Knowledge based
word-concept model estimation and renement for biomedical text
mining. Journal of biomedical informatics, 53:300-307, 2015.
4. Aronson, Alan R. "Effective mapping of biomedical text to the
UMLS Metathesaurus: the MetaMap program." Proceedings of the
AMIA Symposium. American Medical Informatics Association,
2001.
5. https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-
vectors/
10/27/2016 40

References
6. Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. Imagenet
classication with deep convolutional neural networks. In Advances
in neural information processing systems, pages 1097-1105, 2012.
10/27/2016 41

Biomedical Word Sense Disambiguation presentation [Autosaved]

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Biomedical Word Sense Disambiguation presentation [Autosaved]

Similar to Biomedical Word Sense Disambiguation presentation [Autosaved] (20)

Biomedical Word Sense Disambiguation presentation [Autosaved]