Biomedical Word Sense Disambiguation presentation [Autosaved]
1. Biomedical Word Sense Disambiguation with
Neural Word and Concept Embedding
Department of Computer Science
University of Kentucky
Oct 7, 2016
AKM Sabbir
Advisor,
Dr. Ramakanth Kavuluru
10/27/2016 1
2. Outline
Introduction
• Application of Word Sense Disambiguation(WSD)
• Motivation
• Related Methods to Solve WSD
• Our Method
• Word Vectors
• Tools Used
• Our Method in Detail
• Experiment and Analysis
• Conclusion
10/27/2016 2
3. Introduction
• WSD is the task of detecting correct sense or assigning proper
sense
– the air in the center of the vortex of a cyclone is generally
very cold
– I Could not come to office last week because I had a cold
• Retrieving information from Machine is not easy task
• Number of Natural Language Processing(NLP) tasks require
WSD
10/27/2016 3
4. Outline
• Introduction
Application of Word Sense Disambiguation(WSD)
• Motivation
• Related Methods to Solve WSD
• Our Method
• Word Vectors
• Tools Used
• Our Method in Detail
• Experiment and Analysis
• Conclusion
10/27/2016 4
5. Application
• Text to Speech Conversion
– Bass can be pronounced either base or baes
• Machine Translation
– French Word Grille can be translated into gate or bar
• Information Retrieval
• Named Entity Recognition
• Document Summary Generation
10/27/2016 5
6. Outline
• Introduction
• Application of Word Sense Disambiguation(WSD)
Motivation
• Related Methods to Solve WSD
• Our Method
• Word Vectors
• Tools Used
• Our Approach
• Experiment and Analysis
• Conclusion
10/27/2016 6
7. Motivation
• Generalized WSD is a difficult problem
• Solve it for each domain
• Biomedical domain contains a large number of ambiguous
words
• Medical report summary generation
• Drug side effect prediction
10/27/2016 7
8. Outline
• Introduction
• Application of Word Sense Disambiguation(WSD)
• Motivation
Related Methods to Solve WSD
• Our Method
• Word Vectors
• Tools Used
• Our Method in Detail
• Experiment and Analysis
• Conclusion
10/27/2016 8
9. Related Method
• Supervised Methods
– Support Vector Machines, Convolutional Neural Net
• Unsupervised Methods
– Clustering, generative model
– If vocabulary has four words w1, w2, w3, w4
• Knowledge Based Methods
– WordNet, UMLS(Unified Medical Language System)
10/27/2016 9
w1 … w4
w1 1/5 2/5 0
w2
…
10. Outline
• Introduction
• Application of Word Sense Disambiguation(WSD)
• Motivation
• Related Methods to Solve WSD
Our Method
• Word Vectors
• Tools Used
• Our Approach
• Experiment and Analysis
• Conclusion
10/27/2016 10
11. Our Method
• We build a semi supervised model
• Model involves usage of concept/sense/CUI vectors just like
how people use word vectors (more later)
• Metamap is an knowledge based NER tool. We use its
decisions is used to generate concept vectors
• Model also involves the usage of P(w|c) where c is a concept
or sense generated using other knowledge based approaches
• Generated word vectors using unstructured data source
10/27/2016 11
12. Outline
• Introduction
• Application of Word Sense Disambiguation(WSD)
• Motivation
• Related Methods to Solve WSD
• Our Approach
Word Vectors
• Tools Used
• Our Method in Detail
• Experiment and Analysis
• Conclusion
10/27/2016 12
13. What is Word Vector
• Distributed representation of words
• Representation of word spread across all dimension of vector
• The idea is different from other representation where the
length is equal to the vocabulary size. Here we choose a small
dimension say d=200 and generate dense vectors
• Each element of the vector contributes to the definition of
many different words
10/27/2016 13
0.07 0.05 0.8 0.002 0.1 0.3King
0.7 0.05 0.67 0.002 0.2 0.3Queen
14. What is Word Vector
• It is a numerical way of word representation
• Where each dimension captures some semantic and syntactic
information related to that word
• Using the similar idea we can generate concept/sense/CUI vectors.
10/27/2016 14
15. Why Word Vectors Work ?
• Learned word vectors capture the syntactic and semantic
information exist in text
– vector(“king”) – vector(“man”) + vector(“woman”) ≈ vector(“queen”)
10/27/2016 15
Fig 5: resultant queen vector and other vectors [5]
16. Outline
• Introduction
• Application of Word Sense Disambiguation(WSD)
• Motivation
• Related Methods to Solve WSD
• Our Method
• Word Vectors
Tools Used
• Our Method in Detail
• Experiment and Analysis
• Conclusion
10/27/2016 16
18. Required Tools Contd.
10/27/2016 18
Step1 Parsing: text parsed in noun phrases using xerox POS tagger to
perform syntactic analysis [4].
Step2 Variant Generation: Varaint for each input phrase are
generated using the knowledge of specialist lexicons and
supplementary database of synonyms
Step3 Candidate Retrieval: the candidate sets retrieved from the
UMLS metathesaurus contains at least One of the variants generated
from step three
Step 4 Candidate Evaluation
Fig2 : Variants for word ocular
Fig3 : evaluated candidates for Ocular complication
• Metamap
19. Outline
• Introduction
• Application of Word Sense Disambiguation(WSD)
• Motivation
• Related Methods to Solve WSD
• Our Method
• Word Vectors
• Tools Used
Our Method in Detail
• Experiment and Analysis
• Conclusion
10/27/2016 19
20. Our Method in Detail
• Text preprocessing
– English stop words
– Nltk word tokenization
– Frequency greater than five
– Lower case everything
• Word context is ten words long
• Generated word vectors are 300 dimension
10/27/2016 20
21. Generating word and concept vectors
• Generate word and concept vectors
• 20 million citations from pubmed for training word vectors
• Randomly chosen 5 million citations
• Retrieved 7.1 million sentences containing target ambiguous
words
• Each sentence is 16-17 words long
• Combined sentence are used to generate the bigrams
• Each bigrams fed into metamap with WSD option turned on
• Replace each bigram with corresponding concepts
• Then fed the data to language model to generate concept vector10/27/2016 21
22. Estimate P(D|c) [Yepes et al.]
• Using Jimeno-Yepes and Berlanga[3] model used Markov
Chain to calculate P(D|c)
• In order to get P(D|c), need to calculate P(w|c)
– 𝑃 𝑤𝑗 𝑐𝑖 =
𝑐𝑜𝑢𝑛𝑡(𝑤 𝑗,𝑐𝑖)
𝑤 𝑗∈𝑙𝑒𝑥(𝑐 𝑖) 𝑐𝑜𝑢𝑛𝑡(𝑤 𝑗, 𝑐 𝑖)
10/27/2016 22
23. Biomedical MSH WSD
• A dataset with 203 ambiguous words
• 424 unique concept identifiers (senses)
• 38,495 test context instances with an average of 200 test
instances for each ambiguous word.
• Goal -- to correctly identify the senses for each test instance
10/27/2016 23
24. Model I Cosine Similarity
𝑓 𝑐
𝑇, 𝑤, 𝐶 𝑤 = 𝑎𝑟𝑔max
𝑐∈𝐶(𝑤)
cos(𝑇𝑎𝑣𝑔, 𝑐)
• W is the ambiguous word
• T is test instance context containing the ambiguous word w
• C(w) is the set of concepts that w can assume
10/27/2016 24
25. Model II projection magnitude
𝑓 𝑐(𝑇, 𝑤, 𝐶(𝑤)) = argmax
𝑐∈𝐶(𝑤)
𝜌(cos 𝑇𝑎𝑣𝑔, 𝑐 ).
𝑃𝑟(𝑇𝑎𝑣𝑔, 𝑐)
𝑐
• Took projection along concept vector and then consider the
Euclidean norm
10/27/2016 25
26. Model III
𝑓 𝑐
(𝑇, 𝑤, 𝐶(𝑤)) = argmax
𝑐∈𝐶(𝑤)
cos 𝑇𝑎𝑣𝑔, 𝑐 .
𝑃𝑟(𝑇𝑎𝑣𝑔, 𝑐)
𝑐
• Combined both angular and magnitude
10/27/2016 26
27. Model IV
𝑓 𝑐
(𝑇, 𝑤, 𝐶(𝑤)) = argmax
𝑐∈𝐶(𝑤)
cos 𝑇𝑎𝑣𝑔, 𝑐 .
𝑃𝑟(𝑇𝑎𝑣𝑔, 𝑐)
𝑐
+ 𝑃(𝑇|𝑐)
10/27/2016 27
28. Model V KNN
• Now we have multiple ways to resolve sense for ambiguous terms
• Built distantly supervised dataset by collecting data from
biomedical citations
• For each ambiguous words there is on average 40000 sentences
• Resolved senses for each sentences using Model IV
10/27/2016 28
32. Distant Supervision with CNN
• Used the refined assignment of CUIs to sentences as a training set
• Then used MSH WSD data as a test data set
• Trained 203 Convolutional Neural Net
• With one convolutional layer and one hidden layer
• Used 900 filters of 3 different size
• Used the test case for testing purpose
10/27/2016 32
34. Ensembling of CNNs
• Five CNN training and testing for each ambiguous words
• Average the output and takes the best one
• Tends to improve the result at the cost of computation
10/27/2016 34
35. Outline
• Introduction
• Application of Word Sense Disambiguation(WSD)
• Motivation
• Related Methods to Solve WSD
• Our Method
• Word Vectors
• Tools Used
• Our Approach
Experiment and Analysis
• Conclusion
10/27/2016 35
36. Results and Analysis
Methods Results
Jimeno-Yepes and Berlanga [1] 89.10%
Cosine similarity (Model I 𝑓 𝑐
) 85.54%
Projection length proportion(Model II 𝑓 𝑝) 88.68%
Combining Model I and II (𝑓 𝑐,𝑝) 89.26%
Combining Model I, II and [1] 92.24%
Convolutional Neural Net 86.17%
Ensembling CNN 87.78%
K-NN with k = 3500 (𝑓 𝑘−𝑁𝑁) 94.34%
10/27/2016 36
37. Outline
• Introduction
• Application of Word Sense Disambiguation(WSD)
• Motivation
• Related Methods to Solve WSD
• Our Method
• Word Vectors
• Tools Used
• Our Approach
• Experiment and Analysis
Conclusion
10/27/2016 37
38. Conclusion
• The developed model is highly accurate beating previous best
• It is unsupervised no requirement of hand label information
• It is scalable however the accuracy level will be uncertain
– By increasing the number of training sentence and the context of
sentence more information may be extractable
• Graph based algorithm need to be explored
• HPC, Theano, Nltk, Gensim Word2Vec
10/27/2016 38
40. References
1. Eneko Agirre and Philip Edmonds. Word sense disambiguation:
Algorithms and applications, volume 33. Springer Science &
Business Media, 2007.
2. Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian
Janvin. A neural probabilistic language model. The Journal of
Machine Learning Research, 3:1137-1155, 2003
3. Antonio Jimeno Yepes and Rafael Berlanga. Knowledge based
word-concept model estimation and renement for biomedical text
mining. Journal of biomedical informatics, 53:300-307, 2015.
4. Aronson, Alan R. "Effective mapping of biomedical text to the
UMLS Metathesaurus: the MetaMap program." Proceedings of the
AMIA Symposium. American Medical Informatics Association,
2001.
5. https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-
vectors/
10/27/2016 40
41. References
6. Alex Krizhevsky, Ilya Sutskever, and Georey E Hinton. Imagenet
classication with deep convolutional neural networks. In Advances
in neural information processing systems, pages 1097-1105, 2012.
10/27/2016 41