SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Word Sense Detection
and
Word sense Disambiguation
through Data-Mining
Andi Wu & Randall Tan
Asia Bible Society
Outline
Motivations for word sense identification
Problems of existing word sense data
The data-mining approach
Demo and Discussion
Asia Bible Society 2
Motivations
Addressing the issue of Polysemy
Bible translation
– Better understanding of every word
– Unification on the basis of senses rather than
words
Bible search
– More refined search results on the basis of
senses
Asia Bible Society 3
Goals
Word sense detection
For each content word in the Bible, find out how
many senses it has.
Word sense disambiguation
For each instance of the word, find out which of
the senses it has.
Asia Bible Society 4
Asia Bible Society 5
‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬
Sense 1
Sense 1-1:
beginning
Sense 1-2: first
Sense 2
Sense 2-1:
firstfruits
Sense 2-2:
firstborn
Sense 3
Sense 3-1: best
Sense 3-2:
choicest
Sense 4
Sense 4-1:
foremost
Word sense detection
Identify the senses of each word :
Identify the sense of each instance:
ְ‫בּ‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫א‬ ָ‫ר‬ ָ‫בּ‬ִ‫ֹלה‬ ֱ‫א‬‫ץ׃‬ ֶ‫ר‬ ָ‫א‬ ָ‫ה‬ ‫ת‬ ֵ‫א‬ ְ‫ו‬ ‫יִם‬ ַ‫מ‬ ָ‫שּׁ‬ ַ‫ה‬ ‫ת‬ ֵ‫א‬ ‫ים‬ (Gen1:1)
‫ן‬ ַ‫בּ‬ ְ‫ר‬ ָ‫ק‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬ָ‫יהו‬ ַ‫ל‬ ‫ם‬ ָ‫ֹת‬‫א‬ ‫יבוּ‬ ִ‫ר‬ ְ‫ק‬ ַ‫תּ‬‫ה‬‫ל‬ ֶ‫א‬ ְ‫ו‬‫־‬ַ‫י‬ ‫א־‬ֹ ‫ל‬ ַ‫ח‬ ֵ‫בּ‬ְ‫ז‬ ִ‫מּ‬ ַ‫ה‬‫ל‬ ֲ‫ע‬‫וּ‬‫י‬ ֵ‫ר‬ ְ‫ל‬ֹ‫ח‬‫י‬ִ‫נ‬ ַ‫ח‬‫׃‬ ַ‫ח‬ (Lev
2:12)
ִ‫יּ‬ ַ‫ו‬‫ח‬ ַ‫קּ‬ָ‫ע‬ ָ‫ה‬ָ‫ל‬ ָ‫שּׁ‬ ַ‫ה‬ ֵ‫מ‬ ‫ם‬‫ל‬‫ר‬ ָ‫ק‬ ָ‫וּב‬ ‫אן‬ֹ ‫צ‬ִ‫אשׁ‬ ֵ‫ר‬ִ‫אשׁ‬ ֵ‫ר‬ִ‫אשׁ‬ ֵ‫ר‬ִ‫אשׁ‬ ֵ‫ר‬‫ית‬‫ית‬‫ית‬‫ית‬ַ‫ֹח‬‫בּ‬ְ‫ז‬ ִ‫ל‬ ‫ם‬ ֶ‫ר‬ ֵ‫ח‬ ַ‫ה‬ַ‫ל‬ָ‫יהו‬‫ה‬‫י‬ ֶ‫ֹלה‬ ֱ‫א‬‫ל׃‬ָ‫גּ‬ ְ‫ל‬ִ‫גּ‬ ַ‫בּ‬ ‫ָך‬
(1Sm 15:21)
‫הוּא‬ִ‫אשׁ‬ ֵ‫ר‬ִ‫אשׁ‬ ֵ‫ר‬ִ‫אשׁ‬ ֵ‫ר‬ִ‫אשׁ‬ ֵ‫ר‬‫ית‬‫ית‬‫ית‬‫ית‬‫ל‬ ֵ‫א‬ ‫י־‬ ֵ‫כ‬ ְ‫ר‬ ַ‫דּ‬‫ֹשׂ‬‫ע‬ ָ‫ה‬‫וֹ‬ֵ‫גּ‬ַ‫י‬‫בּוֹ׃‬ ְ‫ר‬ ַ‫ח‬ ‫שׁ‬ (Job 40:19)
Asia Bible Society 6
Sense 1: beginning
Sense 2: firstfruits
Sense 3: best
Sense 4: foremost
Word sense disambiguation
Problems of Existing Data
No consensus on the number of senses
each word has
No complete data of instance-based sense
identification
Manual identification can be subjective,
inconsistent, and time-consuming
Asia Bible Society 7
The Data-Mining Approach
Theoretical assumption
Data for mining
Machine learning procedures
Advantages and limitations of the
approach
Tool for sense exploration
Asia Bible Society 8
Theoretical Assumption
Translators presumably use different target
language words to translate different senses of a
word (Translators have done the job of
disambiguation sub-consciously and defined
each sense with target language words).
Asia Bible Society 9
Data for Mining
Translations linked word-to-word to the original Hebrew/Greek texts
Asia Bible Society 10
Basic Task
Take all instances of a word and group the
instances into different senses
Asia Bible Society 11
A Simple and Naive Approach
Look at the words used in a given translation and treat
instances with the same translation words as having the
same sense.
Asia Bible Society 12
A Simple and Naive Approach
Problems:
The translations may not be consistent:
Translators may use different words to translate the
same sense or the same word to translate different
senses
It can be subjective
It only reflects the opinions of a particular translation
The senses are too fragmentary
Asia Bible Society 13
The Voting Approach
Use multiple translations
Two instances of a word is considered to
have the same sense if most of the
translations use the same word to
translate it.
Check and balance
How to define “most”?
Asia Bible Society 14
The Voting Approach
How many votes to get?
Maximal agreement:
– Internal consistency within groups
– Too many senses
– Too many unassigned instances
Minimal agreement:
– Better grouping of senses
– Instances of different senses may be mixed together
Asia Bible Society 15
Progressive Merging
Trying to get the benefits of both maximal and minimal
agreement and avoid their disadvantages
Start with maximal agreement to get initial sense groups
that are internally consistent
Gradually merge the initial groups with decreasing
number of agreements N (N > 0, N < Maximal) and with
a variable association rate R (R > 0, R < 1)
Group B is merged into Group A if A contains B
A contains B if each instance in B is linked to at least R
of the instances in A by N agreements.
Pair-wise merge until no further merge can be done
Asia Bible Society 16
Progressive Merging
Merging two groups: association rate = 0.5
Asia Bible Society 17
Progressive Merging
Example: Maximal N = 4, R = 70%
Merge 1: N-1 = 3
B merges into A if each instance in B is linked to at least 70% of the
instances in A by sharing the same translation in at least 3 versions
Merge 2: N-2 = 2
B merges into A if each instance in B is linked to at least 70% of the
instances in A by sharing the same translation in at least 2 versions
Merge 3: N-3 = 1
B merges into A if each instance in B is linked to at least 70% of the
instances in A by sharing the same translation in at least 1 version
Asia Bible Society 18
Tuning the Variables
We can get different results by tuning the
following variables:
The translations to use
The number translations to use
The number of merges to perform
The association rate
Asia Bible Society 19
The “Accents” of Senses
Senses based on English translations
Senses based on Chinese translations
Senses based on both English and Chinese
The triangulating effect of using different
translations
Asia Bible Society 20
Factors Affecting the Results
The versions of translations that are used
The quality of each translation
The degree of consensus between different
translations
Quality of lemmatization in English
Surface forms vs. lemmatized forms
Asia Bible Society 21
Other Features Considered
Syntactic contexts
– Instances that occur in similar syntactic contexts tend
to have the same sense
– Not used because of sparse data problem
Morphological information
– Verbs with different stems in Hebrew tend to have
different senses
– Not used because the stem distinctions do not always
correspond well with sense distinctions
Asia Bible Society 22
Editing Options
The data shown here has not been manually
edited, but it can be edited using the tool:
Merge sense groups
Split a sense group
Move an instance from one sense group to
another
Use of manual information in automatic learning
Asia Bible Society 23
Demo
Asia Bible Society 24
Applications
Sense-based translation memory
Sense-based concordance
Sense-based consistency check
Asia Bible Society 25
Advantages of the Current Approach
Efficiency: a sense dictionary which not only lists
the senses but also the specific instances of the
sense can be built in a matter of days.
Objectivity: the results are based on actual data
and no pre-conceived subjective categorization
is required.
Flexibility: the granularity of sense divisions can
be adjusted by the values of similarity metrics in
the clustering process.
Asia Bible Society 26
Conclusion
A great tool for exploring and studying word
senses in biblical texts
Asia Bible Society 27

Más contenido relacionado

Similar a BibleTech2011

Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
 
USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...
USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...
USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...ijnlc
 
Using automated lexical resources in arabic sentence subjectivity
Using automated lexical resources in arabic sentence subjectivityUsing automated lexical resources in arabic sentence subjectivity
Using automated lexical resources in arabic sentence subjectivityijaia
 
BibleTech2015
BibleTech2015BibleTech2015
BibleTech2015Andi Wu
 
SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGE
SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGESCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGE
SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGEijnlc
 
Anaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer methodAnaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer methodijcsa
 
SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGE
SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGESCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGE
SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGEkevig
 
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHOD
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHODA SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHOD
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHODIJwest
 
A Self-Supervised Tibetan-Chinese Vocabulary Alignment Method
A Self-Supervised Tibetan-Chinese Vocabulary Alignment MethodA Self-Supervised Tibetan-Chinese Vocabulary Alignment Method
A Self-Supervised Tibetan-Chinese Vocabulary Alignment Methoddannyijwest
 
USING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITY
USING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITYUSING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITY
USING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITYijaia
 
Senti-Lexicon and Analysis for Restaurant Reviews of Myanmar Text
Senti-Lexicon and Analysis for Restaurant Reviews of Myanmar TextSenti-Lexicon and Analysis for Restaurant Reviews of Myanmar Text
Senti-Lexicon and Analysis for Restaurant Reviews of Myanmar TextIJAEMSJORNAL
 
Psychological test adaptation
Psychological test adaptationPsychological test adaptation
Psychological test adaptationCarlo Magno
 
Final PPT Group 1_Authentic Text_Translation Strategies.pptx
Final PPT Group 1_Authentic Text_Translation Strategies.pptxFinal PPT Group 1_Authentic Text_Translation Strategies.pptx
Final PPT Group 1_Authentic Text_Translation Strategies.pptxmdkalex
 
BibleTech2013.pptx
BibleTech2013.pptxBibleTech2013.pptx
BibleTech2013.pptxAndi Wu
 
Polarity detection of movie reviews in
Polarity detection of movie reviews inPolarity detection of movie reviews in
Polarity detection of movie reviews inijcsa
 

Similar a BibleTech2011 (20)

Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy words
 
USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...
USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...
USING OBJECTIVE WORDS IN THE REVIEWS TO IMPROVE THE COLLOQUIAL ARABIC SENTIME...
 
Using automated lexical resources in arabic sentence subjectivity
Using automated lexical resources in arabic sentence subjectivityUsing automated lexical resources in arabic sentence subjectivity
Using automated lexical resources in arabic sentence subjectivity
 
ijcai11
ijcai11ijcai11
ijcai11
 
BibleTech2015
BibleTech2015BibleTech2015
BibleTech2015
 
SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGE
SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGESCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGE
SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGE
 
Anaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer methodAnaphora resolution in hindi language using gazetteer method
Anaphora resolution in hindi language using gazetteer method
 
SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGE
SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGESCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGE
SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGE
 
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHOD
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHODA SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHOD
A SELF-SUPERVISED TIBETAN-CHINESE VOCABULARY ALIGNMENT METHOD
 
A Self-Supervised Tibetan-Chinese Vocabulary Alignment Method
A Self-Supervised Tibetan-Chinese Vocabulary Alignment MethodA Self-Supervised Tibetan-Chinese Vocabulary Alignment Method
A Self-Supervised Tibetan-Chinese Vocabulary Alignment Method
 
I026050054
I026050054I026050054
I026050054
 
NLP
NLPNLP
NLP
 
USING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITY
USING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITYUSING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITY
USING AUTOMATED LEXICAL RESOURCES IN ARABIC SENTENCE SUBJECTIVITY
 
Senti-Lexicon and Analysis for Restaurant Reviews of Myanmar Text
Senti-Lexicon and Analysis for Restaurant Reviews of Myanmar TextSenti-Lexicon and Analysis for Restaurant Reviews of Myanmar Text
Senti-Lexicon and Analysis for Restaurant Reviews of Myanmar Text
 
Psychological test adaptation
Psychological test adaptationPsychological test adaptation
Psychological test adaptation
 
Final PPT Group 1_Authentic Text_Translation Strategies.pptx
Final PPT Group 1_Authentic Text_Translation Strategies.pptxFinal PPT Group 1_Authentic Text_Translation Strategies.pptx
Final PPT Group 1_Authentic Text_Translation Strategies.pptx
 
BibleTech2013.pptx
BibleTech2013.pptxBibleTech2013.pptx
BibleTech2013.pptx
 
Making English Real Anna Gates
Making English Real   Anna GatesMaking English Real   Anna Gates
Making English Real Anna Gates
 
Polarity detection of movie reviews in
Polarity detection of movie reviews inPolarity detection of movie reviews in
Polarity detection of movie reviews in
 
Fyp ca2
Fyp ca2Fyp ca2
Fyp ca2
 

Más de Andi Wu

Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPAndi Wu
 
Correction of Erroneous Characters
Correction of Erroneous CharactersCorrection of Erroneous Characters
Correction of Erroneous CharactersAndi Wu
 
Statistically-Enhanced New Word Identification
Statistically-Enhanced New Word IdentificationStatistically-Enhanced New Word Identification
Statistically-Enhanced New Word IdentificationAndi Wu
 
Learning Verb-Noun Relations to Improve Parsing
Learning Verb-Noun Relations to Improve ParsingLearning Verb-Noun Relations to Improve Parsing
Learning Verb-Noun Relations to Improve ParsingAndi Wu
 
Dynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence AnalysisDynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence AnalysisAndi Wu
 
Word Segmentation in Sentence Analysis
Word Segmentation in Sentence AnalysisWord Segmentation in Sentence Analysis
Word Segmentation in Sentence AnalysisAndi Wu
 
Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation ofAndi Wu
 
BibleTech2010.ppt
BibleTech2010.pptBibleTech2010.ppt
BibleTech2010.pptAndi Wu
 
Dissertation
DissertationDissertation
DissertationAndi Wu
 

Más de Andi Wu (9)

Chinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLPChinese Word Segmentation in MSR-NLP
Chinese Word Segmentation in MSR-NLP
 
Correction of Erroneous Characters
Correction of Erroneous CharactersCorrection of Erroneous Characters
Correction of Erroneous Characters
 
Statistically-Enhanced New Word Identification
Statistically-Enhanced New Word IdentificationStatistically-Enhanced New Word Identification
Statistically-Enhanced New Word Identification
 
Learning Verb-Noun Relations to Improve Parsing
Learning Verb-Noun Relations to Improve ParsingLearning Verb-Noun Relations to Improve Parsing
Learning Verb-Noun Relations to Improve Parsing
 
Dynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence AnalysisDynamic Lexical Acquisition in Chinese Sentence Analysis
Dynamic Lexical Acquisition in Chinese Sentence Analysis
 
Word Segmentation in Sentence Analysis
Word Segmentation in Sentence AnalysisWord Segmentation in Sentence Analysis
Word Segmentation in Sentence Analysis
 
Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation of
 
BibleTech2010.ppt
BibleTech2010.pptBibleTech2010.ppt
BibleTech2010.ppt
 
Dissertation
DissertationDissertation
Dissertation
 

BibleTech2011

  • 1. Word Sense Detection and Word sense Disambiguation through Data-Mining Andi Wu & Randall Tan Asia Bible Society
  • 2. Outline Motivations for word sense identification Problems of existing word sense data The data-mining approach Demo and Discussion Asia Bible Society 2
  • 3. Motivations Addressing the issue of Polysemy Bible translation – Better understanding of every word – Unification on the basis of senses rather than words Bible search – More refined search results on the basis of senses Asia Bible Society 3
  • 4. Goals Word sense detection For each content word in the Bible, find out how many senses it has. Word sense disambiguation For each instance of the word, find out which of the senses it has. Asia Bible Society 4
  • 5. Asia Bible Society 5 ‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬ Sense 1 Sense 1-1: beginning Sense 1-2: first Sense 2 Sense 2-1: firstfruits Sense 2-2: firstborn Sense 3 Sense 3-1: best Sense 3-2: choicest Sense 4 Sense 4-1: foremost Word sense detection Identify the senses of each word :
  • 6. Identify the sense of each instance: ְ‫בּ‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫א‬ ָ‫ר‬ ָ‫בּ‬ִ‫ֹלה‬ ֱ‫א‬‫ץ׃‬ ֶ‫ר‬ ָ‫א‬ ָ‫ה‬ ‫ת‬ ֵ‫א‬ ְ‫ו‬ ‫יִם‬ ַ‫מ‬ ָ‫שּׁ‬ ַ‫ה‬ ‫ת‬ ֵ‫א‬ ‫ים‬ (Gen1:1) ‫ן‬ ַ‫בּ‬ ְ‫ר‬ ָ‫ק‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬‫ית‬ ִ‫אשׁ‬ ֵ‫ר‬ָ‫יהו‬ ַ‫ל‬ ‫ם‬ ָ‫ֹת‬‫א‬ ‫יבוּ‬ ִ‫ר‬ ְ‫ק‬ ַ‫תּ‬‫ה‬‫ל‬ ֶ‫א‬ ְ‫ו‬‫־‬ַ‫י‬ ‫א־‬ֹ ‫ל‬ ַ‫ח‬ ֵ‫בּ‬ְ‫ז‬ ִ‫מּ‬ ַ‫ה‬‫ל‬ ֲ‫ע‬‫וּ‬‫י‬ ֵ‫ר‬ ְ‫ל‬ֹ‫ח‬‫י‬ִ‫נ‬ ַ‫ח‬‫׃‬ ַ‫ח‬ (Lev 2:12) ִ‫יּ‬ ַ‫ו‬‫ח‬ ַ‫קּ‬ָ‫ע‬ ָ‫ה‬ָ‫ל‬ ָ‫שּׁ‬ ַ‫ה‬ ֵ‫מ‬ ‫ם‬‫ל‬‫ר‬ ָ‫ק‬ ָ‫וּב‬ ‫אן‬ֹ ‫צ‬ִ‫אשׁ‬ ֵ‫ר‬ִ‫אשׁ‬ ֵ‫ר‬ִ‫אשׁ‬ ֵ‫ר‬ִ‫אשׁ‬ ֵ‫ר‬‫ית‬‫ית‬‫ית‬‫ית‬ַ‫ֹח‬‫בּ‬ְ‫ז‬ ִ‫ל‬ ‫ם‬ ֶ‫ר‬ ֵ‫ח‬ ַ‫ה‬ַ‫ל‬ָ‫יהו‬‫ה‬‫י‬ ֶ‫ֹלה‬ ֱ‫א‬‫ל׃‬ָ‫גּ‬ ְ‫ל‬ִ‫גּ‬ ַ‫בּ‬ ‫ָך‬ (1Sm 15:21) ‫הוּא‬ִ‫אשׁ‬ ֵ‫ר‬ִ‫אשׁ‬ ֵ‫ר‬ִ‫אשׁ‬ ֵ‫ר‬ִ‫אשׁ‬ ֵ‫ר‬‫ית‬‫ית‬‫ית‬‫ית‬‫ל‬ ֵ‫א‬ ‫י־‬ ֵ‫כ‬ ְ‫ר‬ ַ‫דּ‬‫ֹשׂ‬‫ע‬ ָ‫ה‬‫וֹ‬ֵ‫גּ‬ַ‫י‬‫בּוֹ׃‬ ְ‫ר‬ ַ‫ח‬ ‫שׁ‬ (Job 40:19) Asia Bible Society 6 Sense 1: beginning Sense 2: firstfruits Sense 3: best Sense 4: foremost Word sense disambiguation
  • 7. Problems of Existing Data No consensus on the number of senses each word has No complete data of instance-based sense identification Manual identification can be subjective, inconsistent, and time-consuming Asia Bible Society 7
  • 8. The Data-Mining Approach Theoretical assumption Data for mining Machine learning procedures Advantages and limitations of the approach Tool for sense exploration Asia Bible Society 8
  • 9. Theoretical Assumption Translators presumably use different target language words to translate different senses of a word (Translators have done the job of disambiguation sub-consciously and defined each sense with target language words). Asia Bible Society 9
  • 10. Data for Mining Translations linked word-to-word to the original Hebrew/Greek texts Asia Bible Society 10
  • 11. Basic Task Take all instances of a word and group the instances into different senses Asia Bible Society 11
  • 12. A Simple and Naive Approach Look at the words used in a given translation and treat instances with the same translation words as having the same sense. Asia Bible Society 12
  • 13. A Simple and Naive Approach Problems: The translations may not be consistent: Translators may use different words to translate the same sense or the same word to translate different senses It can be subjective It only reflects the opinions of a particular translation The senses are too fragmentary Asia Bible Society 13
  • 14. The Voting Approach Use multiple translations Two instances of a word is considered to have the same sense if most of the translations use the same word to translate it. Check and balance How to define “most”? Asia Bible Society 14
  • 15. The Voting Approach How many votes to get? Maximal agreement: – Internal consistency within groups – Too many senses – Too many unassigned instances Minimal agreement: – Better grouping of senses – Instances of different senses may be mixed together Asia Bible Society 15
  • 16. Progressive Merging Trying to get the benefits of both maximal and minimal agreement and avoid their disadvantages Start with maximal agreement to get initial sense groups that are internally consistent Gradually merge the initial groups with decreasing number of agreements N (N > 0, N < Maximal) and with a variable association rate R (R > 0, R < 1) Group B is merged into Group A if A contains B A contains B if each instance in B is linked to at least R of the instances in A by N agreements. Pair-wise merge until no further merge can be done Asia Bible Society 16
  • 17. Progressive Merging Merging two groups: association rate = 0.5 Asia Bible Society 17
  • 18. Progressive Merging Example: Maximal N = 4, R = 70% Merge 1: N-1 = 3 B merges into A if each instance in B is linked to at least 70% of the instances in A by sharing the same translation in at least 3 versions Merge 2: N-2 = 2 B merges into A if each instance in B is linked to at least 70% of the instances in A by sharing the same translation in at least 2 versions Merge 3: N-3 = 1 B merges into A if each instance in B is linked to at least 70% of the instances in A by sharing the same translation in at least 1 version Asia Bible Society 18
  • 19. Tuning the Variables We can get different results by tuning the following variables: The translations to use The number translations to use The number of merges to perform The association rate Asia Bible Society 19
  • 20. The “Accents” of Senses Senses based on English translations Senses based on Chinese translations Senses based on both English and Chinese The triangulating effect of using different translations Asia Bible Society 20
  • 21. Factors Affecting the Results The versions of translations that are used The quality of each translation The degree of consensus between different translations Quality of lemmatization in English Surface forms vs. lemmatized forms Asia Bible Society 21
  • 22. Other Features Considered Syntactic contexts – Instances that occur in similar syntactic contexts tend to have the same sense – Not used because of sparse data problem Morphological information – Verbs with different stems in Hebrew tend to have different senses – Not used because the stem distinctions do not always correspond well with sense distinctions Asia Bible Society 22
  • 23. Editing Options The data shown here has not been manually edited, but it can be edited using the tool: Merge sense groups Split a sense group Move an instance from one sense group to another Use of manual information in automatic learning Asia Bible Society 23
  • 25. Applications Sense-based translation memory Sense-based concordance Sense-based consistency check Asia Bible Society 25
  • 26. Advantages of the Current Approach Efficiency: a sense dictionary which not only lists the senses but also the specific instances of the sense can be built in a matter of days. Objectivity: the results are based on actual data and no pre-conceived subjective categorization is required. Flexibility: the granularity of sense divisions can be adjusted by the values of similarity metrics in the clustering process. Asia Bible Society 26
  • 27. Conclusion A great tool for exploring and studying word senses in biblical texts Asia Bible Society 27