SlideShare una empresa de Scribd logo
1 de 19
Descargar para leer sin conexión
Handling Unknown Words in Arabic FST
             Morphology

        Khaled Shaalan and Mohammed Attia
               Faculty of Engineering and IT,
               The British University in Dubai


                     Presented by
                    Younes Samih
            Heinrich-Heine-Universität, Germany
Bird’s Eye view
Problem
  • Out of Vocabulary words (OOV) cause a problem to
    morphological analysers, parsers, MT, etc.
  • The manual extension of lexical databases is costly an time
    consuming.
  • With the large amount of data, manual extension of lexicons
    becomes practically impossible.
Solution
  • Creating an automatic method for updating a lexical database
  • Integrating a Machine Learning method with a finite state
    guesser to lemmatize unknown words
  • Weighting new words by relevance and importance
Outline
•   Introduction
•   Morphological Guesser
•   Methodology
•   Testing and Evaluation
•   Conclusion
Introduction
• Why deal with unknown words?

• Complexity of lemmatization in Arabic

• Data used
Introduction
Why deal with unknown words?
• Language is always changing
     • New words appear
     • Old words disappear
     • Unknown words make up 29% of the Gigaword
       corpus
• Unknown words (OOV) always cause a problem to:
     • Morphological analysers
     • Parsers
     • Machine Translation & other applications
Introduction
Complexity of lemmatization in Arabic
• Lemmatization means reducing words to their base
  (canonical) forms
      • played -> play     studies - study
      • went -> go         wives -> wife
• New words in English appear in their base form 86% of
  the time (Lindén, 2008)
• New words in Arabic appear in their base form 45% of
  the time
• Arabic morphology is complex and semi-algorithmic:
  root, patterns, inflections, clitics, etc.
Introduction
Complexity of lemmatization in Arabic
          Proclitics              Prefix          Lemma Suffix            Enclitic

Conjunction/       Comp           Tense/mood –    Verb     Tense/mood – Object
question article                  number/gend              number/gend pronoun
Conjunctions ‫ل و‬li ‘to’           Imperfective             Imperfective   First person
wa ‘and’ or ‫ف‬fa                   tense (5)                tense (10)     (2)
‘then’
Question word ‫س أ‬sa ‘will’        Perfective tense lemma Perfective
                                                    lemma                Second
᾽a ‘is it true that’              (1)                     tense (12)     person (5)
                     ‫ل‬la ‘then’   Imperative (2)          Imperative (5) Third person
                                                                         (5)

Possible Concatenations in Arabic Verbs
                                                     ‫ شكر‬šakara ‘to thank’, generate
                                                     2,552 valid forms
Introduction
Complexity of lemmatization in Arabic
                   Proclitics                   lemma  Suffix         Enclitic
 Conjunction/       Preposition   Definite      Noun   Gender/Number Genitive
 question article                 article                             pronoun
 Conjunctions ‫ب و‬bi ‘with’,        ‫ال‬al ‘the’          Masculine Dual First person
 wa ‘and’ or ‫ف‬     ‫ك‬ka ‘as’                            (4)            (2)
 fa ‘then’        or ‫ل‬li ‘to’                          Feminine Dual
                                                       (4)
 Question word ‫أ‬                                Stem
                                                 lemma Masculine      Second person
 ᾽a ‘is it true                                        regular plural (5)
 that’                                                 (4)
                                                       Feminine       Third person
                                                       regular plural (5)
                                                       (1)
                                                       Feminine Mark
                                                       (1)
                                                          ‫ معلم‬mu῾allim ‘teacher’, generate 519
Possible Concatenations in Arabic Nouns                  valid forms
Introduction
Data used
• A large-scale corpus of 1,089,111,204
  words
  • 85% from the Arabic Gigaword Fourth Edition
  • 15% from news articles crawled from the Al-Jazeera
    web site
Morphological Guesser
We develop a morphological guesser for
Arabic unknown words that handles all
possible
  • Clitics
  • Prefixes
  • Suffixes
  • And all relevant alteration operations that include
    insertion, assimilation, and deletion
Guesser
LEXC       1                            LEXICON Adjectives
======                                  +adj+fem                      GuessWords;
                                        +adj+masc                     GuessWords;
LEXICON Conjunctions                    ^ss^^‫سعيد‬se^+adj+masc
+‫وـ‬conj:‫وـ‬           Prepositions;                              FemMascduFemduMascplFempl;
+‫فـ‬conj:‫فـ‬           Prepositions;      ....
                     Prepositions;
                                        LEXICON GuessWords
LEXICON Prepositions                    ^ss^^GUESSNOUNSTEM^^se^
+‫لـ‬prep:‫لـ‬           Article;                            FemMascduFemduMascplFempl;
+‫كـ‬prep:‫كـ‬           Article;           ^ss^^GUESSNOUNSTEM^^se^
                                                         FemMascduFemduFempl;
+‫بـ‬prep:‫بـ‬           Article;           ^ss^^GUESSNOUNSTEM^^se^
                     Article;                            FemMascduFemdu;
LEXICON Article                         ….
+‫الـ‬defArt           Nouns;             ALTERATION RULES          2
+‫الـ‬defArt           Adjectives;        =================
                     Nouns;              a -> b || L _ R
                     Adjectives;        XFST                       3
LEXICON Nouns                           =====
+noun                GuessWords;        read regex < arb-Alphabet.txt
                                        define Alphabet
^ss^^‫خادم‬se^         FemMascduMascpl;   define PossNounStem [[Alphabet]^{2,24}] "+Guess":0;
....                                    substitute defined PossNounStem for
                                        "^GUESSNOUNSTEM^“
Methodology
We use a pipelined approach
• First: a machine learning (SVM), context-sensitive tool
  (MADA) is used to predict:
   • POS
   • Morpho-syntactic features of number, gender, person, tense, etc.
• Second: The finite-state morphological guesser is used
  to produce all the possible interpretations of words and
  suggested lemmas.
• Third: The two output are matched together and the
  agreed analysis is selected.
Methodology
Example
‫والمسوِّ قون‬
 َ     َ ُ
wa-Al-musaw~iquwna “and-the-marketers”

MADA output:
form:wAlmswqwn    num:p      gen:m    per:na    case:n     asp:na    mod:na     vox:na
         pos:noun prc0:Al_detprc1:0   prc2:wa_conj   prc3:0     enc0:0    stt:d

Finite-state guesser output:
‫والمسوقون‬   +adj‫+والمسوق‬Guess+masc+pl+nom@
‫والمسوقون‬   +adj‫+والمسوقون‬Guess+sg@
‫والمسوقون‬   +noun‫+والمسوق‬Guess+masc+pl+nom@
‫والمسوقون‬   +noun‫+والمسوقون‬Guess+sg@
‫والمسوقون‬   ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوق‬Guess+masc+pl+nom@
‫والمسوقون‬   ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوقون‬Guess+sg@
‫والمسوقون‬   ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوق‬Guess+masc+pl+nom@ Correct Analysis
‫والمسوقون‬   ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوقون‬Guess+sg@
‫والمسوقون‬   ‫+و‬conj@+adj‫+المسوق‬Guess+masc+pl+nom@
‫والمسوقون‬   ‫+و‬conj@+adj‫+المسوقون‬Guess+sg@
‫والمسوقون‬   ‫+و‬conj@+noun‫+المسوق‬Guess+masc+pl+nom@
‫والمسوقون‬   ‫+و‬conj@+noun‫+المسوقون‬Guess+sg@
Methodology
Results
• Corpus size is 1,089,111,204 tokens, 7,348,173
  types
• Unknown Types in the corpus: 2,116,180 (29%)
• After spell checking, correctly spelt types are
  208,188
• Types with frequency of 10 or more: 40,277
• After lemmatization:18,399 types
Testing and Evaluation
We create a gold standard of 1,310 words
manually-annotated for:
• Gold lemma
• Gold POS
• Lexical relevance (include in a dictionary): yes or
  no
                                     Gold POS    Type Count   Ratio
                                     noun_prop   584          45%
Among unknown words,                 noun        264          20%
- Proper nouns are the most common   adj         255          19%
- Verbs are the least common         verb        52           4%
Testing and Evaluation
Evaluating POS (accuracy)
• Baseline: The most frequent tag (proper name)
  for all unknown words: 45%
• Mada: 60%
• Voted POS Tagging: 69%. When a lemma gets a
  different POS tag with a higher frequency we
  take the higher                           Accuracy
                              POS tagging
                          1   POS Tagging baseline   45%
                          2   MADA POS tagging       60%
                          3   Voted POS Tagging      69%
Testing and Evaluation
Evaluating Lemmatization (accuracy)
• Baseline: new words appear in their base form:
  45%
• Pipelined strict definite article ‘al’: 54%
• Pipelined ignoring definite article ‘al’: 63%
                            Lemmatization
                          1 Lemma first-order baseline        45%
                          2 Pipelined lemmatization (first- 54%
                            order decision) with strict
                            definite article matching
                          3 Pipelined lemmatization (first- 63%
                            order decision) ignoring definite
                            article matching
Testing and Evaluation
Evaluating Lemma Weighting
•   The weighting criteria aims to push lexicographically
    relevant words up the list and less interesting words down.
• We aim to make the number of important words high in the
  top 100 and low in the bottom 100
Word Weight = ((number of
sister forms * 800) +              Good words               In top   In bottom
frequencies of sister forms) / 2 +                          100      100

POS factor                        relying on Frequency      63       50
                                  alone (baseline)
                                  relying on number of      87       28
                                  sister forms * 800
                                  relying on POS factor     58       30
                                  using combined criteria   78       15
Conclusion
• We develop a methodology for automatically extracting
  and lemmatizing unknown words in Arabic
• We pipeline a finite-state guesser with a machine
  learning tool for lemmatization
• We develop a weighting mechanism for predicting the
  relevance and importance of lemmas
• Out of 2,116,180 unknown words, we create a lexicon of
  18,399 lemmatized, POS-tagged and weighted entries.

Más contenido relacionado

Destacado

Fsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attiaFsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attiaMohammed Attia
 
Attia sfcm presentation
Attia sfcm presentationAttia sfcm presentation
Attia sfcm presentationMohammed Attia
 
Dronitaly 2014 Privacy, Droni e Controllo del Territorio
Dronitaly 2014 Privacy, Droni e Controllo del TerritorioDronitaly 2014 Privacy, Droni e Controllo del Territorio
Dronitaly 2014 Privacy, Droni e Controllo del TerritorioGiorgio Pedrazzi
 
The University
The  UniversityThe  University
The Universityophelia23
 
Smau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano Redolfi
Smau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano RedolfiSmau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano Redolfi
Smau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano RedolfiGiorgio Pedrazzi
 
Getting ready for work
Getting ready for workGetting ready for work
Getting ready for workophelia23
 
Noovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud Learning
Noovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud LearningNoovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud Learning
Noovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud LearningGiorgio Pedrazzi
 
Teacher training course
Teacher training courseTeacher training course
Teacher training courseMohammed Attia
 
Floating dict presentation_04
Floating dict presentation_04Floating dict presentation_04
Floating dict presentation_04Mohammed Attia
 
119920864 hernia-inguinalis-ppt
119920864 hernia-inguinalis-ppt119920864 hernia-inguinalis-ppt
119920864 hernia-inguinalis-pptZulfikar Fikar
 
CELTA Course activities
CELTA Course activitiesCELTA Course activities
CELTA Course activitiesMohammed Attia
 

Destacado (15)

Fsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attiaFsmnlp presentation mohammed_attia
Fsmnlp presentation mohammed_attia
 
Esp
EspEsp
Esp
 
Attia sfcm presentation
Attia sfcm presentationAttia sfcm presentation
Attia sfcm presentation
 
Esp
EspEsp
Esp
 
Dronitaly 2014 Privacy, Droni e Controllo del Territorio
Dronitaly 2014 Privacy, Droni e Controllo del TerritorioDronitaly 2014 Privacy, Droni e Controllo del Territorio
Dronitaly 2014 Privacy, Droni e Controllo del Territorio
 
The University
The  UniversityThe  University
The University
 
Smau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano Redolfi
Smau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano RedolfiSmau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano Redolfi
Smau 2014- Keep calm & we track you - Giorgio Pedrazzi - Massimiliano Redolfi
 
Getting ready for work
Getting ready for workGetting ready for work
Getting ready for work
 
Noovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud Learning
Noovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud LearningNoovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud Learning
Noovle EDU Padova 30 ottobre 2014, Privacy Copyright Security nel Cloud Learning
 
Teacher training course
Teacher training courseTeacher training course
Teacher training course
 
Floating dict presentation_04
Floating dict presentation_04Floating dict presentation_04
Floating dict presentation_04
 
119920864 hernia-inguinalis-ppt
119920864 hernia-inguinalis-ppt119920864 hernia-inguinalis-ppt
119920864 hernia-inguinalis-ppt
 
CELTA Course activities
CELTA Course activitiesCELTA Course activities
CELTA Course activities
 
Female pelvis ppt
Female pelvis pptFemale pelvis ppt
Female pelvis ppt
 
Katalog2010
Katalog2010Katalog2010
Katalog2010
 

Similar a Fsmnlp presentation 02

Similar a Fsmnlp presentation 02 (12)

AINL 2016: Grigorieva
AINL 2016: GrigorievaAINL 2016: Grigorieva
AINL 2016: Grigorieva
 
Morphology
Morphology Morphology
Morphology
 
Morphemes
MorphemesMorphemes
Morphemes
 
Betul okcan
Betul okcanBetul okcan
Betul okcan
 
Writing in the discipline Subsentential terminology
Writing in the discipline  Subsentential terminologyWriting in the discipline  Subsentential terminology
Writing in the discipline Subsentential terminology
 
Linguistic morp
Linguistic morpLinguistic morp
Linguistic morp
 
1001 vocab spell-2e
1001 vocab spell-2e1001 vocab spell-2e
1001 vocab spell-2e
 
Natural language processing 2
Natural language processing 2Natural language processing 2
Natural language processing 2
 
NLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguistNLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguist
 
Morphological Analysis
Morphological AnalysisMorphological Analysis
Morphological Analysis
 
Nlp
NlpNlp
Nlp
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 

Fsmnlp presentation 02

  • 1. Handling Unknown Words in Arabic FST Morphology Khaled Shaalan and Mohammed Attia Faculty of Engineering and IT, The British University in Dubai Presented by Younes Samih Heinrich-Heine-Universität, Germany
  • 2. Bird’s Eye view Problem • Out of Vocabulary words (OOV) cause a problem to morphological analysers, parsers, MT, etc. • The manual extension of lexical databases is costly an time consuming. • With the large amount of data, manual extension of lexicons becomes practically impossible. Solution • Creating an automatic method for updating a lexical database • Integrating a Machine Learning method with a finite state guesser to lemmatize unknown words • Weighting new words by relevance and importance
  • 3. Outline • Introduction • Morphological Guesser • Methodology • Testing and Evaluation • Conclusion
  • 4. Introduction • Why deal with unknown words? • Complexity of lemmatization in Arabic • Data used
  • 5. Introduction Why deal with unknown words? • Language is always changing • New words appear • Old words disappear • Unknown words make up 29% of the Gigaword corpus • Unknown words (OOV) always cause a problem to: • Morphological analysers • Parsers • Machine Translation & other applications
  • 6. Introduction Complexity of lemmatization in Arabic • Lemmatization means reducing words to their base (canonical) forms • played -> play studies - study • went -> go wives -> wife • New words in English appear in their base form 86% of the time (Lindén, 2008) • New words in Arabic appear in their base form 45% of the time • Arabic morphology is complex and semi-algorithmic: root, patterns, inflections, clitics, etc.
  • 7. Introduction Complexity of lemmatization in Arabic Proclitics Prefix Lemma Suffix Enclitic Conjunction/ Comp Tense/mood – Verb Tense/mood – Object question article number/gend number/gend pronoun Conjunctions ‫ل و‬li ‘to’ Imperfective Imperfective First person wa ‘and’ or ‫ف‬fa tense (5) tense (10) (2) ‘then’ Question word ‫س أ‬sa ‘will’ Perfective tense lemma Perfective lemma Second ᾽a ‘is it true that’ (1) tense (12) person (5) ‫ل‬la ‘then’ Imperative (2) Imperative (5) Third person (5) Possible Concatenations in Arabic Verbs ‫ شكر‬šakara ‘to thank’, generate 2,552 valid forms
  • 8. Introduction Complexity of lemmatization in Arabic Proclitics lemma Suffix Enclitic Conjunction/ Preposition Definite Noun Gender/Number Genitive question article article pronoun Conjunctions ‫ب و‬bi ‘with’, ‫ال‬al ‘the’ Masculine Dual First person wa ‘and’ or ‫ف‬ ‫ك‬ka ‘as’ (4) (2) fa ‘then’ or ‫ل‬li ‘to’ Feminine Dual (4) Question word ‫أ‬ Stem lemma Masculine Second person ᾽a ‘is it true regular plural (5) that’ (4) Feminine Third person regular plural (5) (1) Feminine Mark (1) ‫ معلم‬mu῾allim ‘teacher’, generate 519 Possible Concatenations in Arabic Nouns valid forms
  • 9. Introduction Data used • A large-scale corpus of 1,089,111,204 words • 85% from the Arabic Gigaword Fourth Edition • 15% from news articles crawled from the Al-Jazeera web site
  • 10. Morphological Guesser We develop a morphological guesser for Arabic unknown words that handles all possible • Clitics • Prefixes • Suffixes • And all relevant alteration operations that include insertion, assimilation, and deletion
  • 11. Guesser LEXC 1 LEXICON Adjectives ====== +adj+fem GuessWords; +adj+masc GuessWords; LEXICON Conjunctions ^ss^^‫سعيد‬se^+adj+masc +‫وـ‬conj:‫وـ‬ Prepositions; FemMascduFemduMascplFempl; +‫فـ‬conj:‫فـ‬ Prepositions; .... Prepositions; LEXICON GuessWords LEXICON Prepositions ^ss^^GUESSNOUNSTEM^^se^ +‫لـ‬prep:‫لـ‬ Article; FemMascduFemduMascplFempl; +‫كـ‬prep:‫كـ‬ Article; ^ss^^GUESSNOUNSTEM^^se^ FemMascduFemduFempl; +‫بـ‬prep:‫بـ‬ Article; ^ss^^GUESSNOUNSTEM^^se^ Article; FemMascduFemdu; LEXICON Article …. +‫الـ‬defArt Nouns; ALTERATION RULES 2 +‫الـ‬defArt Adjectives; ================= Nouns; a -> b || L _ R Adjectives; XFST 3 LEXICON Nouns ===== +noun GuessWords; read regex < arb-Alphabet.txt define Alphabet ^ss^^‫خادم‬se^ FemMascduMascpl; define PossNounStem [[Alphabet]^{2,24}] "+Guess":0; .... substitute defined PossNounStem for "^GUESSNOUNSTEM^“
  • 12. Methodology We use a pipelined approach • First: a machine learning (SVM), context-sensitive tool (MADA) is used to predict: • POS • Morpho-syntactic features of number, gender, person, tense, etc. • Second: The finite-state morphological guesser is used to produce all the possible interpretations of words and suggested lemmas. • Third: The two output are matched together and the agreed analysis is selected.
  • 13. Methodology Example ‫والمسوِّ قون‬ َ َ ُ wa-Al-musaw~iquwna “and-the-marketers” MADA output: form:wAlmswqwn num:p gen:m per:na case:n asp:na mod:na vox:na pos:noun prc0:Al_detprc1:0 prc2:wa_conj prc3:0 enc0:0 stt:d Finite-state guesser output: ‫والمسوقون‬ +adj‫+والمسوق‬Guess+masc+pl+nom@ ‫والمسوقون‬ +adj‫+والمسوقون‬Guess+sg@ ‫والمسوقون‬ +noun‫+والمسوق‬Guess+masc+pl+nom@ ‫والمسوقون‬ +noun‫+والمسوقون‬Guess+sg@ ‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوق‬Guess+masc+pl+nom@ ‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+adj‫+مسوقون‬Guess+sg@ ‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوق‬Guess+masc+pl+nom@ Correct Analysis ‫والمسوقون‬ ‫+و‬conj@‫+ال‬defArt@+noun‫+مسوقون‬Guess+sg@ ‫والمسوقون‬ ‫+و‬conj@+adj‫+المسوق‬Guess+masc+pl+nom@ ‫والمسوقون‬ ‫+و‬conj@+adj‫+المسوقون‬Guess+sg@ ‫والمسوقون‬ ‫+و‬conj@+noun‫+المسوق‬Guess+masc+pl+nom@ ‫والمسوقون‬ ‫+و‬conj@+noun‫+المسوقون‬Guess+sg@
  • 14. Methodology Results • Corpus size is 1,089,111,204 tokens, 7,348,173 types • Unknown Types in the corpus: 2,116,180 (29%) • After spell checking, correctly spelt types are 208,188 • Types with frequency of 10 or more: 40,277 • After lemmatization:18,399 types
  • 15. Testing and Evaluation We create a gold standard of 1,310 words manually-annotated for: • Gold lemma • Gold POS • Lexical relevance (include in a dictionary): yes or no Gold POS Type Count Ratio noun_prop 584 45% Among unknown words, noun 264 20% - Proper nouns are the most common adj 255 19% - Verbs are the least common verb 52 4%
  • 16. Testing and Evaluation Evaluating POS (accuracy) • Baseline: The most frequent tag (proper name) for all unknown words: 45% • Mada: 60% • Voted POS Tagging: 69%. When a lemma gets a different POS tag with a higher frequency we take the higher Accuracy POS tagging 1 POS Tagging baseline 45% 2 MADA POS tagging 60% 3 Voted POS Tagging 69%
  • 17. Testing and Evaluation Evaluating Lemmatization (accuracy) • Baseline: new words appear in their base form: 45% • Pipelined strict definite article ‘al’: 54% • Pipelined ignoring definite article ‘al’: 63% Lemmatization 1 Lemma first-order baseline 45% 2 Pipelined lemmatization (first- 54% order decision) with strict definite article matching 3 Pipelined lemmatization (first- 63% order decision) ignoring definite article matching
  • 18. Testing and Evaluation Evaluating Lemma Weighting • The weighting criteria aims to push lexicographically relevant words up the list and less interesting words down. • We aim to make the number of important words high in the top 100 and low in the bottom 100 Word Weight = ((number of sister forms * 800) + Good words In top In bottom frequencies of sister forms) / 2 + 100 100 POS factor relying on Frequency 63 50 alone (baseline) relying on number of 87 28 sister forms * 800 relying on POS factor 58 30 using combined criteria 78 15
  • 19. Conclusion • We develop a methodology for automatically extracting and lemmatizing unknown words in Arabic • We pipeline a finite-state guesser with a machine learning tool for lemmatization • We develop a weighting mechanism for predicting the relevance and importance of lemmas • Out of 2,116,180 unknown words, we create a lexicon of 18,399 lemmatized, POS-tagged and weighted entries.