SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Language Detection Library
  - 99% over precision for 49 languages -

                 12/3/2010
    Nakatani Shuyo @ Cybozu Labs, Inc.
What languages are these?

sprogregistrering

språkgjenkjenning
What languages are these?

sprogregistrering
                    Danish

språkgjenkjenning
                    Norwegian
‫?‪What languages are these‬‬

‫الكشف عن اللغة‬
  ‫تشخیص زبان‬
‫زبان کی شناخت‬
What languages are these?

‫الكشف عن اللغة‬          Arabic


  ‫تشخیص زبان‬            Persian

‫زبان کی شناخت‬           Urdu
What languages are these?

Language Detection

言語判別
What languages are these?

Language Detection
                        English

言語判別                    Japanese
What’s “Language Detection”?
   Detect language in which texts are written
       also character code detection (excluded)
       alias: Language identification / Language guessing

         Japanese         English         Chinese

          German          Spanish          Italian

           Arabic          Hindi          Korean
Why Language Detection?
   Purpose
       For language of search criteria
            Query “Java” => Hit Chinese texts...
       For SPAM filter/Extract content filter
            To use language-specific information(punctuations,
             keywords)
   Usage
       Web search engine
            Apache Nutch bundles a language detection module
       Bulletin board
            Post in English, Japanese and Chinese
Methods
   The more languages, the more difficult
       Among languages with the same script
       Requires knowledge of scripts and languages
   A simple method:
       Matching with the dictionary in each language
            Huge dictionary(inflections, compound words)

   Our method:
       Calcurates language probabilities from features of
        spelling
            Naive Bayse with character n-gram
Existing Language Detection
   There are a few libraries of language
    detection.
       Usage was limited?
            For only web search?
            But all services will become global from now on!
       Building corpus/model is a expensive work.
            Requires knowledge of scripts and languages

   Few languages supported & low precision
       Almost 10 languages. Not including Asian ones
“Practical” Language Detection
   99% over precision
       90% is not practical. (100 of 1000 mistakes)
   50 languages supported
       European, Asian and so on
   Fast Detection
       Many documents available
   Output each language’s probability
       For multiple candidates
Language Detection Library for Java
   We developed a language detection library for
    Java.
       Generates the language profiles from training
        corpus
            Profile : the probabilities of all spellings in each language
       Returns the candidates and their probabilities for
        given texts
       49 languages supported
   Open Source (Apache License 2.0)
       http://code.google.com/p/language-detection/
Experiments
   Training
       49 languages from Wikipedia
            That can provide a test corpus of its language

   Test
       200 news articles of 49 languages
            Google News (24 languages)
            News sites in each language
            Crawling by RSS
Results (1)
     languages    #          precisions                 items
af   Afrikaans        200   199 (99.50%)    en=1, af=199
ar   Arabic           200   200 (100.00%)   ar=200
bg   Bulgarian        200   200 (100.00%)   bg=200
bn   Bengali          200   200 (100.00%)   bn=200
cs   Czech            200   200 (100.00%)   cs=200
da   Danish           200   179 (89.50%)    da=179, no=14, en=7
de   German           200   200 (100.00%)   de=200
el   Greek            200   200 (100.00%)   el=200
en   English          200   200 (100.00%)   en=200
es   Spanish          200   200 (100.00%)   es=200
fa   Persian          200   200 (100.00%)   fa=200
fi   Finnish          200   200 (100.00%)   fi=200
fr   French           200   200 (100.00%)   fr=200
gu   Gujarati         200   200 (100.00%)   gu=200
he   Hebrew           200   200 (100.00%)   he=200
hi   Hindi            200   200 (100.00%)   hi=200
hr   Croatian         200   200 (100.00%)   hr=200
hu   Hungarian        200   200 (100.00%)   hu=200
id   Indonesian       200   200 (100.00%)   id=200
it   Italian          200   200 (100.00%)   it=200
ja   Japanese         200   200 (100.00%)   ja=200
kn   Kannada          200   200 (100.00%)   kn=200
ko   Korean           200   200 (100.00%)   ko=200
mk   Macedonian       200   200 (100.00%)   mk=200
ml   Malayalam        200   200 (100.00%)   ml=200
Results (2)
      languages             #        precisions                    items
  mr Marathi                 200    200 (100.00%)   mr=200
  ne  Nepali                 200    200 (100.00%)   ne=200
  nl  Dutch                  200    200 (100.00%)   nl=200
  no Norwegian               200    199 (99.50%)    da=1, no=199
  pa  Punjabi                200    200 (100.00%)   pa=200
  pl  Polish                 200    200 (100.00%)   pl=200
  pt  Portuguese             200    200 (100.00%)   pt=200
  ro  Romanian               200    200 (100.00%)   ro=200
  ru  Russian                200    200 (100.00%)   ru=200
  sk  Slovak                 200    200 (100.00%)   sk=200
  so Somali                  200    200 (100.00%)   so=200
  sq  Albanian               200    200 (100.00%)   sq=200
  sv  Swedish                200    200 (100.00%)   sv=200
  sw Swahili                 200    200 (100.00%)   sw=200
  ta  Tamil                  200    200 (100.00%)   ta=200
  te  Telugu                 200    200 (100.00%)   te=200
  th  Thai                   200    200 (100.00%)   th=200
   tl Tagalog                200    200 (100.00%)   tl=200
  tr  Turkish                200    200 (100.00%)   tr=200
  uk Ukrainian               200    200 (100.00%)   uk=200
  ur  Urdu                   200    200 (100.00%)   ur=200
  vi  Vietnamese             200    200 (100.00%)   vi=200
zh-cn Simplified Chinese     200    200 (100.00%)   zh-cn=200
zh-tw Traditional Chinese    200    200 (100.00%)   zh-tw=200
         sum                9800   9777 (99.77%)
Algorithms
Language Detection with Naive Bayes
   Classifies documents into “language”
    categories
       Categories: English, Japanese, Chinese, …
   Updates the posterior probabilities of
    categories by feature probabilities in each
    category
       ������ ������k ������    (m+1)    ∝ ������ ������k ������    m   ⋅ ������ ������������ ������������
            where ������k :category, ������:document, ������������ :feature of document
       Terminates detection process if the maximum
        probability(normalized) is over 0.99999
                 Early termination for perfomance
Features of Language Detection
   Character n-gram
       To be exact, “Unicode’s codepoint n-gram”
       Much less than the size of words
                                              Separator of
                                                 words



        □T h i s □
                T     h     i     s        ←1-gram

         □T    Th     hi   is    s□        ←2-gram

               □Th   Thi   his   is□       ←3-gram
How to detect the text’s language
   Each language has the peculiar characters and spelling rule.
        The accented “é” is used in Spanish, Italian and so on, and not
         used in English in principle.
        The word that starts with “Z” is often used in German and rarely
         used in English.
        The word that starts with “C” and contains spell “Th” are used in
         English and not used in German.
   Accumulates the probabilities assigned to these features in
    given text, so the guessed language is obtained as one that
    has the maximum probability.
                               □C     □L      □Z     Th

                     English   0.75   0.47   0.02   0.74
                     German    0.10   0.37   0.53   0.03
                     French    0.38   0.69   0.01   0.01
Improvement for Naive Detection
   The above naive algorithm can detect only
    90% precision.
       Not “practical”
       Very low precision for some languages
            Japanese, Traditional Chinese, Russian, Persian, ...

   Cause:
       Bias and noise of training and test corpus
   Improvement
       Noise filter
       Character normalization
(1) Bias of Characters
   Alphabet / Arabic / Devanagari
       About 30 characters
   Kanji (Chinese character)
       20000 characters over!
            1000 times as much as Alphabets

   Kanji has “zero frequency problem”
       Can’t detect language of “谢谢”(Simplified
        Chinese)
            This character isn’t used on Wikipedia.
       Name Kanji (uneven frequency)
Normalization with “Joyo Kanji”
   Classifies “similar frequency Kanji” and normalizes
    each cluster into a representative Kanji.
       (1) Clustering by K-means
       (2) Classification by “Joyo Kanji”
            Joyo Kanji (常用漢字: regularly used Kanji)
            Simplified Chinese: “现代汉语常用字表”(3500 characters)
            Traditional Chinese: the first standard of Big5 (5401
             characters). It includes “常用国字標準字体表” (4808
             characters)
            Japanese: Joyo Kanji(2136 characters) + the first standard of
             JIS X 0208 (2965 characters) = 2998 characters
       130 clusters
            Each language has about 50 classes.
(2) Noise of Corpus
   Removes the language-independent characters
        Numeric figures, symbols, URLs and mail addresses
   Latin character noise in non-Latin text
        Alphabets often occur in also non-Latin text.
             Java, WWW, US and so on
        Remove all Latin-characters if their rate is less than 20%.
   Latin character noise in Latin text
        Acronyms, person’s names and place names don’t represent
         feature of languages.
             UNESCO, “New York” in French text
             Person’s name has a various language feature (e.g. Mc- = Gaelic).
        Removes all-capital words
        Reduces the effect of local features by the feature-sampling
Normalization of Arabic Character
   All Persian texts were detected as Arabic!
       Persian and Arabic belong to different language families, so
        it ought to be easy to discriminate them.
   A high-frequency character “yeh” is assigned to
    different codes in training and test corpora respectively.
       In the training corpus (Wikipedia), it is assigned to “‫”ی‬
        (¥u06cc, Farsi yeh).
       In the test corpus (News), it is “‫¥( ”ي‬u064a, Arabic yeh).
            Cause: Arabic character-code CP-1256 don’t has the character
             mapped to ¥u06cc, so it is substituted to ¥u064a in a general way.

   Normalizes ¥u06cc(Farsi yeh) into ¥u064a(Arabic yeh)
       All Persian texts are detected correctly.
Conclusion
Conclusion
   We developed the language detection
    library for Java.
       49 languages can be detected in 99.8%
        precision.
       Our next product will use it (search by
        language).
   90% is easy. But 99% over is practical.
       Ideal: Answer from the novel beautiful theory
       Real: Unrefined steady way all along
Open Issues
   Short text (e.g. twitter)
   Arabic vowel signs
   Text in more than one language
   Source code in text
References
   [Habash 2009] Introduction to Arabic Natual Language Processing
        http://www.medar.info/conference_all/2009/Tutorial_1.pdf
   [Dunning 1994] Statistical Identification of Language
   [Sibun & Reynar 1996] Language identification: Examining the issues
   [Martins+ 2005] Language Identification in Web Pages
   千野栄一編「世界のことば100語辞典 ヨーロッパ編」
   町田和彦編「図説 世界の文字とことば」
   世界の文字研究会「世界の文字の図典」
   町田和彦「ニューエクスプレス ヒンディー語」
   中村公則「らくらくペルシャ語 文法から会話」
   道広勇司「アラビア系文字の基礎知識」
        http://moji.gr.jp/script/arabic/article01.html
   北研二,辻井潤一「確率的言語モデル」
Thank you for reading


   Language Detection Project Home
        http://code.google.com/p/language-detection/
   blog (in Japanese)
        http://d.hatena.ne.jp/n_shuyo/
   twitter
        http://twitter.com/shuyo

Más contenido relacionado

La actualidad más candente

bag-of-words models
bag-of-words models bag-of-words models
bag-of-words models Xiaotao Zou
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Language modelling and its use cases
Language modelling and its use casesLanguage modelling and its use cases
Language modelling and its use casesKhrystyna Skopyk
 
Flutter + tensor flow lite = awesome sauce
Flutter + tensor flow lite = awesome sauceFlutter + tensor flow lite = awesome sauce
Flutter + tensor flow lite = awesome sauceAmit Sharma
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesankit_ppt
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016Taehoon Kim
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
∞-gram を使った短文言語判定
∞-gram を使った短文言語判定∞-gram を使った短文言語判定
∞-gram を使った短文言語判定Shuyo Nakatani
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Fasttext 20170720 yjy
Fasttext 20170720 yjyFasttext 20170720 yjy
Fasttext 20170720 yjy재연 윤
 
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)Myungjin Lee
 

La actualidad más candente (20)

bag-of-words models
bag-of-words models bag-of-words models
bag-of-words models
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Language modelling and its use cases
Language modelling and its use casesLanguage modelling and its use cases
Language modelling and its use cases
 
Word embedding
Word embedding Word embedding
Word embedding
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Flutter + tensor flow lite = awesome sauce
Flutter + tensor flow lite = awesome sauceFlutter + tensor flow lite = awesome sauce
Flutter + tensor flow lite = awesome sauce
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Nlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniquesNlp toolkits and_preprocessing_techniques
Nlp toolkits and_preprocessing_techniques
 
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
지적 대화를 위한 깊고 넓은 딥러닝 PyCon APAC 2016
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
∞-gram を使った短文言語判定
∞-gram を使った短文言語判定∞-gram を使った短文言語判定
∞-gram を使った短文言語判定
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
NLP
NLPNLP
NLP
 
Fasttext 20170720 yjy
Fasttext 20170720 yjyFasttext 20170720 yjy
Fasttext 20170720 yjy
 
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
지식그래프 개념과 활용방안 (Knowledge Graph - Introduction and Use Cases)
 

Similar a Language Detection Library for Java

Short Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-GramShort Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-GramShuyo Nakatani
 
Sltu12
Sltu12Sltu12
Sltu12tihtow
 
From Programming to Modeling And Back Again
From Programming to Modeling And Back AgainFrom Programming to Modeling And Back Again
From Programming to Modeling And Back AgainMarkus Voelter
 
Segmenting dna sequence into words
Segmenting dna sequence into wordsSegmenting dna sequence into words
Segmenting dna sequence into wordsLiang Wang
 
Theory of computing
Theory of computingTheory of computing
Theory of computingRanjan Kumar
 
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...Seonghyun Kim
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyoutsider2
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easyGopi Krishnan Nambiar
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsForward Gradient
 
Building a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationBuilding a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationKepa J. Rodriguez
 

Similar a Language Detection Library for Java (11)

Short Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-GramShort Text Language Detection with Infinity-Gram
Short Text Language Detection with Infinity-Gram
 
Sltu12
Sltu12Sltu12
Sltu12
 
From Programming to Modeling And Back Again
From Programming to Modeling And Back AgainFrom Programming to Modeling And Back Again
From Programming to Modeling And Back Again
 
Segmenting dna sequence into words
Segmenting dna sequence into wordsSegmenting dna sequence into words
Segmenting dna sequence into words
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
 
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
Korean-optimized Word Representations for Out of Vocabulary Problems caused b...
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Sslis
SslisSslis
Sslis
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
Building a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationBuilding a 3-gram model for Language Identification
Building a 3-gram model for Language Identification
 

Más de Shuyo Nakatani

画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15
画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15
画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15Shuyo Nakatani
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networksShuyo Nakatani
 
無限関係モデル (続・わかりやすいパターン認識 13章)
無限関係モデル (続・わかりやすいパターン認識 13章)無限関係モデル (続・わかりやすいパターン認識 13章)
無限関係モデル (続・わかりやすいパターン認識 13章)Shuyo Nakatani
 
Memory Networks (End-to-End Memory Networks の Chainer 実装)
Memory Networks (End-to-End Memory Networks の Chainer 実装)Memory Networks (End-to-End Memory Networks の Chainer 実装)
Memory Networks (End-to-End Memory Networks の Chainer 実装)Shuyo Nakatani
 
人工知能と機械学習の違いって?
人工知能と機械学習の違いって?人工知能と機械学習の違いって?
人工知能と機械学習の違いって?Shuyo Nakatani
 
RとStanでクラウドセットアップ時間を分析してみたら #TokyoR
RとStanでクラウドセットアップ時間を分析してみたら #TokyoRRとStanでクラウドセットアップ時間を分析してみたら #TokyoR
RとStanでクラウドセットアップ時間を分析してみたら #TokyoRShuyo Nakatani
 
ドラえもんでわかる統計的因果推論 #TokyoR
ドラえもんでわかる統計的因果推論 #TokyoRドラえもんでわかる統計的因果推論 #TokyoR
ドラえもんでわかる統計的因果推論 #TokyoRShuyo Nakatani
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...Shuyo Nakatani
 
星野「調査観察データの統計科学」第3章
星野「調査観察データの統計科学」第3章星野「調査観察データの統計科学」第3章
星野「調査観察データの統計科学」第3章Shuyo Nakatani
 
星野「調査観察データの統計科学」第1&2章
星野「調査観察データの統計科学」第1&2章星野「調査観察データの統計科学」第1&2章
星野「調査観察データの統計科学」第1&2章Shuyo Nakatani
 
言語処理するのに Python でいいの? #PyDataTokyo
言語処理するのに Python でいいの? #PyDataTokyo言語処理するのに Python でいいの? #PyDataTokyo
言語処理するのに Python でいいの? #PyDataTokyoShuyo Nakatani
 
Zipf? (ジップ則のひみつ?) #DSIRNLP
Zipf? (ジップ則のひみつ?) #DSIRNLPZipf? (ジップ則のひみつ?) #DSIRNLP
Zipf? (ジップ則のひみつ?) #DSIRNLPShuyo Nakatani
 
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...Shuyo Nakatani
 
ソーシャルメディアの多言語判定 #SoC2014
ソーシャルメディアの多言語判定 #SoC2014ソーシャルメディアの多言語判定 #SoC2014
ソーシャルメディアの多言語判定 #SoC2014Shuyo Nakatani
 
猫に教えてもらうルベーグ可測
猫に教えてもらうルベーグ可測猫に教えてもらうルベーグ可測
猫に教えてもらうルベーグ可測Shuyo Nakatani
 
アラビア語とペルシャ語の見分け方 #DSIRNLP 5
アラビア語とペルシャ語の見分け方 #DSIRNLP 5アラビア語とペルシャ語の見分け方 #DSIRNLP 5
アラビア語とペルシャ語の見分け方 #DSIRNLP 5Shuyo Nakatani
 
どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013
どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013
どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013Shuyo Nakatani
 
Active Learning 入門
Active Learning 入門Active Learning 入門
Active Learning 入門Shuyo Nakatani
 
数式を綺麗にプログラミングするコツ #spro2013
数式を綺麗にプログラミングするコツ #spro2013数式を綺麗にプログラミングするコツ #spro2013
数式を綺麗にプログラミングするコツ #spro2013Shuyo Nakatani
 
ノンパラベイズ入門の入門
ノンパラベイズ入門の入門ノンパラベイズ入門の入門
ノンパラベイズ入門の入門Shuyo Nakatani
 

Más de Shuyo Nakatani (20)

画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15
画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15
画像をテキストで検索したい!(OpenAI CLIP) - VRC-LT #15
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
無限関係モデル (続・わかりやすいパターン認識 13章)
無限関係モデル (続・わかりやすいパターン認識 13章)無限関係モデル (続・わかりやすいパターン認識 13章)
無限関係モデル (続・わかりやすいパターン認識 13章)
 
Memory Networks (End-to-End Memory Networks の Chainer 実装)
Memory Networks (End-to-End Memory Networks の Chainer 実装)Memory Networks (End-to-End Memory Networks の Chainer 実装)
Memory Networks (End-to-End Memory Networks の Chainer 実装)
 
人工知能と機械学習の違いって?
人工知能と機械学習の違いって?人工知能と機械学習の違いって?
人工知能と機械学習の違いって?
 
RとStanでクラウドセットアップ時間を分析してみたら #TokyoR
RとStanでクラウドセットアップ時間を分析してみたら #TokyoRRとStanでクラウドセットアップ時間を分析してみたら #TokyoR
RとStanでクラウドセットアップ時間を分析してみたら #TokyoR
 
ドラえもんでわかる統計的因果推論 #TokyoR
ドラえもんでわかる統計的因果推論 #TokyoRドラえもんでわかる統計的因果推論 #TokyoR
ドラえもんでわかる統計的因果推論 #TokyoR
 
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
 
星野「調査観察データの統計科学」第3章
星野「調査観察データの統計科学」第3章星野「調査観察データの統計科学」第3章
星野「調査観察データの統計科学」第3章
 
星野「調査観察データの統計科学」第1&2章
星野「調査観察データの統計科学」第1&2章星野「調査観察データの統計科学」第1&2章
星野「調査観察データの統計科学」第1&2章
 
言語処理するのに Python でいいの? #PyDataTokyo
言語処理するのに Python でいいの? #PyDataTokyo言語処理するのに Python でいいの? #PyDataTokyo
言語処理するのに Python でいいの? #PyDataTokyo
 
Zipf? (ジップ則のひみつ?) #DSIRNLP
Zipf? (ジップ則のひみつ?) #DSIRNLPZipf? (ジップ則のひみつ?) #DSIRNLP
Zipf? (ジップ則のひみつ?) #DSIRNLP
 
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
ACL2014 Reading: [Zhang+] "Kneser-Ney Smoothing on Expected Count" and [Pickh...
 
ソーシャルメディアの多言語判定 #SoC2014
ソーシャルメディアの多言語判定 #SoC2014ソーシャルメディアの多言語判定 #SoC2014
ソーシャルメディアの多言語判定 #SoC2014
 
猫に教えてもらうルベーグ可測
猫に教えてもらうルベーグ可測猫に教えてもらうルベーグ可測
猫に教えてもらうルベーグ可測
 
アラビア語とペルシャ語の見分け方 #DSIRNLP 5
アラビア語とペルシャ語の見分け方 #DSIRNLP 5アラビア語とペルシャ語の見分け方 #DSIRNLP 5
アラビア語とペルシャ語の見分け方 #DSIRNLP 5
 
どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013
どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013
どの言語でつぶやかれたのか、機械が知る方法 #WebDBf2013
 
Active Learning 入門
Active Learning 入門Active Learning 入門
Active Learning 入門
 
数式を綺麗にプログラミングするコツ #spro2013
数式を綺麗にプログラミングするコツ #spro2013数式を綺麗にプログラミングするコツ #spro2013
数式を綺麗にプログラミングするコツ #spro2013
 
ノンパラベイズ入門の入門
ノンパラベイズ入門の入門ノンパラベイズ入門の入門
ノンパラベイズ入門の入門
 

Último

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Último (20)

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Language Detection Library for Java

  • 1. Language Detection Library - 99% over precision for 49 languages - 12/3/2010 Nakatani Shuyo @ Cybozu Labs, Inc.
  • 2. What languages are these? sprogregistrering språkgjenkjenning
  • 3. What languages are these? sprogregistrering Danish språkgjenkjenning Norwegian
  • 4. ‫?‪What languages are these‬‬ ‫الكشف عن اللغة‬ ‫تشخیص زبان‬ ‫زبان کی شناخت‬
  • 5. What languages are these? ‫الكشف عن اللغة‬ Arabic ‫تشخیص زبان‬ Persian ‫زبان کی شناخت‬ Urdu
  • 6. What languages are these? Language Detection 言語判別
  • 7. What languages are these? Language Detection English 言語判別 Japanese
  • 8. What’s “Language Detection”?  Detect language in which texts are written  also character code detection (excluded)  alias: Language identification / Language guessing Japanese English Chinese German Spanish Italian Arabic Hindi Korean
  • 9. Why Language Detection?  Purpose  For language of search criteria  Query “Java” => Hit Chinese texts...  For SPAM filter/Extract content filter  To use language-specific information(punctuations, keywords)  Usage  Web search engine  Apache Nutch bundles a language detection module  Bulletin board  Post in English, Japanese and Chinese
  • 10. Methods  The more languages, the more difficult  Among languages with the same script  Requires knowledge of scripts and languages  A simple method:  Matching with the dictionary in each language  Huge dictionary(inflections, compound words)  Our method:  Calcurates language probabilities from features of spelling  Naive Bayse with character n-gram
  • 11. Existing Language Detection  There are a few libraries of language detection.  Usage was limited?  For only web search?  But all services will become global from now on!  Building corpus/model is a expensive work.  Requires knowledge of scripts and languages  Few languages supported & low precision  Almost 10 languages. Not including Asian ones
  • 12. “Practical” Language Detection  99% over precision  90% is not practical. (100 of 1000 mistakes)  50 languages supported  European, Asian and so on  Fast Detection  Many documents available  Output each language’s probability  For multiple candidates
  • 13. Language Detection Library for Java  We developed a language detection library for Java.  Generates the language profiles from training corpus  Profile : the probabilities of all spellings in each language  Returns the candidates and their probabilities for given texts  49 languages supported  Open Source (Apache License 2.0)  http://code.google.com/p/language-detection/
  • 14. Experiments  Training  49 languages from Wikipedia  That can provide a test corpus of its language  Test  200 news articles of 49 languages  Google News (24 languages)  News sites in each language  Crawling by RSS
  • 15. Results (1) languages # precisions items af Afrikaans 200 199 (99.50%) en=1, af=199 ar Arabic 200 200 (100.00%) ar=200 bg Bulgarian 200 200 (100.00%) bg=200 bn Bengali 200 200 (100.00%) bn=200 cs Czech 200 200 (100.00%) cs=200 da Danish 200 179 (89.50%) da=179, no=14, en=7 de German 200 200 (100.00%) de=200 el Greek 200 200 (100.00%) el=200 en English 200 200 (100.00%) en=200 es Spanish 200 200 (100.00%) es=200 fa Persian 200 200 (100.00%) fa=200 fi Finnish 200 200 (100.00%) fi=200 fr French 200 200 (100.00%) fr=200 gu Gujarati 200 200 (100.00%) gu=200 he Hebrew 200 200 (100.00%) he=200 hi Hindi 200 200 (100.00%) hi=200 hr Croatian 200 200 (100.00%) hr=200 hu Hungarian 200 200 (100.00%) hu=200 id Indonesian 200 200 (100.00%) id=200 it Italian 200 200 (100.00%) it=200 ja Japanese 200 200 (100.00%) ja=200 kn Kannada 200 200 (100.00%) kn=200 ko Korean 200 200 (100.00%) ko=200 mk Macedonian 200 200 (100.00%) mk=200 ml Malayalam 200 200 (100.00%) ml=200
  • 16. Results (2) languages # precisions items mr Marathi 200 200 (100.00%) mr=200 ne Nepali 200 200 (100.00%) ne=200 nl Dutch 200 200 (100.00%) nl=200 no Norwegian 200 199 (99.50%) da=1, no=199 pa Punjabi 200 200 (100.00%) pa=200 pl Polish 200 200 (100.00%) pl=200 pt Portuguese 200 200 (100.00%) pt=200 ro Romanian 200 200 (100.00%) ro=200 ru Russian 200 200 (100.00%) ru=200 sk Slovak 200 200 (100.00%) sk=200 so Somali 200 200 (100.00%) so=200 sq Albanian 200 200 (100.00%) sq=200 sv Swedish 200 200 (100.00%) sv=200 sw Swahili 200 200 (100.00%) sw=200 ta Tamil 200 200 (100.00%) ta=200 te Telugu 200 200 (100.00%) te=200 th Thai 200 200 (100.00%) th=200 tl Tagalog 200 200 (100.00%) tl=200 tr Turkish 200 200 (100.00%) tr=200 uk Ukrainian 200 200 (100.00%) uk=200 ur Urdu 200 200 (100.00%) ur=200 vi Vietnamese 200 200 (100.00%) vi=200 zh-cn Simplified Chinese 200 200 (100.00%) zh-cn=200 zh-tw Traditional Chinese 200 200 (100.00%) zh-tw=200 sum 9800 9777 (99.77%)
  • 18. Language Detection with Naive Bayes  Classifies documents into “language” categories  Categories: English, Japanese, Chinese, …  Updates the posterior probabilities of categories by feature probabilities in each category  ������ ������k ������ (m+1) ∝ ������ ������k ������ m ⋅ ������ ������������ ������������  where ������k :category, ������:document, ������������ :feature of document  Terminates detection process if the maximum probability(normalized) is over 0.99999  Early termination for perfomance
  • 19. Features of Language Detection  Character n-gram  To be exact, “Unicode’s codepoint n-gram”  Much less than the size of words Separator of words □T h i s □ T h i s ←1-gram □T Th hi is s□ ←2-gram □Th Thi his is□ ←3-gram
  • 20. How to detect the text’s language  Each language has the peculiar characters and spelling rule.  The accented “é” is used in Spanish, Italian and so on, and not used in English in principle.  The word that starts with “Z” is often used in German and rarely used in English.  The word that starts with “C” and contains spell “Th” are used in English and not used in German.  Accumulates the probabilities assigned to these features in given text, so the guessed language is obtained as one that has the maximum probability. □C □L □Z Th English 0.75 0.47 0.02 0.74 German 0.10 0.37 0.53 0.03 French 0.38 0.69 0.01 0.01
  • 21. Improvement for Naive Detection  The above naive algorithm can detect only 90% precision.  Not “practical”  Very low precision for some languages  Japanese, Traditional Chinese, Russian, Persian, ...  Cause:  Bias and noise of training and test corpus  Improvement  Noise filter  Character normalization
  • 22. (1) Bias of Characters  Alphabet / Arabic / Devanagari  About 30 characters  Kanji (Chinese character)  20000 characters over!  1000 times as much as Alphabets  Kanji has “zero frequency problem”  Can’t detect language of “谢谢”(Simplified Chinese)  This character isn’t used on Wikipedia.  Name Kanji (uneven frequency)
  • 23. Normalization with “Joyo Kanji”  Classifies “similar frequency Kanji” and normalizes each cluster into a representative Kanji.  (1) Clustering by K-means  (2) Classification by “Joyo Kanji”  Joyo Kanji (常用漢字: regularly used Kanji)  Simplified Chinese: “现代汉语常用字表”(3500 characters)  Traditional Chinese: the first standard of Big5 (5401 characters). It includes “常用国字標準字体表” (4808 characters)  Japanese: Joyo Kanji(2136 characters) + the first standard of JIS X 0208 (2965 characters) = 2998 characters  130 clusters  Each language has about 50 classes.
  • 24. (2) Noise of Corpus  Removes the language-independent characters  Numeric figures, symbols, URLs and mail addresses  Latin character noise in non-Latin text  Alphabets often occur in also non-Latin text.  Java, WWW, US and so on  Remove all Latin-characters if their rate is less than 20%.  Latin character noise in Latin text  Acronyms, person’s names and place names don’t represent feature of languages.  UNESCO, “New York” in French text  Person’s name has a various language feature (e.g. Mc- = Gaelic).  Removes all-capital words  Reduces the effect of local features by the feature-sampling
  • 25. Normalization of Arabic Character  All Persian texts were detected as Arabic!  Persian and Arabic belong to different language families, so it ought to be easy to discriminate them.  A high-frequency character “yeh” is assigned to different codes in training and test corpora respectively.  In the training corpus (Wikipedia), it is assigned to “‫”ی‬ (¥u06cc, Farsi yeh).  In the test corpus (News), it is “‫¥( ”ي‬u064a, Arabic yeh).  Cause: Arabic character-code CP-1256 don’t has the character mapped to ¥u06cc, so it is substituted to ¥u064a in a general way.  Normalizes ¥u06cc(Farsi yeh) into ¥u064a(Arabic yeh)  All Persian texts are detected correctly.
  • 27. Conclusion  We developed the language detection library for Java.  49 languages can be detected in 99.8% precision.  Our next product will use it (search by language).  90% is easy. But 99% over is practical.  Ideal: Answer from the novel beautiful theory  Real: Unrefined steady way all along
  • 28. Open Issues  Short text (e.g. twitter)  Arabic vowel signs  Text in more than one language  Source code in text
  • 29. References  [Habash 2009] Introduction to Arabic Natual Language Processing  http://www.medar.info/conference_all/2009/Tutorial_1.pdf  [Dunning 1994] Statistical Identification of Language  [Sibun & Reynar 1996] Language identification: Examining the issues  [Martins+ 2005] Language Identification in Web Pages  千野栄一編「世界のことば100語辞典 ヨーロッパ編」  町田和彦編「図説 世界の文字とことば」  世界の文字研究会「世界の文字の図典」  町田和彦「ニューエクスプレス ヒンディー語」  中村公則「らくらくペルシャ語 文法から会話」  道広勇司「アラビア系文字の基礎知識」  http://moji.gr.jp/script/arabic/article01.html  北研二,辻井潤一「確率的言語モデル」
  • 30. Thank you for reading  Language Detection Project Home  http://code.google.com/p/language-detection/  blog (in Japanese)  http://d.hatena.ne.jp/n_shuyo/  twitter  http://twitter.com/shuyo