SlideShare a Scribd company logo
1 of 25
Download to read offline
NLP in the WILD
-or-
Building a System for
Text Language Identification
Vsevolod Dyomkin
12/2016
A Bit about Me
* Lisp programmer
* 5+ years of NLP work at Grammarly
* Occasional lecturer
https://vseloved.github.io
Roles
Langid Problem
* 150+ langs in Wikipedia
* >10 writing systems
(script/alphabet) in active use
* script-lang: 1:1, 1:2, 1:n, n:1 :)
* Latin >50 langs, Cyrillyc >20
* Long texts easy, short hmm– –
* Internet texts (mixed langs)
* Small task => resource-constrained
Twitter Case Study
https://blog.twitter.com/2015/evaluating-language-
identification-performance
Prior Art
* C++: https://github.com/CLD2Owners/cld2
* Python: https://github.com/saffsd/langid.py
* Java: 
https://github.com/shuyo/language-detection/
http://blog.mikemccandless.com/2011/10/accuracy
-and-performance-of-googles.html
http://lab.hypotheses.org/1083
http://labs.translated.net/language-identifier/
YALI WILD
* All of them use weak models
* Wanted to use Wiktionary —
150+ languages, always evolving
* Wanted to do in Lisp
Linguistics
(domain knowledge)
* Polyglots?
* ISO 639
* Internet lang bias
https://en.wikipedia.org/wiki/Languages_used_on_the_Internet
* Rule-based ideas:
- 1:1/1:2 scripts
- unique letters
* Per-script/per-lang segmentation
insight
data
Data
* evaluation data:
- smoke test
- in-/out-of-domain data
- precision-/recall-oriented
* training data
- where to get? Wikidata
- how to get? SAX parsing
Wiktionary
* good source for various
dictionaries and word lists (word
forms, definitions, synonyms,…)
* ~100 langs
Wiktionary
* good source for various
dictionaries and word lists (word
forms, definitions, synonyms,…)
* ~100 langs
Wikipedia
* >150 langs
* size? Wikipedia abstracts
* automation?
* filtering?
Alternatives
* API
(defun get-examples (word)
(remove-if-not
^(upper-case-p (char % 0))
(mapcar ^(substitute #Space #_ (? % "text"))
(? (yason:parse
(drakma:http-request
(fmt "http://api.wordnik.com/v4/word.json/~A/examples"
(drakma:url-encode word :utf-8))
:additional-headers *wordnik-auth-headers*))
"examples"))))
* Web scraping
(defmethod scrape ((site (eql :linguaholic)) source)
(match-html source
'(>> article
(aside (>> a ($ user))
(>> li (strong "Native Tongue:") ($ lang)))
(div |...| (>> (div :data-role "commentContent")
($ text) (span) |...|))
!!!))))
Research
(quality)
* Simple task => simple models (NB)
* Challenges
- short texts
- mixed langs
- 90% of data - cryptic
ideas
experiments
Naive Bayes
* Features: 3-/4-char ngrams
* Improvement ideas:
- add words (word unigrams)
- factor in word lengths
- use Internet lang bias
Formula:
(argmax (* (? priors lang)
(or (? word-probs word)
(norm (reduce '* ^(? 3g-probs %)
(word-3gs word)))))
langs)
http://www.paulgraham.com/spam.html
Experiments
* Usual ML setup (70:30) doesn't
work here
* “If you torment the data too
much...” (~c) Yaser Abu-Mosafa
* Comparison with existing systems
helps
Confusion MatrixAB: 0.90 | FR:0.10
AF: 0.80 | EN:0.20
AK: 0.80 | NN:0.10 IT:0.10
AN: 0.90 | ES:0.10
AY: 0.90 | ES:0.10
BG: 0.60 | RU:0.40
BM: 0.80 | FR:0.10 LA:0.10
BS: 0.90 | EN:0.10
CO: 0.90 | IT:0.10
CR: 0.40 | FR:0.30 UND:0.20 MS:0.10
CS: 0.90 | IT:0.10
CU: 0.90 | VI:0.10
CV: 0.80 | RU:0.20
DA: 0.70 | FO:0.10 NO:0.10 NN:0.10
DV: 0.80 | UZ:0.10 EN:0.10
DZ: NIL | BO:0.80 IK:0.10 NE:0.10
EN: 0.90 | NL:0.10
ET: 0.80 | EN:0.20
FF: 0.50 | EN:0.20 FR:0.10 EO:0.10 SV:0.10
FI: 0.80 | FR:0.10 DA:0.10
FJ: 0.90 | OC:0.10
GL: 0.90 | ES:0.10
HA: 0.80 | YO:0.10 EN:0.10
HR: 0.70 | BS:0.10 DE:0.10 GL:0.10
ID: 0.80 | MS:0.20
IE: 0.90 | EN:0.10
IG: 0.60 | EN:0.40
IO: 0.86 | DA:0.14
KG: 0.90 | SW:0.10
KL: 0.90 | EN:0.10
KS: 0.30 | UR:0.60 UND:0.10
KU: 0.90 | EN:0.10
KW: 0.89 | UND:0.11
LA: 0.90 | FR:0.10
LB: 0.90 | EN:0.10
LG: 0.90 | IT:0.10
LI: 0.80 | NL:0.20
MI: 0.90 | ES:0.10
MK: 0.80 | IT:0.10 RU:0.10
MS: 0.80 | ID:0.10 EN:0.10
MT: 0.90 | DE:0.10
NO: 0.90 | DA:0.10
NY: 0.80 | AR:0.10 SW:0.10
OM: 0.90 | EN:0.10
OS: 0.90 | RU:0.10
QU: 0.70 | ES:0.20 EN:0.10
RM: 0.90 | EN:0.10
RN: 0.50 | RW:0.40 YO:0.10
SC: 0.90 | FR:0.10
SG: 0.90 | FR:0.10
SR: 0.80 | HR:0.10 BS:0.10
SS: 0.50 | EN:0.30 DA:0.10 ZU:0.10
ST: 0.90 | PT:0.10
SV: 0.90 | DA:0.10
TI: 0.40 | AM:0.40 LA:0.10 EN:0.10
TK: 0.80 | TR:0.20
TO: 0.50 | EN:0.50
TS: 0.80 | EN:0.10 UZ:0.10
TW: 0.40 | EN:0.40 AK:0.10 YO:0.10
TY: 0.90 | ES:0.10
UG: 0.60 | UZ:0.40
UK: 0.80 | UND:0.10 VI:0.10
VE: 0.90 | EN:0.10
WO: 0.80 | NL:0.10 FR:0.10
XH: 0.80 | UZ:0.10 EN:0.10
YO: 0.80 | EN:0.20
ZU: 0.60 | XH:0.30 PT:0.10
Total quality: 0.90
The Ladder of NLP
Rule-based
Linear ML
Decision Trees & co.
Sequence models
Artificial Neural networks
Better Models
What can be improved?
* Account for word order
* Discriminative models per script
* DeepLearning™ model
Marginal gain is not huge…
Engineer
(efficiency)
* Just a small piece
of the pipeline:
- good-enough speed
- minimize space usage
- minimize external dependencies
* Proper floating-point calculations
* Proper processing of big texts?
* Pre-/post-processing
* Clean API
implementation
optimization
Model Optimization
Initial model size: ~1G
Target: ~10M :)
How to do it?
- Lossy compression: pruning
- Lossless compression:
Huffman coding, efficient DS
API
* Levels of detalization:
- text-langs
- word-langs
- window?
* UI: library, REPL & Web APIs
Recap
* Triple view of any
knowledge-related problem
* Ladder of approaches to solving
NLP problems
* Importance of productive env:
general- & special-purpose
REPL lang API access to data– –
efficient testing–
* Main stages of problem solving:
data experiment→ →
implementation optimization→

More Related Content

Viewers also liked

Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispDamien Cassou
 
LISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love ParanthesesLISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love ParanthesesDominic Graefen
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Vsevolod Dyomkin
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная оберткаVsevolod Dyomkin
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхVsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common LispVsevolod Dyomkin
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Rajiv Shah
 

Viewers also liked (11)

Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common Lisp
 
LISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love ParanthesesLISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love Parantheses
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная обертка
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данных
 
CL-NLP
CL-NLPCL-NLP
CL-NLP
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common Lisp
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow
 

Similar to NLP in the WILD or Building a System for Text Language Identification

Building modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaBuilding modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaAlexander Gyoshev
 
Lambdas myths-and-mistakes
Lambdas myths-and-mistakesLambdas myths-and-mistakes
Lambdas myths-and-mistakesRichardWarburton
 
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)jaxLondonConference
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate WorksGoro Fuji
 
An Introduction to CSS Preprocessors
An Introduction to CSS PreprocessorsAn Introduction to CSS Preprocessors
An Introduction to CSS PreprocessorsMiloš Sutanovac
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty FrameworkAapo Talvensaari
 
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksHow to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksPuppet
 
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksHow to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksCarlos Sanchez
 
Creating web APIs with apigility
Creating web APIs with apigilityCreating web APIs with apigility
Creating web APIs with apigilityKaloyan Raev
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisBrendan Gregg
 
How deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performanceHow deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performanceCumulus Networks
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsAjay Ohri
 
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures LibraryAPOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Libraryjexp
 
Beginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleBeginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleAri Lerner
 
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей СавченкоWebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей СавченкоGeeksLab Odessa
 
Spring scala - Sneaking Scala into your corporation
Spring scala  - Sneaking Scala into your corporationSpring scala  - Sneaking Scala into your corporation
Spring scala - Sneaking Scala into your corporationHenryk Konsek
 
The Crystal language *recently* update
The Crystal language *recently* updateThe Crystal language *recently* update
The Crystal language *recently* updatekarupanerura
 

Similar to NLP in the WILD or Building a System for Text Language Identification (20)

Building modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and javaBuilding modern web apps with html5, javascript, and java
Building modern web apps with html5, javascript, and java
 
Lambdas myths-and-mistakes
Lambdas myths-and-mistakesLambdas myths-and-mistakes
Lambdas myths-and-mistakes
 
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
Lambda Expressions: Myths and Mistakes - Richard Warburton (jClarity)
 
Scalaxb preso
Scalaxb presoScalaxb preso
Scalaxb preso
 
How Xslate Works
How Xslate WorksHow Xslate Works
How Xslate Works
 
An Introduction to CSS Preprocessors
An Introduction to CSS PreprocessorsAn Introduction to CSS Preprocessors
An Introduction to CSS Preprocessors
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
Developing OpenResty Framework
Developing OpenResty FrameworkDeveloping OpenResty Framework
Developing OpenResty Framework
 
Soap vs-rest
Soap vs-restSoap vs-rest
Soap vs-rest
 
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksHow to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
 
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero ClicksHow to Develop Puppet Modules: From Source to the Forge With Zero Clicks
How to Develop Puppet Modules: From Source to the Forge With Zero Clicks
 
Creating web APIs with apigility
Creating web APIs with apigilityCreating web APIs with apigility
Creating web APIs with apigility
 
MeetBSD2014 Performance Analysis
MeetBSD2014 Performance AnalysisMeetBSD2014 Performance Analysis
MeetBSD2014 Performance Analysis
 
How deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performanceHow deep is your buffer – Demystifying buffers and application performance
How deep is your buffer – Demystifying buffers and application performance
 
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
 
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures LibraryAPOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
APOC Pearls - Whirlwind Tour Through the Neo4j APOC Procedures Library
 
Beginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at GoogleBeginner workshop to angularjs presentation at Google
Beginner workshop to angularjs presentation at Google
 
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей СавченкоWebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
WebCamp: Developer Day: The Big, the Small and the Redis - Андрей Савченко
 
Spring scala - Sneaking Scala into your corporation
Spring scala  - Sneaking Scala into your corporationSpring scala  - Sneaking Scala into your corporation
Spring scala - Sneaking Scala into your corporation
 
The Crystal language *recently* update
The Crystal language *recently* updateThe Crystal language *recently* update
The Crystal language *recently* update
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

NLP in the WILD or Building a System for Text Language Identification

  • 1. NLP in the WILD -or- Building a System for Text Language Identification Vsevolod Dyomkin 12/2016
  • 2. A Bit about Me * Lisp programmer * 5+ years of NLP work at Grammarly * Occasional lecturer https://vseloved.github.io
  • 4. Langid Problem * 150+ langs in Wikipedia * >10 writing systems (script/alphabet) in active use * script-lang: 1:1, 1:2, 1:n, n:1 :) * Latin >50 langs, Cyrillyc >20 * Long texts easy, short hmm– – * Internet texts (mixed langs) * Small task => resource-constrained
  • 6. Prior Art * C++: https://github.com/CLD2Owners/cld2 * Python: https://github.com/saffsd/langid.py * Java:  https://github.com/shuyo/language-detection/ http://blog.mikemccandless.com/2011/10/accuracy -and-performance-of-googles.html http://lab.hypotheses.org/1083 http://labs.translated.net/language-identifier/
  • 7.
  • 8. YALI WILD * All of them use weak models * Wanted to use Wiktionary — 150+ languages, always evolving * Wanted to do in Lisp
  • 9. Linguistics (domain knowledge) * Polyglots? * ISO 639 * Internet lang bias https://en.wikipedia.org/wiki/Languages_used_on_the_Internet * Rule-based ideas: - 1:1/1:2 scripts - unique letters * Per-script/per-lang segmentation insight data
  • 10. Data * evaluation data: - smoke test - in-/out-of-domain data - precision-/recall-oriented * training data - where to get? Wikidata - how to get? SAX parsing
  • 11. Wiktionary * good source for various dictionaries and word lists (word forms, definitions, synonyms,…) * ~100 langs
  • 12. Wiktionary * good source for various dictionaries and word lists (word forms, definitions, synonyms,…) * ~100 langs
  • 13. Wikipedia * >150 langs * size? Wikipedia abstracts * automation? * filtering?
  • 14. Alternatives * API (defun get-examples (word) (remove-if-not ^(upper-case-p (char % 0)) (mapcar ^(substitute #Space #_ (? % "text")) (? (yason:parse (drakma:http-request (fmt "http://api.wordnik.com/v4/word.json/~A/examples" (drakma:url-encode word :utf-8)) :additional-headers *wordnik-auth-headers*)) "examples")))) * Web scraping (defmethod scrape ((site (eql :linguaholic)) source) (match-html source '(>> article (aside (>> a ($ user)) (>> li (strong "Native Tongue:") ($ lang))) (div |...| (>> (div :data-role "commentContent") ($ text) (span) |...|)) !!!))))
  • 15. Research (quality) * Simple task => simple models (NB) * Challenges - short texts - mixed langs - 90% of data - cryptic ideas experiments
  • 16. Naive Bayes * Features: 3-/4-char ngrams * Improvement ideas: - add words (word unigrams) - factor in word lengths - use Internet lang bias Formula: (argmax (* (? priors lang) (or (? word-probs word) (norm (reduce '* ^(? 3g-probs %) (word-3gs word))))) langs) http://www.paulgraham.com/spam.html
  • 17.
  • 18. Experiments * Usual ML setup (70:30) doesn't work here * “If you torment the data too much...” (~c) Yaser Abu-Mosafa * Comparison with existing systems helps
  • 19. Confusion MatrixAB: 0.90 | FR:0.10 AF: 0.80 | EN:0.20 AK: 0.80 | NN:0.10 IT:0.10 AN: 0.90 | ES:0.10 AY: 0.90 | ES:0.10 BG: 0.60 | RU:0.40 BM: 0.80 | FR:0.10 LA:0.10 BS: 0.90 | EN:0.10 CO: 0.90 | IT:0.10 CR: 0.40 | FR:0.30 UND:0.20 MS:0.10 CS: 0.90 | IT:0.10 CU: 0.90 | VI:0.10 CV: 0.80 | RU:0.20 DA: 0.70 | FO:0.10 NO:0.10 NN:0.10 DV: 0.80 | UZ:0.10 EN:0.10 DZ: NIL | BO:0.80 IK:0.10 NE:0.10 EN: 0.90 | NL:0.10 ET: 0.80 | EN:0.20 FF: 0.50 | EN:0.20 FR:0.10 EO:0.10 SV:0.10 FI: 0.80 | FR:0.10 DA:0.10 FJ: 0.90 | OC:0.10 GL: 0.90 | ES:0.10 HA: 0.80 | YO:0.10 EN:0.10 HR: 0.70 | BS:0.10 DE:0.10 GL:0.10 ID: 0.80 | MS:0.20 IE: 0.90 | EN:0.10 IG: 0.60 | EN:0.40 IO: 0.86 | DA:0.14 KG: 0.90 | SW:0.10 KL: 0.90 | EN:0.10 KS: 0.30 | UR:0.60 UND:0.10 KU: 0.90 | EN:0.10 KW: 0.89 | UND:0.11 LA: 0.90 | FR:0.10 LB: 0.90 | EN:0.10 LG: 0.90 | IT:0.10 LI: 0.80 | NL:0.20 MI: 0.90 | ES:0.10 MK: 0.80 | IT:0.10 RU:0.10 MS: 0.80 | ID:0.10 EN:0.10 MT: 0.90 | DE:0.10 NO: 0.90 | DA:0.10 NY: 0.80 | AR:0.10 SW:0.10 OM: 0.90 | EN:0.10 OS: 0.90 | RU:0.10 QU: 0.70 | ES:0.20 EN:0.10 RM: 0.90 | EN:0.10 RN: 0.50 | RW:0.40 YO:0.10 SC: 0.90 | FR:0.10 SG: 0.90 | FR:0.10 SR: 0.80 | HR:0.10 BS:0.10 SS: 0.50 | EN:0.30 DA:0.10 ZU:0.10 ST: 0.90 | PT:0.10 SV: 0.90 | DA:0.10 TI: 0.40 | AM:0.40 LA:0.10 EN:0.10 TK: 0.80 | TR:0.20 TO: 0.50 | EN:0.50 TS: 0.80 | EN:0.10 UZ:0.10 TW: 0.40 | EN:0.40 AK:0.10 YO:0.10 TY: 0.90 | ES:0.10 UG: 0.60 | UZ:0.40 UK: 0.80 | UND:0.10 VI:0.10 VE: 0.90 | EN:0.10 WO: 0.80 | NL:0.10 FR:0.10 XH: 0.80 | UZ:0.10 EN:0.10 YO: 0.80 | EN:0.20 ZU: 0.60 | XH:0.30 PT:0.10 Total quality: 0.90
  • 20. The Ladder of NLP Rule-based Linear ML Decision Trees & co. Sequence models Artificial Neural networks
  • 21. Better Models What can be improved? * Account for word order * Discriminative models per script * DeepLearning™ model Marginal gain is not huge…
  • 22. Engineer (efficiency) * Just a small piece of the pipeline: - good-enough speed - minimize space usage - minimize external dependencies * Proper floating-point calculations * Proper processing of big texts? * Pre-/post-processing * Clean API implementation optimization
  • 23. Model Optimization Initial model size: ~1G Target: ~10M :) How to do it? - Lossy compression: pruning - Lossless compression: Huffman coding, efficient DS
  • 24. API * Levels of detalization: - text-langs - word-langs - window? * UI: library, REPL & Web APIs
  • 25. Recap * Triple view of any knowledge-related problem * Ladder of approaches to solving NLP problems * Importance of productive env: general- & special-purpose REPL lang API access to data– – efficient testing– * Main stages of problem solving: data experiment→ → implementation optimization→