SlideShare una empresa de Scribd logo
1 de 85
Descargar para leer sin conexión
NLP Project
Full Circle
Vsevolod Dyomkin, @vseloved
Kyiv Dataspring 18, 2018-05-19
A Bit about Me
* 10+ Lisp programmer
* 5+ years of NLP work (Grammarly & my own)
* co-author of Projector NLP course
https://vseloved.github.io
Roles
NLP Project Stages
1) Domain analysis
2) Data preparation
3) Iterating the solution
4) Productizing the result
5) Gathering feedback
and returning to stages 1/2/3
1. Domain Analysis
1) Problem formulation
2) Existing solutions
3) Possible datasets
4) Identifying Challenges
5) Metrics, baseline & SOTA
6) Selecting possible
approaches and devising
a plan of attack
Domain:
text language identification
Problem Formulation
* Not an unsolved problem
* Short texts
* Support “all” languages
* The issue of UND/MULT texts
* Dialects?
* Encodings?
Interesting Examples
https://blog.twitter.com/2015/evaluating-language-
identification-performance
Existing Solutions
* https://github.com/shuyo/language-detection/
(Java)
* https://github.com/saffsd/langid.py (Python)
* https://github.com/mzsanford/cld (C++)
Existing Solutions
* https://github.com/shuyo/language-detection/
(Java)
* https://github.com/saffsd/langid.py (Python)
* https://github.com/mzsanford/cld (C++)
* https://github.com/CLD2Owners/cld2
* https://github.com/google/cld3
Existing Solutions
* https://github.com/shuyo/language-detection/
(Java)
* https://github.com/saffsd/langid.py (Python)
* https://github.com/mzsanford/cld (C++)
* https://github.com/CLD2Owners/cld2
* https://github.com/google/cld3
Possible Datasets
* Debian i18n (~90 langs)
Possible Datasets
* Debian i18n (~90 langs)
* TED Multilingual (109 langs)
Possible Datasets
* Debian i18n (~90 langs)
* TED Multilingual (109 langs)
* Wiktionary (~150 langs)
Possible Datasets
* Debian i18n (~90 langs)
* TED Multilingual (109 langs)
* Wiktionary (~150 langs)
* Wikipedia (175 langs)
Test Data
* Twitter evaluation dataset
* Datasets with fewer langs
* Extract from Wikipedia
+ smoke test
Challenges
* Linguistic challenges
- know next to nothing about
90% of the languages
- languages and scripts
http://www.omniglot.com/writing/langalph.htm
- language distributions
https://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites
- word segmentation?
* Engineering challenges
Metrics
NLP Evaluation
Intrinsic: use a gold-
standard dataset and some
metric to measure the system’s
performance directly.
(in-domain & out-of-domain)
Extrinsic: use a system as
part of an upstream task(s)
and measure performance change
there.
Dev-Test Split
f1 et al.
1)
f1 et al.
Confusion Matrix
AF: 1.00 |
DE: 1.00 |
EN: 1.00 |
ES: 0.94 | IT:0.06
NL: 0.85 | IT:0.03 CA:0.03 AF:0.03 DE:0.03 FR:0.03
RU: 0.82 | UK:0.12 EN:0.03 DA:0.02 FR:0.02
UK: 0.93 | TG:0.07
Total quality: 0.91
Evaluating NLG
1) Word Error Rate (WER)
2) BLEU, ROUGE, METEOR
Other Metrics
1) Cross-entropy
2) Perplexity
3) Custom: MaxMatch, smatch,
parseval, UAS/LAS…
4) Your own?
Baseline
Quality that can be achieved
using some reasonable “basic”
approach
(on the same data).
2 ways to measure improvement:
* quality improvement
* error reduction
Recent SNLI Example
https://arxiv.org/pdf/1803.02324.pdf
- majority class baseline: 0.34
- SOTA models: 0.87
Recent SNLI Example
https://arxiv.org/pdf/1803.02324.pdf
- majority class baseline: 0.34
- SOTA models: 0.87
- basic fasttext classifier: 0.67
State-of-the-Art
(SOTA)
The highest publicly known
result on a well-established
dataset.
https://aclweb.org/aclwiki/Sta
te_of_the_art
2. Getting Data
The Unreasonable
Effectiveness of Data
“Data is ten times more powerful than algorithms.”
-- Peter Norvig, http://youtu.be/yvDCzhbjYWs
https://twitter.com/shivon/status/864889085697024000
Uses of Data in NLP
* understanding the problem
* statistical analysis
* connecting separate domains
* evaluation data set
* training data set
* real-time feedback
* marketing/PR
* external competitions
* what else?
Data Workflow
Types of Data
* Structured
* Semi-structured
* Unstructured
Getting Data
* Get existing
* Create your own
* Acquire from users
Corpus
Annotated collection of docs
in a certain format.
Prominent Corpora
* National: OANC/MASC,
British (non-free)
* LDC(non-free): Penn Treebank,
OntoNotes, Web Treebank
* Books: Gutenberg, GoogleBooks
* Corporate: Reuters, Enron
* Research: SNLI, SQuAD
* Multilang: UDeps, Europarl
Corpora Pitfalls
* Requires licensing (often)
* Tied to a domain
* Annotation quality
* Requires processing
of custom formats (often)
DBs & KBs
* Wikimedia
* RDF knowledge bases (Freebase)
* Wordnet, Conceptnet, Babelnet
* Custom DBs
Dictionaries
* Wordlists (example: COCA)
* Dictionaries, grammar dicts
- Wiktionary
* Thesauri
Unstructured Data
* Internet
* CommonCrawl (also, NewsCrawl)
* UMBC, ClueWeb
Raw Text Pros & Cons
+ Can collect stats
=> build LMs, word vectors…
+ Can have a variety of domains
- … but hard to control the
distribution of domains
- Web artifacts
- Web noise
- Huge processing effort
Creating Own Data
* Scraping
* Annotation
* Crowdsourcing
* Getting from users
* Generating
Data Scraping
* Full-page web-scraping
(see: readability)
* Custom web-scraping
(see: scrapy, crawlik)
* Extracting from non-HTML
formats (.pdf, .doc…)
* Getting from API
(see: twitter, webhose)
Scraping Pros & Cons
+ possible to get needed domain
+ may be the only “fast” way
+ it’s the “google” way
- licensing issues
- getting volume requires time
- need to deal with antirobot
protection
- other engineering challenges
Corpus Annotation
* Initial data gathering
* Annotation guidelines
* Annotator selection
* Annotation tools
* Processing of raw annotations
* Result analysis & feedback
Annotation Variants
* professional
* mturk
* volunteer
Crowdsourcing
- Like annotation but harder
requires a custom solution
With gamification
+ If successful, can get
volumes and diversity
User Feedback Data
+ Your product, your domain
+ Real-time, allows adaptation
+ Ties into customer support
- Chicken & egg problem
- Requires anonymization
Approach: “lean startup”
Generating Data
+ Potentially unlimited volume
+ Control of the parameters
- Artificial (if you can code
a generation function,
why not code a reverse one?)
Data Best Practices
* Proper ML dataset handling
* Domain adequacy, diversity
* Inter-annotator agreement,
reasonable baselines
* Error analysis
* Real-time tracking
Data Tools
* (grep & co)
* other Shell powertools
* statistical analysis tools
+ plotting
* annotation tools
* web-scraping tools
* metrics databases (Graphite)
* Hadoop, Spark
3. Possible Approaches
1) Symbolical/rule-based
“The (Weak) Empire of Reason”
2) Counting-based
“The Empiricist Invasion”
3) Deep Learning-based
“The Revenge of the Spherical Cows”
http://www.earningmyturns.org/2017/06/
a-computational-linguistic-farce-in.ht
ml
Rule-based Approach
English Tokenization
(small problem)
* Simplest regex:
[^s]+
* More advanced regex:
w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|] «»“”‘’-]⟨⟩ ‒–—―
* Even more advanced regex:
[+-]?[0-9](?:[0-9,.]*[0-9])?
|[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']?
|["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—―
|[.!?]+
|-+
Add post-processing:
* concatenate abbreviations and decimals
* split contractions with regexes
- 2-character abbreviations regex:
i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they)['‘’`]d$
- 3-character abbreviations regex:
(?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$
https://github.com/lang-uk/tokenize-uk/blob/master/tokenize_uk/tokeni
ze_uk.py
Error Correction
(inherently rule-based)
LanguageTool:<category name=" " id="STYLE" type="style">Стиль
<rulegroup id="BILSH_WITH_ADJ" name=" ">Більш з прикметниками
<rule>
<pattern>
<token> </token>більш
<token postag_regexp="yes" postag="ad[jv]:.*compc.*">
<exception> </exception>менш
</token>
</pattern>
<message> « » </message>Після більш не може стояти вища форма прикметника
<!-- TODO: when we can bind comparative forms togher again
<suggestion><match no="2"/></suggestion>
<suggestion><match no="1"/> <match no="2" postag_regexp="yes"
postag="(.*):compc(.*)" postag_replace="$1:compb$2"/></suggestion>
<example correction=" | "><marker>Світліший Більш світлий Більш
</marker>.</example>світліший
-->
<example correction=""><marker> </marker>.</example>Більш світліший
<example> </example>все закінчилось більш менш
</rule>
...
https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modu
les/uk/src/main/resources/org/languagetool/rules/uk/
Dialog Systems
(inherently scripted)
ELIZA:
(defparameter *eliza-rules*
'((((?* ?x) hello (?* ?y))
(How do you do. Please state your problem.))
(((?* ?x) I want (?* ?y))
(What would it mean if you got ?y)
(Why do you want ?y) (Suppose you got ?y soon))
(((?* ?x) if (?* ?y))
(Do you really think its likely that ?y) (Do you wish that ?y)
(What do you think about ?y) (Really-- if ?y))
(((?* ?x) no (?* ?y))
(Why not?) (You are being a bit negative)
(Are you saying "NO" just to be negative?))
(((?* ?x) I was (?* ?y))
(Were you really?) (Perhaps I already knew you were ?y)
(Why do you tell me you were ?y now?))
(((?* ?x) I feel (?* ?y))
(Do you often feel ?y ?))
(((?* ?x) I felt (?* ?y))
(What other feelings do you have?))))
https://norvig.com/paip/eliza1.lisp
Fighting Spam
(just a bad idea :)
SpamAssasin:
# bodyn
# example: postacie greet! Chcesz porozmawiac Co prawda tu rzadko bywam,
zwykle pisze tu - http://hanna.3xa.info
uri __LOCAL_LINK_INFO /http://w{3,8}.www.info/
header __LOCAL_FROM_O2 From =~ /@o2.pl/
meta LOCAL_LINK_INFO __LOCAL_LINK_INFO && __LOCAL_FROM_O2
describe LOCAL_LINK_INFO Link postaci http://cos.cos.info i z o2.pl
score LOCAL_LINK_INFO 5
https://wiki.apache.org/spamassassin/CustomRulesets
(…which continues being
reimplemented)
Facebook antispam (using HAXL):
fpSpammer :: Haxl Bool
fpSpammer =
talkingAboutFP .&&
numFriends .> 100 .&&
friendsLikeCPlusPlus
where
talkingAboutFP =
strContains “Functional Programming” <$> postContent
friendsLikeCPlusPlus = do
friends <- getFriends
cppFriends <- filterM likesCPlusPlus friends
return (length cppFriends >= length friends `div` 2)
http://multicore.doc.ic.ac.uk/iPr0gram/slides/2015-2016/Marlow-fighting-s
pam.pdf
https://petrimazepa.com/m_li_ili_kak_algoritm_facebook_blokiruet_polzovat
elei_na_osnovanii_stop_slov
Languages for Rules
* regexes
Languages for Rules
* regexes
* XML, JSON
Languages for Rules
* regexes
* XML, JSON
* external DSLs (example: JBOSS drules)
Languages for Rules
* regexes
* XML, JSON
* external DSLs
* internal DSLs
(match-html source
'(>> article
(aside (>> a ($ user))
(>> li (strong "Native Tongue:") ($ lang)))
(div |...| (>> (div :data-role "commentContent"
($ text) (span) |...|))
!!!))
Rules Pros & Cons
+ compact & fast
+ full control
+ arbitrary recall
+ iterative
+ best interpetability
+ perfectly accommodate domain experts
- precision ceiling
- non-optimal weights
- require a lot of human labor
- hard to interpret/calibrate score
Classic ML Approach
NLP Viewpoints
* Bag-of-words
* Sequence
* Tree
* Graph
NB Model for LangID
* The problem of using words
* Character ngrams to the rescue
* Combining them
Deep Learning Approach
BiLSTM+attention as SOTA
“Basically, if you want to do an
NLP task, no matter what it is,
what you should do is throw your
data into a Bi-directional
long-short term memory network,
and augment its information flow
with the attention mechanism.”
– Chris Manning
https://twitter.com/mayurbhangale
/status/988332845708886016
Hybrid Approach
“Rule-based” framework that
incorporates Machine Learning
models.
+ control
+ best of both worlds
- more complex
- not sexy
Bias
* Domain bias - systematic statistical
differences between “domains”, usually
in the form of vocab, class priors and
conditional probabilities
* Dataset bias - sampling bias in a
given dataset, giving rise to (often
unintentional) content bias
* Model bias - bias in the output of a
model wrt gold standard
* Social bias - model bias
towards/against particular socio-
demographic groups of users
Counteracting Bias
* Rule-based post-processing
* Mix multiple datasets
+ special feature-selection
* Train multiple models on
different datasets and
create an ensemble
* Use domain adaptation
techniques
4. Productization
Delivering your NLP
application to the users
Product variants
* end-user product
(web, mobile, desktop)
* internal feature
* API
* library
* scientific paper
Requirements
* speed (processing, startup)
* memory usage
* storage
* interoperability
* ease of use
* ease of update
Storage Optimization
WILD Initial model size ~ 1G
(compared to megabytes for CLD,
target: 10-20 MB)
Ways to reduce:
- partly rule-based
- model pruning
- clever algorithms: perfect hash
tables, Huffman coding
- model compression
https://arxiv.org/abs/1612.03651
Model Formats
* pickle & co.
Model Formats
* pickle & co.
* JSON, ProtoBuf, ...
Model Formats
* pickle & co.
* JSON, ProtoBuf, ...
* HDF5
Model Formats
* pickle & co.
* JSON, ProtoBuf, ...
* HDF5
* Custom gzipped text-based
Adaptation
(manual vs automatic)
* for RB systems: add more
rules :D
* for ML systems: online
learning algorithms
* for DL systems: recent
“universal” models
* Lambda architecture
The Pipeline
“Getting a data pipeline into production for the
first time entails dealing with many different
moving parts in these cases, spend more time—
thinking about the pipeline itself, and how it
will enable experimentation, and less about its
first algorithm.”
https://medium.com/@neal_lathia/five-lessons-from
-building-machine-learning-systems-d703162846ad
Read More
• https://www.slideshare.net/frandzi/native-langua
ge-identification-brief-review-to-the-state-of-t
he-art•
• http://people.cs.umass.edu/~brenocon/inlp2015/15
-eval.pdf
• https://ehudreiter.com/2017/05/03/metrics-nlg-ev
aluation/
• https://nlpers.blogspot.com/2014/11/the-myth-of-
strong-baseline.html
•
https://www.slideshare.net/grammarly/grammarly-a
inlp-club-1-domain-and-social-bias-in-nlp-case-s
tudy-in-language-identification-tim-baldwin-8025
2288

Más contenido relacionado

La actualidad más candente

Why I Love Python
Why I Love PythonWhy I Love Python
Why I Love Pythondidip
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to PythonNowell Strite
 
Experience protocol buffer on android
Experience protocol buffer on androidExperience protocol buffer on android
Experience protocol buffer on androidRichard Chang
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPDeclarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPstubbles
 
Introduction to programming with python
Introduction to programming with pythonIntroduction to programming with python
Introduction to programming with pythonPorimol Chandro
 
Phython Programming Language
Phython Programming LanguagePhython Programming Language
Phython Programming LanguageR.h. Himel
 
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Steven Francia
 
Document Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitDocument Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitBen Healey
 
Python final presentation kirti ppt1
Python final presentation kirti ppt1Python final presentation kirti ppt1
Python final presentation kirti ppt1Kirti Verma
 
Go lang introduction
Go lang introductionGo lang introduction
Go lang introductionyangwm
 
Introduction To Python | Edureka
Introduction To Python | EdurekaIntroduction To Python | Edureka
Introduction To Python | EdurekaEdureka!
 
Augmenting and structuring user queries to support efficient free-form code s...
Augmenting and structuring user queries to support efficient free-form code s...Augmenting and structuring user queries to support efficient free-form code s...
Augmenting and structuring user queries to support efficient free-form code s...Dongsun Kim
 
FaCoY – A Code-to-Code Search Engine
FaCoY – A Code-to-Code Search EngineFaCoY – A Code-to-Code Search Engine
FaCoY – A Code-to-Code Search EngineDongsun Kim
 
Executable specifications for xtext
Executable specifications for xtextExecutable specifications for xtext
Executable specifications for xtextmeysholdt
 
A Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoA Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoMatt Stine
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Fasihul Kabir
 

La actualidad más candente (20)

Training python (new Updated)
Training python (new Updated)Training python (new Updated)
Training python (new Updated)
 
Why I Love Python
Why I Love PythonWhy I Love Python
Why I Love Python
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Basics of python
Basics of pythonBasics of python
Basics of python
 
Experience protocol buffer on android
Experience protocol buffer on androidExperience protocol buffer on android
Experience protocol buffer on android
 
Declarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHPDeclarative Development Using Annotations In PHP
Declarative Development Using Annotations In PHP
 
Introduction to programming with python
Introduction to programming with pythonIntroduction to programming with python
Introduction to programming with python
 
Phython Programming Language
Phython Programming LanguagePhython Programming Language
Phython Programming Language
 
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
 
Python Tutorial for Beginner
Python Tutorial for BeginnerPython Tutorial for Beginner
Python Tutorial for Beginner
 
Document Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitDocument Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language Toolkit
 
Python final presentation kirti ppt1
Python final presentation kirti ppt1Python final presentation kirti ppt1
Python final presentation kirti ppt1
 
Go lang introduction
Go lang introductionGo lang introduction
Go lang introduction
 
Introduction To Python | Edureka
Introduction To Python | EdurekaIntroduction To Python | Edureka
Introduction To Python | Edureka
 
Augmenting and structuring user queries to support efficient free-form code s...
Augmenting and structuring user queries to support efficient free-form code s...Augmenting and structuring user queries to support efficient free-form code s...
Augmenting and structuring user queries to support efficient free-form code s...
 
Python - Lesson 1
Python - Lesson 1Python - Lesson 1
Python - Lesson 1
 
FaCoY – A Code-to-Code Search Engine
FaCoY – A Code-to-Code Search EngineFaCoY – A Code-to-Code Search Engine
FaCoY – A Code-to-Code Search Engine
 
Executable specifications for xtext
Executable specifications for xtextExecutable specifications for xtext
Executable specifications for xtext
 
A Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to GoA Recovering Java Developer Learns to Go
A Recovering Java Developer Learns to Go
 
Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014Nltk:a tool for_nlp - py_con-dhaka-2014
Nltk:a tool for_nlp - py_con-dhaka-2014
 

Similar a NLP Project Full Circle

Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Esteve Castells
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to MicroservicesAd van der Veer
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 
Prairie Dev Con West - 2012-03-14 - Webmatrix, see what the matrix can do fo...
Prairie Dev Con West -  2012-03-14 - Webmatrix, see what the matrix can do fo...Prairie Dev Con West -  2012-03-14 - Webmatrix, see what the matrix can do fo...
Prairie Dev Con West - 2012-03-14 - Webmatrix, see what the matrix can do fo...Frédéric Harper
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...Antonios Giannopoulos
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalystdwm042
 
Code Documentation. That ugly thing...
Code Documentation. That ugly thing...Code Documentation. That ugly thing...
Code Documentation. That ugly thing...Christos Manios
 
Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Tony Frame
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedDainius Jocas
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Modelchomas kandar
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Modelchomas kandar
 
Ruby On Rails Intro
Ruby On Rails IntroRuby On Rails Intro
Ruby On Rails IntroSarah Allen
 
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA
 
MongoDB hearts Django? (Django NYC)
MongoDB hearts Django? (Django NYC)MongoDB hearts Django? (Django NYC)
MongoDB hearts Django? (Django NYC)Mike Dirolf
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real ExperienceIhor Bobak
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
 

Similar a NLP Project Full Circle (20)

Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
An Introduction to Microservices
An Introduction to MicroservicesAn Introduction to Microservices
An Introduction to Microservices
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 
Mdb dn 2016_06_query_primer
Mdb dn 2016_06_query_primerMdb dn 2016_06_query_primer
Mdb dn 2016_06_query_primer
 
Prairie Dev Con West - 2012-03-14 - Webmatrix, see what the matrix can do fo...
Prairie Dev Con West -  2012-03-14 - Webmatrix, see what the matrix can do fo...Prairie Dev Con West -  2012-03-14 - Webmatrix, see what the matrix can do fo...
Prairie Dev Con West - 2012-03-14 - Webmatrix, see what the matrix can do fo...
 
Mini-Training: Redis
Mini-Training: RedisMini-Training: Redis
Mini-Training: Redis
 
How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...How sitecore depends on mongo db for scalability and performance, and what it...
How sitecore depends on mongo db for scalability and performance, and what it...
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Code Documentation. That ugly thing...
Code Documentation. That ugly thing...Code Documentation. That ugly thing...
Code Documentation. That ugly thing...
 
10 Ways To Improve Your Code
10 Ways To Improve Your Code10 Ways To Improve Your Code
10 Ways To Improve Your Code
 
Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01Googleappengineintro 110410190620-phpapp01
Googleappengineintro 110410190620-phpapp01
 
Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Model
 
Document Object Model
Document Object ModelDocument Object Model
Document Object Model
 
Ruby On Rails Intro
Ruby On Rails IntroRuby On Rails Intro
Ruby On Rails Intro
 
Html5 and css3
Html5 and css3Html5 and css3
Html5 and css3
 
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph FactoryData Con LA 2022 - Open Source Large Knowledge Graph Factory
Data Con LA 2022 - Open Source Large Knowledge Graph Factory
 
MongoDB hearts Django? (Django NYC)
MongoDB hearts Django? (Django NYC)MongoDB hearts Django? (Django NYC)
MongoDB hearts Django? (Django NYC)
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 

Más de Vsevolod Dyomkin

Loading Multiple Versions of an ASDF System in the Same Lisp Image
Loading Multiple Versions of an ASDF System in the Same Lisp ImageLoading Multiple Versions of an ASDF System in the Same Lisp Image
Loading Multiple Versions of an ASDF System in the Same Lisp ImageVsevolod Dyomkin
 
NLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationNLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationVsevolod Dyomkin
 
Sugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturySugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturyVsevolod Dyomkin
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная оберткаVsevolod Dyomkin
 
Lisp for Python Programmers
Lisp for Python ProgrammersLisp for Python Programmers
Lisp for Python ProgrammersVsevolod Dyomkin
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelinesTedxkyiv communication guidelines
Tedxkyiv communication guidelinesVsevolod Dyomkin
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхVsevolod Dyomkin
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Vsevolod Dyomkin
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common LispVsevolod Dyomkin
 

Más de Vsevolod Dyomkin (19)

The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Loading Multiple Versions of an ASDF System in the Same Lisp Image
Loading Multiple Versions of an ASDF System in the Same Lisp ImageLoading Multiple Versions of an ASDF System in the Same Lisp Image
Loading Multiple Versions of an ASDF System in the Same Lisp Image
 
NLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationNLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language Identification
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Sugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturySugaring Lisp for the 21st Century
Sugaring Lisp for the 21st Century
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Lisp Machine Prunciples
Lisp Machine PrunciplesLisp Machine Prunciples
Lisp Machine Prunciples
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
CL-NLP
CL-NLPCL-NLP
CL-NLP
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная обертка
 
Lisp for Python Programmers
Lisp for Python ProgrammersLisp for Python Programmers
Lisp for Python Programmers
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelinesTedxkyiv communication guidelines
Tedxkyiv communication guidelines
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данных
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common Lisp
 

Último

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Último (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

NLP Project Full Circle

  • 1. NLP Project Full Circle Vsevolod Dyomkin, @vseloved Kyiv Dataspring 18, 2018-05-19
  • 2. A Bit about Me * 10+ Lisp programmer * 5+ years of NLP work (Grammarly & my own) * co-author of Projector NLP course https://vseloved.github.io
  • 4. NLP Project Stages 1) Domain analysis 2) Data preparation 3) Iterating the solution 4) Productizing the result 5) Gathering feedback and returning to stages 1/2/3
  • 5. 1. Domain Analysis 1) Problem formulation 2) Existing solutions 3) Possible datasets 4) Identifying Challenges 5) Metrics, baseline & SOTA 6) Selecting possible approaches and devising a plan of attack
  • 7. Problem Formulation * Not an unsolved problem * Short texts * Support “all” languages * The issue of UND/MULT texts * Dialects? * Encodings?
  • 9. Existing Solutions * https://github.com/shuyo/language-detection/ (Java) * https://github.com/saffsd/langid.py (Python) * https://github.com/mzsanford/cld (C++)
  • 10. Existing Solutions * https://github.com/shuyo/language-detection/ (Java) * https://github.com/saffsd/langid.py (Python) * https://github.com/mzsanford/cld (C++) * https://github.com/CLD2Owners/cld2 * https://github.com/google/cld3
  • 11. Existing Solutions * https://github.com/shuyo/language-detection/ (Java) * https://github.com/saffsd/langid.py (Python) * https://github.com/mzsanford/cld (C++) * https://github.com/CLD2Owners/cld2 * https://github.com/google/cld3
  • 12. Possible Datasets * Debian i18n (~90 langs)
  • 13. Possible Datasets * Debian i18n (~90 langs) * TED Multilingual (109 langs)
  • 14. Possible Datasets * Debian i18n (~90 langs) * TED Multilingual (109 langs) * Wiktionary (~150 langs)
  • 15. Possible Datasets * Debian i18n (~90 langs) * TED Multilingual (109 langs) * Wiktionary (~150 langs) * Wikipedia (175 langs)
  • 16. Test Data * Twitter evaluation dataset * Datasets with fewer langs * Extract from Wikipedia + smoke test
  • 17. Challenges * Linguistic challenges - know next to nothing about 90% of the languages - languages and scripts http://www.omniglot.com/writing/langalph.htm - language distributions https://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites - word segmentation? * Engineering challenges
  • 19. NLP Evaluation Intrinsic: use a gold- standard dataset and some metric to measure the system’s performance directly. (in-domain & out-of-domain) Extrinsic: use a system as part of an upstream task(s) and measure performance change there.
  • 23. Confusion Matrix AF: 1.00 | DE: 1.00 | EN: 1.00 | ES: 0.94 | IT:0.06 NL: 0.85 | IT:0.03 CA:0.03 AF:0.03 DE:0.03 FR:0.03 RU: 0.82 | UK:0.12 EN:0.03 DA:0.02 FR:0.02 UK: 0.93 | TG:0.07 Total quality: 0.91
  • 24. Evaluating NLG 1) Word Error Rate (WER) 2) BLEU, ROUGE, METEOR
  • 25.
  • 26. Other Metrics 1) Cross-entropy 2) Perplexity 3) Custom: MaxMatch, smatch, parseval, UAS/LAS… 4) Your own?
  • 27. Baseline Quality that can be achieved using some reasonable “basic” approach (on the same data). 2 ways to measure improvement: * quality improvement * error reduction
  • 28. Recent SNLI Example https://arxiv.org/pdf/1803.02324.pdf - majority class baseline: 0.34 - SOTA models: 0.87
  • 29. Recent SNLI Example https://arxiv.org/pdf/1803.02324.pdf - majority class baseline: 0.34 - SOTA models: 0.87 - basic fasttext classifier: 0.67
  • 30. State-of-the-Art (SOTA) The highest publicly known result on a well-established dataset. https://aclweb.org/aclwiki/Sta te_of_the_art
  • 32. The Unreasonable Effectiveness of Data “Data is ten times more powerful than algorithms.” -- Peter Norvig, http://youtu.be/yvDCzhbjYWs https://twitter.com/shivon/status/864889085697024000
  • 33. Uses of Data in NLP * understanding the problem * statistical analysis * connecting separate domains * evaluation data set * training data set * real-time feedback * marketing/PR * external competitions * what else?
  • 35. Types of Data * Structured * Semi-structured * Unstructured
  • 36. Getting Data * Get existing * Create your own * Acquire from users
  • 37. Corpus Annotated collection of docs in a certain format.
  • 38. Prominent Corpora * National: OANC/MASC, British (non-free) * LDC(non-free): Penn Treebank, OntoNotes, Web Treebank * Books: Gutenberg, GoogleBooks * Corporate: Reuters, Enron * Research: SNLI, SQuAD * Multilang: UDeps, Europarl
  • 39. Corpora Pitfalls * Requires licensing (often) * Tied to a domain * Annotation quality * Requires processing of custom formats (often)
  • 40. DBs & KBs * Wikimedia * RDF knowledge bases (Freebase) * Wordnet, Conceptnet, Babelnet * Custom DBs
  • 41. Dictionaries * Wordlists (example: COCA) * Dictionaries, grammar dicts - Wiktionary * Thesauri
  • 42. Unstructured Data * Internet * CommonCrawl (also, NewsCrawl) * UMBC, ClueWeb
  • 43. Raw Text Pros & Cons + Can collect stats => build LMs, word vectors… + Can have a variety of domains - … but hard to control the distribution of domains - Web artifacts - Web noise - Huge processing effort
  • 44. Creating Own Data * Scraping * Annotation * Crowdsourcing * Getting from users * Generating
  • 45. Data Scraping * Full-page web-scraping (see: readability) * Custom web-scraping (see: scrapy, crawlik) * Extracting from non-HTML formats (.pdf, .doc…) * Getting from API (see: twitter, webhose)
  • 46. Scraping Pros & Cons + possible to get needed domain + may be the only “fast” way + it’s the “google” way - licensing issues - getting volume requires time - need to deal with antirobot protection - other engineering challenges
  • 47. Corpus Annotation * Initial data gathering * Annotation guidelines * Annotator selection * Annotation tools * Processing of raw annotations * Result analysis & feedback
  • 49. Crowdsourcing - Like annotation but harder requires a custom solution With gamification + If successful, can get volumes and diversity
  • 50. User Feedback Data + Your product, your domain + Real-time, allows adaptation + Ties into customer support - Chicken & egg problem - Requires anonymization Approach: “lean startup”
  • 51. Generating Data + Potentially unlimited volume + Control of the parameters - Artificial (if you can code a generation function, why not code a reverse one?)
  • 52. Data Best Practices * Proper ML dataset handling * Domain adequacy, diversity * Inter-annotator agreement, reasonable baselines * Error analysis * Real-time tracking
  • 53. Data Tools * (grep & co) * other Shell powertools * statistical analysis tools + plotting * annotation tools * web-scraping tools * metrics databases (Graphite) * Hadoop, Spark
  • 54. 3. Possible Approaches 1) Symbolical/rule-based “The (Weak) Empire of Reason” 2) Counting-based “The Empiricist Invasion” 3) Deep Learning-based “The Revenge of the Spherical Cows” http://www.earningmyturns.org/2017/06/ a-computational-linguistic-farce-in.ht ml
  • 56. English Tokenization (small problem) * Simplest regex: [^s]+ * More advanced regex: w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|] «»“”‘’-]⟨⟩ ‒–—― * Even more advanced regex: [+-]?[0-9](?:[0-9,.]*[0-9])? |[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']? |["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—― |[.!?]+ |-+ Add post-processing: * concatenate abbreviations and decimals * split contractions with regexes - 2-character abbreviations regex: i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they)['‘’`]d$ - 3-character abbreviations regex: (?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$ https://github.com/lang-uk/tokenize-uk/blob/master/tokenize_uk/tokeni ze_uk.py
  • 57. Error Correction (inherently rule-based) LanguageTool:<category name=" " id="STYLE" type="style">Стиль <rulegroup id="BILSH_WITH_ADJ" name=" ">Більш з прикметниками <rule> <pattern> <token> </token>більш <token postag_regexp="yes" postag="ad[jv]:.*compc.*"> <exception> </exception>менш </token> </pattern> <message> « » </message>Після більш не може стояти вища форма прикметника <!-- TODO: when we can bind comparative forms togher again <suggestion><match no="2"/></suggestion> <suggestion><match no="1"/> <match no="2" postag_regexp="yes" postag="(.*):compc(.*)" postag_replace="$1:compb$2"/></suggestion> <example correction=" | "><marker>Світліший Більш світлий Більш </marker>.</example>світліший --> <example correction=""><marker> </marker>.</example>Більш світліший <example> </example>все закінчилось більш менш </rule> ... https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modu les/uk/src/main/resources/org/languagetool/rules/uk/
  • 58. Dialog Systems (inherently scripted) ELIZA: (defparameter *eliza-rules* '((((?* ?x) hello (?* ?y)) (How do you do. Please state your problem.)) (((?* ?x) I want (?* ?y)) (What would it mean if you got ?y) (Why do you want ?y) (Suppose you got ?y soon)) (((?* ?x) if (?* ?y)) (Do you really think its likely that ?y) (Do you wish that ?y) (What do you think about ?y) (Really-- if ?y)) (((?* ?x) no (?* ?y)) (Why not?) (You are being a bit negative) (Are you saying "NO" just to be negative?)) (((?* ?x) I was (?* ?y)) (Were you really?) (Perhaps I already knew you were ?y) (Why do you tell me you were ?y now?)) (((?* ?x) I feel (?* ?y)) (Do you often feel ?y ?)) (((?* ?x) I felt (?* ?y)) (What other feelings do you have?)))) https://norvig.com/paip/eliza1.lisp
  • 59.
  • 60. Fighting Spam (just a bad idea :) SpamAssasin: # bodyn # example: postacie greet! Chcesz porozmawiac Co prawda tu rzadko bywam, zwykle pisze tu - http://hanna.3xa.info uri __LOCAL_LINK_INFO /http://w{3,8}.www.info/ header __LOCAL_FROM_O2 From =~ /@o2.pl/ meta LOCAL_LINK_INFO __LOCAL_LINK_INFO && __LOCAL_FROM_O2 describe LOCAL_LINK_INFO Link postaci http://cos.cos.info i z o2.pl score LOCAL_LINK_INFO 5 https://wiki.apache.org/spamassassin/CustomRulesets
  • 61. (…which continues being reimplemented) Facebook antispam (using HAXL): fpSpammer :: Haxl Bool fpSpammer = talkingAboutFP .&& numFriends .> 100 .&& friendsLikeCPlusPlus where talkingAboutFP = strContains “Functional Programming” <$> postContent friendsLikeCPlusPlus = do friends <- getFriends cppFriends <- filterM likesCPlusPlus friends return (length cppFriends >= length friends `div` 2) http://multicore.doc.ic.ac.uk/iPr0gram/slides/2015-2016/Marlow-fighting-s pam.pdf https://petrimazepa.com/m_li_ili_kak_algoritm_facebook_blokiruet_polzovat elei_na_osnovanii_stop_slov
  • 63. Languages for Rules * regexes * XML, JSON
  • 64. Languages for Rules * regexes * XML, JSON * external DSLs (example: JBOSS drules)
  • 65. Languages for Rules * regexes * XML, JSON * external DSLs * internal DSLs (match-html source '(>> article (aside (>> a ($ user)) (>> li (strong "Native Tongue:") ($ lang))) (div |...| (>> (div :data-role "commentContent" ($ text) (span) |...|)) !!!))
  • 66. Rules Pros & Cons + compact & fast + full control + arbitrary recall + iterative + best interpetability + perfectly accommodate domain experts - precision ceiling - non-optimal weights - require a lot of human labor - hard to interpret/calibrate score
  • 68. NLP Viewpoints * Bag-of-words * Sequence * Tree * Graph
  • 69. NB Model for LangID * The problem of using words * Character ngrams to the rescue * Combining them
  • 71. BiLSTM+attention as SOTA “Basically, if you want to do an NLP task, no matter what it is, what you should do is throw your data into a Bi-directional long-short term memory network, and augment its information flow with the attention mechanism.” – Chris Manning https://twitter.com/mayurbhangale /status/988332845708886016
  • 72. Hybrid Approach “Rule-based” framework that incorporates Machine Learning models. + control + best of both worlds - more complex - not sexy
  • 73. Bias * Domain bias - systematic statistical differences between “domains”, usually in the form of vocab, class priors and conditional probabilities * Dataset bias - sampling bias in a given dataset, giving rise to (often unintentional) content bias * Model bias - bias in the output of a model wrt gold standard * Social bias - model bias towards/against particular socio- demographic groups of users
  • 74. Counteracting Bias * Rule-based post-processing * Mix multiple datasets + special feature-selection * Train multiple models on different datasets and create an ensemble * Use domain adaptation techniques
  • 75. 4. Productization Delivering your NLP application to the users
  • 76. Product variants * end-user product (web, mobile, desktop) * internal feature * API * library * scientific paper
  • 77. Requirements * speed (processing, startup) * memory usage * storage * interoperability * ease of use * ease of update
  • 78. Storage Optimization WILD Initial model size ~ 1G (compared to megabytes for CLD, target: 10-20 MB) Ways to reduce: - partly rule-based - model pruning - clever algorithms: perfect hash tables, Huffman coding - model compression https://arxiv.org/abs/1612.03651
  • 80. Model Formats * pickle & co. * JSON, ProtoBuf, ...
  • 81. Model Formats * pickle & co. * JSON, ProtoBuf, ... * HDF5
  • 82. Model Formats * pickle & co. * JSON, ProtoBuf, ... * HDF5 * Custom gzipped text-based
  • 83. Adaptation (manual vs automatic) * for RB systems: add more rules :D * for ML systems: online learning algorithms * for DL systems: recent “universal” models * Lambda architecture
  • 84. The Pipeline “Getting a data pipeline into production for the first time entails dealing with many different moving parts in these cases, spend more time— thinking about the pipeline itself, and how it will enable experimentation, and less about its first algorithm.” https://medium.com/@neal_lathia/five-lessons-from -building-machine-learning-systems-d703162846ad
  • 85. Read More • https://www.slideshare.net/frandzi/native-langua ge-identification-brief-review-to-the-state-of-t he-art• • http://people.cs.umass.edu/~brenocon/inlp2015/15 -eval.pdf • https://ehudreiter.com/2017/05/03/metrics-nlg-ev aluation/ • https://nlpers.blogspot.com/2014/11/the-myth-of- strong-baseline.html • https://www.slideshare.net/grammarly/grammarly-a inlp-club-1-domain-and-social-bias-in-nlp-case-s tudy-in-language-identification-tim-baldwin-8025 2288