4. NLP Project Stages
1) Domain analysis
2) Data preparation
3) Iterating the solution
4) Productizing the result
5) Gathering feedback
and returning to stages 1/2/3
5. 1. Domain Analysis
1) Problem formulation
2) Existing solutions
3) Possible datasets
4) Identifying Challenges
5) Metrics, baseline & SOTA
6) Selecting possible
approaches and devising
a plan of attack
16. Test Data
* Twitter evaluation dataset
* Datasets with fewer langs
* Extract from Wikipedia
+ smoke test
17. Challenges
* Linguistic challenges
- know next to nothing about
90% of the languages
- languages and scripts
http://www.omniglot.com/writing/langalph.htm
- language distributions
https://en.wikipedia.org/wiki/Languages_used_on_the_Internet#Content_languages_for_websites
- word segmentation?
* Engineering challenges
19. NLP Evaluation
Intrinsic: use a gold-
standard dataset and some
metric to measure the system’s
performance directly.
(in-domain & out-of-domain)
Extrinsic: use a system as
part of an upstream task(s)
and measure performance change
there.
27. Baseline
Quality that can be achieved
using some reasonable “basic”
approach
(on the same data).
2 ways to measure improvement:
* quality improvement
* error reduction
32. The Unreasonable
Effectiveness of Data
“Data is ten times more powerful than algorithms.”
-- Peter Norvig, http://youtu.be/yvDCzhbjYWs
https://twitter.com/shivon/status/864889085697024000
33. Uses of Data in NLP
* understanding the problem
* statistical analysis
* connecting separate domains
* evaluation data set
* training data set
* real-time feedback
* marketing/PR
* external competitions
* what else?
43. Raw Text Pros & Cons
+ Can collect stats
=> build LMs, word vectors…
+ Can have a variety of domains
- … but hard to control the
distribution of domains
- Web artifacts
- Web noise
- Huge processing effort
44. Creating Own Data
* Scraping
* Annotation
* Crowdsourcing
* Getting from users
* Generating
45. Data Scraping
* Full-page web-scraping
(see: readability)
* Custom web-scraping
(see: scrapy, crawlik)
* Extracting from non-HTML
formats (.pdf, .doc…)
* Getting from API
(see: twitter, webhose)
46. Scraping Pros & Cons
+ possible to get needed domain
+ may be the only “fast” way
+ it’s the “google” way
- licensing issues
- getting volume requires time
- need to deal with antirobot
protection
- other engineering challenges
47. Corpus Annotation
* Initial data gathering
* Annotation guidelines
* Annotator selection
* Annotation tools
* Processing of raw annotations
* Result analysis & feedback
49. Crowdsourcing
- Like annotation but harder
requires a custom solution
With gamification
+ If successful, can get
volumes and diversity
50. User Feedback Data
+ Your product, your domain
+ Real-time, allows adaptation
+ Ties into customer support
- Chicken & egg problem
- Requires anonymization
Approach: “lean startup”
51. Generating Data
+ Potentially unlimited volume
+ Control of the parameters
- Artificial (if you can code
a generation function,
why not code a reverse one?)
52. Data Best Practices
* Proper ML dataset handling
* Domain adequacy, diversity
* Inter-annotator agreement,
reasonable baselines
* Error analysis
* Real-time tracking
54. 3. Possible Approaches
1) Symbolical/rule-based
“The (Weak) Empire of Reason”
2) Counting-based
“The Empiricist Invasion”
3) Deep Learning-based
“The Revenge of the Spherical Cows”
http://www.earningmyturns.org/2017/06/
a-computational-linguistic-farce-in.ht
ml
56. English Tokenization
(small problem)
* Simplest regex:
[^s]+
* More advanced regex:
w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|] «»“”‘’-]⟨⟩ ‒–—―
* Even more advanced regex:
[+-]?[0-9](?:[0-9,.]*[0-9])?
|[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']?
|["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—―
|[.!?]+
|-+
Add post-processing:
* concatenate abbreviations and decimals
* split contractions with regexes
- 2-character abbreviations regex:
i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they)['‘’`]d$
- 3-character abbreviations regex:
(?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$
https://github.com/lang-uk/tokenize-uk/blob/master/tokenize_uk/tokeni
ze_uk.py
57. Error Correction
(inherently rule-based)
LanguageTool:<category name=" " id="STYLE" type="style">Стиль
<rulegroup id="BILSH_WITH_ADJ" name=" ">Більш з прикметниками
<rule>
<pattern>
<token> </token>більш
<token postag_regexp="yes" postag="ad[jv]:.*compc.*">
<exception> </exception>менш
</token>
</pattern>
<message> « » </message>Після більш не може стояти вища форма прикметника
<!-- TODO: when we can bind comparative forms togher again
<suggestion><match no="2"/></suggestion>
<suggestion><match no="1"/> <match no="2" postag_regexp="yes"
postag="(.*):compc(.*)" postag_replace="$1:compb$2"/></suggestion>
<example correction=" | "><marker>Світліший Більш світлий Більш
</marker>.</example>світліший
-->
<example correction=""><marker> </marker>.</example>Більш світліший
<example> </example>все закінчилось більш менш
</rule>
...
https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modu
les/uk/src/main/resources/org/languagetool/rules/uk/
58. Dialog Systems
(inherently scripted)
ELIZA:
(defparameter *eliza-rules*
'((((?* ?x) hello (?* ?y))
(How do you do. Please state your problem.))
(((?* ?x) I want (?* ?y))
(What would it mean if you got ?y)
(Why do you want ?y) (Suppose you got ?y soon))
(((?* ?x) if (?* ?y))
(Do you really think its likely that ?y) (Do you wish that ?y)
(What do you think about ?y) (Really-- if ?y))
(((?* ?x) no (?* ?y))
(Why not?) (You are being a bit negative)
(Are you saying "NO" just to be negative?))
(((?* ?x) I was (?* ?y))
(Were you really?) (Perhaps I already knew you were ?y)
(Why do you tell me you were ?y now?))
(((?* ?x) I feel (?* ?y))
(Do you often feel ?y ?))
(((?* ?x) I felt (?* ?y))
(What other feelings do you have?))))
https://norvig.com/paip/eliza1.lisp
59.
60. Fighting Spam
(just a bad idea :)
SpamAssasin:
# bodyn
# example: postacie greet! Chcesz porozmawiac Co prawda tu rzadko bywam,
zwykle pisze tu - http://hanna.3xa.info
uri __LOCAL_LINK_INFO /http://w{3,8}.www.info/
header __LOCAL_FROM_O2 From =~ /@o2.pl/
meta LOCAL_LINK_INFO __LOCAL_LINK_INFO && __LOCAL_FROM_O2
describe LOCAL_LINK_INFO Link postaci http://cos.cos.info i z o2.pl
score LOCAL_LINK_INFO 5
https://wiki.apache.org/spamassassin/CustomRulesets
66. Rules Pros & Cons
+ compact & fast
+ full control
+ arbitrary recall
+ iterative
+ best interpetability
+ perfectly accommodate domain experts
- precision ceiling
- non-optimal weights
- require a lot of human labor
- hard to interpret/calibrate score
71. BiLSTM+attention as SOTA
“Basically, if you want to do an
NLP task, no matter what it is,
what you should do is throw your
data into a Bi-directional
long-short term memory network,
and augment its information flow
with the attention mechanism.”
– Chris Manning
https://twitter.com/mayurbhangale
/status/988332845708886016
73. Bias
* Domain bias - systematic statistical
differences between “domains”, usually
in the form of vocab, class priors and
conditional probabilities
* Dataset bias - sampling bias in a
given dataset, giving rise to (often
unintentional) content bias
* Model bias - bias in the output of a
model wrt gold standard
* Social bias - model bias
towards/against particular socio-
demographic groups of users
74. Counteracting Bias
* Rule-based post-processing
* Mix multiple datasets
+ special feature-selection
* Train multiple models on
different datasets and
create an ensemble
* Use domain adaptation
techniques
83. Adaptation
(manual vs automatic)
* for RB systems: add more
rules :D
* for ML systems: online
learning algorithms
* for DL systems: recent
“universal” models
* Lambda architecture
84. The Pipeline
“Getting a data pipeline into production for the
first time entails dealing with many different
moving parts in these cases, spend more time—
thinking about the pipeline itself, and how it
will enable experimentation, and less about its
first algorithm.”
https://medium.com/@neal_lathia/five-lessons-from
-building-machine-learning-systems-d703162846ad