Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Crash Course in Natural Language Processing (2016)

Cargando en…3

Eche un vistazo a continuación

1 de 47 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Crash Course in Natural Language Processing (2016) (20)


Más de Vsevolod Dyomkin (13)

Más reciente (20)


Crash Course in Natural Language Processing (2016)

  1. 1. Crash Course in Natural Language Processing Vsevolod Dyomkin 04/2016
  2. 2. A Bit about Me * Lisp programmer * 5+ years of NLP work at Grammarly * Occasional lecturer
  3. 3. A Bit about Grammarly The best English language writing app Spellcheck - Grammar check - Style improvement - Synonyms and word choice Plagiarism check
  4. 4. Plan * Overview of NLP * Where to get Data * Common NLP problems and approaches * How to develop an NLP system
  5. 5. What Is NLP? Transforming free-form text into structured data and back
  6. 6. What Is NLP? Transforming free-form text into structured data and back Intersection of: * Computational Linguistics * CompSci & AI * Stats & Information Theory
  7. 7. Linguistic Basis * Syntax (form) * Semantics (meaning) * Pragmatics (intent/logic)
  8. 8. Natural Language * ambiguous * noisy * evolving
  9. 9. Time flies like an arrow. Fruit flies like a banana. I read a story about evolution in ten minutes. I read a story about evolution in the last million years.
  10. 10. NLP & Data Types of text data: * structured * semi-structured * unstructured “Data is ten times more powerful than algorithms.” -- Peter Norvig The Unreasonable Effectiveness of Data.
  11. 11. Kinds of Data * Dictionaries * Databases/Ontologies * Corpora * User Data
  12. 12. Where to Get Data? * Linguistic Data Consortium * Common Crawl * Wikimedia * Wordnet * APIs: Twitter, Wordnik, ... * University sites & the academic community: Stanford, Oxford, CMU, ...
  13. 13. Create Your Own! * Linguists * Crowdsourcing * By-product -- Johnatahn Zittrain
  14. 14. Classic NLP Problems * Linguistically-motivated: segmentation, tagging, parsing * Analytical: classification, sentiment analysis * Transformation: translation, correction, generation * Conversation: question answering, dialog
  15. 15. Tokenization Example: This is a test that isn't so simple: 1.23. "This" "is" "a" "test" "that" "is" "n't" "so" "simple" ":" "1.23" "." Issues: * Finland’s capital - Finland Finlands Finland’s * what’re, I’m, isn’t - what ’re, I ’m, is n’t * Hewlett-Packard or Hewlett Packard * San Francisco - one token or two? * m.p.h., PhD.
  16. 16. Regular Expressions Simplest regex: [^s]+ More advanced regex: w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–— «»“”‘’-]― Even more advanced regex: [+-]?[0-9](?:[0-9,.]*[0-9])? |[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']? |["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—― |[.!?]+ |-+
  17. 17. Post-processing * concatenate abbreviations and decimals * split contractions with regexes 2-character: i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they) ['‘’`]d$ 3-character: (?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$
  18. 18. Rule-based Approach * easy to understand and reason about * can be arbitrarily precise * iterative, can be used to gather more data Limitations: * recall problems * poor adaptability
  19. 19. Rule-based NLP tools * SpamAssasin * LanguageTool * ELIZA * GATE
  20. 20. Statistical Approach “Probability theory is nothing but common sense reduced to calculation.” -- Pierre-Simon Laplace
  21. 21. Language Models Question: what is the probability of a sequence of words/sentence?
  22. 22. Language Models Question: what is the probability of a sequence of words/sentence? Answer: Apply the chain rule P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w0 w1 w2) * … where S = w0 w1 w2 …
  23. 23. Ngrams Apply Markov assumption: each word depends only on N previous words (in practice N=1..4 which results in bigrams-fivegrams, because we include the current word also). If n=2: P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w1 w2) * … According to the chain rule: P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)
  24. 24. Spelling Correction Problem: given an out-of-dictionary word return a list of most probable in-dictionary corrections.
  25. 25. Edit Distance Minimum-edit (Levenstein) distance the– minimum number of insertions/deletions/substitutions needed to transform string A into B. Other distance metrics: * the Damerau-Levenstein distance adds another operation: transposition * the longest common subsequence (LCS) metric allows only insertion and deletion, not substitution * the Hamming distance allows only substitution, hence, it only applies to strings of the same length
  26. 26. Dynamic Programming Initialization: D(i,0) = i D(0,j) = j Recurrence relation: For each i = 1..M For each j = 1..N D(i,j) = D(i-1,j-1), if X(i) = Y(j) otherwise: min D(i-1,j) + w_del(Y(j)) D(i,j-1) + w_ins(X(i)) D(i-1,j-1) + w_subst(X(i),Y(j))
  27. 27. Noisy Channel Model Given an alphabet A, let A* be the set of all finite strings over A. Let the dictionary D of valid words be some subset of A*. The noisy channel is the matrix G = P(s|w) where w in D is the intended word and s in A* is the scrambled word that was actually received. P(s|w) = sum(P(x(i)|y(i))) for x(i) in s* (s aligned with w) for y(i) in w* (w aligned with s)
  28. 28. Machine Learning Approach
  29. 29. Spam Filtering A 2-class classification problem with a bias towards minimizing FPs. Default approach: rule-based (SpamAssassin) Problems: * scales poorly * hard to reach arbitrary precision * hard to rank the importance of complex features?
  30. 30. Bag-of-words Models * each word is a feature * each word is independent of others * position of the word in a sentence is irrelevant Pros: * simple * fast * scalable Limitations: * independence assumption doesn't hold Initial results: recall: 92%, precision: 98.84% Improved results: recall: 99.5%, precision: 99.97%
  31. 31. Naive Bayes Classifier P(Y|X) = P(Y) * P(X|Y) / P(X) select Y = argmax P(Y|x) Naive step: P(Y|x) = P(Y) * prod(P(x|Y)) for all x in X (P(x) is marginalized out because it's the same for all Y)
  32. 32. Dependency Parsing nsubj(ate-2, They-1) root(ROOT-0, ate-2) det(pizza-4, the-3) dobj(ate-2, pizza-4) prep(ate-2, with-5) pobj(with-5, anchovies-6) t-algorithm-for-natural-language-dependency-parsing/
  33. 33. Shift-reduce Parsing
  34. 34. Shift-reduce Parsing
  35. 35. ML-based Parsing The parser starts with an empty stack, and a buffer index at 0, with no dependencies recorded. It chooses one of the valid actions, and applies it to the state. It continues choosing actions and applying them until the stack is empty and the buffer index is at the end of the input. SHIFT = 0; RIGHT = 1; LEFT = 2 MOVES = [SHIFT, RIGHT, LEFT] def parse(words, tags): n = len(words) deps = init_deps(n) idx = 1 stack = [0] while stack or idx < n: features = extract_features(words, tags, idx, n, stack, deps) scores = score(features) valid_moves = get_valid_moves(i, n, len(stack)) next_move = max(valid_moves, key=lambda move: scores[move]) idx = transition(next_move, idx, stack, parse) return tags, parse
  36. 36. Averaged Perceptron def train(model, number_iter, examples): for i in range(number_iter): for features, true_tag in examples: guess = model.predict(features) if guess != true_tag: for f in features: model.weights[f][true_tag] += 1 model.weights[f][guess] -= 1 random.shuffle(examples)
  37. 37. Features * Word and tag unigrams, bigrams, trigrams * The first three words of the buffer * The top three words of the stack * The two leftmost children of the top of the stack * The two rightmost children of the top of the stack * The two leftmost children of the first word in the buffer * Distance between top of buffer and stack
  38. 38. Discriminative ML Models Linear: * (Averaged) Perceptron * Maximum Entropy / LogLinear / Logistic Regression; Conditional Random Field * SVM Non-linear: * Decision Trees, Random Forests * Other ensemble classifiers * Neural networks
  39. 39. Semantics Question: how to model relationships between words?
  40. 40. Semantics Question: how to model relationships between words? Answer: build a graph Wordnet Freebase DBPedia
  41. 41. Word Similarity Next question: now, how do we measure those relations?
  42. 42. Word Similarity Next question: now, how do we measure those relations? * different Wordnet similarity measures
  43. 43. Word Similarity Next question: now, how do we measure those relations? * different Wordnet similarity measures * PMI(x,y) = log(p(x,y) / p(x) * p(y))
  44. 44. Distributional Semantics Distributional hypothesis: "You shall know a word by the company it keeps" --John Rupert Firth Word representations: * Explicit representation Number of nonzero dimensions: max:474234, min:3, mean:1595, median:415 * Dense representation (word2vec, GloVe) * Hierarchical representation (Brown clustering)
  45. 45. Steps to Develop an NLP System * Translate real-world requirements into a measurable goal * Find a suitable level and representation * Find initial data for experiments * Find and utilize existing tools and Frameworks where possible * Don't trust research results * Setup and perform a proper experiment (series of experiments)
  46. 46. Going into Prod * NLP tasks are usually CPU-intensive but stateless * General-purpose NLP frameworks are (mostly) not production-ready * Value pre- and post- processing * Gather user feedback
  47. 47. Final Words We have discussed: * linguistic basis of NLP - although some people manage to do NLP without it: * rule-based & statistical/ML approaches * different concrete tasks We haven't covered: * all the different tasks, such as MT, question answering, etc. (but they use the same technics) * deep learning for NLP * natural language understanding (which remains an unsolved problem)