Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Próximo SlideShare
Word2 vec
Word2 vec
Cargando en…3
×

Eche un vistazo a continuación

1 de 44 Anuncio
Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Intro to nlp (20)

Anuncio

Más reciente (20)

Anuncio

Intro to nlp

  1. 1. Intro to Natural Language Processing (NLP)
  2. 2. What is Natural Language Processing (NLP)? By “natural language” we mean a language that is used for everyday communication by humans. NLP is an Intersection of several fields ▪ Computer Science ▪ Artificial Intelligence ▪ Linguistics It is basically teaching computers to process human language Two main components: ▪ Natural Language Understanding (NLU) ▪ Natural Language Generation (NLG) NLP is AI Complete ▪ Requires all types of knowledge humans possess  It’s hard!
  3. 3. Natural Language Understanding (NLU) • Deriving meaning from natural language • Imagine a Concept (aka Semantic or Representation) space ▪ In it, any idea/word/concept has unique computer representation ▪ Usually via a vector space ▪ NLU  Mapping language into this space
  4. 4. Natural Language Generation (NLG) • Mapping from computer representation space to language space • Opposite direction of NLU ▪ Usually need NLU to perform NLG! • NLG is really hard!
  5. 5. NLP: Speech vs Text • Natural Language can refer to Text or Speech • Goal of both is the same: translate raw data (text or speech) into underlying concepts (NLU) then possibly into the other form (NLG) Concept SpaceText Speech Text to Speech Speech to Text NLU NLG NLU NLG
  6. 6. History of NLP • NLP has been through (at least) 3 major eras: ▪ 1950s-1980s: Linguistics Methods and Handwritten Rules ▪ 1980s-Now: Corpus/Statistical Methods ▪ Now-???: Deep Learning • Lucky you! You’re right near the start of a paradigm shift!
  7. 7. 1950s - 1980s: Linguistics/Rule Systems • NLP systems focus on: ▪ Linguistics: Grammar rules, sentence structure parsing, etc ▪ Handwritten Rules: Huge sets of logical (if/else) statements ▪ Ontologies: Manually created (domain-specific!) knowledge bases to augment rules above • Problems: ▪ Too complex to maintain ▪ Can’t scale! ▪ Can’t generalize!
  8. 8. 1980s - Now: Corpus/Statistical Methods • NLP starts using Machine Learning methods • Use statistical learning over huge datasets of unstructured text ▪ Corpus: Collection of text documents ▪ e.g. Supervised Learning: Machine Translation ▪ e.g. Unsupervised Learning: Deriving Word "Meanings" (vectors)
  9. 9. Now - ???: Deep Learning • Deep Learning made its name with Images first • 2012: Deep Learning has major NLP breakthroughs ▪ Researchers use a neural network to win the Large Scale Visual Recognition Challenge (LSVRC) ▪ This state of the art approach beat other ML approaches with half their error rate (26% vs 16%) • Very useful for unified processing of Language + Images
  10. 10. NLP Definitions • Phonemes: the smallest sound units in a language • Morphemes: the smallest units of meaning in a language • Syntax: how words and sentences are constructed from these two building blocks • Semantics: the meaning of those words and sentences • Discourse: semantics in context. Conversation, persuasive writing, etc.
  11. 11. Levels of NLP Morphemes Phonemes Syntax Semantics Discourse Text Speech Early Rules Engines Corpus Methods Modern Deep Learning is about here!
  12. 12. NLU Applications • ML on Text (Classification, Regression, Clustering) • Document Recommendation • Language Identification • Natural Language Search • Sentiment Analysis • Text Summarization • Extracting Word/Document Meaning (vectors) • Relationship Extraction • Topic Modeling • …and more!
  13. 13. NLU Application: Document Classification • Classify “documents” - discrete collections of text - into categories ▪ Example: classify emails as spam vs. not spam ▪ Example: classify movie reviews as positive vs. negative ▪ Example: classify legal documents as relevant vs. not relevant to a topic
  14. 14. NLU Application: Document Recommendation • Choosing the most relevant document based on some information: ▪ Example: show most relevant webpages based on query to search engine ▪ Example: recommend news articles based on past articles liked ▪ Example: recommend restaurants based on Yelp reviews
  15. 15. NLU Application: Topic Modeling • Breaking a set of documents into topics at the word level ▪ Example: see how prevalence of certain topics covered in a magazine changes over time ▪ Example: find documents belonging to a certain topic explain
  16. 16. NLG Applications • Image Captioning • (Better) Text Summarization • Machine Translation • Question Answering/Chatbots • …so much more • Notice NLU is almost a prerequisite for NLG
  17. 17. NLG Application: Image Captioning • Automatically generate captions for images Captions automatically generated. Source: https://cs.stanford.edu/people/karpathy/cvpr2015.pdf
  18. 18. NLG Application: Machine Translation Example from Google®’s machine translation system (2016) Source: https://ai.googleblog.com/2016/09/a-neural-network-for- machine.html • Automatically translate text between language
  19. 19. NLG Application: Machine Translation Source: https://ai.googleblog.com/2016/09/a-neural-network-for- machine.html
  20. 20. NLG Application: Text Summarization • Automatically generate text summaries of documents ▪ Example: generate headlines of news articles Source: https://ai.googleblog.com/2016/08/text-summarization- with-tensorflow.html
  21. 21. Where We’re Headed • Today: ▪ Regular Expression (regex) Basics • Upcoming: ▪ NLP Preprocessing Tasks (Week 2) ▪ Deriving features from text (Week 3) ▪ Machine Learning with Text (Week 4) ▪ Topic Modeling on Text (Weeks 5-7) ▪ Advanced Topics for NLG including Deep Learning (Week 8)
  22. 22. Regular Expressions
  23. 23. What are Regular Expressions (“RegEx”)? • A regex describes a pattern to look for in text • Examples: ▪ Does a string contain the word ”cat”? ▪ Does a string contain 3 letters and then 2 numbers followed by a space? ▪ Does a string contain one or more letters (and only letters)? • Typical Usage: ▪ Find strings that contain the regex pattern ▪ Retrieve the part of a string matching the pattern
  24. 24. Use Cases for Regex • Parsing documents with expected component structure ▪ Use regex to grab the pieces you want ▪ e.g. HTML, find all headers aka within <h> tags ▪ e.g. remove known boilerplate sections from emails • Validating User Input ▪ e.g. does email match xxx@xxx.com? • NLP Preprocessing ▪ What pattern(s) represent individual words in text? ▪ e.g. space + letters + space  grab the letters
  25. 25. Regex in Python • Import • Compile pattern • Pattern methods
  26. 26. Metacharacters • . Matches any single character • ^ Beginning of string • $ End of string • * matches 0 or more characters • + matches 1 or more characters • ? Optional character
  27. 27. Metacharacters: Examples • Helper function:
  28. 28. Metacharacters: Examples • . Matches any single character • ^ Beginning of string
  29. 29. Metacharacters: Examples • $ End of string
  30. 30. Metacharacters: Examples • * matches 0 or more characters • + matches 1 or more characters
  31. 31. Metacharacters: Examples • ? Optional character
  32. 32. More Metacharacters • { m,n} specify number of times character is matched between m and n times • [ ] list characters to be matched • escape character • | or • ( ) capture group inside parenthesis
  33. 33. Metacharacters: Examples • { m,n} specify number of times character is matched between m and n times • [ ] list characters to be matched
  34. 34. Metacharacters: Examples • escape character • | or
  35. 35. Character Classes • s - matches any whitespace • w - matches any alpha character. Equivalent to [A-Za-z] • d - matches any numeric character. Equivalent to [0-9] You may negate these by capitalizing. For example, D matches anything not a digit
  36. 36. Handling Groups • () Parenthesis control which parts of the string are returned after a match
  37. 37. Character Classes and Groups Example
  38. 38. Splitting by Regex • re.split() • Can be used to split a string on any REGEX match • Frequently done when you don’t know the type of whitespace or to handle dashes and underscores
  39. 39. Lookaheads/Lookbehinds • Allow you to keep the cursor in place, but still check to see if certain conditions are met before or after that character • Look ahead: (?=d) checks to see if the character is followed by a number • Look Behind: (?<=[aeiou]) checks to see if the character is preceded by a vowel examples
  40. 40. Search and Replace • re.sub() allows for quick and easy search and replace
  41. 41. Find all • Frequently we will want to find all occurrences of a certain bit of text in a corpus • pattern.findall(text) is very useful for doing this
  42. 42. Regex Flags • re.I - ignore case • re.L - locale dependent • re.M - Multiline • re.S - dot matches all • re.U unicode dependent • re.X verbose
  43. 43. Greedy vs Non-Greedy • + and * operators are greedy, they try to match as much as they can. <div>this is some text</div> <.+> will not match <div> it will match the entire line. To make it non greedy (lazy) you can add a ? After the operator <.+?> will match just <div>

Notas del editor

  • Welcome to Week 1! Today we will be introducing you to Natural Language Processing (NLP).

    Slides that NEED INPUT: 2,8,9,11,17,39,40,42,43
    KEY:
    none = read the slide
    expert = needs someone more knowledgeable about NLP to explain
    may be = possibly add notes/examples
  • What is NLP? The process of using computers for extracting meaning from text. It can be any kind of computer manipulation of natural language. NLP could be as simple as calculating word frequencies; or more complex use-cases such as predictive text (autocomplete), handwriting recognition or grouping paragraphs into topics based on their content and context.
  • Natural language understanding is the process of converting words from our human language into computer language. Usually, this is done by representing words as vectors, which is a format that computers can read. When a computer can read individual words, it can combine them in some way to understand a body of text (document). This is a necessary first step for us to perform any kind of analysis on a text, using a computer.
  • Natural Language Generation on the other hand is the process of converting from the concept space back into human language. For this to occur, the computer must already have words in its concept space to have any sense of what a word ‘means’. A good example of natural language generation is a chatbot as seen in the photo. Getting chatbots that respond like humans is really tough!
  • In the phrase “natural language processing”, the “natural language” can refer to either text or speech data. The process of converting either of these into concept space/computer-readable format is NLU. The process of going from the concept space back to either speech or text is natural language generation.
    The entire process of switching from text to speech through concept space is text-to-speech while the entire process of switching from speech to text through concept space is speech-to-text.
  • none
  • Early attempts at trying to understand languages relied heavily on if/else statements to encode every edge case you could think of. For example, while most words can be converted to past tense by adding ‘ed’ (eg. kick -> kicked), some words don’t follow this pattern (eg. run -> ran). The problems with this are apparent, nobody has the time to hard code every rule, so these systems can’t scale or generalize.
  • In the 80s, NLP approaches grew from using linguistic and handwritten rules to creating easily generalizable statistical models from the analysis of a text corpora. This era introduced supervised and unsupervised learning in Natural Language Processing.
  • Deep Learning is a subset of Machine Learning algorithms that use a large (LARGE) amount of data to recognize patterns and utilize neural networks.

    Breakthrough source: https://beamandrew.github.io/deeplearning/2017/02/23/deep_learning_101_part1.html


    Last bullet?
  • none
  • Major levels of linguistic structure: https://courses.lumenlearning.com/boundless-psychology/chapter/introduction-to-language/
  • none
  • Here’s a cool post by Google Cloud on Document Classification: https://cloud.google.com/blog/big-data/2018/01/problem-solving-with-ml-automatic-document-classification
  • Cool implementation of document recommendation for movies: https://www.kernix.com/blog/recommender-system-based-on-natural-language-processing_p10
    Bookmark this link! You’ll find it a lot easier to understand after the course.
  • Every document we read can be thought of as consisting of many topics all stacked upon one another.
    We can unpack these topics using NLP techniques.

    We will be learning more about this topic later in the course!
  • none
  • As you can see the algorithms generally perform pretty well and even sound human like. However, the picture on the right seems to be mislabeled. The algorithm probably thought that the water in the background meant that the boy was wakeboarding.
  • PBMT: Phrase Based Machine Translation. It breaks up the inputs into words and phrases which are each translated individually.
    GNMT: Google Neural Machine Translation. State of the art method developed by Google that considers a sentence as a whole. Currently being used to translate Chinese to English on Google Translate.

    As you can see the GNMT definitely outperforms the PBMT in terms of readability and I might argue that it even beats the human translation.
  • PBMT: Phrase Based Machine Translation. It breaks up the inputs into words and phrases which are each translated individually.
    GNMT: Google Neural Machine Translation. State of the art method developed by Google that considers a sentence as a whole. Currently being used to translate Chinese to English on Google Translate.

    As you can see the GNMT definitely outperforms the PBMT in terms of readability and I might argue that it even beats the human translation.
  • Google developed this text summarization using TensorFlow and so far has had success in generating sensible headlines from the first few sentences of an article, but the method currently performs more poorly on longer texts.
  • none
  • none
  • none
  • This example would search the variable text for occurrences of the pattern ‘Sherlock Holmes’.
  • none
  • This is the function that we will be using in later slides to demonstrate the properties of these metacharacters.
  • Because the period matches any single character, it matched the first one it found, “S”.

    Because there is no string that starts with ‘e’, that is the only pattern that doesn’t have a match.
  • Because there is no string in the searched text that end in ‘t’, the search returns no match.
  • Because there is no ‘z’ in Sherlock Holmes, “z*” is able to match an empty string, but “z+” is not able to match any thing because there are not 1 or more z’s.

  • Because we’ve added the ‘?’ character after the 3, it becomes an optional character. So either “Sh” or “S3h” would be detected by this pattern.
  • None
  • The first example specifies that the number two is to be matched between one and three times. There are two 2’s, so both are matched. Notice that if we ask for between three and four 2’s, no match is found.

    [ik] means match i or k in the text.
  • Remember, in regex ‘?’ is a metacharacter that has meaning in a pattern, so to search for a ‘?’ in the text, we must escape it.
  • These are special regex expressions matching commonly sought for characters. You can search for the negation or opposite of these character classes by capitalizing them.
  • none
  • (\d+\w*) -> one or more numbers followed by 0 or more letters -> 221B
    \s+ -> one or more spaces
    ([A-Z]{1}\w+ -> one or more capital letters followed by one or more letters -> Baker
    \s+ -> one or more spaces
    [A-Z]{1}\w+) -> one or more capital letters followed by one or more letters -> Street

    Notice that the two groups are “221B” and “Baker Street” based on where the parentheses are
  • Useful if you have to break a file up by some special kind of character or sequence. For example, if you had a book and you wanted to split it into chapters, you could use re.split() on the pattern ‘Chapter’.
  • May be add example
  • none
  • May be add example
  • May be add example

×