Ai in text analytics shailesh patel - capgemini .cwin18.telford

2. 2Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. Why Text Analytics? DATA PRODUCTION WILL BE 44 TIMES GREATER IN 2020 THAN IT WAS IN 2009 UNSTRUCTURED DATA REPRESENTS 70% TO 90% OF THE DATA CAPTURED PATTERN-BASED REUSABLE APPROACHES AVAILABLE ACCELERATED APPROACHES FOR ACTIONABLE INSIGHT

5. 5Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. How Does It Work? File Input Extraction Normalisation Vectorisation Tokenisation Stop Word Removal Stemming Dimensionality Reduction Analysis of Corpus

6. 6Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. Feature Representations Bag of Words Words are Features / Dimensions Word Frequencies Set of Sequences N-gram Sequence Word Proximity Sequential Properties of Text

7. 7Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. Some Text Analytics Approaches Text Classification: Categorisation based on the Known Topic Modelling: Abstract Subject Discovery Text Clustering: Automatic Grouping Semantic Analysis: What does it all mean? Sentiment Analysis: How Do You Feel? Text Summarisation: TL;DR Text Correlation: Graph based Degrees of Separation

8. 8Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. Text Classification • Problem: How do we categorise based on a known prior? • Use Case: Spam email filtering, Recommender Engines • Summary: Text Classification is closely related to Text Clustering but uses a Supervised Learning approached based on prior knowledge i.e. a Trained Model. The Trained Model is created with labelled data which outlines the ‘good’ and ‘bad’ e.g. spam and not spam for detecting spam emails. Excellent for binary classifications. • Techniques: Logistic Regression, Bayesian Modelling, Neural Networks, Support Vector Machines Training Data Trained Model Predict

9. 9Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. Clustering for Similarity • Problem: Given a corpus of text how do you categorise them for similarity? And look for keywords and phrases that are similar syntactically and semantically? • Use Case: Categorisation, Feature Reduction, Market Segmentation • Technique: Unsupervised Learning with two types of Clustering approaches available. Partitive Clustering Hierarchical Clustering

10. 10Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. Clustering for Similarity • Approach: Text Vectorisation, Set Theory, etc approaches can apply distance measures to the words or documents. For example Frequency Vectors • Techniques: Euclidean Distance, Manhattan Distance, Minkowski Distance, Mahalanobis Distance, Jacard Distance, Edit Distance, TF-IDF Distance, Cosine Distance Euclidean vs Manhattan Mahalanobis Distance Jacard Distance Edit Distance Cosine Distance Pictures: © Bengfort, Benjamin; Bilbro, Rebecca; Ojeda, Tony. Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning.

11. 11Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. Topic Modelling • Problem: How do we discover the topics of interest of a corpus? • Summary: Topic Modelling can be described as an unsupervised machine learning technique for abstracting topics from collections of documents. • Technique: Latent Dirichlet Allocations (LDA) / Latent Semantic Analysis (LSA) finds relationships between documents that use the related words and performs similarity mapping. This results in the creation of groups based on that criteria.

12. 12Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. Topic Modelling (Animation) © Animation: Christoph Carl Kling • Column: Document • Row: Word • Cell: Frequency of Word • Grouping: • Documents with Similar Words • Words in Similar Documents

13. 13Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. Semantic Analysis • Problem: How do we extract the context of a sentence or infer meaning? • Use Case: Social Media Analysis, Text Summarisation, Emotional Inference, Natural Language Generation, Chatbots • Example: • The cat chased the mouse • The mouse chased the cat • Allows us to extract grammatical features from the sentence • Works well if the sentences are grammatically correct and conform to structure • Syntax Parsing, Extract Key Phrases, Entity Extraction, etc. Inference. Grammar Feature Extraction • Example: “After, there were several follow-up questions” (' After', ',', 'there', 'were’) (',', 'there', 'were', 'several’) ('there', 'were', 'several’, ‘follow-up’) • More generalised way of identifying sequences in sentences. Predict. n-gram Feature Extraction • Embeds Words into a Vector Space • Conveys closeness or sequencing of word via nearness. Predict and Infer. • Techniques and Pre-trained Models Available i.e. Word2Vec, GloVe, LSA Word Embeddings

14. 14Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. Sentiment Analytics / Opinion Mining • Problem: How do we capture and understand the positivity, negativity or neutrality of a textual statement? • Summary: The analysis needs to consider nuances such as sarcasm, exaggeration and gesturing. Language modifiers play a big part in the analysis of statements. 70%+ Accuracy (Human-Level). • Use Case: Product Review Analysis, Recommender Systems, Survey Responses, Social Media Analytics • Example: “I loved the fact that this product didn’t work properly” • List Affect/Start words like “good, bad, hate, like, happy, sad”. Dictionary/Key-phrase • Match to Syntactical Structure of Sentence • Supervised Learning with Trained Models Knowledge • Use Semantic Machine Learning i.e. LSA, LDA. Recurrent Neural Networks (RNN), LSTM • Word Order and Proximity. Allow for Linguistic Complexities i.e. Sarcasm. • Entity Identification to Map Target of Opinion Statistical

15. 15Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. Text Summarization • Problem: How do we create a shortened form for a corpus of text? • Use Case: Search Results Summary, News Articles, Scientific Abstracts, Document Summary Extractive Identify Topics Scored Topics Supervised Model Score Sentences Summary from Existing Text Summary from Rewritten Text

16. 16Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. The Art of Text Analytics NO WRONG OR RIGHT ANSWER TO APPROACH. EVALUATE! ITERATIVE APPROACH TO SEE WHAT WORKS. ITERATE! STILL A MATHEMATICAL DRIVEN STATISTICAL MODELLING CAPABILITY. VALIDATE! ENTERPRISE AI ACCELERATE!

18. 18Text Analytics | Shailesh Patel | September 2018 © 2017 Capgemini. All rights reserved. Bibliography / References • Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning. O'Reilly Media. Bengfort, Benjamin; Bilbro, Rebecca; Ojeda, Tony. • Machine Learning for Text. Springer International Publishing. Charu C. Aggarwal. • chapter • Practical Text Analytics: Interpreting Text and Unstructured Data for Business Intelligence (Marketing Science) Struhl, Steven. • The Executive Guide to Artificial Intelligence: How to identify and implement applications for AI in your organization. Springer International Publishing. Burgess, Andrew. • Smart Machines: IBM's Watson and the Era of Cognitive Computing (Columbia Business School Publishing). Columbia University Press. Kelly III, John E.. • Life 3.0: Being Human in the Age of Artificial Intelligence. Penguin Books Ltd. Tegmark, Max. • The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Penguin Books Ltd. Domingos, Pedro. • Superintelligence: Paths, Dangers, Strategies . OUP Oxford. Bostrom, Nick.

19. With more than 190,000 people, Capgemini is present in over 40 countries and celebrates its 50th Anniversary year in 2017. A global leader in consulting, technology and outsourcing services, the Group reported 2016 global revenues of EUR 12.5 billion. Together with its clients, Capgemini creates and delivers business, technology and digital solutions that fit their needs, enabling them to achieve innovation and competitiveness. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business Experience™, and draws on Rightshore®, its worldwide delivery model. About Capgemini Learn more about us at www.capgemini.com This message contains information that may be privileged or confidential and is the property of the Capgemini Group. Copyright © 2017 Capgemini. All rights reserved. Rightshore® is a trademark belonging to Capgemini. This message is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.

Notas del editor

Shailesh Patel Head of AI for the HMRC Account. I lead the AI Centre of Excellence and my team provides Enterprise AI solutions and support to our internal delivery groups as well as HMRC. I’ve been working with Capgemini for 15 years and on the HMRC Account for 11 of those years. Why Text Analytics? The processing of unstructured data is key for most organisations. My team are creating PoCs for Text Analytics based solutions i.e. Email Caches, Free-Format Text, etc This text data holds a lot of valuable information and there techniques available that allow us to mine that insight to better understand the value. This presentation is about some appreciation of text analytics approaches. Do a poll and get a view of how many know what AI, Machine Learning, Deep Learning, Reinforcement Learning, Transfer Learning Do people know the difference between linear regression and logistic regression? Do people know what a neural network is and the concept of neurons and activation functions? Do people understand Bayesian statistical modelling for classification approaches?
AI was coined by John McCarthy in 1955 at Stanford University. He was a math professor from Dartmouth. Umbrella Term for Learning Based Algorithms that can be made to improve with better data. Change data not AI code. Old days changed code, now we change data. Data production will be 44 times greater in 2020 than it was in 2009 We generate more data in 2 days than was generated up to 2002 Big Data is the driver behind this revolution. And now that we have a confluence of Big Data, Compute and AI Algorithms we can actually gain insights from the data Unstructured Data Represents 70% to 90% of the Data Captured Most of the data we capture is unstructured. Image, Video, Audio and Free Format Text. And most those other formats result in textual descriptions like captions or speech to text or just text. Pattern Based Reusable Approaches Now Available We have lots of text analytics techniques we can use and reuse to analyse that text data and we use it to classify, cluster and network that information. The approaches and techniques I describe today are reusable. And they all yield some value in terms of analytics or processing of text data. Based on statistical modelling we can now use techniques to classify, cluster and network data. Which leads to… Accelerated Approaches for Actionable Insights This type of data can be difficult and time consuming to process manually but now with these approaches we can accelerate the activity. Large volumes, unstructured and it’s like looking for a needle in a haystack. Information is hidden in plain sight in certain circumstances to make it difficult to process i.e. scanned PDF files, hand written documents, or just prose to describe something. We can use these techniques with automation to help accelerate our path to insight. Resulting in improved customer experience, increase sales, detect fraud or non-compliance and so on. All these techniques are based on mathematical modelling i.e. statistics, linear algebra, calculus, and so on. So today… ******** Corpus and Corpora. This includes text, audio, image and video data. This session will concentrate on the unique insights gained from text data, including the pre-processing of the other formats which results in text. Utilising techniques around Natural Language Processing, Semantic Analysis, Sentiment Analytics and other advanced modelling approaches can result in valuable competitive intelligence and insight.
A bit of fun! Statistics, Calculus, Linear Algebra, etc What we are going to do today is go through the derivation of some these equations from first principles. Only joking! What this shows is that all the Text Analytics processes are based on statistical modelling! They are statistical problem solvers. Also remember that you can prove or disprove anything with statistics, so you need some experts to validate your models. ******** Going clockwise LDA (Latent (Diri-clay) Dirichlet Allocations) LSA (Latent Semantic Analysis) Linear Regressions ADAM Optimizer for Neural Networks (instead of Stochastic Gradient Descent) Principle Component Analysis Logistic Regression Stochastic Gradient Descent SVD (Singular Value Decomposition) NOT on Picture – Like PCA used for Dimensionality Reduction.
Relevance: With deluge of information, we automated mechanisms to process information to ensure that we only get information that is relevant to our needs. This could be search results, recommender systems, etc Feedback: As feedback is captured from your customers. How will you analyse and gain insights into that data. How will you ensure you are supplying the best service to your customer. AI Assistance: As AI based services become more and more prevalent, we need better systems that communicate with us using natural language interfaces providing NLU and NLG. These need a more sophisticated ability to analyse text and ensure the service consumes the semantic meaning of our requests. Talk about Google Duplex demo Google I/O 2018 based on RNNs. Use of Linguistic Modifiers. One of the key research insights was to constrain Duplex to closed domains, which are narrow enough to explore extensively. Duplex can only carry out natural conversations after being deeply trained in such domains. It cannot carry out general conversations. The system also sounds more natural thanks to the incorporation of speech disfluencies (e.g. “hmm”s and “uh”s). Insights: Given the volume of data captured, there are more insights to be gained from analysing the data. Most forms of data capture will results in some form of text i.e. pictures with caption generation, speech to text, optical character recognition and literal text data. This can be consumed and analysed to provide predictions, analysis and ???? Experience: With the understanding of this text data we can now improve the customers experience with our services. The use chatbots allows us to provide an expedient, efficient no-wait service. The analysis of buying habits ensure we can better predict the buying habits of customers and provide them with recommendations on what else they may like. The list is endless These are a very small number of the areas that benefit from text analysis. ******** Feedback Processing, Email Processing, AI Assistants Like Alex, Google, Cortana, Siri, to convert Speech to Text Search Engines to Make Search Results More Relevant Filtering for reduction in Spam or increase topics of interest Organise the Data into Topics of Interest i.e. News, Sport, etc Text Summaries ChatBot Processing
Corpus is a body of text for analysis. Can consist of many sentences which are documents. Text Wrangling / Text Pre-processing / Feature Transformation – Convert Unstructured Text into a Multi-Dimensional Structured Representation Text Extraction – Get the bodies the text from HMTL, Binary PDF/Word, XML, etc Text Normalization – Store the text in a consistent form. As simple as removing syntax modifiers i.e. removing non-alphanumeric characters and more complex domain specific normalisation using context. TF/IDF – Term Frequencies / Inverse-Document Frequency – Logarithmic Based Frequency Analysis Text Vectorization – Frequency Vectors (The simplest vector encoding model is to simply fill in the vector with the frequency of each word as it appears in the document), One Hot Encoding, Bag of Words, Bag of Sequences Tokenisation – Requires Substantial Domain Knowledge. Identify words, places, names, etc i.e. turn characters into something meaningful i.e. identify words Stop Word Removal – Remove ‘the’, ’a’, ’for’, etc which are noise in the data. Stemming – Change Plural to singular, etc Games == Game, Frequencies == Frequency, etc. Sinking, Sank, Sink == Sink / Lemmatization Dimensionality Reduction – Reduce the Features Used. Could be driven by frequency i.e. have more frequency words or even less but more important domain specific words. ********
Feature Engineering methods use Statistical Modelling to create a high dimensional model of the document Bag of Words – Most ML applications work with the bag-of-words representation in which words are treated as dimensions or features with values corresponding to word frequencies. The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). Also known as the vector space model. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. Usually stored in a sparse storage model Set of Sequences – Natural Language Processing for context driven processing. Data driven approach to representing text capturing the sequential properties of text. The bag-of-words representation will not reveal the fact that a person's name is always followed by the verb "likes" in this text. As an alternative, the n-gram model can be used to store this spatial information within the text ********
Techniques - Topic Modelling (PLSA, LDA), Named Entity Recognition, Pattern-Based Identified Entities, Quantative Text Analysis, NLP/NLU/NLG, Text Summarization, Chatbots, Speech Recognition, RegEx Processing, Text Classification/Regression – Based on supervised learning. We have a label data set for training which allows us to find know priors which may be used for visualisation, searching or input into another model. Logistic regression allows us to classify datasets in a linear when original documents have be transformed into a linear space. E.g. Looking for sports related articles, Topic Modelling: Discover topics in the data i.e. find related terms and correlate them based on proximity of words. E.g. Sports, Gardening, Entertainment, Football, etc with overlap of keywords i.e sports and football. Text Clustering: Using unsupervised learning to find grouping in the data E.g. Find explicit groups i.e. sports, gardening, entertainment, etc Semantic Analysis: What do the words actually mean? You need to understand context. Jeopardy with IBM Watson. Sentiment Analysis: Positive/Negative or Neutral E.g. product reviews; were people happy or disappointed by products they purchased. Text Summarisation: Too Long; Didn’t Read Text Correlation – Graph modelling of text data to infer relationships. Identify and Create relations between textual entities i.e. people and organisations. And there are more: Correlation using Graph Analysis Quantitative Text Analysis using Quantitative Approaches And various visualisation approaches such as word clouds, etc. I’m only going to talk about some today. ********
Simplest Text Analytics capability where we know what we are looking for in order to carry out a prediction. Two rules to implement: Firstly, what do we want to measure i.e. spam, political affiliation, specific topic, etc, Secondly, observation by analysis of text i.e the classification. This results in automatic classification. How do we classify or predict based on a known criteria i.e. a labelled dataset creating a trained model. We wish to extract knowledge about something based on a known prior. Spam Filtering a great example where we are looking for know keywords to classify emails as spam i.e. PPI, Hey! In title of email, your name in an email title, etc. This is the bread and butter of supervised learning given the training dataset availability to create a trained model for use in generalisation. Example Spam or Not Spam email. Recommender Engines - Rather than manually creating recommendations we can analyse product descriptions and reviews to find recommendations. ******** Technique: Naïve Bayes classifier models are simple linear probabilistic mathematical models for classification. Using word frequency approaches allows for fast classification of documents. Naïve Bayes is an online model that can be updated in real-time. Recurrent Neural Networks for time-series datasets are also available for non-linear modelling approaches to classification. More recent approaches such as Support Vector Machines have become more fashionable and provide a more accurate classifier than standard probabilistic approaches. For example, a recommendation system may have classifiers that identify a product’s target age (e.g., a youth versus an adult bicycle), gender (women’s versus men’s clothing), or category (e.g., electronics versus movies) by classifying the product’s description or other attributes. Product reviews may then be classified to detect quality or to determine similar products.
This technique can be used to support Segmentation and Categorisation, etc. It is an Unsupervised Learning approach to grouping data. Where you don’t know what you are looking for but use a mechanism to group. This can also result in a dimensionality reduction allowing us to work on a smaller dataset. There are a number of different measures that can be used to determine document similarity. Fundamentally, each relies on our ability to imagine documents as points in space, where the relative closeness of any two documents is a measure of their similarity. E.g. Sports terms like football, cricket, motor racing, etc. You can use String Matching, Distance Measures, Relational Matching, and others like fuzzy matching, Boolean equality, domain specificity, etc We are going to talk a little about distance measures based on text vectorisation. Clustering Techniques: Partitive Cluster – Grouping based on distance measurements. K-means is an example. A popular method for unsupervised learning tasks, the k-means clustering algorithm starts with an arbitrarily chosen number of clusters, k, and partitions the vectorized instances into clusters according to their proximity to the centroids, which are computed to minimize the within-cluster sum of squares. SVD (Singular Value Decomposition) is a similar technique and could be used as can PCA (Principle Component Analysis). Hierarchical Clustering – involves creating clusters that have a predetermined ordering from top to bottom. Either start with a single instance and iteratively aggregate by similarity i.e. group or start with all instances and divide until you have a single instance. Using Decision Trees a technique to map and group words or documents. ******** Technique: Similarity Based Algorithms : Distance Measure – Using a distance metric on feature vectors i.e. documents- closer together are similar. These distances are measured using many mathematical models e.g. Jaccard, TF/IDF, Cosine Similarity Once you have a distance measure, a cluster mechanism can be implement i.e. Partitive Clustering or Hierarchical Clustering Techniques for Clustering include Deterministic and Probabilistic Matrix Factorization Methods, Probabilistic Mixture Models of Document, Similarity Based Algorithms, Graph Partitioning and Ensemble Methods. We will touch on one, that is Similarity Based Algorithms. Use Case: Based on Unsupervised Learning it has no prior knowledge of topics and can be used an exploratory approach to analysis. Could be at document level or term level. Can be used as mechanism for dimensionality reduction
So spoke of Text Vectorisation to measure distance this slide describes some of the approaches to measuring distance. First need to use text vectorisation techniques to vectorize a corpus with a bag-of-words (BOW) approach, we represent every document from the corpus as a vector whose length is equal to the vocabulary of the corpus. So the Cosine Distance is a algebraic way of measuring the distance between two documents or words. The smaller the distance the more likely they are to be related. This means that the distance measurement can define which cluster the document belongs to. Set Theory to measure the Union or Intersection of two sets of words. You also have simple Euclidean Distance i.e. a straight line between two points or Edit Distance (similar spelling of a word) These distance allows us to mathematically measure the similarity of documents and words. The shorter the distance the more likely they are to be related and hence they belong to the same cluster. ******** Examples are Frequency Vectors (how many times does a word appear in a document), One-Hot Encoding, TF/IDF (Term Frequency / Inverse Document Frequency), Distributed Representation, Distance techniques include Euclidean Distance, Mahalanobis Distance, Manhattan Distance, Minkowski Distance, Jacard Distance, Edit Distance, TF-IDF (Term Frequency/Inverse Document Frequency) Distance, Cosine Distance Euclidean Distance – Straight line distance between two points Manhattan Distance – stepped sum of absolute difference of Cartesian Coordinates Minkowski Distance - generalization of Euclidean and Manhattan distance, and defines the distance between two points in a normalized vector space. Mahalanobis Distance - of how many standard deviations away a particular point is from a distribution of points. Jacard Distance (using set theory and quotient of union and intersection) Edit Distance (distance between two strings i.e. number of permutations required to convert from one to the other) TF-IDF Distance (shared unique terms relative to the rest of the words in the corpus) Cosine Distance (Cosine angle between two vectors to assess similarity orientation. Aiming for parallelism between two vectors to measure similarity) © Bengfort, Benjamin; Bilbro, Rebecca; Ojeda, Tony. Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning.
Topic modelling is when you have lots of documents you want to group them together by potential subjects not just word frequencies. Discover topics or categories of interest. i.e. I have News stories find topics such as Human Interest, Business News, Gardening and Sports. It doesn’t know those topics but can cluster to create those topics. Again based on Unsupervised Learning to automatically cluster and categorise the words into topics. Topics are the clusters of similar words but with words falling into more than one cluster whereas clustering only allows a given word in each cluster. Picture shows document vs words and frequency is mapped by intensity of square. We could just search the documents for frequency of words and manually document the words and documents based on the discoveries but this is hard work and time consuming. Topic Modelling allows us to create this group without having to specify the grouping topics using unsupervised learning approaches. Move documents/words together based on their semantic similarity. i.e. if words appear close together in lots documents then they may be related. <Explain cluster and unsupervised learning vs supervised learning?> ******** Peter Dirichlet (dee-ree-clay) - 1805–59, German mathematician, noted for his work on number theory and calculus By using LDA / LSA we can use a statistical modelling to start to group the documents and words to give us meaningful topics we can subsequently use for clustering documents based on topics we discover from this unsupervised approach. Used for word similarity Mapping. The color gradient identifies the frequency of the words in the document. lighter = less whereas darker = more. Using Euclidean Distance (Pythagoras Theorem) or Cosine Distances to map the distance and correlation between words. Or K-means clustering… Techniques include Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-Negative Matrix Factorization (NNMF).
Techniques include Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-Negative Matrix Factorization (NNMF). Every column corresponds to a document, every row to a word. A cell stores the frequency of a word in a document, dark cells indicate high word frequencies. Topic models group both documents, which use similar words, as well as words which occur in a similar set of documents. The resulting patterns are called "Topics“ You’ll notice that we have multiple documents containing words and we want to cluster documents with related words. So animation starts to move together documents that contain similar words and word that appear in similar document This results in a uniform diagonal cluster that shows a bunch of related topics documents environment, immigration, space, and so on. We could now use that grouping of words to inform a supervised model of topic we are looking for in a wider data set and begin to identify groups of documents belonging to those topics. Animation attribution: https://en.wikipedia.org/wiki/File:Topic_model_scheme.webm#filelinks Author: Christoph Carl Kling
Aka Sequential Language Modelling How do we consider context in corpus so that we can understand the meaning of sentence for example. Lots of text out there where the context of the words defines the meaning of that data. This is usually used to find relationships between document or infer or predict based on sequences. E.g. predict the next word, infer emotion in a sentence, generate responses based on the data NLG. Example where semantics add value (both would be the same in the bag of word/ word frequency solution). But both are clearly different and opposing: The cat chased the mouse The mouse chased the cat Bag of Words would imply that both these sentences are the same. So we need to introduce grammatical context to better interpret the sentences. Feature Engineering methods use Statistical Modelling to create a high dimensional model of the document. Grammar Based Approach - In linguistics, grammar is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language. Grammar-Based Feature Extraction – Allows us to extract grammatical features from the sentence i.e. Noun, Verb, Preposition, Adjective, etc Syntax Parsing – Deconstruct the sentences into a parse tree so that we better check the grammar correctness of the sentence Extract Key Phrases – The key terms or phrases provide insights into topics of potential interest Extract Entities – Create a bag of entities i.e. person, organisation, address, etc Works well if the sentences are grammatically correct to begin with but fails if we cannot recognise the grammar of the sentence i.e. recognises nouns, verbs, prepositions, etc. By recognising the verbs, nouns, etc we may be able to infer the contextual meaning of a sentence. n-gram Feature Extraction – A more generalised way of identifying sequences of tokens and language independent. The bag-of-words representation will not reveal the fact that a person's name is always followed by the verb "likes" in this text. As an alternative, the n-gram model can be used to store this spatial information within the text. Word Embeddings Example to allow for inferences. Embeds words into vector space then measures the closeness of words to convey and predict sequence i.e. one word follows another. Cosine distance is an technique that could be used to measure that distance. Word2Vec offers a pre-defined model for embeddings. Can we use the relationship between words to infer relationships between documents/sentences and hence create a meaning? ******** One very famous example of how word embeddings can represent such relationship is that you can do a vector computation like this: “king is to queen as man is to woman” king−man+woman≈queen n-grams window of 4 for “After, there were several follow-up questions. The New York Times asked when the bill would be signed,”: (' After', ',', 'there', 'were’) (',', 'there', 'were', 'several’) (' there', 'were', 'several', 'follow’) (' were', 'several', 'follow', 'up’) (' several', 'follow', 'up', 'questions’) A lot of the techniques described in this pack are based on word frequency i.e. bag of words for multi-document processing but in this scenario we need to extract key-phrases to better understand the sentences contained in the documents. Here we are trying to derive contextual meaning so that we can respond appropriately, A Statistical Model assigns a probability to a sequence of words. Technique: Language Specific Methods use Grammar Rules to define the syntax of a language along with some statistical analysis. This approach can be rigid due to the inexact nature of the human language. Technique: Language Independent Methods use a number of modelling approaches such as unigram, bigram, trigram, n-gram models or neural networks to encode the grammatical structure of a language from examples. Much more accurate and allows the use of Feature Engineering methods. Approach: Word Embeddings – “Embeds” words into a vector space model based on how often a word appears close to other words. With pre-trained models like word2vec and GloVe you capture the semantics of the words, so that similar words have similar vectors.
The spam classification example has recently been displaced by a new vogue: sentiment analysis. How do we capture the emotion of a corpus or document? Positive, Negative or Neutral. Social media and feedback systems allow us to express our opinions about a product, movie or service. This provides valuable insight to the vendor/provider. Do people like or dislike my product or service? “I loved the fact that this product didn’t work properly” – Need to allow for sarcasm. Achieving 70% Accuracy is classifying sentiment as well as humans. Two Approaches: Knowledge Based – classify text by affect properties of the sentences i.e. good, bad, like, hate, etc. This has limited uses but can yield quick results. Statistical Based – Uses semantic analysis based approaches that analyse the grammatical structure of the sentences to yield more accurate results. LSA, LDA, etc techniques can consider the semantics of the sentences and paragraphs that allow a better understand of the emotion and entity that is the target of that emotion. ******** Sentiment analysis models attempt to predict positive (“ I love writing Python code”) or negative (“ I hate it when people repeat themselves”) sentiment based on content and has gained significant popularity thanks to the expressiveness of social media. Because companies are involved in a more general dialogue where they do not control the information channel (such as reviews of their products and services), there is a belief that sentiment analysis can assist with targeted customer support or even model corporate performance. The complexities and nuances inherent in language context make sentiment analysis less straightforward than spam detection. The Web provides a forum to individuals to express their opinions and sentiments. For example, the product reviews in a Web site might contain text beyond the numerical ratings provided by the user. The textual content of these reviews provides useful information that is not available in numerical ratings. From this point of view, opinion mining can be viewed as the text-centric analog of the rating-centric techniques used in recommender systems. For example, product reviews are often used by both types of methods. Whereas recommender systems analyze the numerical ratings for prediction, opinion mining methods analyze the text of the opinions. It is noteworthy that opinions are often mined from information settings like social media and blogs where ratings are not available. Chapter 13 will discuss the problem of opinion mining and sentiment analysis of text data. The use of information extraction methods for opinion mining is also discussed. “The movie is surprising with plenty of unsettling plot twists.” (Negative term used in a positive sense in certain domains). Both Supervised and Unsupervised Learning approaches can be used. Unsupervised where labelled data is not available i.e. social media posts about new topics of interest. Technique: Using Recursive Neural Networks or Recurrent Neural Networks with LSTMs allows utilise a ‘bag-of-keyphrases’ approach to ensure we retain the nuances and positivity or negativity associated with a word i.e. “terribly helpful”.
How do we take a corpus of text and create a summarised version of that text? There are two techniques available Extractive and Abstractive Technique: Extractive. Uses existing sentences to create a summary. Using a method of scoring based on Topic Word Frequencies, Latent Semantic Analysis, Machine Learning with Supervised Learning. By matching to high frequency or even low frequency words and similarity mapping it finds high scoring sentences in the document and uses those to create the summary. Topic Word approaches work by removing low frequency occurrence and high frequency stop words, topic words left can be used to score sentences that contain them. Machine Learning uses Trained Models to select appropriates features of a document i.e. frequency of topic words, presence of title words, location features (beginning or end of a paragraph) Technique: Abstractive. Re-write the sentences from the document. Uses phrases and clauses from the document but new text is generated. This is an area of research in AI that requires coherence and fluency with semantic understanding to support summarisation. Sentence Compression, Information Fusion, Information Ordering are problems that need to be solved. This is a largely unsolved problem but an area of great interest given it potential applications in AI. ********
I call it art because it is still a intelligence driven approach. You need to think about your use cases to ensure you get the right solution to support your business. No Wrong or Right Answer / Many Approaches to Text Analytics As you have seen there are many different techniques to analysing text and today you’ve seen a few. They yield different insights and benefits and should be aligned to the problem you are trying to solve They are also inter-related so you can use one to carry out many different activities i.e. Clustering or Semantic Analysis Also remember you can build competing models to see which ones yield the best solution to your problem. There’s no wrong or right answers, see what works and measure the accuracy of the insights based prior understanding if its available Iterative Approach to See What Works There isn’t a one size fits all so you need to tune and tweak the models Look at your problem space. Understand your data if you can else look at mechanisms for dimensionality reduction to narrow down your insights if that’s appropriate Mathematical Driven Statistical Modelling It’s all still maths! Based on Probabilities. Consider those probabilities when contemplating correctness. Statistical problem solver at heart. Easy to Implement Some Approaches But Need to be Validated By Your Experts Enterprise AI Solutions. Lots of Tooling Available COTS and Open Source. SAS, IBM, Microsoft, Google, etc There are many products out there. Commoditise and Democratise AI Accelerate the production of your solutions by using these tools. We have looked at IBM Watson, SAS, RapidMiner, etc. They have allowed us to create initial models in hours and days because of the pre-define nature of the modelling capability available Hopefully you can see that Text Analytics is an incredibly useful approach to supporting your business needs! Thank you!

Ai in text analytics shailesh patel - capgemini .cwin18.telford

Recomendados

Recomendados

Más contenido relacionado

Similar a Ai in text analytics shailesh patel - capgemini .cwin18.telford

Similar a Ai in text analytics shailesh patel - capgemini .cwin18.telford (20)

Más de Capgemini

Más de Capgemini (20)

Último

Último (20)

Ai in text analytics shailesh patel - capgemini .cwin18.telford

Notas del editor