Shailesh Patel Head of AI for the HMRC Account. I lead the AI Centre of Excellence and my team provides Enterprise AI solutions and support to our internal delivery groups as well as HMRC.
I’ve been working with Capgemini for 15 years and on the HMRC Account for 11 of those years.
Why Text Analytics? The processing of unstructured data is key for most organisations.
My team are creating PoCs for Text Analytics based solutions i.e. Email Caches, Free-Format Text, etc
This text data holds a lot of valuable information and there techniques available that allow us to mine that insight to better understand the value.
This presentation is about some appreciation of text analytics approaches.
Do a poll and get a view of how many know what AI, Machine Learning, Deep Learning, Reinforcement Learning, Transfer Learning
Do people know the difference between linear regression and logistic regression?
Do people know what a neural network is and the concept of neurons and activation functions?
Do people understand Bayesian statistical modelling for classification approaches?
AI was coined by John McCarthy in 1955 at Stanford University. He was a math professor from Dartmouth.
Umbrella Term for Learning Based Algorithms that can be made to improve with better data. Change data not AI code. Old days changed code, now we change data.
Data production will be 44 times greater in 2020 than it was in 2009
We generate more data in 2 days than was generated up to 2002
Big Data is the driver behind this revolution. And now that we have a confluence of Big Data, Compute and AI Algorithms we can actually gain insights from the data
Unstructured Data Represents 70% to 90% of the Data Captured
Most of the data we capture is unstructured. Image, Video, Audio and Free Format Text. And most those other formats result in textual descriptions like captions or speech to text or just text.
Pattern Based Reusable Approaches Now Available
We have lots of text analytics techniques we can use and reuse to analyse that text data and we use it to classify, cluster and network that information.
The approaches and techniques I describe today are reusable. And they all yield some value in terms of analytics or processing of text data. Based on statistical modelling we can now use techniques to classify, cluster and network data.
Which leads to…
Accelerated Approaches for Actionable Insights
This type of data can be difficult and time consuming to process manually but now with these approaches we can accelerate the activity.
Large volumes, unstructured and it’s like looking for a needle in a haystack.
Information is hidden in plain sight in certain circumstances to make it difficult to process i.e. scanned PDF files, hand written documents, or just prose to describe something.
We can use these techniques with automation to help accelerate our path to insight.
Resulting in improved customer experience, increase sales, detect fraud or non-compliance and so on.
All these techniques are based on mathematical modelling i.e. statistics, linear algebra, calculus, and so on.
So today…
********
Corpus and Corpora.
This includes text, audio, image and video data. This session will concentrate on the unique insights gained from text data, including the pre-processing of the other formats which results in text.
Utilising techniques around Natural Language Processing, Semantic Analysis, Sentiment Analytics and other advanced modelling approaches can result in valuable competitive intelligence and insight.
A bit of fun!
Statistics, Calculus, Linear Algebra, etc
What we are going to do today is go through the derivation of some these equations from first principles. Only joking!
What this shows is that all the Text Analytics processes are based on statistical modelling! They are statistical problem solvers. Also remember that you can prove or disprove anything with statistics, so you need some experts to validate your models.
********
Going clockwise
LDA (Latent (Diri-clay) Dirichlet Allocations)
LSA (Latent Semantic Analysis)
Linear Regressions
ADAM Optimizer for Neural Networks (instead of Stochastic Gradient Descent)
Principle Component Analysis
Logistic Regression
Stochastic Gradient Descent
SVD (Singular Value Decomposition) NOT on Picture – Like PCA used for Dimensionality Reduction.
Relevance: With deluge of information, we automated mechanisms to process information to ensure that we only get information that is relevant to our needs. This could be search results, recommender systems, etc
Feedback: As feedback is captured from your customers. How will you analyse and gain insights into that data. How will you ensure you are supplying the best service to your customer.
AI Assistance: As AI based services become more and more prevalent, we need better systems that communicate with us using natural language interfaces providing NLU and NLG. These need a more sophisticated ability to analyse text and ensure the service consumes the semantic meaning of our requests. Talk about Google Duplex demo Google I/O 2018 based on RNNs. Use of Linguistic Modifiers. One of the key research insights was to constrain Duplex to closed domains, which are narrow enough to explore extensively. Duplex can only carry out natural conversations after being deeply trained in such domains. It cannot carry out general conversations. The system also sounds more natural thanks to the incorporation of speech disfluencies (e.g. “hmm”s and “uh”s).
Insights: Given the volume of data captured, there are more insights to be gained from analysing the data. Most forms of data capture will results in some form of text i.e. pictures with caption generation, speech to text, optical character recognition and literal text data. This can be consumed and analysed to provide predictions, analysis and ????
Experience: With the understanding of this text data we can now improve the customers experience with our services. The use chatbots allows us to provide an expedient, efficient no-wait service. The analysis of buying habits ensure we can better predict the buying habits of customers and provide them with recommendations on what else they may like. The list is endless
These are a very small number of the areas that benefit from text analysis.
********
Feedback Processing, Email Processing,
AI Assistants Like Alex, Google, Cortana, Siri, to convert Speech to Text
Search Engines to Make Search Results More Relevant
Filtering for reduction in Spam or increase topics of interest
Organise the Data into Topics of Interest i.e. News, Sport, etc
Text Summaries
ChatBot Processing
Corpus is a body of text for analysis. Can consist of many sentences which are documents.
Text Wrangling / Text Pre-processing / Feature Transformation – Convert Unstructured Text into a Multi-Dimensional Structured Representation
Text Extraction – Get the bodies the text from HMTL, Binary PDF/Word, XML, etc
Text Normalization – Store the text in a consistent form. As simple as removing syntax modifiers i.e. removing non-alphanumeric characters and more complex domain specific normalisation using context.
TF/IDF – Term Frequencies / Inverse-Document Frequency – Logarithmic Based Frequency Analysis
Text Vectorization – Frequency Vectors (The simplest vector encoding model is to simply fill in the vector with the frequency of each word as it appears in the document), One Hot Encoding, Bag of Words, Bag of Sequences
Tokenisation – Requires Substantial Domain Knowledge. Identify words, places, names, etc i.e. turn characters into something meaningful i.e. identify words
Stop Word Removal – Remove ‘the’, ’a’, ’for’, etc which are noise in the data.
Stemming – Change Plural to singular, etc Games == Game, Frequencies == Frequency, etc. Sinking, Sank, Sink == Sink / Lemmatization
Dimensionality Reduction – Reduce the Features Used. Could be driven by frequency i.e. have more frequency words or even less but more important domain specific words.
********
Feature Engineering methods use Statistical Modelling to create a high dimensional model of the document
Bag of Words – Most ML applications work with the bag-of-words representation in which words are treated as dimensions or features with values corresponding to word frequencies.
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). Also known as the vector space model. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.
Usually stored in a sparse storage model
Set of Sequences – Natural Language Processing for context driven processing. Data driven approach to representing text capturing the sequential properties of text. The bag-of-words representation will not reveal the fact that a person's name is always followed by the verb "likes" in this text. As an alternative, the n-gram model can be used to store this spatial information within the text
********
Techniques - Topic Modelling (PLSA, LDA), Named Entity Recognition, Pattern-Based Identified Entities, Quantative Text Analysis, NLP/NLU/NLG, Text Summarization, Chatbots, Speech Recognition, RegEx Processing,
Text Classification/Regression – Based on supervised learning. We have a label data set for training which allows us to find know priors which may be used for visualisation, searching or input into another model. Logistic regression allows us to classify datasets in a linear when original documents have be transformed into a linear space. E.g. Looking for sports related articles,
Topic Modelling: Discover topics in the data i.e. find related terms and correlate them based on proximity of words. E.g. Sports, Gardening, Entertainment, Football, etc with overlap of keywords i.e sports and football.
Text Clustering: Using unsupervised learning to find grouping in the data E.g. Find explicit groups i.e. sports, gardening, entertainment, etc
Semantic Analysis: What do the words actually mean? You need to understand context. Jeopardy with IBM Watson.
Sentiment Analysis: Positive/Negative or Neutral E.g. product reviews; were people happy or disappointed by products they purchased.
Text Summarisation: Too Long; Didn’t Read
Text Correlation – Graph modelling of text data to infer relationships. Identify and Create relations between textual entities i.e. people and organisations.
And there are more:
Correlation using Graph Analysis
Quantitative Text Analysis using Quantitative Approaches
And various visualisation approaches such as word clouds, etc.
I’m only going to talk about some today.
********
Simplest Text Analytics capability where we know what we are looking for in order to carry out a prediction.
Two rules to implement: Firstly, what do we want to measure i.e. spam, political affiliation, specific topic, etc, Secondly, observation by analysis of text i.e the classification. This results in automatic classification.
How do we classify or predict based on a known criteria i.e. a labelled dataset creating a trained model. We wish to extract knowledge about something based on a known prior. Spam Filtering a great example where we are looking for know keywords to classify emails as spam i.e. PPI, Hey! In title of email, your name in an email title, etc.
This is the bread and butter of supervised learning given the training dataset availability to create a trained model for use in generalisation.
Example Spam or Not Spam email.
Recommender Engines - Rather than manually creating recommendations we can analyse product descriptions and reviews to find recommendations.
********
Technique: Naïve Bayes classifier models are simple linear probabilistic mathematical models for classification. Using word frequency approaches allows for fast classification of documents. Naïve Bayes is an online model that can be updated in real-time.
Recurrent Neural Networks for time-series datasets are also available for non-linear modelling approaches to classification.
More recent approaches such as Support Vector Machines have become more fashionable and provide a more accurate classifier than standard probabilistic approaches.
For example, a recommendation system may have classifiers that identify a product’s target age (e.g., a youth versus an adult bicycle), gender (women’s versus men’s clothing), or category (e.g., electronics versus movies) by classifying the product’s description or other attributes. Product reviews may then be classified to detect quality or to determine similar products.
This technique can be used to support Segmentation and Categorisation, etc. It is an Unsupervised Learning approach to grouping data. Where you don’t know what you are looking for but use a mechanism to group. This can also result in a dimensionality reduction allowing us to work on a smaller dataset.
There are a number of different measures that can be used to determine document similarity. Fundamentally, each relies on our ability to imagine documents as points in space, where the relative closeness of any two documents is a measure of their similarity. E.g. Sports terms like football, cricket, motor racing, etc.
You can use String Matching, Distance Measures, Relational Matching, and others like fuzzy matching, Boolean equality, domain specificity, etc
We are going to talk a little about distance measures based on text vectorisation.
Clustering Techniques:
Partitive Cluster – Grouping based on distance measurements. K-means is an example. A popular method for unsupervised learning tasks, the k-means clustering algorithm starts with an arbitrarily chosen number of clusters, k, and partitions the vectorized instances into clusters according to their proximity to the centroids, which are computed to minimize the within-cluster sum of squares. SVD (Singular Value Decomposition) is a similar technique and could be used as can PCA (Principle Component Analysis).
Hierarchical Clustering – involves creating clusters that have a predetermined ordering from top to bottom. Either start with a single instance and iteratively aggregate by similarity i.e. group or start with all instances and divide until you have a single instance. Using Decision Trees a technique to map and group words or documents.
********
Technique: Similarity Based Algorithms :
Distance Measure – Using a distance metric on feature vectors i.e. documents- closer together are similar. These distances are measured using many mathematical models e.g. Jaccard, TF/IDF, Cosine Similarity
Once you have a distance measure, a cluster mechanism can be implement i.e. Partitive Clustering or Hierarchical Clustering
Techniques for Clustering include Deterministic and Probabilistic Matrix Factorization Methods, Probabilistic Mixture Models of Document, Similarity Based Algorithms, Graph Partitioning and Ensemble Methods. We will touch on one, that is Similarity Based Algorithms.
Use Case: Based on Unsupervised Learning it has no prior knowledge of topics and can be used an exploratory approach to analysis. Could be at document level or term level. Can be used as mechanism for dimensionality reduction
Topic modelling is when you have lots of documents you want to group them together by potential subjects not just word frequencies.
Discover topics or categories of interest. i.e. I have News stories find topics such as Human Interest, Business News, Gardening and Sports. It doesn’t know those topics but can cluster to create those topics.
Again based on Unsupervised Learning to automatically cluster and categorise the words into topics.
Topics are the clusters of similar words but with words falling into more than one cluster whereas clustering only allows a given word in each cluster.
Picture shows document vs words and frequency is mapped by intensity of square.
We could just search the documents for frequency of words and manually document the words and documents based on the discoveries but this is hard work and time consuming.
Topic Modelling allows us to create this group without having to specify the grouping topics using unsupervised learning approaches.
Move documents/words together based on their semantic similarity. i.e. if words appear close together in lots documents then they may be related.
<Explain cluster and unsupervised learning vs supervised learning?>
********
Peter Dirichlet (dee-ree-clay) - 1805–59, German mathematician, noted for his work on number theory and calculus
By using LDA / LSA we can use a statistical modelling to start to group the documents and words to give us meaningful topics we can subsequently use for clustering documents based on topics we discover from this unsupervised approach. Used for word similarity Mapping.
The color gradient identifies the frequency of the words in the document. lighter = less whereas darker = more.
Using Euclidean Distance (Pythagoras Theorem) or Cosine Distances to map the distance and correlation between words.
Or K-means clustering…
Techniques include Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-Negative Matrix Factorization (NNMF).
Techniques include Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-Negative Matrix Factorization (NNMF).
Every column corresponds to a document, every row to a word. A cell stores the frequency of a word in a document, dark cells indicate high word frequencies. Topic models group both documents, which use similar words, as well as words which occur in a similar set of documents. The resulting patterns are called "Topics“
You’ll notice that we have multiple documents containing words and we want to cluster documents with related words.
So animation starts to move together documents that contain similar words and word that appear in similar document
This results in a uniform diagonal cluster that shows a bunch of related topics documents environment, immigration, space, and so on.
We could now use that grouping of words to inform a supervised model of topic we are looking for in a wider data set and begin to identify groups of documents belonging to those topics.
Animation attribution:
https://en.wikipedia.org/wiki/File:Topic_model_scheme.webm#filelinks
Author: Christoph Carl Kling
Aka Sequential Language Modelling
How do we consider context in corpus so that we can understand the meaning of sentence for example.
Lots of text out there where the context of the words defines the meaning of that data.
This is usually used to find relationships between document or infer or predict based on sequences. E.g. predict the next word, infer emotion in a sentence, generate responses based on the data NLG.
Example where semantics add value (both would be the same in the bag of word/ word frequency solution). But both are clearly different and opposing:
The cat chased the mouse
The mouse chased the cat
Bag of Words would imply that both these sentences are the same. So we need to introduce grammatical context to better interpret the sentences.
Feature Engineering methods use Statistical Modelling to create a high dimensional model of the document.
Grammar Based Approach - In linguistics, grammar is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language.
Grammar-Based Feature Extraction – Allows us to extract grammatical features from the sentence i.e. Noun, Verb, Preposition, Adjective, etc
Syntax Parsing – Deconstruct the sentences into a parse tree so that we better check the grammar correctness of the sentence
Extract Key Phrases – The key terms or phrases provide insights into topics of potential interest
Extract Entities – Create a bag of entities i.e. person, organisation, address, etc
Works well if the sentences are grammatically correct to begin with but fails if we cannot recognise the grammar of the sentence i.e. recognises nouns, verbs, prepositions, etc. By recognising the verbs, nouns, etc we may be able to infer the contextual meaning of a sentence.
n-gram Feature Extraction – A more generalised way of identifying sequences of tokens and language independent. The bag-of-words representation will not reveal the fact that a person's name is always followed by the verb "likes" in this text. As an alternative, the n-gram model can be used to store this spatial information within the text.
Word Embeddings Example to allow for inferences. Embeds words into vector space then measures the closeness of words to convey and predict sequence i.e. one word follows another. Cosine distance is an technique that could be used to measure that distance. Word2Vec offers a pre-defined model for embeddings. Can we use the relationship between words to infer relationships between documents/sentences and hence create a meaning?
********
One very famous example of how word embeddings can represent such relationship is that you can do a vector computation like this:
“king is to queen as man is to woman”
king−man+woman≈queen
n-grams window of 4 for “After, there were several follow-up questions. The New York Times asked when the bill would be signed,”:
(' After', ',', 'there', 'were’)
(',', 'there', 'were', 'several’)
(' there', 'were', 'several', 'follow’)
(' were', 'several', 'follow', 'up’)
(' several', 'follow', 'up', 'questions’)
A lot of the techniques described in this pack are based on word frequency i.e. bag of words for multi-document processing but in this scenario we need to extract key-phrases to better understand the sentences contained in the documents.
Here we are trying to derive contextual meaning so that we can respond appropriately,
A Statistical Model assigns a probability to a sequence of words.
Technique: Language Specific Methods use Grammar Rules to define the syntax of a language along with some statistical analysis. This approach can be rigid due to the inexact nature of the human language.
Technique: Language Independent Methods use a number of modelling approaches such as unigram, bigram, trigram, n-gram models or neural networks to encode the grammatical structure of a language from examples. Much more accurate and allows the use of Feature Engineering methods.
Approach: Word Embeddings – “Embeds” words into a vector space model based on how often a word appears close to other words. With pre-trained models like word2vec and GloVe you capture the semantics of the words, so that similar words have similar vectors.
The spam classification example has recently been displaced by a new vogue: sentiment analysis. How do we capture the emotion of a corpus or document? Positive, Negative or Neutral.
Social media and feedback systems allow us to express our opinions about a product, movie or service. This provides valuable insight to the vendor/provider. Do people like or dislike my product or service?
“I loved the fact that this product didn’t work properly” – Need to allow for sarcasm.
Achieving 70% Accuracy is classifying sentiment as well as humans.
Two Approaches:
Knowledge Based – classify text by affect properties of the sentences i.e. good, bad, like, hate, etc. This has limited uses but can yield quick results.
Statistical Based – Uses semantic analysis based approaches that analyse the grammatical structure of the sentences to yield more accurate results. LSA, LDA, etc techniques can consider the semantics of the sentences and paragraphs that allow a better understand of the emotion and entity that is the target of that emotion.
********
Sentiment analysis models attempt to predict positive (“ I love writing Python code”) or negative (“ I hate it when people repeat themselves”) sentiment based on content and has gained significant popularity thanks to the expressiveness of social media. Because companies are involved in a more general dialogue where they do not control the information channel (such as reviews of their products and services), there is a belief that sentiment analysis can assist with targeted customer support or even model corporate performance. The complexities and nuances inherent in language context make sentiment analysis less straightforward than spam detection.
The Web provides a forum to individuals to express their opinions and sentiments. For example, the product reviews in a Web site might contain text beyond the numerical ratings provided by the user. The textual content of these reviews provides useful information that is not available in numerical ratings. From this point of view, opinion mining can be viewed as the text-centric analog of the rating-centric techniques used in recommender systems. For example, product reviews are often used by both types of methods. Whereas recommender systems analyze the numerical ratings for prediction, opinion mining methods analyze the text of the opinions. It is noteworthy that opinions are often mined from information settings like social media and blogs where ratings are not available. Chapter 13 will discuss the problem of opinion mining and sentiment analysis of text data. The use of information extraction methods for opinion mining is also discussed.
“The movie is surprising with plenty of unsettling plot twists.” (Negative term used in a positive sense in certain domains).
Both Supervised and Unsupervised Learning approaches can be used. Unsupervised where labelled data is not available i.e. social media posts about new topics of interest.
Technique: Using Recursive Neural Networks or Recurrent Neural Networks with LSTMs allows utilise a ‘bag-of-keyphrases’ approach to ensure we retain the nuances and positivity or negativity associated with a word i.e. “terribly helpful”.
How do we take a corpus of text and create a summarised version of that text? There are two techniques available Extractive and Abstractive
Technique: Extractive. Uses existing sentences to create a summary.
Using a method of scoring based on Topic Word Frequencies, Latent Semantic Analysis, Machine Learning with Supervised Learning. By matching to high frequency or even low frequency words and similarity mapping it finds high scoring sentences in the document and uses those to create the summary.
Topic Word approaches work by removing low frequency occurrence and high frequency stop words, topic words left can be used to score sentences that contain them.
Machine Learning uses Trained Models to select appropriates features of a document i.e. frequency of topic words, presence of title words, location features (beginning or end of a paragraph)
Technique: Abstractive. Re-write the sentences from the document. Uses phrases and clauses from the document but new text is generated. This is an area of research in AI that requires coherence and fluency with semantic understanding to support summarisation. Sentence Compression, Information Fusion, Information Ordering are problems that need to be solved. This is a largely unsolved problem but an area of great interest given it potential applications in AI.
********
I call it art because it is still a intelligence driven approach. You need to think about your use cases to ensure you get the right solution to support your business.
No Wrong or Right Answer / Many Approaches to Text Analytics
As you have seen there are many different techniques to analysing text and today you’ve seen a few.
They yield different insights and benefits and should be aligned to the problem you are trying to solve
They are also inter-related so you can use one to carry out many different activities i.e. Clustering or Semantic Analysis
Also remember you can build competing models to see which ones yield the best solution to your problem.
There’s no wrong or right answers, see what works and measure the accuracy of the insights based prior understanding if its available
Iterative Approach to See What Works
There isn’t a one size fits all so you need to tune and tweak the models
Look at your problem space.
Understand your data if you can else look at mechanisms for dimensionality reduction to narrow down your insights if that’s appropriate
Mathematical Driven Statistical Modelling
It’s all still maths! Based on Probabilities. Consider those probabilities when contemplating correctness.
Statistical problem solver at heart. Easy to Implement Some Approaches But Need to be Validated By Your Experts
Enterprise AI Solutions. Lots of Tooling Available COTS and Open Source. SAS, IBM, Microsoft, Google, etc
There are many products out there.
Commoditise and Democratise AI
Accelerate the production of your solutions by using these tools. We have looked at IBM Watson, SAS, RapidMiner, etc.
They have allowed us to create initial models in hours and days because of the pre-define nature of the modelling capability available
Hopefully you can see that Text Analytics is an incredibly useful approach to supporting your business needs!
Thank you!