Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
2. Agenda/Intro
Agenda
This slide
Why I’m talking
What I’m talking about
How to do what I’m talking about
Overview of tools and techniques
Where new research is headed
Questions
$> whoami
Max Irwin
Working in Search since 2012
Leads Search Center of Excellence
Long time programmer
Recent interests are NLP and
Deep Learning
No need to take photos of slides
Video, deck, code, references, materials will be made available
3. Why I’m talking. (Problem statement)
Suggesting stuff to users
based on what?
Content clustering/relationships/similarities
but how?
Slots and intent for Queries and Bots
with what?
Entities and Named Entity Recognition
sourced from where?
Question Answering
how can it know?
Dimension reduction for unstructured text
down to what?
Lots of products in different domains
Law, Tax, Health, Marketing, Etc.
Better search with less effort
Shortage of metadata experts
Domains differ, content proprietary
Lots of work, always from scratch
Terms of Art, Concepts, Vocabularies,
take years to curate manually
They are usually subjective
Information Retrieval Problems Product Problems
My goal is to introduce you to a suite of techniques to help solve the above problems
4. What I’m Talking About.
Terms associated with
documents
Classify and associate
documents
Techniques:
LDA,
RAKE,
Maui
Associates terms with
the same semantic
meaning (synonyms)
Building blocks for
vocabularies
Techniques:
Topia,
Skipchunk
Keywords Concepts
Represent entire
domains (or subsets)
Reduce dimensions for
abstracting domain
corpora
Techniques:
Lexico-syntactic patterns,
TAXI
Ontologies/Taxonomies
A survey of technologies for automatically extracting the following from text
5. How do these tools work?
Get candidates
Preprocess, arrange, and group tokens
Score candidates
Assign each entry a confidence weight
Relate candidates (only for taxos/ontos)
Link into hierarchies or triples
Score the relationships
Finish and generate list or vocab
Keep “best” scored candidates
Keep “best” scored relationships
Prune (optional step, sometimes human)
Remove noise and cleanup
Precision/Recall/F1 to measure
vs existing keywords/vocabs
Can also use relevance testing
like nDCG if applying to Search
Use open sets if available
(SemEval has good ones)
Otherwise, curate one manually
Varies between experts, so get consensus!
General Workflow Testing!
6. Our Example Corpus
https://opensourceconnections.com/blog/
Quality content written by our hosts and community members
Articles are lacking keywords, and search doesn’t give term suggestions!
Highly contextual to the audience
8. Latent Dirichlet Allocation (LDA)
Unsupervised ML for topical classification of documents
“if observations are words collected into documents, [LDA] posits that each
document is a mixture of a small number of topics and that each word's
creation is attributable to one of the document's topics” - wikipedia
How it works:
Give it a corpus (pre-processed into nice tokens)
Specify an exact number of topics and train
Uses Dirichlet prior for Bayesian probability of each term to a topic
The topics are identified and assigned to the documents
Trained model is re-used to classify new documents
Language independent, well established statistical proofs
Downsides: Can be nondeterministic, intensive training, model maintenance
9. LDA – Example Corpus Topics
Using Gensim LdaModel
Steps:
Tokenize the content
Remove non-words and stopwords
Stem or lemmatize
Train the model (with 20 topics)
See the topics!
Save the model and use it later to
classify new documents with topics
Steps Resulting Topics
11) 0.031 document
12) 0.027 score
13) 0.026 result
14) 0.025 user
15) 0.024 will
16) 0.023 govern
17) 0.022 term
18) 0.021 match
19) 0.017 databas
20) 0.017 depend
1) 0.109 search
2) 0.087 use
3) 0.079 can
4) 0.06 queri
5) 0.049 open
6) 0.046 sourc
7) 0.043 data
8) 0.041 solr
9) 0.04 like
10) 0.037 field
10. Rapid Automatic Keyword Extraction (RAKE)
Novel language independent technique, very fast, and bag-of-words friendly
Also proposed a nice stopword selection algorithm as part of the paper
Candidates:
Tokenize
Split token groups by punctuation and stopwords
Identify co-occurances of sequences of unfiltered words
Scores:
Co-occurrences of tokens t=1..n are used for scoring as kt=degree(t)/frequency(t)
Keywords are re-adjoined as candidate phrases with score = sum member token k
Selection
Top third best scoring candidate phrases are kept
Downsides: Relies heavily on Frequency, Patented
11. RAKE algorithm in one slide
For search managers, developers & data scientists finding ways to innovate
Constructing criteria bounds = 1 + 1 + 2 = 4
Corresponding components = 2 + 1 = 3
Compatibility algorithms = 1.5 + 1 = 2.5
“For search managers, developers & data scientists finding ways to innovate”
12. Multi-purpose automatic topic indexing (“Maui”)
Upgrade on the “KEA” tool
Trains a Naïve Bayes Classifier with
the Weka ML framework
Can draw from existing vocabs
Multi-Purpose:
Assign terms with a controlled vocabulary
Index subject headings
Extract keywords and key phrases
Link entities
Extract terminologies
Generate automatic tagging
Downsides: Requires a training
set, model maintenance
14. Part of Speech tagging - 30 second overview
Sentence to Tree: PoS Tagging and Edge Labeling.
Based on training data from a Treebank
Treebanks are usually not domain specific
Lack of domain specificity can decrease accuracy
When it works, it is useful for many applications
The tax rate is 20.0%
https://demos.explosion.ai/displacy/?text=The%20tax%20rate%20is%2020%25
15. Topia TermExtract
Python2 library: Topia.termextract
Algorithm:
Tags Part-of-Speech* for all terms in corpus
Find noun phrases using patterns of tags
State machine groups nouns and adjectives
~25 lines of python2
*Depends on NLTK, Part of Speech tagging
accuracy varies (75%-92%)
Score and Filter:
Term frequency
Term length
Can be changed with a plugin
Simple but effective
Downsides: favors single token terms
16. Skipchunk
I made this . The name is because it Skips noise to Chunk concepts and predicates.
Extracts flat SKOS concepts and predicates by finding similar label forms.
Algorithm:
Tags Part-of-Speech* for all terms in corpus
Lemmatize and switch to de-adjectival** nouns where appropriate
Take greedy noun/verb phrases, use sorted nouns/verbs in the same phrase of as a key identifier
Group sloppy noun phrases (concepts) and verb phrases (predicates) with the same key
Score is the total count of all label variations, prefLabel is the shortest variation
* Used NLTK at first but migrated to spaCy (90%+ PoS tagging accuracy)
**(beautiful beauty), uses wordnet (needs accuracy improvement though)
Extra long chunks on purpose: they are likely to be terms of art with other forms
With Haystack we want to open up the invite to practitioners from
ADP PROPN PRON VERB PART VERB PART DET NOUN ADP NOUN ADP
around the world similarly struggling on hard meaty relevance problems.
ADP DET NOUN ADV VERB PART ADJ NOUN NOUN NOUN
invite practitioner
17. Skipchunk – example extractions
skos:prefLabel "twitter / facebook"@en ;
skos:altLabel "facebook and twitter"@en ;
skos:prefLabel "drupal search block"@en ;
skos:altLabel "search to any drupal block"@en ;
skos:prefLabel "top search terms"@en ;
skos:altLabel "top 100 search terms"@en ;
skos:prefLabel "document’s term vectors"@en ;
skos:altLabel "term vectors from documents"@en ;
skos:prefLabel "last longer"@en ;
skos:altLabel "longer lasting"@en ;
skos:prefLabel "was uploaded"@en ;
skos:altLabel "is that we can upload"@en ;
skos:prefLabel "woke up early"@en ;
skos:altLabel "woke us all up early"@en ;
skos:prefLabel "so you see"@en ;
skos:altLabel "so when you see"@en ;
skos:altLabel "so you can see"@en ;
Concepts (Noun Phrases) Predicates (Narrow Verb Phrases)
18. Showdown! Top 20 from the example corpus
trek holodeck
hồ chí minh
premium unsanded grout
prank bubble gum
weird art film
dog catcher law
latent semantic analysis
open source connections
tf*idf score
probabilistic information retrieval
open source solutions
open source search
inverse document frequency
open source software
open source community
google search appliance
test driven relevancy
social networking sites
semantic web technologies
open source projects
search
solr
query
user
data
document
result
time
use
work
field
project
name
example
term
need
way
code
problem
thing
search engine
search results
opensource connections
otherness words
open source
search relevance
use case
search terms
frequencies for all four terms
blog post
solr or elasticsearch
visual studio
document frequency
otherness hand
dependencies downloading
query time
Eric Pugh
recommendation systems
title field
big data
RAKE Topia Skipchunk
solr
ve
machine learning
filtering that information
ranking
training set
training data
providing information
retrieval systems
machine learning
techniques
query with rankings
cheat
installs git
extensive amounts
clean package
parent project
solr 4.X
mvn clean
custom relevancy
matches like
MAUI
search
use
can
queri
open
sourc
data
solr
like
field
document
score
result
user
will
govern
term
match
databas
depend
LDA
19. Ontology learning
Specifically – Terminological Ontologies
(SKOS, WordNet, Etc)
Taxonomies are hierarchical
Can narrow focus to Hypernym
Discovery (SemEval 2018 task 9)
More broadly, Taxonomy extraction,
Hyponym detection
SemEval challenges for state of the art
Don’t forget Meronymy (membership)!
Image Source: Nuria Casellas, 2012
20. Types of Ontologies
Formal:
a conceptualization whose categories are distinguished by axioms and
definitions. Can be used to computationally and logically arrive at exact
proven conclusions.
Prototype-based:
distinguished by typical instances or prototypes rather than by axioms and
definitions in logic. Categories are formed by collecting instances
extensionally
Terminological:
partially specified by subtype-supertype relations and describe concepts by
concept labels or synonyms rather than prototypical instances, but lack an
axiomatic grounding. SKOS, WordNet, BabelNet are examples
Source: C. Biemann, 2005
22. Hearst Patterns (Lexico-Syntactic)
“Automatic Acquisition of Hyponyms from Large Text Corpora”
Marti Hearst, 1992. Cited by 3504 in Google Scholar
Hard and fast rules based on language syntax
Uses trigger words and punctuation
NP0 such as {NP1,NP2 …, (and | or)} NPn
for all NPi, 1<=i<=n, hyponym(NPi, NP0)
Therefore: hyponym(“Bing”, “search engine”)
such NP as {NP,}* {or|and} NP
NP {, NP}* {,} or other NP
…
“…traffic comes from an external search engine such as Google, Bing, or Yahoo”
Lexico-Syntactic patterns have improved with research and expanded to Meronyms
23. Lexico-Syntactic Pattern Success Rate
Some animals such as dogs Countries around the world such as Armenia
Success! Not so much success
Pattern Occurrences* Success Rate*
NP0 including NP1 601 409 (68.0%)
NP0 such as NP1 2389 2107 (88.2%)
NP0 like NP1 401 330 (82.0%)
NP0 e.g. NP1 170 134 (79%)
NP0 kinds|types|forms of NP1 48 31 (65%)
NP0 especially NP1 61 54 (89%)
NP0 notably NP1 22 13 (59%)
*Source: Klaussner and Zhekova, 2011
24. TAXI – A Taxonomy Induction System
State of the Art
First place in SemEval 2016 Task
13 (Taxonomy extraction
evaluation)
Innovations:
Hundreds of TB of general domain content
Focused Crawl of specific domain content
Substring Matching and Lexico-Syntactic
Patterns together, ported to four languages
Unsupervised and Supervised learning,
based on the language
Automated pruning of the graph
Domain
Content on the
Web
Corpus
& Web
Overlap
Original
to the
Corpus
25. TAXI Workflow
Substr matches
“Biomedical science”
science
“Microbiology”
biology
Calculate Score: σ(ti ,tj)
Lexico-syntactic
PattaMaika (NLP chunks)
PatternSim (Hearst, etc)
WebISA (rexexp patterns)
Calculate Score: π(ti ,tj)
Unsupervised
French, Dutch, Italian
ti is hypernym of tj if:
σ(ti ,tj) > 0
OR
π(ti ,tj) rank in top 2
Supervised
English Only
Use trained SVM classifier
from existing taxo
Model incorporates
Negative Sampling
Classifies all possible word
pairs, positives get added
Gather lots of Content Prune Candidates
General
Wikipedia(11GB)
59G (59GB)
Common Crawl (168TB)
Specific
Focused Domain Crawl
Lang modelling approach
e.g. food, science, enviro
Thorough
Takes 1 week per
language per domain
Candidate Hypernyms
Steps
Start with the noisy graph
Use graph pruning
techniques
Remove cycles and
bidirectionals
Makes a Directed Acyclic
Graph
Attach top nodes to root
End result is a
Taxonomy
Construct Taxonomy
28. What’s Next?
Trending to tasks being split:
Hypernym Detection
Hypernym Discovery
Taxonomy Construction
Taxonomy Evaluation
Word-Embeddings and Deep
Learning are becoming more
prevalent in the above tasks
Improve Accuracy
Generate of RDF triples
Use common predicates and
leverage substrings and lexico-
syntactic patterns
Known issues that make things
hard:
Co-reference resolution
Intransitivity
Passive vs Active voice
For the Field For Skipchunk
So, many of us here deal mostly in unstructured text. We do our best to help customers find things ensconced in corpora, trying to make their lives easier and more efficient. We often see patterns in the text ourselves and wish that, perhaps, this thing was metadata or that thing was normalized. So when given a bag of words, we take out our bag of tricks. We Lemmatize, we ASCII Fold, we catenateWords, we boost and tune. Doing our best to make things nice and tidy, coherent and findable. But almost always we do this inside the engine while processing content and queries.
But it is easy to lose sight of the overall problems when deep in our analyzers working through search bugs. Two main issues in search come down to context and intent. What is the customer really looking for? Can’t they be more expressive and less vague? An enormous gap exists because there isn’t any machine understanding of content, and an inverted index can’t connect with the customer in any meaningful way. A document or fragment is always about something, and it’s in a certain context. Being able to express that in an abstraction is what leads towards relating to the customer, their query intent, and goal for using your product in the first place.
We have difficulty representing a domain in plain terms, and reducing the dimensions of the content in that domain. We can do it but it takes time and specialist expertise.
So with that we look to automate. We automate not because we are lazy but because it saves time, removes bias, and broadens our ability beyond what we can achieve with our human minds and learned skills. We will automatically abstract across a corpus and I’m going to be talking about how to do that with tools and techniques. Some of these are flat, and some have deeper structure. Ultimately a domain is abstracted through an Ontology, which is a graph of core concepts and their relationships, sometimes hierarchical, but sometimes messy and bidirectional…but that’s fine because our mental representation of the world is never simple.
I only have 40 minutes, and these techniques are not exhaustive. Rather they are selection across a spectrum with varying use cases and degrees of accuracy. There is a whole world of research being done in this space that, at least in my normal day to day activities, rarely sees light or application to the hard problems we have. This world is hidden away in brilliant academic research and sometimes available through proprietary and expensive black boxes. So over the past several years I’ve been casually researching. In preparation for this talk, for the past couple months, I dug myself in deep and learned as much as I could in my spare time, by reading dozens of papers and trying all sorts of technology. Many thanks to OpenSource Connections for giving me an opportunity to speak today. And I hope you are able to take what I’ve distilled here and find some inspiration and new ways of thinking. Since much of this research is about the most important thing that we deal with: language.
We’re not yet at the point where things end up perfectly nice and clean in the end. Many of the techniques will require a human touch to finish things off nicely, just to make sure our naïve and deterministic automatons are doing their job correctly. So unless you are web scale and can’t possibly take the time to have a person comb through the results, I recommend doing just that and over time finding good ways to replicate.
Used wget to grab ~700 articles
If thesaurus is provided, its concept labels are used to identify candidates.
Candidate Keywords are continuous token n-grams of declared length from 1 to 3. Candidates do not start or end with a stopword.
Candidate scoring features are TFxIDF, First Occurrence (beginning and end of documents are favored), number of tokens, and Node Degree (related to a thesaurus).
Skipchunk is greedy and likes to include modifiers, since they are frequently included in terms of art for the domains we work with, such as “qualified buyer exemption” or “regulated investment company”. It also includes stop words like “tax for the year” and another label for the same concept as “the year’s tax”. However leading and trailing stopwords are removed. The second label in the previous example will become “year’s tax”
Some things to notice: LDA and Topia are very similar and have lots of overlap, but this is for the entire corpus and further classification will produce much different results for each subsequent document.
RAKE has lots of very odd terms seemingly out of domain context (“weird art film”, “trek holodeck”) attributed to their frequency in the posts, this can be tuned.
Maui has a nice mix of single and multiple token terms, but some noise (like ‘ve’ and ‘matches like’).
Skipchunk has some de-adjectival noun bugs (“in other words” “otherness words”).
Marti Hearst proposed the original 6 patterns. The research was so early (pre-web!) that it was difficult to scale the analysis and in some steps she had to resort to manual work rather than computation. It is worth noting that though Noun Phrases are specified, these were discovered as part of the pattern, and not pre-computed for discovery.
Researchers Klaussner and Zhekova discovered and added new patterns, and did thorough analysis on their success rate.
Everything we’ve seen so far has been using the documents that are part of the corpus being analyzed, or use models that are sourced from controlled content
Your domain isn’t new and while your content can be original, there is going to be significant overlap with existing and publicly available knowledge.
The focused crawl works well because it draws on the intelligence and scale of the web. When Hearst first published her Lexico-Syntactic patterns, the material was locked away in books or newly digitized content, therefore much of the extraction was original.
Nowadays, while the corpus may have original material unseen beforehand, the bulk of the domain is encapsulated by existing and publically available knowledge on the web.
This uses existing information to get statistically significant likelihood of hypernymy, and apply that likelihood to the new material and corpus structure.
The Taxonomy is still specific to the corpus, but it is improved by drawing from this likelihood. While in many ways this is a brute force approach, its success is a testament to the idea of learning from data at scale.
Note the maturity of the stack which draws on many techniques previously discussed. Extra steps were taken to ensure success with all four languages as specified in the SemEval task.
I wasn’t able to apply TAXI to our example corpus in time for this talk, so I used the SemEval ‘Science’ domain evaluation instead. Nevertheless, the concepts in TAXI are indeed powerful and warrant further investigation, whether used as-is or as a reference for development.