Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies

Algorithmic Extraction of
Keywords, Concepts, and
Vocabularies
Max Irwin @ Haystack
April 10, 2018

Agenda/Intro
Agenda
 This slide
 Why I’m talking
 What I’m talking about
 How to do what I’m talking about
 Overview of tools and techniques
 Where new research is headed
 Questions
$> whoami
 Max Irwin
 Working in Search since 2012
 Leads Search Center of Excellence
 Long time programmer
 Recent interests are NLP and
Deep Learning
No need to take photos of slides
Video, deck, code, references, materials will be made available

Why I’m talking. (Problem statement)
 Suggesting stuff to users
 based on what?
 Content clustering/relationships/similarities
 but how?
 Slots and intent for Queries and Bots
 with what?
 Entities and Named Entity Recognition
 sourced from where?
 Question Answering
 how can it know?
 Dimension reduction for unstructured text
 down to what?
 Lots of products in different domains
 Law, Tax, Health, Marketing, Etc.
 Better search with less effort
 Shortage of metadata experts
 Domains differ, content proprietary
 Lots of work, always from scratch
 Terms of Art, Concepts, Vocabularies,
take years to curate manually
 They are usually subjective
Information Retrieval Problems Product Problems
My goal is to introduce you to a suite of techniques to help solve the above problems

What I’m Talking About.
 Terms associated with
documents
 Classify and associate
documents
 Techniques:
 LDA,
 RAKE,
 Maui
 Associates terms with
the same semantic
meaning (synonyms)
 Building blocks for
vocabularies
 Techniques:
 Topia,
 Skipchunk
Keywords Concepts
 Represent entire
domains (or subsets)
 Reduce dimensions for
abstracting domain
corpora
 Techniques:
 Lexico-syntactic patterns,
 TAXI
Ontologies/Taxonomies
A survey of technologies for automatically extracting the following from text

How do these tools work?
 Get candidates
 Preprocess, arrange, and group tokens
 Score candidates
 Assign each entry a confidence weight
 Relate candidates (only for taxos/ontos)
 Link into hierarchies or triples
 Score the relationships
 Finish and generate list or vocab
 Keep “best” scored candidates
 Keep “best” scored relationships
 Prune (optional step, sometimes human)
 Remove noise and cleanup
 Precision/Recall/F1 to measure
vs existing keywords/vocabs
 Can also use relevance testing
like nDCG if applying to Search
 Use open sets if available
(SemEval has good ones)
 Otherwise, curate one manually
 Varies between experts, so get consensus!
General Workflow Testing!

Our Example Corpus
 https://opensourceconnections.com/blog/
 Quality content written by our hosts and community members
 Articles are lacking keywords, and search doesn’t give term suggestions!
 Highly contextual to the audience

Topics, Keywords, and Concepts
LDA, RAKE, Maui, Topia, Skipchunk

Latent Dirichlet Allocation (LDA)
 Unsupervised ML for topical classification of documents
 “if observations are words collected into documents, [LDA] posits that each
document is a mixture of a small number of topics and that each word's
creation is attributable to one of the document's topics” - wikipedia
 How it works:
 Give it a corpus (pre-processed into nice tokens)
 Specify an exact number of topics and train
 Uses Dirichlet prior for Bayesian probability of each term to a topic
 The topics are identified and assigned to the documents
 Trained model is re-used to classify new documents
 Language independent, well established statistical proofs
 Downsides: Can be nondeterministic, intensive training, model maintenance

LDA – Example Corpus Topics
 Using Gensim LdaModel
 Steps:
 Tokenize the content
 Remove non-words and stopwords
 Stem or lemmatize
 Train the model (with 20 topics)
 See the topics!
 Save the model and use it later to
classify new documents with topics
Steps Resulting Topics
11) 0.031 document
12) 0.027 score
13) 0.026 result
14) 0.025 user
15) 0.024 will
16) 0.023 govern
17) 0.022 term
18) 0.021 match
19) 0.017 databas
20) 0.017 depend
1) 0.109 search
2) 0.087 use
3) 0.079 can
4) 0.06 queri
5) 0.049 open
6) 0.046 sourc
7) 0.043 data
8) 0.041 solr
9) 0.04 like
10) 0.037 field

Rapid Automatic Keyword Extraction (RAKE)
 Novel language independent technique, very fast, and bag-of-words friendly
 Also proposed a nice stopword selection algorithm as part of the paper
 Candidates:
 Tokenize
 Split token groups by punctuation and stopwords
 Identify co-occurances of sequences of unfiltered words
 Scores:
 Co-occurrences of tokens t=1..n are used for scoring as kt=degree(t)/frequency(t)
 Keywords are re-adjoined as candidate phrases with score = sum member token k
 Selection
 Top third best scoring candidate phrases are kept
 Downsides: Relies heavily on Frequency, Patented 

RAKE algorithm in one slide
For search managers, developers & data scientists finding ways to innovate
Constructing criteria bounds = 1 + 1 + 2 = 4
Corresponding components = 2 + 1 = 3
Compatibility algorithms = 1.5 + 1 = 2.5
“For search managers, developers & data scientists finding ways to innovate”

Multi-purpose automatic topic indexing (“Maui”)
 Upgrade on the “KEA” tool
 Trains a Naïve Bayes Classifier with
the Weka ML framework
 Can draw from existing vocabs
 Multi-Purpose:
 Assign terms with a controlled vocabulary
 Index subject headings
 Extract keywords and key phrases
 Link entities
 Extract terminologies
 Generate automatic tagging
 Downsides: Requires a training
set, model maintenance

Using NLP Libraries
Language is Hard

Part of Speech tagging - 30 second overview
Sentence to Tree: PoS Tagging and Edge Labeling.
 Based on training data from a Treebank
 Treebanks are usually not domain specific
 Lack of domain specificity can decrease accuracy
 When it works, it is useful for many applications
The tax rate is 20.0%
https://demos.explosion.ai/displacy/?text=The%20tax%20rate%20is%2020%25

Topia TermExtract
 Python2 library: Topia.termextract
 Algorithm:
 Tags Part-of-Speech* for all terms in corpus
 Find noun phrases using patterns of tags
 State machine groups nouns and adjectives
 ~25 lines of python2
 *Depends on NLTK, Part of Speech tagging
accuracy varies (75%-92%)
 Score and Filter:
 Term frequency
 Term length
 Can be changed with a plugin
 Simple but effective
 Downsides: favors single token terms

Skipchunk
 I made this . The name is because it Skips noise to Chunk concepts and predicates.
 Extracts flat SKOS concepts and predicates by finding similar label forms.
 Algorithm:
 Tags Part-of-Speech* for all terms in corpus
 Lemmatize and switch to de-adjectival** nouns where appropriate
 Take greedy noun/verb phrases, use sorted nouns/verbs in the same phrase of as a key identifier
 Group sloppy noun phrases (concepts) and verb phrases (predicates) with the same key
 Score is the total count of all label variations, prefLabel is the shortest variation
 * Used NLTK at first but migrated to spaCy (90%+ PoS tagging accuracy)
 **(beautiful  beauty), uses wordnet (needs accuracy improvement though)
 Extra long chunks on purpose: they are likely to be terms of art with other forms
With Haystack we want to open up the invite to practitioners from
ADP PROPN PRON VERB PART VERB PART DET NOUN ADP NOUN ADP
around the world similarly struggling on hard meaty relevance problems.
ADP DET NOUN ADV VERB PART ADJ NOUN NOUN NOUN
invite practitioner

Skipchunk – example extractions
skos:prefLabel "twitter / facebook"@en ;
skos:altLabel "facebook and twitter"@en ;
skos:prefLabel "drupal search block"@en ;
skos:altLabel "search to any drupal block"@en ;
skos:prefLabel "top search terms"@en ;
skos:altLabel "top 100 search terms"@en ;
skos:prefLabel "document’s term vectors"@en ;
skos:altLabel "term vectors from documents"@en ;
skos:prefLabel "last longer"@en ;
skos:altLabel "longer lasting"@en ;
skos:prefLabel "was uploaded"@en ;
skos:altLabel "is that we can upload"@en ;
skos:prefLabel "woke up early"@en ;
skos:altLabel "woke us all up early"@en ;
skos:prefLabel "so you see"@en ;
skos:altLabel "so when you see"@en ;
skos:altLabel "so you can see"@en ;
Concepts (Noun Phrases) Predicates (Narrow Verb Phrases)

Showdown! Top 20 from the example corpus
trek holodeck
hồ chí minh
premium unsanded grout
prank bubble gum
weird art film
dog catcher law
latent semantic analysis
open source connections
tf*idf score
probabilistic information retrieval
open source solutions
open source search
inverse document frequency
open source software
open source community
google search appliance
test driven relevancy
social networking sites
semantic web technologies
open source projects
search
solr
query
user
data
document
result
time
use
work
field
project
name
example
term
need
way
code
problem
thing
search engine
search results
opensource connections
otherness words
open source
search relevance
use case
search terms
frequencies for all four terms
blog post
solr or elasticsearch
visual studio
document frequency
otherness hand
dependencies downloading
query time
Eric Pugh
recommendation systems
title field
big data
RAKE Topia Skipchunk
solr
ve
machine learning
filtering that information
ranking
training set
training data
providing information
retrieval systems
machine learning
techniques
query with rankings
cheat
installs git
extensive amounts
clean package
parent project
solr 4.X
mvn clean
custom relevancy
matches like
MAUI
search
use
can
queri
open
sourc
data
solr
like
field
document
score
result
user
will
govern
term
match
databas
depend
LDA

Ontology learning
 Specifically – Terminological Ontologies
(SKOS, WordNet, Etc)
 Taxonomies are hierarchical
 Can narrow focus to Hypernym
Discovery (SemEval 2018 task 9)
 More broadly, Taxonomy extraction,
Hyponym detection
 SemEval challenges for state of the art
 Don’t forget Meronymy (membership)!
Image Source: Nuria Casellas, 2012

Types of Ontologies
 Formal:
 a conceptualization whose categories are distinguished by axioms and
definitions. Can be used to computationally and logically arrive at exact
proven conclusions.
 Prototype-based:
 distinguished by typical instances or prototypes rather than by axioms and
definitions in logic. Categories are formed by collecting instances
extensionally
 Terminological:
 partially specified by subtype-supertype relations and describe concepts by
concept labels or synonyms rather than prototypical instances, but lack an
axiomatic grounding. SKOS, WordNet, BabelNet are examples
Source: C. Biemann, 2005

Hypernymy and Meronymy
Co-Hyponyms
Hypernym
Hyponyms
Hypernymy Classification
(“is a” relationships)
Hypernym
AND
Hyponym
Meronyms Meronyms
Meronyms
Meronyms Meronyms
Meronymy Membership
(“part of” relationships)

Hearst Patterns (Lexico-Syntactic)
 “Automatic Acquisition of Hyponyms from Large Text Corpora”
 Marti Hearst, 1992. Cited by 3504 in Google Scholar
 Hard and fast rules based on language syntax
 Uses trigger words and punctuation
 NP0 such as {NP1,NP2 …, (and | or)} NPn
 for all NPi, 1<=i<=n, hyponym(NPi, NP0)
 Therefore: hyponym(“Bing”, “search engine”)
 such NP as {NP,}* {or|and} NP
 NP {, NP}* {,} or other NP
 …
“…traffic comes from an external search engine such as Google, Bing, or Yahoo”
Lexico-Syntactic patterns have improved with research and expanded to Meronyms

Lexico-Syntactic Pattern Success Rate
Some animals such as dogs Countries around the world such as Armenia
Success! Not so much success
Pattern Occurrences* Success Rate*
NP0 including NP1 601 409 (68.0%)
NP0 such as NP1 2389 2107 (88.2%)
NP0 like NP1 401 330 (82.0%)
NP0 e.g. NP1 170 134 (79%)
NP0 kinds|types|forms of NP1 48 31 (65%)
NP0 especially NP1 61 54 (89%)
NP0 notably NP1 22 13 (59%)
*Source: Klaussner and Zhekova, 2011

TAXI – A Taxonomy Induction System
 State of the Art
 First place in SemEval 2016 Task
13 (Taxonomy extraction
evaluation)
 Innovations:
 Hundreds of TB of general domain content
 Focused Crawl of specific domain content
 Substring Matching and Lexico-Syntactic
Patterns together, ported to four languages
 Unsupervised and Supervised learning,
based on the language
 Automated pruning of the graph
Domain
Content on the
Web
Corpus
& Web
Overlap
Original
to the
Corpus

TAXI Workflow
 Substr matches
 “Biomedical science”
 science
 “Microbiology”
biology
 Calculate Score: σ(ti ,tj)
 Lexico-syntactic
 PattaMaika (NLP chunks)
 PatternSim (Hearst, etc)
 WebISA (rexexp patterns)
 Calculate Score: π(ti ,tj)
 Unsupervised
 French, Dutch, Italian
 ti is hypernym of tj if:
σ(ti ,tj) > 0
OR
π(ti ,tj) rank in top 2
 Supervised
 English Only
 Use trained SVM classifier
from existing taxo
 Model incorporates
Negative Sampling
 Classifies all possible word
pairs, positives get added
Gather lots of Content Prune Candidates
 General
 Wikipedia(11GB)
 59G (59GB)
 Common Crawl (168TB)
 Specific
 Focused Domain Crawl
 Lang modelling approach
 e.g. food, science, enviro
 Thorough
 Takes 1 week per
language per domain
Candidate Hypernyms
 Steps
 Start with the noisy graph
 Use graph pruning
techniques
 Remove cycles and
bidirectionals
 Makes a Directed Acyclic
Graph
 Attach top nodes to root
 End result is a
Taxonomy
Construct Taxonomy

TAXI - Science Domain Example Graph

UnsupervisedSupervised
Use Cases and Applicable Techniques
LDA  1,2,5
RAKE  1,2,3
Maui  1,2,3
Topia  3
Skipchunk  3,4,5,7
Hearst  4,5,6
TAXI  4,5,6,7
1. Document Classification
3. Terms for Query Suggestion
4. Grouping Similar Terms
5. Relating Concepts
6. Taxonomy Generation
2. Enriching Content
7. Ontology Bootstrapping

What’s Next?
 Trending to tasks being split:
 Hypernym Detection
 Hypernym Discovery
 Taxonomy Construction
 Taxonomy Evaluation
 Word-Embeddings and Deep
Learning are becoming more
prevalent in the above tasks
 Improve Accuracy
 Generate of RDF triples
 Use common predicates and
leverage substrings and lexico-
syntactic patterns
 Known issues that make things
hard:
 Co-reference resolution
 Intransitivity
 Passive vs Active voice
For the Field For Skipchunk

References
 Title Slide Image:
 “The Entry of the Animals into Noah's Ark”, Jan Brueghel the Elder
 DataSets
 https://opensourceconnections.com/blog/
 https://github.com/zelandiya/keyword-extraction-datasets
 Tools:
 https://graphviz.readthedocs.io/en/stable/
 http://www.nltk.org/
 https://spacy.io/
 LDA
 http://www.jmlr.org/papers/v3/blei03a.html
 https://radimrehurek.com/gensim/models/ldamodel.html
 RAKE
 https://pdfs.semanticscholar.org/5a58/00deb6461b3d022c8465e528
6908de9f8d4e.pdf
 Maui/Kea
 https://github.com/zelandiya/maui
 https://code.google.com/archive/p/maui-indexer/
 https://www.airpair.com/nlp/keyword-extraction-tutorial
 http://community.nzdl.org/kea/
 https://code.google.com/archive/p/kea-algorithm/downloads
 Topia
 https://pypi.python.org/pypi/topia.termextract/
 Hearst
 http://people.ischool.berkeley.edu/~hearst/papers/coling92.pdf
 https://github.com/mmichelsonIF/hearst_patterns_python
 http://www.aclweb.org/anthology/R11-2017
 https://www.researchgate.net/publication/306072432_Automatic_Ex
traction_of_Hypernym_Meronym_Relations_in_English_Sentences_U
sing_Dependency_Parser
 TAXI
 https://www.lt.informatik.tu-darmstadt.de/de/software/taxi-a-
taxonomy-induction-system/
 http://alt.qcri.org/semeval2016/task13/
 http://web.informatik.uni-
mannheim.de/ponzetto/pubs/panchenko16.pdf
 http://tudarmstadt-lt.github.io/taxi/
 Ontology Learning:
 http://www.jlcl.org/2005_Heft2/Chris_Biemann.pdf
 http://www.semantic-web-journal.net/system/files/swj311_2.pdf
 https://competitions.codalab.org/competitions/17119
 https://www.researchgate.net/publication/221303651_Lexico-
Syntactic_Patterns_for_Automatic_Ontology_Building
 What’s Next:
 https://arxiv.org/pdf/1703.04178.pdf

Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies

Similar to Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies (20)

Recently uploaded

Recently uploaded (20)

Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies

Editor's Notes