The document describes dictionary-based named entity extraction from streaming text. It discusses named entity recognition approaches like regular expression-based, dictionary-based, and model-based. It then describes the SoDA v.2 architecture for scalable dictionary-based named entity extraction, including the Aho-Corasick algorithm, SolrTextTagger, and services provided. Finally, it outlines future work on improving the system.
2. Agenda
• Introduction
• The Entity Resolution Problem
• Named Entity Recognition/Extraction (NER)
• SoDA v.2 Architecture
• SoDA v.2 Services
• Future Work
• Conclusion
2
Dictionary based Named Entity Extraction from streaming text
3. Introduction
• About Me
• Work at Elsevier Labs
• Interested in Search, NLP and Machine Learning
• Email: sujit.pal@elsevier.com
• Twitter: @palsujit
• About Elsevier Labs
• Advanced Technology Group within Elsevier
• More info: https://labs.elsevier.com
• About Elsevier
• World’s largest publisher of STM books and journals
• Uses data to inform and enable consumers of STM Info
3
Dictionary based Named Entity Extraction from streaming text
4. The Entity Resolution Problem
• Named Entity Recognition/Extraction – recognize mentions of named
entities.
• Named Entity Resolution – resolve entity with root entity.
4
Dictionary based Named Entity Extraction from streaming text
Hillary Clinton and Bill Clinton visited a diner during
Clinton’s 2016 presidential campaign.
PERSON LOCATIONEVENT
Hillary Clinton and Bill Clinton visited a diner during
Clinton’s 2016 presidential campaign.
5. Approaches to NER
• Three major approaches
• Regular Expression (RegEx) Based
• Dictionary Based
• Model Based
• Hybrid approaches
• Combining Approaches
• Data Programming
• Active Learning
5
Dictionary based Named Entity Extraction from streaming text
6. RegEx based NER
Pierre Vinken , 61 years old , will join the board as a
nonexecutive director Nov. 29 .
PERSON
([A-Z][a-z]+){2,3}
AGE
(d){1,3}syearssold
DATE
([A-Z][a-z]{2}(.)*)s(d{2})
6
Dictionary based Named Entity Extraction from streaming text
7. Dictionary Based NER
Pierre Vinken , 61 years old , will join the board as a
nonexecutive director Nov. 29 .
PERSON
Names of
famous
people
DATE
Month names
and abbrs.
7
Dictionary based Named Entity Extraction from streaming text
8. Dictionary based NER – 3rd Party S/W
• Open Source
• GATE (General Architecture for Text Engineering)
• pyahocorasick
• SoDA (SOlr Dictionary Annotator)
• Commercial / Open Source
• LingPipe
8
Dictionary based Named Entity Extraction from streaming text
10. Model based NER – Sequence Models
• Typical model structure
• Input – a sentence s or a sequence of words {x0, x1, …, xn}.
• Output – a sequence Y {y0, y1, …, yn} of IOB tags.
• Hidden Markov Models – IOB tag depends on input variable and
previous label.
• Conditional Random Fields – IOB tag depends on features {f0, f1, …,
fm} with learned weights {ƛ0, ƛ1, …, ƛm} defined over current word xi,
current label yi, previous label yi-1, and the entire sentence s.
10
Dictionary based Named Entity Extraction from streaming text
11. Model based NER – Sequence Models (2)
• Family of Deep Learning Sequence Models – has been used for POS
tagging, phrase chunking, NER and even language translation.
• Feature vectors for words created using Word Embeddings (word2vec,
GloVe, fasttext, etc).
• Performance can be improved with Attention mechanisms.
• Represents state of the art for Named Entity Recognition.
• Needs lots of data to train.
11
Dictionary based Named Entity Extraction from streaming text
x1x0 EOSxn
y1y0 y2
y0 yny1
EOS
LSTM ENCODER LSTM DECODER
weights
12. Model based NER – 3rd party S/W
• Open Source
• GATE
• Apache OpenNLP
• Stanford NER (has NLTK plugin)
• SpaCy NER
• NERDS
• Commercial
• Basis Technologies Rosette Entity Extractor
• IBM Watson / Alchemy API
• Amazon Comprehend
• Azure Named Entity Recognition
12
Dictionary based Named Entity Extraction from streaming text
13. Hybrid Approaches – combinations
• Create initial labeled dataset by harvesting entities from large text corpora
using one or more of the following:
• Weak Supervision – RegEx and other pattern matching (eg. Hearst
Patterns for phrases).
• Distant Supervision – matching against dictionaries derived from
industry specific (public or private) ontologies.
• Unsupervised – legacy rule based models.
• Supervised – predictions from weaker models.
• Crowdsourcing – using human experts.
• Train powerful seq2seq model using labeled dataset.
• Refine using human-in-the-loop active learning or other techniques.
13
Dictionary based Named Entity Extraction from streaming text
14. Data Programming - Snorkel
• Start with noisy labels L from various sources
• Train generative model capable of generating probabilities P for each of
the output classes based on feature vector of noisy labels.
• Train final noise-aware discriminative model with output of generative
model P and original data X to predict class label Q for data.
• The Snorkel project (https://hazyresearch.github.io/snorkel/) pioneered
this approach and provides tooling for all these steps.
14
Dictionary based Named Entity Extraction from streaming text
Image Credit: Snorkel Project
15. SoDA v.2 Architecture
• Theoretical Foundations
• Aho-Corasick algorithm
• SolrTextTagger
• SoDA Architecture
• Scaling SoDA
15
Dictionary based Named Entity Extraction from streaming text
16. Aho-Corasick Algorithm
• Implements a data structure called “trie”
• State machine over characters
• Dictionary based NERs implement similar state machine over words in
phrases.
16
Dictionary based Named Entity Extraction from streaming text
Image Credit: ResearchGate
17. SolrTextTagger
• Lucene’s TokenStreams are finite state automatons (FSA).
• SolrTextTagger (https://github.com/OpenSextant/SolrTextTagger)
dynamically creates FSAs from dictionary entries into a Finite State
Transducer (FST) data structure.
• Provides tag service to annotate incoming streaming text against FST.
• Input is text, output is matched dictionary entries and offsets into text.
• SolrTextTagger is OSS created by Lucene/Solr committer David Smiley.
17
Dictionary based Named Entity Extraction from streaming text
Image Credit: Slides for Automata Invasion talk by Michael McCandless and Robert Muir
18. Architecture
18
Dictionary based Named Entity Extraction from streaming text
• Co-located with standalone
Solr server.
• Scala based thin wrapper over
SolrTextTagger.
• Provides following services.
• unified JSON over HTTP
request/response
• multiple matching styles
• multiple lexicons
• hides details of managing
SolrTextTagger.
• Streaming (text) and non-
streaming (phrase) matching
services.
• Programmatic APIs for Scala
and Python.
19. Scaling
19
Dictionary based Named Entity Extraction from streaming text
• Install and configure Solr,
SolrTextTagger and SoDA and
create AMI
• Use CloudFormation (or
Terraform) templates to
instantiate cluster of
Solr+SoDA instances behind
Elastic Load Balancer.
• Autoscaling cluster
• Monitored by CloudWatch
• New dictionaries loaded by
instantiating EC2 from AMI via
Lambda and saved back into
AMI for next cluster build.
client
loader
20. Consuming Annotations at scale
20
Dictionary based Named Entity Extraction from streaming text
• Synchronous
• Asynchronous
Databricks
Notebook
Documents
on S3
SoDA cluster
Parquet
Annotations
on S3
Documents
on S3
SoDA cluster
Parquet
Annotations
on S3
Kafka/Kinesis
Streams
Producer Consumer
21. SoDA Services
• Bulk Loader (backend)
• Client facing (front end)
• Index (status check)
• Add New Record into Lexicon
• Delete Lexicon or Entry
• Annotate Text against Lexicon
• List Available Lexicons
• Find coverage of incoming text against Lexicons
• Lookup by ID
• Reverse Lookup by Phrase
21
Dictionary based Named Entity Extraction from streaming text
22. SoDA Bulk Loader
• Multithreaded loader for bulk loading dictionaries into SoDA.
• Requires tab-separated file in following format:
• id {TAB} primary-name {PIPE} alt-name-1 {PIPE} ... {PIPE} alt-name-n
• One line per dictionary entry
• Script to run (on SoDA/Solr box).
• ./bulk_load.sh lexicon /path/to/input num_workers
22
Dictionary based Named Entity Extraction from streaming text
23. SoDA Health Check – index.json
• Returns a status message. Meant to be used for testing if the SoDA application is up.
• Python client code
• Scala client code
• Output
23
Dictionary based Named Entity Extraction from streaming text
24. Annotate Text against Lexicon – annot.json
• Annotates text against a specific lexicon and match type.
• Match types can be one of the following:
• exact – matches text spans with dictionary entries.
• lower – same as exact, but matches are case-sensitive
• stop – same as lower, but stop words removed from both text and dictionary entries
• stem1 – same as stop, but stemmed with Solr minimal English stemmer
• stem2 – same as stop, but stemmed with Solr Kstem stemmer
• stem3 – same as stop, but stemmed with Solr Porter stemmer.
• Input (HTTP POST)
24
Dictionary based Named Entity Extraction from streaming text
25. Annotate Text against Lexicon (2)
• Python client code
• Scala client code
• Output
25
Dictionary based Named Entity Extraction from streaming text
26. List Available Lexicons – dicts.json
• Returns a list of lexicons available to annotate against.
• Python client
• Scala client
• Output
26
Dictionary based Named Entity Extraction from streaming text
27. Check Coverage – coverage.json
• This can be used to find which lexicons are appropriate for annotating your text.
The service allows you to send a piece of text to all hosted lexicons and returns
with the number of matches found in each.
• Input (HTTP POST)
• Python client
• Scala client
27
Dictionary based Named Entity Extraction from streaming text
28. Check Coverage (2)
• Output
28
Dictionary based Named Entity Extraction from streaming text
29. Lookup by ID – lookup.json
• Allows looking up a dictionary entry by lexicon and ID.
• Input (HTTP POST)
• Python client
• Scala client
29
Dictionary based Named Entity Extraction from streaming text
30. Lookup by ID (2)
• Output
30
Dictionary based Named Entity Extraction from streaming text
31. Reverse Lookup by Phrase
• Matches phrases against specific lexicon and match type.
• Match types can be one of the following:
• All match types supported by Annotation service (annot.json)
• lsort – case-insensitive matching against phrase with words sorted
alphabetically.
• s3sort – case-insensitive matching against phrase stemmed using
Porter Stemmer (stem3) and its words sorted alphabetically.
• Input
31
Dictionary based Named Entity Extraction from streaming text
32. Reverse Lookup by Phrase (2)
• Python client
• Scala client
• Output
32
Dictionary based Named Entity Extraction from streaming text
33. Future Work
• List of open items on the SoDA issues page and continuously updated as
I find them (https://github.com/elsevierlabs-os/soda/issues).
• Please feel free to post issues and ideas for improvement.
33
Dictionary based Named Entity Extraction from streaming text