SlideShare una empresa de Scribd logo
1 de 34
Dictionary based
Named Entity
Extraction from
streaming text
Sujit Pal
SWIFT Technology Center, July 16, 2018
Agenda
• Introduction
• The Entity Resolution Problem
• Named Entity Recognition/Extraction (NER)
• SoDA v.2 Architecture
• SoDA v.2 Services
• Future Work
• Conclusion
2
Dictionary based Named Entity Extraction from streaming text
Introduction
• About Me
• Work at Elsevier Labs
• Interested in Search, NLP and Machine Learning
• Email: sujit.pal@elsevier.com
• Twitter: @palsujit
• About Elsevier Labs
• Advanced Technology Group within Elsevier
• More info: https://labs.elsevier.com
• About Elsevier
• World’s largest publisher of STM books and journals
• Uses data to inform and enable consumers of STM Info
3
Dictionary based Named Entity Extraction from streaming text
The Entity Resolution Problem
• Named Entity Recognition/Extraction – recognize mentions of named
entities.
• Named Entity Resolution – resolve entity with root entity.
4
Dictionary based Named Entity Extraction from streaming text
Hillary Clinton and Bill Clinton visited a diner during
Clinton’s 2016 presidential campaign.
PERSON LOCATIONEVENT
Hillary Clinton and Bill Clinton visited a diner during
Clinton’s 2016 presidential campaign.
Approaches to NER
• Three major approaches
• Regular Expression (RegEx) Based
• Dictionary Based
• Model Based
• Hybrid approaches
• Combining Approaches
• Data Programming
• Active Learning
5
Dictionary based Named Entity Extraction from streaming text
RegEx based NER
Pierre Vinken , 61 years old , will join the board as a
nonexecutive director Nov. 29 .
PERSON
([A-Z][a-z]+){2,3}
AGE
(d){1,3}syearssold
DATE
([A-Z][a-z]{2}(.)*)s(d{2})
6
Dictionary based Named Entity Extraction from streaming text
Dictionary Based NER
Pierre Vinken , 61 years old , will join the board as a
nonexecutive director Nov. 29 .
PERSON
Names of
famous
people
DATE
Month names
and abbrs.
7
Dictionary based Named Entity Extraction from streaming text
Dictionary based NER – 3rd Party S/W
• Open Source
• GATE (General Architecture for Text Engineering)
• pyahocorasick
• SoDA (SOlr Dictionary Annotator)
• Commercial / Open Source
• LingPipe
8
Dictionary based Named Entity Extraction from streaming text
Model Based NER
Pierre
Vinken
,
61
years
old
,
will
join
the
board
as
a
non-executive
director
Nov.
29
.
B-PER
I-PER
O
B-AGE
I-AGE
O
O
O
O
O
O
O
O
O
O
B-DATE
I-DATE
O
Machine
Learning
model
9
Dictionary based Named Entity Extraction from streaming text
Model based NER – Sequence Models
• Typical model structure
• Input – a sentence s or a sequence of words {x0, x1, …, xn}.
• Output – a sequence Y {y0, y1, …, yn} of IOB tags.
• Hidden Markov Models – IOB tag depends on input variable and
previous label.
• Conditional Random Fields – IOB tag depends on features {f0, f1, …,
fm} with learned weights {ƛ0, ƛ1, …, ƛm} defined over current word xi,
current label yi, previous label yi-1, and the entire sentence s.
10
Dictionary based Named Entity Extraction from streaming text
Model based NER – Sequence Models (2)
• Family of Deep Learning Sequence Models – has been used for POS
tagging, phrase chunking, NER and even language translation.
• Feature vectors for words created using Word Embeddings (word2vec,
GloVe, fasttext, etc).
• Performance can be improved with Attention mechanisms.
• Represents state of the art for Named Entity Recognition.
• Needs lots of data to train.
11
Dictionary based Named Entity Extraction from streaming text
x1x0 EOSxn
y1y0 y2
y0 yny1
EOS
LSTM ENCODER LSTM DECODER
weights
Model based NER – 3rd party S/W
• Open Source
• GATE
• Apache OpenNLP
• Stanford NER (has NLTK plugin)
• SpaCy NER
• NERDS
• Commercial
• Basis Technologies Rosette Entity Extractor
• IBM Watson / Alchemy API
• Amazon Comprehend
• Azure Named Entity Recognition
12
Dictionary based Named Entity Extraction from streaming text
Hybrid Approaches – combinations
• Create initial labeled dataset by harvesting entities from large text corpora
using one or more of the following:
• Weak Supervision – RegEx and other pattern matching (eg. Hearst
Patterns for phrases).
• Distant Supervision – matching against dictionaries derived from
industry specific (public or private) ontologies.
• Unsupervised – legacy rule based models.
• Supervised – predictions from weaker models.
• Crowdsourcing – using human experts.
• Train powerful seq2seq model using labeled dataset.
• Refine using human-in-the-loop active learning or other techniques.
13
Dictionary based Named Entity Extraction from streaming text
Data Programming - Snorkel
• Start with noisy labels L from various sources
• Train generative model capable of generating probabilities P for each of
the output classes based on feature vector of noisy labels.
• Train final noise-aware discriminative model with output of generative
model P and original data X to predict class label Q for data.
• The Snorkel project (https://hazyresearch.github.io/snorkel/) pioneered
this approach and provides tooling for all these steps.
14
Dictionary based Named Entity Extraction from streaming text
Image Credit: Snorkel Project
SoDA v.2 Architecture
• Theoretical Foundations
• Aho-Corasick algorithm
• SolrTextTagger
• SoDA Architecture
• Scaling SoDA
15
Dictionary based Named Entity Extraction from streaming text
Aho-Corasick Algorithm
• Implements a data structure called “trie”
• State machine over characters
• Dictionary based NERs implement similar state machine over words in
phrases.
16
Dictionary based Named Entity Extraction from streaming text
Image Credit: ResearchGate
SolrTextTagger
• Lucene’s TokenStreams are finite state automatons (FSA).
• SolrTextTagger (https://github.com/OpenSextant/SolrTextTagger)
dynamically creates FSAs from dictionary entries into a Finite State
Transducer (FST) data structure.
• Provides tag service to annotate incoming streaming text against FST.
• Input is text, output is matched dictionary entries and offsets into text.
• SolrTextTagger is OSS created by Lucene/Solr committer David Smiley.
17
Dictionary based Named Entity Extraction from streaming text
Image Credit: Slides for Automata Invasion talk by Michael McCandless and Robert Muir
Architecture
18
Dictionary based Named Entity Extraction from streaming text
• Co-located with standalone
Solr server.
• Scala based thin wrapper over
SolrTextTagger.
• Provides following services.
• unified JSON over HTTP
request/response
• multiple matching styles
• multiple lexicons
• hides details of managing
SolrTextTagger.
• Streaming (text) and non-
streaming (phrase) matching
services.
• Programmatic APIs for Scala
and Python.
Scaling
19
Dictionary based Named Entity Extraction from streaming text
• Install and configure Solr,
SolrTextTagger and SoDA and
create AMI
• Use CloudFormation (or
Terraform) templates to
instantiate cluster of
Solr+SoDA instances behind
Elastic Load Balancer.
• Autoscaling cluster
• Monitored by CloudWatch
• New dictionaries loaded by
instantiating EC2 from AMI via
Lambda and saved back into
AMI for next cluster build.
client
loader
Consuming Annotations at scale
20
Dictionary based Named Entity Extraction from streaming text
• Synchronous
• Asynchronous
Databricks
Notebook
Documents
on S3
SoDA cluster
Parquet
Annotations
on S3
Documents
on S3
SoDA cluster
Parquet
Annotations
on S3
Kafka/Kinesis
Streams
Producer Consumer
SoDA Services
• Bulk Loader (backend)
• Client facing (front end)
• Index (status check)
• Add New Record into Lexicon
• Delete Lexicon or Entry
• Annotate Text against Lexicon
• List Available Lexicons
• Find coverage of incoming text against Lexicons
• Lookup by ID
• Reverse Lookup by Phrase
21
Dictionary based Named Entity Extraction from streaming text
SoDA Bulk Loader
• Multithreaded loader for bulk loading dictionaries into SoDA.
• Requires tab-separated file in following format:
• id {TAB} primary-name {PIPE} alt-name-1 {PIPE} ... {PIPE} alt-name-n
• One line per dictionary entry
• Script to run (on SoDA/Solr box).
• ./bulk_load.sh lexicon /path/to/input num_workers
22
Dictionary based Named Entity Extraction from streaming text
SoDA Health Check – index.json
• Returns a status message. Meant to be used for testing if the SoDA application is up.
• Python client code
• Scala client code
• Output
23
Dictionary based Named Entity Extraction from streaming text
Annotate Text against Lexicon – annot.json
• Annotates text against a specific lexicon and match type.
• Match types can be one of the following:
• exact – matches text spans with dictionary entries.
• lower – same as exact, but matches are case-sensitive
• stop – same as lower, but stop words removed from both text and dictionary entries
• stem1 – same as stop, but stemmed with Solr minimal English stemmer
• stem2 – same as stop, but stemmed with Solr Kstem stemmer
• stem3 – same as stop, but stemmed with Solr Porter stemmer.
• Input (HTTP POST)
24
Dictionary based Named Entity Extraction from streaming text
Annotate Text against Lexicon (2)
• Python client code
• Scala client code
• Output
25
Dictionary based Named Entity Extraction from streaming text
List Available Lexicons – dicts.json
• Returns a list of lexicons available to annotate against.
• Python client
• Scala client
• Output
26
Dictionary based Named Entity Extraction from streaming text
Check Coverage – coverage.json
• This can be used to find which lexicons are appropriate for annotating your text.
The service allows you to send a piece of text to all hosted lexicons and returns
with the number of matches found in each.
• Input (HTTP POST)
• Python client
• Scala client
27
Dictionary based Named Entity Extraction from streaming text
Check Coverage (2)
• Output
28
Dictionary based Named Entity Extraction from streaming text
Lookup by ID – lookup.json
• Allows looking up a dictionary entry by lexicon and ID.
• Input (HTTP POST)
• Python client
• Scala client
29
Dictionary based Named Entity Extraction from streaming text
Lookup by ID (2)
• Output
30
Dictionary based Named Entity Extraction from streaming text
Reverse Lookup by Phrase
• Matches phrases against specific lexicon and match type.
• Match types can be one of the following:
• All match types supported by Annotation service (annot.json)
• lsort – case-insensitive matching against phrase with words sorted
alphabetically.
• s3sort – case-insensitive matching against phrase stemmed using
Porter Stemmer (stem3) and its words sorted alphabetically.
• Input
31
Dictionary based Named Entity Extraction from streaming text
Reverse Lookup by Phrase (2)
• Python client
• Scala client
• Output
32
Dictionary based Named Entity Extraction from streaming text
Future Work
• List of open items on the SoDA issues page and continuously updated as
I find them (https://github.com/elsevierlabs-os/soda/issues).
• Please feel free to post issues and ideas for improvement.
33
Dictionary based Named Entity Extraction from streaming text
Thank you
Contact Information
Email: sujit.pal@elsevier.com
Twitter: @palsujit
SoDA: https://github.com/elsevierlabs-os/soda

Más contenido relacionado

La actualidad más candente

Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Lecture-18(11-02-22)Stochastics POS Tagging.pdf
Lecture-18(11-02-22)Stochastics POS Tagging.pdfLecture-18(11-02-22)Stochastics POS Tagging.pdf
Lecture-18(11-02-22)Stochastics POS Tagging.pdfNiraliRajeshAroraAut
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingIla Group
 
Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)Ryan Cuprak
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model ServingDatabricks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
 
word level analysis
word level analysis word level analysis
word level analysis tjs1
 

La actualidad más candente (20)

Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Word embedding
Word embedding Word embedding
Word embedding
 
Lecture-18(11-02-22)Stochastics POS Tagging.pdf
Lecture-18(11-02-22)Stochastics POS Tagging.pdfLecture-18(11-02-22)Stochastics POS Tagging.pdf
Lecture-18(11-02-22)Stochastics POS Tagging.pdf
 
Nlp
NlpNlp
Nlp
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Semantic analysis
Semantic analysisSemantic analysis
Semantic analysis
 
NLP
NLPNLP
NLP
 
5. phase of nlp
5. phase of nlp5. phase of nlp
5. phase of nlp
 
Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)Exploring Java Heap Dumps (Oracle Code One 2018)
Exploring Java Heap Dumps (Oracle Code One 2018)
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
SPARQL Cheat Sheet
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
word level analysis
word level analysis word level analysis
word level analysis
 
NLP_KASHK:POS Tagging
NLP_KASHK:POS TaggingNLP_KASHK:POS Tagging
NLP_KASHK:POS Tagging
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 

Similar a SoDA v2 - Named Entity Recognition from streaming text

Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic webStanley Wang
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyRobert Viseur
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsMelanie Courtot
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Takeshi Morita
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopJSGB
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction abenyeung1
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 

Similar a SoDA v2 - Named Entity Recognition from streaming text (20)

Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web tools
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
ELK stack introduction
ELK stack introduction ELK stack introduction
ELK stack introduction
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Apache solr
Apache solrApache solr
Apache solr
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...
Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...
Norman and McCraken, "OpenURL Implementation: Link Resolution That Users Will...
 

Más de Sujit Pal

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question AnsweringSujit Pal
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and TestSujit Pal
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Sujit Pal
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringSujit Pal
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop VisualizationSujit Pal
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudSujit Pal
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubSujit Pal
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalSujit Pal
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsSujit Pal
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesSujit Pal
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSSujit Pal
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingSujit Pal
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSujit Pal
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSujit Pal
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchSujit Pal
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Sujit Pal
 

Más de Sujit Pal (20)

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...Building Learning to Rank (LTR) search reranking models using Large Language ...
Building Learning to Rank (LTR) search reranking models using Large Language ...
 
Cheap Trick for Question Answering
Cheap Trick for Question AnsweringCheap Trick for Question Answering
Cheap Trick for Question Answering
 
Searching Across Images and Test
Searching Across Images and TestSearching Across Images and Test
Searching Across Images and Test
 
Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...Learning a Joint Embedding Representation for Image Search using Self-supervi...
Learning a Joint Embedding Representation for Image Search using Self-supervi...
 
The power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestringThe power of community: training a Transformer Language Model on a shoestring
The power of community: training a Transformer Language Model on a shoestring
 
Backprop Visualization
Backprop VisualizationBackprop Visualization
Backprop Visualization
 
Accelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn CloudAccelerating NLP with Dask and Saturn Cloud
Accelerating NLP with Dask and Saturn Cloud
 
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19
 
Leslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal ClubLeslie Smith's Papers discussion for DL Journal Club
Leslie Smith's Papers discussion for DL Journal Club
 
Using Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based RetrievalUsing Graph and Transformer Embeddings for Vector Based Retrieval
Using Graph and Transformer Embeddings for Vector Based Retrieval
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Question Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other StoriesQuestion Answering as Search - the Anserini Pipeline and Other Stories
Question Answering as Search - the Anserini Pipeline and Other Stories
 
Building Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDSBuilding Named Entity Recognition Models Efficiently using NERDS
Building Named Entity Recognition Models Efficiently using NERDS
 
Graph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Search summit-2018-ltr-presentation
Search summit-2018-ltr-presentationSearch summit-2018-ltr-presentation
Search summit-2018-ltr-presentation
 
Search summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slidesSearch summit-2018-content-engineering-slides
Search summit-2018-content-engineering-slides
 
Evolving a Medical Image Similarity Search
Evolving a Medical Image Similarity SearchEvolving a Medical Image Similarity Search
Evolving a Medical Image Similarity Search
 
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
Embed, Encode, Attend, Predict – applying the 4 step NLP recipe for text clas...
 

Último

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 

Último (20)

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 

SoDA v2 - Named Entity Recognition from streaming text

  • 1. Dictionary based Named Entity Extraction from streaming text Sujit Pal SWIFT Technology Center, July 16, 2018
  • 2. Agenda • Introduction • The Entity Resolution Problem • Named Entity Recognition/Extraction (NER) • SoDA v.2 Architecture • SoDA v.2 Services • Future Work • Conclusion 2 Dictionary based Named Entity Extraction from streaming text
  • 3. Introduction • About Me • Work at Elsevier Labs • Interested in Search, NLP and Machine Learning • Email: sujit.pal@elsevier.com • Twitter: @palsujit • About Elsevier Labs • Advanced Technology Group within Elsevier • More info: https://labs.elsevier.com • About Elsevier • World’s largest publisher of STM books and journals • Uses data to inform and enable consumers of STM Info 3 Dictionary based Named Entity Extraction from streaming text
  • 4. The Entity Resolution Problem • Named Entity Recognition/Extraction – recognize mentions of named entities. • Named Entity Resolution – resolve entity with root entity. 4 Dictionary based Named Entity Extraction from streaming text Hillary Clinton and Bill Clinton visited a diner during Clinton’s 2016 presidential campaign. PERSON LOCATIONEVENT Hillary Clinton and Bill Clinton visited a diner during Clinton’s 2016 presidential campaign.
  • 5. Approaches to NER • Three major approaches • Regular Expression (RegEx) Based • Dictionary Based • Model Based • Hybrid approaches • Combining Approaches • Data Programming • Active Learning 5 Dictionary based Named Entity Extraction from streaming text
  • 6. RegEx based NER Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . PERSON ([A-Z][a-z]+){2,3} AGE (d){1,3}syearssold DATE ([A-Z][a-z]{2}(.)*)s(d{2}) 6 Dictionary based Named Entity Extraction from streaming text
  • 7. Dictionary Based NER Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . PERSON Names of famous people DATE Month names and abbrs. 7 Dictionary based Named Entity Extraction from streaming text
  • 8. Dictionary based NER – 3rd Party S/W • Open Source • GATE (General Architecture for Text Engineering) • pyahocorasick • SoDA (SOlr Dictionary Annotator) • Commercial / Open Source • LingPipe 8 Dictionary based Named Entity Extraction from streaming text
  • 10. Model based NER – Sequence Models • Typical model structure • Input – a sentence s or a sequence of words {x0, x1, …, xn}. • Output – a sequence Y {y0, y1, …, yn} of IOB tags. • Hidden Markov Models – IOB tag depends on input variable and previous label. • Conditional Random Fields – IOB tag depends on features {f0, f1, …, fm} with learned weights {ƛ0, ƛ1, …, ƛm} defined over current word xi, current label yi, previous label yi-1, and the entire sentence s. 10 Dictionary based Named Entity Extraction from streaming text
  • 11. Model based NER – Sequence Models (2) • Family of Deep Learning Sequence Models – has been used for POS tagging, phrase chunking, NER and even language translation. • Feature vectors for words created using Word Embeddings (word2vec, GloVe, fasttext, etc). • Performance can be improved with Attention mechanisms. • Represents state of the art for Named Entity Recognition. • Needs lots of data to train. 11 Dictionary based Named Entity Extraction from streaming text x1x0 EOSxn y1y0 y2 y0 yny1 EOS LSTM ENCODER LSTM DECODER weights
  • 12. Model based NER – 3rd party S/W • Open Source • GATE • Apache OpenNLP • Stanford NER (has NLTK plugin) • SpaCy NER • NERDS • Commercial • Basis Technologies Rosette Entity Extractor • IBM Watson / Alchemy API • Amazon Comprehend • Azure Named Entity Recognition 12 Dictionary based Named Entity Extraction from streaming text
  • 13. Hybrid Approaches – combinations • Create initial labeled dataset by harvesting entities from large text corpora using one or more of the following: • Weak Supervision – RegEx and other pattern matching (eg. Hearst Patterns for phrases). • Distant Supervision – matching against dictionaries derived from industry specific (public or private) ontologies. • Unsupervised – legacy rule based models. • Supervised – predictions from weaker models. • Crowdsourcing – using human experts. • Train powerful seq2seq model using labeled dataset. • Refine using human-in-the-loop active learning or other techniques. 13 Dictionary based Named Entity Extraction from streaming text
  • 14. Data Programming - Snorkel • Start with noisy labels L from various sources • Train generative model capable of generating probabilities P for each of the output classes based on feature vector of noisy labels. • Train final noise-aware discriminative model with output of generative model P and original data X to predict class label Q for data. • The Snorkel project (https://hazyresearch.github.io/snorkel/) pioneered this approach and provides tooling for all these steps. 14 Dictionary based Named Entity Extraction from streaming text Image Credit: Snorkel Project
  • 15. SoDA v.2 Architecture • Theoretical Foundations • Aho-Corasick algorithm • SolrTextTagger • SoDA Architecture • Scaling SoDA 15 Dictionary based Named Entity Extraction from streaming text
  • 16. Aho-Corasick Algorithm • Implements a data structure called “trie” • State machine over characters • Dictionary based NERs implement similar state machine over words in phrases. 16 Dictionary based Named Entity Extraction from streaming text Image Credit: ResearchGate
  • 17. SolrTextTagger • Lucene’s TokenStreams are finite state automatons (FSA). • SolrTextTagger (https://github.com/OpenSextant/SolrTextTagger) dynamically creates FSAs from dictionary entries into a Finite State Transducer (FST) data structure. • Provides tag service to annotate incoming streaming text against FST. • Input is text, output is matched dictionary entries and offsets into text. • SolrTextTagger is OSS created by Lucene/Solr committer David Smiley. 17 Dictionary based Named Entity Extraction from streaming text Image Credit: Slides for Automata Invasion talk by Michael McCandless and Robert Muir
  • 18. Architecture 18 Dictionary based Named Entity Extraction from streaming text • Co-located with standalone Solr server. • Scala based thin wrapper over SolrTextTagger. • Provides following services. • unified JSON over HTTP request/response • multiple matching styles • multiple lexicons • hides details of managing SolrTextTagger. • Streaming (text) and non- streaming (phrase) matching services. • Programmatic APIs for Scala and Python.
  • 19. Scaling 19 Dictionary based Named Entity Extraction from streaming text • Install and configure Solr, SolrTextTagger and SoDA and create AMI • Use CloudFormation (or Terraform) templates to instantiate cluster of Solr+SoDA instances behind Elastic Load Balancer. • Autoscaling cluster • Monitored by CloudWatch • New dictionaries loaded by instantiating EC2 from AMI via Lambda and saved back into AMI for next cluster build. client loader
  • 20. Consuming Annotations at scale 20 Dictionary based Named Entity Extraction from streaming text • Synchronous • Asynchronous Databricks Notebook Documents on S3 SoDA cluster Parquet Annotations on S3 Documents on S3 SoDA cluster Parquet Annotations on S3 Kafka/Kinesis Streams Producer Consumer
  • 21. SoDA Services • Bulk Loader (backend) • Client facing (front end) • Index (status check) • Add New Record into Lexicon • Delete Lexicon or Entry • Annotate Text against Lexicon • List Available Lexicons • Find coverage of incoming text against Lexicons • Lookup by ID • Reverse Lookup by Phrase 21 Dictionary based Named Entity Extraction from streaming text
  • 22. SoDA Bulk Loader • Multithreaded loader for bulk loading dictionaries into SoDA. • Requires tab-separated file in following format: • id {TAB} primary-name {PIPE} alt-name-1 {PIPE} ... {PIPE} alt-name-n • One line per dictionary entry • Script to run (on SoDA/Solr box). • ./bulk_load.sh lexicon /path/to/input num_workers 22 Dictionary based Named Entity Extraction from streaming text
  • 23. SoDA Health Check – index.json • Returns a status message. Meant to be used for testing if the SoDA application is up. • Python client code • Scala client code • Output 23 Dictionary based Named Entity Extraction from streaming text
  • 24. Annotate Text against Lexicon – annot.json • Annotates text against a specific lexicon and match type. • Match types can be one of the following: • exact – matches text spans with dictionary entries. • lower – same as exact, but matches are case-sensitive • stop – same as lower, but stop words removed from both text and dictionary entries • stem1 – same as stop, but stemmed with Solr minimal English stemmer • stem2 – same as stop, but stemmed with Solr Kstem stemmer • stem3 – same as stop, but stemmed with Solr Porter stemmer. • Input (HTTP POST) 24 Dictionary based Named Entity Extraction from streaming text
  • 25. Annotate Text against Lexicon (2) • Python client code • Scala client code • Output 25 Dictionary based Named Entity Extraction from streaming text
  • 26. List Available Lexicons – dicts.json • Returns a list of lexicons available to annotate against. • Python client • Scala client • Output 26 Dictionary based Named Entity Extraction from streaming text
  • 27. Check Coverage – coverage.json • This can be used to find which lexicons are appropriate for annotating your text. The service allows you to send a piece of text to all hosted lexicons and returns with the number of matches found in each. • Input (HTTP POST) • Python client • Scala client 27 Dictionary based Named Entity Extraction from streaming text
  • 28. Check Coverage (2) • Output 28 Dictionary based Named Entity Extraction from streaming text
  • 29. Lookup by ID – lookup.json • Allows looking up a dictionary entry by lexicon and ID. • Input (HTTP POST) • Python client • Scala client 29 Dictionary based Named Entity Extraction from streaming text
  • 30. Lookup by ID (2) • Output 30 Dictionary based Named Entity Extraction from streaming text
  • 31. Reverse Lookup by Phrase • Matches phrases against specific lexicon and match type. • Match types can be one of the following: • All match types supported by Annotation service (annot.json) • lsort – case-insensitive matching against phrase with words sorted alphabetically. • s3sort – case-insensitive matching against phrase stemmed using Porter Stemmer (stem3) and its words sorted alphabetically. • Input 31 Dictionary based Named Entity Extraction from streaming text
  • 32. Reverse Lookup by Phrase (2) • Python client • Scala client • Output 32 Dictionary based Named Entity Extraction from streaming text
  • 33. Future Work • List of open items on the SoDA issues page and continuously updated as I find them (https://github.com/elsevierlabs-os/soda/issues). • Please feel free to post issues and ideas for improvement. 33 Dictionary based Named Entity Extraction from streaming text
  • 34. Thank you Contact Information Email: sujit.pal@elsevier.com Twitter: @palsujit SoDA: https://github.com/elsevierlabs-os/soda