SlideShare a Scribd company logo
1 of 32
Download to read offline
Advanced Natural Language Processing
with Spark NLP
Alex Thomas, Principal Data Scientist at WiseCube
David Talby, CTO at John Snow Labs
Agenda
Introducing Spark NLP
Accuracy, scalability, and speed benchmarks
Out-of-the-box functionality
Getting Things Done
End-to-end NLP tasks in 3 lines of code
Key concepts and a backstage tour
Notebooks!
Using pre-trained pipelines & models
Named entity recognition
Document classification
INTRODUCING SPARK NLP
STATE OF THE ART NLP FOR PYTHON, JAVA & SCALA
1. ACCURACY
2. SCALABILITY
3. SPEED
SPARK NLP IN THE ENTERPRISE
O’REILLY AI ADOPTION IN THE ENTERPRISE SURVEY OF 1,300 PRACTITIONERS, FEB 2019
• ”State of the art” means the best peer-reviewed academic results
• Public benchmarks: Comparing production-grade NLP libraries
• Public benchmarks of pre-trained models: nlp.johnsnowlabs.com
“Spark NLP 2.4 sets new accuracy records for common tasks including NER, OCR & Matching”
New: Redesigned NER-DL and BERT-large
New: Spark OCR image filters & scalable pipelines
New: Hierarchical clinical entity resolution
“Spark NLP 2.5 delivers state-of-the-art accuracy for spell checking and sentiment analysis”
New: ALBERT & XLNet embeddings
New: Contextual spell checker
New: DL-based sentiment analysis
ACCURACY
SCALABILITY
• Zero code changes to scale a pipeline
to any Spark cluster
• Only natively distributed
open-source NLP library
• Spark provides execution planning,
caching, serialization, shuffling
• Caveats
– Speedup depends heavily on what
you actually do
– Not all algorithms scale well
– Spark configuration matters
SPEED: GET THE MOST FROM MODERN HARDWARE
• Optimized builds of Spark NLP
for both Intel and Nvidia
• Benchmark done on AWS:
Train a Named Entity Recognizer in
French
• Achieving F1-score of 89% requires
at least 80 Epochs with batch size of
512
• Intel outperformed Nvidia: Cascade
Lake was 19% faster & 46% cheaper
than Tesla P-100
Production Grade + Active Community
In production in multiple Fortune 500’s
26 new releases in 2018, 30 in 2019
Active Slack community
Permissive open source license: Apache 2.0
SPARK NLP
out-of-the-box
functionality
OFFICIALLY SUPPORTED RUNTIMES
Getting Things Done
SENTIMENT ANALYSIS
import sparknlp
sparknlp.start()
from sparknlp.pretrained import PretrainedPipeline
pipeline = PretrainedPipeline('analyze-sentiment', 'en')
result = pipeline.annotate('Harry Potter is a great movie’)
print(result['sentiment’]) ## will print ['positive']
NAMED ENTITY RECOGNITION
pipeline = PretrainedPipeline('recognize_entities_bert', 'en')
result = pipeline.annotate('Harry and Ron met in Hogsmeade')
print(result['ner'])
# prints ['I-PER', 'O', 'I-PER', 'O', 'O', 'I-LOC')
SPELL CHECKING & CORRECTION
Now in Scala:
val pipeline = PretrainedPipeline("spell_check_ml", "en")
val result = pipeline.annotate("Harry Potter is a graet muvie")
println(result("spell"))
/* will print Seq[String](…, "is", "a", "great", "movie") */
UNDER THE HOOD
1.sparknlp.start() starts a new Spark session if there isn’t one, and returns it.
2.PretrainedPipeline() loads the English version of the explain_document_dl pipeline,
the pre-trained models, and the embeddings it depends on.
3. These are stored and cached locally.
4. TensorFlow is initialized, within the same JVM process that runs Spark.
The pre-trained embeddings and deep-learning models (like NER) are loaded. Models are
automatically distributed and shared if running on a cluster.
5. The annotate() call runs an NLP inference pipeline which activates each stage’s
algorithm (tokenization, POS, etc.).
6. The NER stage is run on TensorFlow – applying a neural network with bi-LSTM layers for
tokens and a CNN for characters.
7. Embeddings are used to convert contextual tokens into vectors during the NER inference
process.
8. The result object is a plain old local Python dictionary.
KEY CONCEPT #1: PIPELINE
A list of text processing steps.
Each step has input and output columns.
Document
Assembler
Sentence
Detector
Tokenizer
Sentiment
Analyzer
text document sentence token sentiment
KEY CONCEPT #2: ANNOTATOR
sentiment_detector = SentimentDetector() 
.setInputCols(["sentence”]) 
.setOutputCol("sentiment_score") 
.setDictionary(resource_path+"sent.txt")
An object encapsulating one text processing step.
KEY CONCEPT #3: RESOURCE
• Trained ML models
• Trained DL networks
• Dictionaries
• Embeddings
• Rules
• Pretrained pipelines
An external file that an annotator needs.
Resources can be shared, cached, and locally stored.
KEY CONCEPT #4: PRETRAINED PIPELINE
A pre-built pipeline, with all the annotators and resources it needs.
PUTTING IT ALL TOGETHER: TRAINING A NER WITH BERT
Initialization
Training data
Resources
Annotator
Pipeline
Run Training
Notebooks!
Cleaning, Splitting, and Finding Text
+ Understanding Grammar
Run “Spark NLP Basics” notebook
Oprn on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb
Using Pre-trained Pipelines
+ Named Entity Recognition
Run “Entity Recognizer with Deep Learning” notebook
Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/colab/4-%20Entity%20Recognizer%20DL.ipynb
Training your own NER model
Run “NER BERT Training” notebook
Open on Google Colab: https://colab.research.google.com/drive/1A1ovV74nOG-MEpVQnmageeU-ksRLSmXZ
Walkthrough in blog post: https://www.johnsnowlabs.com/named-entity-recognition-ner-with-bert-in-spark-nlp/
Document Classification
+ Universal Sentence Embeddings
Run “Text Classification with ClassifierDL” notebook
Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/
blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb
WHAT ELSE IS AVAILABLE?
• Spark NLP for Healthcare: 50+ models for clinical entity recognition, linking
to medical terminologies, assertion status detection, and de-identification
• Spark OCR: 20 annotators for image enhancement, layout, and smart editing
LEARN MORE: TECHNICAL CASE STUDIES
Improving Patient Flow Forecasting
Automated clinical coding & chart
reviews
Knowledge Extraction from
Pathology Reports
High-accuracy fact extraction
from long financial documents
Improving Mental Health for
HIV-Positive Adolescents
Accelerating Clinical Trial Recruiting
NEXT STEPS
1. READ THE DOCS & JOIN SLACK
HTTPS://NLP.JOHNSNOWLABS.COM
2. STAR & FORK THE REPO
GITHUB.COM/JOHNSNOWLABS/SPARK-NLP
3. QUESTIONS? GET IT TOUCH
Thank you!
alex@wisecube.com
david@johnsnowlabs.com
Advanced Natural Language Processing with Apache Spark NLP

More Related Content

What's hot

Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
confluent
 
Apache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart CityApache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart City
Kai Wähner
 

What's hot (20)

kafka
kafkakafka
kafka
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Architecting Agile Data Applications for Scale
Architecting Agile Data Applications for ScaleArchitecting Agile Data Applications for Scale
Architecting Agile Data Applications for Scale
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
Ingesting and Processing IoT Data Using MQTT, Kafka Connect and Kafka Streams...
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and origin
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart CityApache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart City
 

Similar to Advanced Natural Language Processing with Apache Spark NLP

Similar to Advanced Natural Language Processing with Apache Spark NLP (20)

Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat...
 Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat... Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat...
Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable & Unified Nat...
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learning
 
PPT5: Neuron Introduction
PPT5: Neuron IntroductionPPT5: Neuron Introduction
PPT5: Neuron Introduction
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Simplify Governance of Streaming Data
Simplify Governance of Streaming Data Simplify Governance of Streaming Data
Simplify Governance of Streaming Data
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
PySaprk
PySaprkPySaprk
PySaprk
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 

Recently uploaded (20)

Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 

Advanced Natural Language Processing with Apache Spark NLP

  • 1.
  • 2. Advanced Natural Language Processing with Spark NLP Alex Thomas, Principal Data Scientist at WiseCube David Talby, CTO at John Snow Labs
  • 3. Agenda Introducing Spark NLP Accuracy, scalability, and speed benchmarks Out-of-the-box functionality Getting Things Done End-to-end NLP tasks in 3 lines of code Key concepts and a backstage tour Notebooks! Using pre-trained pipelines & models Named entity recognition Document classification
  • 4. INTRODUCING SPARK NLP STATE OF THE ART NLP FOR PYTHON, JAVA & SCALA 1. ACCURACY 2. SCALABILITY 3. SPEED
  • 5. SPARK NLP IN THE ENTERPRISE O’REILLY AI ADOPTION IN THE ENTERPRISE SURVEY OF 1,300 PRACTITIONERS, FEB 2019
  • 6.
  • 7. • ”State of the art” means the best peer-reviewed academic results • Public benchmarks: Comparing production-grade NLP libraries • Public benchmarks of pre-trained models: nlp.johnsnowlabs.com “Spark NLP 2.4 sets new accuracy records for common tasks including NER, OCR & Matching” New: Redesigned NER-DL and BERT-large New: Spark OCR image filters & scalable pipelines New: Hierarchical clinical entity resolution “Spark NLP 2.5 delivers state-of-the-art accuracy for spell checking and sentiment analysis” New: ALBERT & XLNet embeddings New: Contextual spell checker New: DL-based sentiment analysis ACCURACY
  • 8. SCALABILITY • Zero code changes to scale a pipeline to any Spark cluster • Only natively distributed open-source NLP library • Spark provides execution planning, caching, serialization, shuffling • Caveats – Speedup depends heavily on what you actually do – Not all algorithms scale well – Spark configuration matters
  • 9. SPEED: GET THE MOST FROM MODERN HARDWARE • Optimized builds of Spark NLP for both Intel and Nvidia • Benchmark done on AWS: Train a Named Entity Recognizer in French • Achieving F1-score of 89% requires at least 80 Epochs with batch size of 512 • Intel outperformed Nvidia: Cascade Lake was 19% faster & 46% cheaper than Tesla P-100
  • 10. Production Grade + Active Community In production in multiple Fortune 500’s 26 new releases in 2018, 30 in 2019 Active Slack community Permissive open source license: Apache 2.0
  • 14. SENTIMENT ANALYSIS import sparknlp sparknlp.start() from sparknlp.pretrained import PretrainedPipeline pipeline = PretrainedPipeline('analyze-sentiment', 'en') result = pipeline.annotate('Harry Potter is a great movie’) print(result['sentiment’]) ## will print ['positive']
  • 15. NAMED ENTITY RECOGNITION pipeline = PretrainedPipeline('recognize_entities_bert', 'en') result = pipeline.annotate('Harry and Ron met in Hogsmeade') print(result['ner']) # prints ['I-PER', 'O', 'I-PER', 'O', 'O', 'I-LOC')
  • 16. SPELL CHECKING & CORRECTION Now in Scala: val pipeline = PretrainedPipeline("spell_check_ml", "en") val result = pipeline.annotate("Harry Potter is a graet muvie") println(result("spell")) /* will print Seq[String](…, "is", "a", "great", "movie") */
  • 17. UNDER THE HOOD 1.sparknlp.start() starts a new Spark session if there isn’t one, and returns it. 2.PretrainedPipeline() loads the English version of the explain_document_dl pipeline, the pre-trained models, and the embeddings it depends on. 3. These are stored and cached locally. 4. TensorFlow is initialized, within the same JVM process that runs Spark. The pre-trained embeddings and deep-learning models (like NER) are loaded. Models are automatically distributed and shared if running on a cluster. 5. The annotate() call runs an NLP inference pipeline which activates each stage’s algorithm (tokenization, POS, etc.). 6. The NER stage is run on TensorFlow – applying a neural network with bi-LSTM layers for tokens and a CNN for characters. 7. Embeddings are used to convert contextual tokens into vectors during the NER inference process. 8. The result object is a plain old local Python dictionary.
  • 18. KEY CONCEPT #1: PIPELINE A list of text processing steps. Each step has input and output columns. Document Assembler Sentence Detector Tokenizer Sentiment Analyzer text document sentence token sentiment
  • 19. KEY CONCEPT #2: ANNOTATOR sentiment_detector = SentimentDetector() .setInputCols(["sentence”]) .setOutputCol("sentiment_score") .setDictionary(resource_path+"sent.txt") An object encapsulating one text processing step.
  • 20. KEY CONCEPT #3: RESOURCE • Trained ML models • Trained DL networks • Dictionaries • Embeddings • Rules • Pretrained pipelines An external file that an annotator needs. Resources can be shared, cached, and locally stored.
  • 21. KEY CONCEPT #4: PRETRAINED PIPELINE A pre-built pipeline, with all the annotators and resources it needs.
  • 22. PUTTING IT ALL TOGETHER: TRAINING A NER WITH BERT Initialization Training data Resources Annotator Pipeline Run Training
  • 24. Cleaning, Splitting, and Finding Text + Understanding Grammar Run “Spark NLP Basics” notebook Oprn on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/ blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb
  • 25. Using Pre-trained Pipelines + Named Entity Recognition Run “Entity Recognizer with Deep Learning” notebook Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/ blob/master/tutorials/colab/4-%20Entity%20Recognizer%20DL.ipynb
  • 26. Training your own NER model Run “NER BERT Training” notebook Open on Google Colab: https://colab.research.google.com/drive/1A1ovV74nOG-MEpVQnmageeU-ksRLSmXZ Walkthrough in blog post: https://www.johnsnowlabs.com/named-entity-recognition-ner-with-bert-in-spark-nlp/
  • 27. Document Classification + Universal Sentence Embeddings Run “Text Classification with ClassifierDL” notebook Open on Google Colab: https://github.com/JohnSnowLabs/spark-nlp-workshop/ blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb
  • 28. WHAT ELSE IS AVAILABLE? • Spark NLP for Healthcare: 50+ models for clinical entity recognition, linking to medical terminologies, assertion status detection, and de-identification • Spark OCR: 20 annotators for image enhancement, layout, and smart editing
  • 29. LEARN MORE: TECHNICAL CASE STUDIES Improving Patient Flow Forecasting Automated clinical coding & chart reviews Knowledge Extraction from Pathology Reports High-accuracy fact extraction from long financial documents Improving Mental Health for HIV-Positive Adolescents Accelerating Clinical Trial Recruiting
  • 30. NEXT STEPS 1. READ THE DOCS & JOIN SLACK HTTPS://NLP.JOHNSNOWLABS.COM 2. STAR & FORK THE REPO GITHUB.COM/JOHNSNOWLABS/SPARK-NLP 3. QUESTIONS? GET IT TOUCH