SlideShare una empresa de Scribd logo
1 de 33
Improving Organizational Knowledge
with Natural Language Processing
Enriched Data Pipelines
Dataworks Summit, Washington D.C.
May 23, 2019
Introductions
Eric Wolok
● Over $500 million in commercial
real estate investment sales
● Specializing in Multi-Family
within the Chicago market
● ericw@partnersco.com
Jeff Zemerick
● Cloud/big-data consultant
● Apache OpenNLP PMC member
● ASF Member
● Morgantown, WV
● jeff.zemerick@mtnfog.com
2
Information is Everywhere
The answer to a question is often spread across
multiple locations.
In the real estate domain:
3
Others?
Answers are Distributed
Bringing these sources together brings the full answer together.
4
Real Estate
Includes structured and unstructured data:
● Property ownership history
● Individual buyer’s purchase history
● News articles mentioning properties
● Property listings
● And so on…
This data can be widely distributed across many
sources.
Let’s bring it together.
5
Key Technologies Used
● Apache Kafka
○ Message broker for data ingest.
● Apache NiFi
○ Orchestrate the flow of data in a pipeline.
● Apache MiNiFi
○ Facilitate ingest from edge locations.
● Apache OpenNLP
○ Process unstructured text.
● Apache Superset
○ Create dashboards from the extracted
data.
● Apache ZooKeeper
○ Cluster coordination for Kafka and Nifi.
● Amazon Web Services
○ Managed by CloudFormation for
infrastructure as code.
● Docker
○ Containerization of NLP microservices.
● Relational database
○ Storage of data extracted by the pipeline.
6
Apache Kafka
● Pub/sub message broker for
streaming data.
● Can handle massive amounts of
messages.
● Allows replay of data.
7
By Ch.ko123 - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=59871096
Apache NiFi and MiNiFi
Apache NiFi
● Provides “directed graphs for data routing,
transformation, and system mediation
logic.”
● Has a drag and drop interface where
“processors” are configured to form the
data flow.
Apache MiNiFi
● A sub-project of NiFi that extends NiFi’s
capabilities to IoT and edge devices.
● Facilitates data collection and ingestion
into a NiFi cluster or other system.
8
Apache OpenNLP
9
● A machine learning-based toolkit for processing natural language text.
● Supports:
○ Tokenization
○ Sentence segmentation
○ Part-of-speech tagging
○ Named entity extraction
○ Document classification
○ Chunking
○ Parsing
○ Language detection
The Unstructured Data Pipeline
Data
Ingestion
NLP
Enrichment /
Deduplication
Storage Reporting
10
A happy user!
Sentiment
Analysis
The Tools of the Unstructured Data Pipeline
Data
Ingestion
NLP
Enrichment /
Deduplication
Storage Reporting
11
Sentiment
Analysis
Kafka OpenNLP
DockerMiNiFi
OpenNLP Database Superset
NiFi
ZooKeeper
Amazon Web Services
The Architecture
Data Ingestion
● Plain text is published to Kafka topics.
● Each published message is a single
document.
● Originates from:
○ Bash scripts publishing to the Kafka topic.
○ MiNiFi publishing to the Kafka topic from
database tables (QueryDatabaseTable).
13
NLP Microservices
● Microservices wrap OpenNLP functionality:
○ Language detection microservice
○ Sentence extraction microservice
○ Sentence tokenization microservice
○ Named-entity extraction microservice
● Spring Boot applications as runnable jars.
○ No external dependencies other than JRE.
● Provides simple REST-based interfaces to NLP operations.
○ e.g. /api/detect, /api/sentences, /api/tokens, /api/extract
● Available on GitHub under Apache license.
14
NLP Microservices Pipelines
Language
Detection
Sentence
Extraction
Sentence
Tokenization
Entity
Extraction
Sentence
Extraction
Sentence
Tokenization
Entity
Extraction
Supports independent pipelines supporting multiple source languages.
eng
deu
15
others
NLP Microservices Deployment
● Deployed as containers
○ https://github.com/mtnfog/nlp-building-blocks
● Stateless
○ Each can scale independently.
○ Deployed behind a load balancer.
Load
Balancer
NLP
NLP
NLP
16
NLP Operations - An Example
George Washington was president. He was the first president.
1. Language = eng
2. Sentences = [“George Washington was president.”, “He was
the first president.”]
3. Tokens = {[“George”, “Washington”, “was”, “president”],
[“He”, “was”, “the”, “first”, “president”]}
4. Entities = [“George Washington”]
17
Managing the NLP Models
NLP models are stored externally and pulled by the containers from S3/HTTP
when first needed.
NLP Microservice
Containers
S3 Bucket
containing models
Models organized in storage as: s3://nlp-models/[language]/[model-type]/[model-filename]
18
Extracted Information
Entity types extracted from natural language text:
● Person’s names - John Smith (model-based)
● Street addresses - 123 Main St, Anytown, CA (model-based / regex)
● Currency - $250,000, $300000 (regex)
● Phone numbers - (123) 456-7890, 123-456-7890 (regex)
19
Entity Confidence Thresholds
● Each entity extracted by a model has a confidence value that indicates
model’s confidence it actually is an entity.
● Varies depending on the text and the model.
○ How well does the actual text represent the training text?
● Monitored thresholds to determine best minimum cutoff values.
○ Entities with confidence less than a threshold are filtered.
20
0.00 to 1.00
Storage - Persisting the Extracted Data
● JSON responses from the microservices are converted to SQL.
○ Via NiFi’s ConvertJSONToSQL processor.
● Extracted entities are persisted to a relational database.
○ Via NiFi’s PutSQL processor.
● Extracted entities are persisted to an Elasticsearch index.
○ Via NiFi’s PutElasticsearch processor.
21
Reporting
● Dashboards in Apache Superset show views on the data.
22
Entity Counts by Context
● Shows a high-level view of entity counts per context.
23
A context name is
arbitrary, e.g. a
book, a chapter, a
document, etc.
Entities by Type
● Shows entity counts by type of entity.
24
Stopped ingestion
at 1000 person
entities.
Entity Confidence Values
● Shows the distribution of entity confidence values (0.0 to 1.0).
25
Entities with
confidence = 1 are
phone numbers.
Putting It All Together - The NiFi Pipeline
26
The NiFi Flow
27
George Washington was president. He was the first president.
George Washington was president. He was the first president.
[“George”, “Washington”, “was”, “president”] [“He”, “was”, “the”, “first”, “president”]
[“George Washington”]
{entity: text: “George Washington”, confidence: 0.9}
INSERT INTO entities (TEXT, CONFIDENCE) VALUES (“George
Washington”, 0.9)
One entity sent to the database.
Attribute - eng
[]
{}
Nothing sent to database.
Design Challenges
● Repeatable
○ How to deploy it for testing, for different environments, for other clients, …?
● Scalable
○ How can we handle massive amounts of text efficiently?
● Extensible / Updatable
○ How can we ensure that the process can easily be customized or changed?
28
Repeatable
● 100% infrastructure as code.
○ AWS CloudFormation + Bash scripts = Source control in git
● Fully automated.
○ No part of the deployment/configuration is done manually.
○ Entire architecture is created by kicking off a command.
● Apache NiFi flow is version controlled in the NiFi Registry.
○ Registry is automatically connected to NiFi when NiFi is configured.
○ NiFi flow is also under source control.
29
Scalable
● Leveraged autoscaling from the cloud platform where appropriate.
○ NLP microservice containers and underlying hardware have scaling policies set.
● Number of Kafka, NiFi, ZooKeeper instances can be increased as needed.
30
Extensible
● Using Apache NiFi insulates us from a hard-coded solution.
○ No pipeline code to write and manage!
● We can modify the pipeline simply by adding new processors.
● Architecture is layered logically.
○ Layers can be swapped in and out if technology requirements change.
○ Layers consist of cloud-specific networking, ZooKeeper, Kafka, NiFi, etc.
● Data can be replayed through Kafka.
○ Allows for updating the data captured.
31
Summary
We presented a method for extracting information from distributed data.
● This pipeline…
○ Ingests data
○ Processes the data
○ Extracts the entities
○ Stores the entities
○ Visualizes the results
● … in a scalable, loosely-connected, and repeatable fashion.
● We can build systems, such as recommendation engines, around this data.
32
Questions?
Credits and thanks to:
● The open source projects and contributors who make this work possible.
33

Más contenido relacionado

La actualidad más candente

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 

La actualidad más candente (20)

Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 
Zero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsightZero ETL analytics with LLAP in Azure HDInsight
Zero ETL analytics with LLAP in Azure HDInsight
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
 
Apache Druid 101
Apache Druid 101Apache Druid 101
Apache Druid 101
 
In Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging serviceIn Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging service
 
Capital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting PlatformCapital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting Platform
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 

Similar a Improving Organizational Knowledge with Natural Language Processing Enriched Data Pipelines

Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Marcus Hanwell
 

Similar a Improving Organizational Knowledge with Natural Language Processing Enriched Data Pipelines (20)

Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUpStrimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
Strimzi - Where Apache Kafka meets OpenShift - OpenShift Spain MeetUp
 
Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2Netflix Open Source Meetup Season 3 Episode 2
Netflix Open Source Meetup Season 3 Episode 2
 
NetflixOSS Meetup season 3 episode 2
NetflixOSS Meetup season 3 episode 2NetflixOSS Meetup season 3 episode 2
NetflixOSS Meetup season 3 episode 2
 
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthUSENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a Month
 
Big data @ Hootsuite analtyics
Big data @ Hootsuite analtyicsBig data @ Hootsuite analtyics
Big data @ Hootsuite analtyics
 
re:Invent re:Peat
re:Invent re:Peatre:Invent re:Peat
re:Invent re:Peat
 
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...
Atmosphere 2016 -  Pawel Mastalerz, Wojciech Inglot - New way of building inf...Atmosphere 2016 -  Pawel Mastalerz, Wojciech Inglot - New way of building inf...
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...
 
Wattpad - Spark Stories
Wattpad - Spark StoriesWattpad - Spark Stories
Wattpad - Spark Stories
 
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
Building a high-performance, scalable ML & NLP platform with Python, Sheer El...
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
 
OpenSearch.pdf
OpenSearch.pdfOpenSearch.pdf
OpenSearch.pdf
 
Summer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpointSummer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpoint
 
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistryOpen Chemistry, JupyterLab and data: Reproducible quantum chemistry
Open Chemistry, JupyterLab and data: Reproducible quantum chemistry
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
Go at uber
Go at uberGo at uber
Go at uber
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Devoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterDevoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipster
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 

Más de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

Improving Organizational Knowledge with Natural Language Processing Enriched Data Pipelines

  • 1. Improving Organizational Knowledge with Natural Language Processing Enriched Data Pipelines Dataworks Summit, Washington D.C. May 23, 2019
  • 2. Introductions Eric Wolok ● Over $500 million in commercial real estate investment sales ● Specializing in Multi-Family within the Chicago market ● ericw@partnersco.com Jeff Zemerick ● Cloud/big-data consultant ● Apache OpenNLP PMC member ● ASF Member ● Morgantown, WV ● jeff.zemerick@mtnfog.com 2
  • 3. Information is Everywhere The answer to a question is often spread across multiple locations. In the real estate domain: 3 Others?
  • 4. Answers are Distributed Bringing these sources together brings the full answer together. 4
  • 5. Real Estate Includes structured and unstructured data: ● Property ownership history ● Individual buyer’s purchase history ● News articles mentioning properties ● Property listings ● And so on… This data can be widely distributed across many sources. Let’s bring it together. 5
  • 6. Key Technologies Used ● Apache Kafka ○ Message broker for data ingest. ● Apache NiFi ○ Orchestrate the flow of data in a pipeline. ● Apache MiNiFi ○ Facilitate ingest from edge locations. ● Apache OpenNLP ○ Process unstructured text. ● Apache Superset ○ Create dashboards from the extracted data. ● Apache ZooKeeper ○ Cluster coordination for Kafka and Nifi. ● Amazon Web Services ○ Managed by CloudFormation for infrastructure as code. ● Docker ○ Containerization of NLP microservices. ● Relational database ○ Storage of data extracted by the pipeline. 6
  • 7. Apache Kafka ● Pub/sub message broker for streaming data. ● Can handle massive amounts of messages. ● Allows replay of data. 7 By Ch.ko123 - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=59871096
  • 8. Apache NiFi and MiNiFi Apache NiFi ● Provides “directed graphs for data routing, transformation, and system mediation logic.” ● Has a drag and drop interface where “processors” are configured to form the data flow. Apache MiNiFi ● A sub-project of NiFi that extends NiFi’s capabilities to IoT and edge devices. ● Facilitates data collection and ingestion into a NiFi cluster or other system. 8
  • 9. Apache OpenNLP 9 ● A machine learning-based toolkit for processing natural language text. ● Supports: ○ Tokenization ○ Sentence segmentation ○ Part-of-speech tagging ○ Named entity extraction ○ Document classification ○ Chunking ○ Parsing ○ Language detection
  • 10. The Unstructured Data Pipeline Data Ingestion NLP Enrichment / Deduplication Storage Reporting 10 A happy user! Sentiment Analysis
  • 11. The Tools of the Unstructured Data Pipeline Data Ingestion NLP Enrichment / Deduplication Storage Reporting 11 Sentiment Analysis Kafka OpenNLP DockerMiNiFi OpenNLP Database Superset NiFi ZooKeeper Amazon Web Services
  • 13. Data Ingestion ● Plain text is published to Kafka topics. ● Each published message is a single document. ● Originates from: ○ Bash scripts publishing to the Kafka topic. ○ MiNiFi publishing to the Kafka topic from database tables (QueryDatabaseTable). 13
  • 14. NLP Microservices ● Microservices wrap OpenNLP functionality: ○ Language detection microservice ○ Sentence extraction microservice ○ Sentence tokenization microservice ○ Named-entity extraction microservice ● Spring Boot applications as runnable jars. ○ No external dependencies other than JRE. ● Provides simple REST-based interfaces to NLP operations. ○ e.g. /api/detect, /api/sentences, /api/tokens, /api/extract ● Available on GitHub under Apache license. 14
  • 16. NLP Microservices Deployment ● Deployed as containers ○ https://github.com/mtnfog/nlp-building-blocks ● Stateless ○ Each can scale independently. ○ Deployed behind a load balancer. Load Balancer NLP NLP NLP 16
  • 17. NLP Operations - An Example George Washington was president. He was the first president. 1. Language = eng 2. Sentences = [“George Washington was president.”, “He was the first president.”] 3. Tokens = {[“George”, “Washington”, “was”, “president”], [“He”, “was”, “the”, “first”, “president”]} 4. Entities = [“George Washington”] 17
  • 18. Managing the NLP Models NLP models are stored externally and pulled by the containers from S3/HTTP when first needed. NLP Microservice Containers S3 Bucket containing models Models organized in storage as: s3://nlp-models/[language]/[model-type]/[model-filename] 18
  • 19. Extracted Information Entity types extracted from natural language text: ● Person’s names - John Smith (model-based) ● Street addresses - 123 Main St, Anytown, CA (model-based / regex) ● Currency - $250,000, $300000 (regex) ● Phone numbers - (123) 456-7890, 123-456-7890 (regex) 19
  • 20. Entity Confidence Thresholds ● Each entity extracted by a model has a confidence value that indicates model’s confidence it actually is an entity. ● Varies depending on the text and the model. ○ How well does the actual text represent the training text? ● Monitored thresholds to determine best minimum cutoff values. ○ Entities with confidence less than a threshold are filtered. 20 0.00 to 1.00
  • 21. Storage - Persisting the Extracted Data ● JSON responses from the microservices are converted to SQL. ○ Via NiFi’s ConvertJSONToSQL processor. ● Extracted entities are persisted to a relational database. ○ Via NiFi’s PutSQL processor. ● Extracted entities are persisted to an Elasticsearch index. ○ Via NiFi’s PutElasticsearch processor. 21
  • 22. Reporting ● Dashboards in Apache Superset show views on the data. 22
  • 23. Entity Counts by Context ● Shows a high-level view of entity counts per context. 23 A context name is arbitrary, e.g. a book, a chapter, a document, etc.
  • 24. Entities by Type ● Shows entity counts by type of entity. 24 Stopped ingestion at 1000 person entities.
  • 25. Entity Confidence Values ● Shows the distribution of entity confidence values (0.0 to 1.0). 25 Entities with confidence = 1 are phone numbers.
  • 26. Putting It All Together - The NiFi Pipeline 26
  • 27. The NiFi Flow 27 George Washington was president. He was the first president. George Washington was president. He was the first president. [“George”, “Washington”, “was”, “president”] [“He”, “was”, “the”, “first”, “president”] [“George Washington”] {entity: text: “George Washington”, confidence: 0.9} INSERT INTO entities (TEXT, CONFIDENCE) VALUES (“George Washington”, 0.9) One entity sent to the database. Attribute - eng [] {} Nothing sent to database.
  • 28. Design Challenges ● Repeatable ○ How to deploy it for testing, for different environments, for other clients, …? ● Scalable ○ How can we handle massive amounts of text efficiently? ● Extensible / Updatable ○ How can we ensure that the process can easily be customized or changed? 28
  • 29. Repeatable ● 100% infrastructure as code. ○ AWS CloudFormation + Bash scripts = Source control in git ● Fully automated. ○ No part of the deployment/configuration is done manually. ○ Entire architecture is created by kicking off a command. ● Apache NiFi flow is version controlled in the NiFi Registry. ○ Registry is automatically connected to NiFi when NiFi is configured. ○ NiFi flow is also under source control. 29
  • 30. Scalable ● Leveraged autoscaling from the cloud platform where appropriate. ○ NLP microservice containers and underlying hardware have scaling policies set. ● Number of Kafka, NiFi, ZooKeeper instances can be increased as needed. 30
  • 31. Extensible ● Using Apache NiFi insulates us from a hard-coded solution. ○ No pipeline code to write and manage! ● We can modify the pipeline simply by adding new processors. ● Architecture is layered logically. ○ Layers can be swapped in and out if technology requirements change. ○ Layers consist of cloud-specific networking, ZooKeeper, Kafka, NiFi, etc. ● Data can be replayed through Kafka. ○ Allows for updating the data captured. 31
  • 32. Summary We presented a method for extracting information from distributed data. ● This pipeline… ○ Ingests data ○ Processes the data ○ Extracts the entities ○ Stores the entities ○ Visualizes the results ● … in a scalable, loosely-connected, and repeatable fashion. ● We can build systems, such as recommendation engines, around this data. 32
  • 33. Questions? Credits and thanks to: ● The open source projects and contributors who make this work possible. 33

Notas del editor

  1. This data may be geographically distributed between different organizations or it may exist in a single organization.
  2. Can be distributed inside and outside an organization.
  3. This allows updating the models without having to redeploy code.