SlideShare a Scribd company logo
1 of 40
Download to read offline
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 1/40
Streaming Topic Model TrainingStreaming Topic Model Training
and Inferenceand Inference
Suneel MarthiSuneel Marthi
March 21, 2019March 21, 2019
DataWorks Summit 2019 - BarcelonaDataWorks Summit 2019 - Barcelona
1
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 2/40
$WhoAreWe$WhoAreWe
Suneel MarthiSuneel Marthi
 @suneelmarthi@suneelmarthi
Member of Apache Software Foundation
Committer and PMC on Apache Mahout, Apache OpenNLP, Apache
Streams
2
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 3/40
AgendaAgenda
Motivation for Topic Modeling
Existing Approaches for Topic Modeling
Topic Modeling on Streams
3
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 4/40
Motivation for Topic ModelingMotivation for Topic Modeling
4
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 5/40
Topic ModelsTopic Models
Automatically discovering main topics in collections
of documents
Overall
What are the main themes discussed on a mailing
list or a discussion forum ?
What are the most recurring topics discussed in
research papers ?
5
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 6/40
Topic Models (Contd)Topic Models (Contd)
Per Document
What are the main topics discussed in a certain
(single) newspaper article ?
What is the most distinctive topic that makes a
certain research article more interesting than the
others ?
6
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 7/40
Uses of Topic ModelsUses of Topic Models
Search Engines to organize text
Annotating Documents according to these topics
Uncovering hidden topical patterns across a
document collection
7
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 8/40
Common Approaches to TopicCommon Approaches to Topic
ModelingModeling
8
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 9/40
Latent Dirichlet Allocation (LDA)
Topics are composed by probability distributions
over words
Documents are composed by probability
distributions over Topics
Batch Oriented approach
9
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 10/40
N: vocab size
M/D: # of documents
alpha: prior on per-document topic dist.
beta: prior on topic-word dist.
theta: per-document topic dist.
phi: per-topic word dist.
z: topic for word w
IntuitionIntuition: Documents are a mixture of topics, topics are a mixture of words. So documents: Documents are a mixture of topics, topics are a mixture of words. So documents
can be generated by drawing a sample of topics, and then drawing a sample of words.can be generated by drawing a sample of topics, and then drawing a sample of words.
LDA ModelLDA Model
10
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 11/40
Latent Semantic Analysis (LSA)
SVDed TF-IDF Document-Term Matrix
Batch Oriented approach
11
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 12/40
Drawbacks of Traditional Topic Modeling Approach
Problem : labelling topics is often a manual task
E.g. LDA topic:30% basket,15% ball,10% drunk,…
can be tagged as basketball
12
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 13/40
Common ApproachesCommon Approaches DeepDeep
Learning (of course)Learning (of course)
Learn to perform LDA: LDA supervised DNN training
(inference speed up)
https://arxiv.org/abs/1508.01011
Deep Belief Networks for Topic Modelling
https://arxiv.org/abs/1501.04325
13
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 14/40
What are Embeddings ?What are Embeddings ?
Represent a document as a point in space, semantically similarRepresent a document as a point in space, semantically similar
docs are close together when plotteddocs are close together when plotted
Such a representation is learned via a (shallow) neural networkSuch a representation is learned via a (shallow) neural network
algorithmalgorithm
14
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 15/40
Embeddings for Topic ModelingEmbeddings for Topic Modeling
LDA2Vec: Mixing LDA and word2vec word embeddings
https://arxiv.org/abs/1605.02019
Navigating embeddings: geometric navigation in word
and doc embeddings space looking for topics
15
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 16/40
Static vs Dynamic CorporaStatic vs Dynamic Corpora
Learning Topic Models over a fixed set of Documents
Fetch the Corpus
Train and Fit a topic model
Extract topics
...other downstream tasks
16
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 17/40
Static vs Dynamic Corpora (Contd)Static vs Dynamic Corpora (Contd)
What to do with Corpus updates?
Document collections are not static in real world
There may be no document collections to begin
with
Memory limits with large corpora
Reprocess entire corpora when new documents are
added
17
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 18/40
Static vs Dynamic Corpora (Contd)Static vs Dynamic Corpora (Contd)
Streams, Streams, Streams
Twitter stream
Reddit posts
Papers on ResearchGate and Arxiv
....
18
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 19/40
Questions for AudienceQuestions for Audience
Who’s doing inference or scoring on streaming data?
Who’s doing online learning and optimization on streaming
data?
(2) is hard, sometimes because the algorithms aren’t(2) is hard, sometimes because the algorithms aren’t
there.there. But, also because people don’t know theBut, also because people don’t know the
algorithms are there.algorithms are there.
19
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 20/40
Topic Modeling on StreamsTopic Modeling on Streams
20
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 21/40
2 Streaming Approaches2 Streaming Approaches
Learning Topics from Jira Issues
Online LDA
21
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 22/40
Learning Topics on Jira IssuesLearning Topics on Jira Issues
22
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 23/40
WorkflowWorkflow
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 24/40 23
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 25/40
FLINK-9963: Add a single value table formatFLINK-9963: Add a single value table format
factoryfactory
24
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 26/40
Topics extracted for Flink-9963Topics extracted for Flink-9963
format factory
table schema
table source
sink factory
table sink
25
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 27/40
FLINK-8286: Fix Flink-Yarn-Kerberos integrationFLINK-8286: Fix Flink-Yarn-Kerberos integration
for FLIP-6for FLIP-6
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 28/40 26
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 29/40
Topics extracted for Flink-8286Topics extracted for Flink-8286
yarn kerberos integration
kerberos integrations
yarn kerberos
27
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 30/40
Bayesian InferenceBayesian Inference
GoalGoal: Calculate the posterior distribution of a: Calculate the posterior distribution of a
probabilistic model given dataprobabilistic model given data
Gibbs sampling/MCMC: Repeatedly sample from conditional
distribution(s) to approximate the posterior of the joint
distribution; typically a batch process
Variational Bayes: Optimize the parameters of a function that
approximates joint distribution
28
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 31/40
Variational BayesVariational Bayes
Expectation Maximization-like optimization
Faster and more accurate
Can often be computed online (on windows over streams)
29
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 32/40
Online LDA (1/2)Online LDA (1/2)
Variational Bayes for LDA
Batch or online
Iterate between optimizing per-word topics and per-document
topics (“E” step) and topic-word distributions (“M” step)
Default topic model optimization in Gensim, scikit-learn,
Apache Spark MLlib -- not because it’s amenable to streaming
per se, but more because it’s memory efficient and can be
applied to large corpora.
Hoffman, Blei & Bach. Online Learning for Latent Dirichlet Allocation. Proceedings of Neural Information Processing Systems. 2010.Hoffman, Blei & Bach. Online Learning for Latent Dirichlet Allocation. Proceedings of Neural Information Processing Systems. 2010.
30
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 33/40
job-constant
K: number of topics
alpha: topic-document dist. prior
eta: topic-word dist. prior
tau0: learning rate
kappa: decay rate
(mini)batch-local
gamma: theta prior
theta: param. for per-document topic dist
phi: param. for per-topic word dist.
rho: update weight
job-global
lambda: parameter for topic-word dist.
t: documents/batches seen (index)
GoalGoal: Optimize the topic-word dist. parameter: Optimize the topic-word dist. parameter
lambda; it tells us what words belong to the topics.lambda; it tells us what words belong to the topics.
[Image source: Hoffman, Blei & Bach, 2010.][Image source: Hoffman, Blei & Bach, 2010.]
Job-global parameters can be tracked in a stateJob-global parameters can be tracked in a state
store. Batch-local parameters can be discarded.store. Batch-local parameters can be discarded.
Online LDA (2/2)Online LDA (2/2)
31
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 34/40
job-constant
K: number of topics
alpha: topic-document dist. prior
eta: topic-word dist. prior
tau0: learning rate
kappa: decay rate
Set job parameters globally withSet job parameters globally with
setGlobalJobParameters() on executionsetGlobalJobParameters() on execution
environment, or pass as arguments toenvironment, or pass as arguments to
operatorsoperators
(mini)batch-local
gamma: theta prior
theta: param. for per-document topic dist
phi: param. for per-topic word dist.
rho: update weight
Operator/task-internal and don't need to beOperator/task-internal and don't need to be
accessible by other tasksaccessible by other tasks
job-global
lambda: parameter for topic-word dist.
t: documents/batches seen (index)
Several options:
External key-value store
Accumulators: lambda update as increment, accumulator +
broadcast variables
Broadcast state
And, (mini)batches selected by Window sizeAnd, (mini)batches selected by Window size
Online LDA on BeamOnline LDA on Beam
32
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 35/40
OnlineLDA PipelineOnlineLDA Pipeline
33
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 36/40
So what?So what?
34
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 37/40
For many tasks, the default is batch training/optimization
Often, even if good online training and optimization
algorithms exist (why?)
So if your algorithms only require 1-step of parameter state,
why not just do streaming?
You can get instant, always up-to-date results by putting the
effort into using a different class of optimization algorithms
an using a stream processing engine, so just do that
35
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 38/40
LinksLinks
http://papers.nips.cc/paper/3902-online-learning-
for-latentdirichlet-allocation
Learning from LDA using Deep Neural Networks -
https://arxiv.org/pdf/1508.01011.pdf
Deep Belief Nets for Topic Modeling -
https://arxiv.org/pdf/1501.04325.pdf
Topic Modeling on Jira issues:
https://github.com/tteofili/jtm
36
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 39/40
CreditsCredits
Tommaso Teofili, Simone Tripodi (Adobe - Rome)
Joern Kottmann, Bruno Kinoshita (Apache OpenNLP)
Fabian Hueske (Apache Flink, Data Artisans)
37
3/31/2019 Streaming Topic Model Training and Inference
https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 40/40
Questions ???Questions ???
38

More Related Content

Similar to Streaming Topic Model Training and Inference with Apache Flink

Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
Gianmario Spacagna
 
Conversion and visualization of business processes in dsm format s. nicastro
Conversion and visualization of business processes in dsm format   s. nicastroConversion and visualization of business processes in dsm format   s. nicastro
Conversion and visualization of business processes in dsm format s. nicastro
SaraNicastro1
 
Continuous Deployment of Machine Learning Pipelines
Continuous Deployment of Machine Learning PipelinesContinuous Deployment of Machine Learning Pipelines
Continuous Deployment of Machine Learning Pipelines
Behrouz Derakhshan
 

Similar to Streaming Topic Model Training and Inference with Apache Flink (20)

Data Skipping Technology Yosef Moatti
Data Skipping Technology Yosef MoattiData Skipping Technology Yosef Moatti
Data Skipping Technology Yosef Moatti
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
 
Tabnet presentation
Tabnet presentationTabnet presentation
Tabnet presentation
 
SFScon19 - Davide Boschetto - Constraints of model deployment and production ...
SFScon19 - Davide Boschetto - Constraints of model deployment and production ...SFScon19 - Davide Boschetto - Constraints of model deployment and production ...
SFScon19 - Davide Boschetto - Constraints of model deployment and production ...
 
2019 11-30 SPSMUC19 - Integrate Power Platform with SharePoint
2019 11-30 SPSMUC19 - Integrate Power Platform with SharePoint2019 11-30 SPSMUC19 - Integrate Power Platform with SharePoint
2019 11-30 SPSMUC19 - Integrate Power Platform with SharePoint
 
Towards Automated Boundary Value Testing with Program Derivatives and Search
Towards Automated Boundary Value Testing with Program Derivatives and SearchTowards Automated Boundary Value Testing with Program Derivatives and Search
Towards Automated Boundary Value Testing with Program Derivatives and Search
 
Conversion and visualization of business processes in dsm format s. nicastro
Conversion and visualization of business processes in dsm format   s. nicastroConversion and visualization of business processes in dsm format   s. nicastro
Conversion and visualization of business processes in dsm format s. nicastro
 
Cis 512 week 10 term paper strayer new
Cis 512 week 10 term paper   strayer newCis 512 week 10 term paper   strayer new
Cis 512 week 10 term paper strayer new
 
Cis 512 week 10 term paper strayer new
Cis 512 week 10 term paper   strayer newCis 512 week 10 term paper   strayer new
Cis 512 week 10 term paper strayer new
 
Cis 512 week 10 term paper strayer new
Cis 512 week 10 term paper   strayer newCis 512 week 10 term paper   strayer new
Cis 512 week 10 term paper strayer new
 
Cis 512 week 10 term paper strayer new
Cis 512 week 10 term paper   strayer newCis 512 week 10 term paper   strayer new
Cis 512 week 10 term paper strayer new
 
Ltms 510 Class
Ltms 510   ClassLtms 510   Class
Ltms 510 Class
 
Continuous Deployment of Machine Learning Pipelines
Continuous Deployment of Machine Learning PipelinesContinuous Deployment of Machine Learning Pipelines
Continuous Deployment of Machine Learning Pipelines
 
Let fleat2019-0807-1
Let fleat2019-0807-1Let fleat2019-0807-1
Let fleat2019-0807-1
 
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
 
Let fleat2019-0807-2
Let fleat2019-0807-2Let fleat2019-0807-2
Let fleat2019-0807-2
 
Assisted code review with source{d} lookout
Assisted code review with source{d} lookoutAssisted code review with source{d} lookout
Assisted code review with source{d} lookout
 
Assignment 2 debayan bhattacharya submission
Assignment 2 debayan bhattacharya submissionAssignment 2 debayan bhattacharya submission
Assignment 2 debayan bhattacharya submission
 
Deep Learning State of the Art (2019) - MIT by Lex Fridman
Deep Learning State of the Art (2019) - MIT by Lex FridmanDeep Learning State of the Art (2019) - MIT by Lex Fridman
Deep Learning State of the Art (2019) - MIT by Lex Fridman
 
Deep learning state_of_the_art- Autonomous Driving
Deep learning state_of_the_art- Autonomous DrivingDeep learning state_of_the_art- Autonomous Driving
Deep learning state_of_the_art- Autonomous Driving
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Streaming Topic Model Training and Inference with Apache Flink

  • 1. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 1/40 Streaming Topic Model TrainingStreaming Topic Model Training and Inferenceand Inference Suneel MarthiSuneel Marthi March 21, 2019March 21, 2019 DataWorks Summit 2019 - BarcelonaDataWorks Summit 2019 - Barcelona 1
  • 2. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 2/40 $WhoAreWe$WhoAreWe Suneel MarthiSuneel Marthi  @suneelmarthi@suneelmarthi Member of Apache Software Foundation Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams 2
  • 3. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 3/40 AgendaAgenda Motivation for Topic Modeling Existing Approaches for Topic Modeling Topic Modeling on Streams 3
  • 4. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 4/40 Motivation for Topic ModelingMotivation for Topic Modeling 4
  • 5. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 5/40 Topic ModelsTopic Models Automatically discovering main topics in collections of documents Overall What are the main themes discussed on a mailing list or a discussion forum ? What are the most recurring topics discussed in research papers ? 5
  • 6. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 6/40 Topic Models (Contd)Topic Models (Contd) Per Document What are the main topics discussed in a certain (single) newspaper article ? What is the most distinctive topic that makes a certain research article more interesting than the others ? 6
  • 7. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 7/40 Uses of Topic ModelsUses of Topic Models Search Engines to organize text Annotating Documents according to these topics Uncovering hidden topical patterns across a document collection 7
  • 8. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 8/40 Common Approaches to TopicCommon Approaches to Topic ModelingModeling 8
  • 9. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 9/40 Latent Dirichlet Allocation (LDA) Topics are composed by probability distributions over words Documents are composed by probability distributions over Topics Batch Oriented approach 9
  • 10. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 10/40 N: vocab size M/D: # of documents alpha: prior on per-document topic dist. beta: prior on topic-word dist. theta: per-document topic dist. phi: per-topic word dist. z: topic for word w IntuitionIntuition: Documents are a mixture of topics, topics are a mixture of words. So documents: Documents are a mixture of topics, topics are a mixture of words. So documents can be generated by drawing a sample of topics, and then drawing a sample of words.can be generated by drawing a sample of topics, and then drawing a sample of words. LDA ModelLDA Model 10
  • 11. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 11/40 Latent Semantic Analysis (LSA) SVDed TF-IDF Document-Term Matrix Batch Oriented approach 11
  • 12. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 12/40 Drawbacks of Traditional Topic Modeling Approach Problem : labelling topics is often a manual task E.g. LDA topic:30% basket,15% ball,10% drunk,… can be tagged as basketball 12
  • 13. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 13/40 Common ApproachesCommon Approaches DeepDeep Learning (of course)Learning (of course) Learn to perform LDA: LDA supervised DNN training (inference speed up) https://arxiv.org/abs/1508.01011 Deep Belief Networks for Topic Modelling https://arxiv.org/abs/1501.04325 13
  • 14. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 14/40 What are Embeddings ?What are Embeddings ? Represent a document as a point in space, semantically similarRepresent a document as a point in space, semantically similar docs are close together when plotteddocs are close together when plotted Such a representation is learned via a (shallow) neural networkSuch a representation is learned via a (shallow) neural network algorithmalgorithm 14
  • 15. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 15/40 Embeddings for Topic ModelingEmbeddings for Topic Modeling LDA2Vec: Mixing LDA and word2vec word embeddings https://arxiv.org/abs/1605.02019 Navigating embeddings: geometric navigation in word and doc embeddings space looking for topics 15
  • 16. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 16/40 Static vs Dynamic CorporaStatic vs Dynamic Corpora Learning Topic Models over a fixed set of Documents Fetch the Corpus Train and Fit a topic model Extract topics ...other downstream tasks 16
  • 17. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 17/40 Static vs Dynamic Corpora (Contd)Static vs Dynamic Corpora (Contd) What to do with Corpus updates? Document collections are not static in real world There may be no document collections to begin with Memory limits with large corpora Reprocess entire corpora when new documents are added 17
  • 18. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 18/40 Static vs Dynamic Corpora (Contd)Static vs Dynamic Corpora (Contd) Streams, Streams, Streams Twitter stream Reddit posts Papers on ResearchGate and Arxiv .... 18
  • 19. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 19/40 Questions for AudienceQuestions for Audience Who’s doing inference or scoring on streaming data? Who’s doing online learning and optimization on streaming data? (2) is hard, sometimes because the algorithms aren’t(2) is hard, sometimes because the algorithms aren’t there.there. But, also because people don’t know theBut, also because people don’t know the algorithms are there.algorithms are there. 19
  • 20. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 20/40 Topic Modeling on StreamsTopic Modeling on Streams 20
  • 21. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 21/40 2 Streaming Approaches2 Streaming Approaches Learning Topics from Jira Issues Online LDA 21
  • 22. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 22/40 Learning Topics on Jira IssuesLearning Topics on Jira Issues 22
  • 23. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 23/40 WorkflowWorkflow
  • 24. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 24/40 23
  • 25. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 25/40 FLINK-9963: Add a single value table formatFLINK-9963: Add a single value table format factoryfactory 24
  • 26. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 26/40 Topics extracted for Flink-9963Topics extracted for Flink-9963 format factory table schema table source sink factory table sink 25
  • 27. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 27/40 FLINK-8286: Fix Flink-Yarn-Kerberos integrationFLINK-8286: Fix Flink-Yarn-Kerberos integration for FLIP-6for FLIP-6
  • 28. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 28/40 26
  • 29. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 29/40 Topics extracted for Flink-8286Topics extracted for Flink-8286 yarn kerberos integration kerberos integrations yarn kerberos 27
  • 30. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 30/40 Bayesian InferenceBayesian Inference GoalGoal: Calculate the posterior distribution of a: Calculate the posterior distribution of a probabilistic model given dataprobabilistic model given data Gibbs sampling/MCMC: Repeatedly sample from conditional distribution(s) to approximate the posterior of the joint distribution; typically a batch process Variational Bayes: Optimize the parameters of a function that approximates joint distribution 28
  • 31. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 31/40 Variational BayesVariational Bayes Expectation Maximization-like optimization Faster and more accurate Can often be computed online (on windows over streams) 29
  • 32. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 32/40 Online LDA (1/2)Online LDA (1/2) Variational Bayes for LDA Batch or online Iterate between optimizing per-word topics and per-document topics (“E” step) and topic-word distributions (“M” step) Default topic model optimization in Gensim, scikit-learn, Apache Spark MLlib -- not because it’s amenable to streaming per se, but more because it’s memory efficient and can be applied to large corpora. Hoffman, Blei & Bach. Online Learning for Latent Dirichlet Allocation. Proceedings of Neural Information Processing Systems. 2010.Hoffman, Blei & Bach. Online Learning for Latent Dirichlet Allocation. Proceedings of Neural Information Processing Systems. 2010. 30
  • 33. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 33/40 job-constant K: number of topics alpha: topic-document dist. prior eta: topic-word dist. prior tau0: learning rate kappa: decay rate (mini)batch-local gamma: theta prior theta: param. for per-document topic dist phi: param. for per-topic word dist. rho: update weight job-global lambda: parameter for topic-word dist. t: documents/batches seen (index) GoalGoal: Optimize the topic-word dist. parameter: Optimize the topic-word dist. parameter lambda; it tells us what words belong to the topics.lambda; it tells us what words belong to the topics. [Image source: Hoffman, Blei & Bach, 2010.][Image source: Hoffman, Blei & Bach, 2010.] Job-global parameters can be tracked in a stateJob-global parameters can be tracked in a state store. Batch-local parameters can be discarded.store. Batch-local parameters can be discarded. Online LDA (2/2)Online LDA (2/2) 31
  • 34. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 34/40 job-constant K: number of topics alpha: topic-document dist. prior eta: topic-word dist. prior tau0: learning rate kappa: decay rate Set job parameters globally withSet job parameters globally with setGlobalJobParameters() on executionsetGlobalJobParameters() on execution environment, or pass as arguments toenvironment, or pass as arguments to operatorsoperators (mini)batch-local gamma: theta prior theta: param. for per-document topic dist phi: param. for per-topic word dist. rho: update weight Operator/task-internal and don't need to beOperator/task-internal and don't need to be accessible by other tasksaccessible by other tasks job-global lambda: parameter for topic-word dist. t: documents/batches seen (index) Several options: External key-value store Accumulators: lambda update as increment, accumulator + broadcast variables Broadcast state And, (mini)batches selected by Window sizeAnd, (mini)batches selected by Window size Online LDA on BeamOnline LDA on Beam 32
  • 35. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 35/40 OnlineLDA PipelineOnlineLDA Pipeline 33
  • 36. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 36/40 So what?So what? 34
  • 37. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 37/40 For many tasks, the default is batch training/optimization Often, even if good online training and optimization algorithms exist (why?) So if your algorithms only require 1-step of parameter state, why not just do streaming? You can get instant, always up-to-date results by putting the effort into using a different class of optimization algorithms an using a stream processing engine, so just do that 35
  • 38. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 38/40 LinksLinks http://papers.nips.cc/paper/3902-online-learning- for-latentdirichlet-allocation Learning from LDA using Deep Neural Networks - https://arxiv.org/pdf/1508.01011.pdf Deep Belief Nets for Topic Modeling - https://arxiv.org/pdf/1501.04325.pdf Topic Modeling on Jira issues: https://github.com/tteofili/jtm 36
  • 39. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 39/40 CreditsCredits Tommaso Teofili, Simone Tripodi (Adobe - Rome) Joern Kottmann, Bruno Kinoshita (Apache OpenNLP) Fabian Hueske (Apache Flink, Data Artisans) 37
  • 40. 3/31/2019 Streaming Topic Model Training and Inference https://kottmann.github.io/dataworks-summit-2019/?print-pdf#/ 40/40 Questions ???Questions ??? 38