SlideShare a Scribd company logo
1 of 37
©2013 LinkedIn Corporation. All Rights Reserved.
Latent Dirichlet Allocation (LDA)
- for ML-IR Discussion Group
1
Prepared by Wayne Tai Lee, Satpreet Singh
©2013 LinkedIn Corporation. All Rights Reserved.
Latent Dirichlet Allocation:
A Bayesian Unsupervised Learning Model
Roadmap
2
• Unsupervised learning
• Bayesian Statistics
• Mixture Models
• LDA – theory and intuition
• LDA – practice and applications
©2013 LinkedIn Corporation. All Rights Reserved.
Unsupervised Learning
Learning patterns with no labels
3
• Clustering is a form of “Unsupervised learning”
• Classification is known as supervised learning
• Validation is difficult
©2013 LinkedIn Corporation. All Rights Reserved. 4
How would you cluster?
©2013 LinkedIn Corporation. All Rights Reserved. 5
Documents of wikipedia
Now try these ones!
©2013 LinkedIn Corporation. All Rights Reserved.
Bayesian Statistics
A framework to update your beliefs
6
• Probabilities as beliefs
• Updates your belief as data is observed
• Requires a model that describes the data generation
©2013 LinkedIn Corporation. All Rights Reserved. 7
Candidate potential
Example: Evaluating Candidates
©2013 LinkedIn Corporation. All Rights Reserved. 8
Candidate potential
Example: Evaluating Candidates
Schooling
Experience
Interview
Internship
©2013 LinkedIn Corporation. All Rights Reserved. 9
Candidate potential
Example: Evaluating Candidates
Schooling
Experience
Interview
Internship
How to update?!
©2013 LinkedIn Corporation. All Rights Reserved. 10
©2013 LinkedIn Corporation. All Rights Reserved. 11
Model for candidates Model for data generation
©2013 LinkedIn Corporation. All Rights Reserved.
Mixture Models
A popular statistical model
12
• An easy way to build hierarchical relationships
©2013 LinkedIn Corporation. All Rights Reserved.
Mixture models visualized
13
Candidate Quality
High
Low
©2013 LinkedIn Corporation. All Rights Reserved. 14
Marginal Distribution of Candidate Performance: ignore quality
©2013 LinkedIn Corporation. All Rights Reserved. 15
Distribution of Candidate Performance:
©2013 LinkedIn Corporation. All Rights Reserved. 16
Distribution of Candidate Performance:
Mixture Weights
©2013 LinkedIn Corporation. All Rights Reserved. 17
Mixture Weights
Distribution of Candidate Performance:
©2013 LinkedIn Corporation. All Rights Reserved. 18
Distribution of Candidate Performance:
?
? ?
?
©2013 LinkedIn Corporation. All Rights Reserved.
How are words in a document generated?
19
©2013 LinkedIn Corporation. All Rights Reserved.
One possibility:
20
Each word comes from different topics (bag of words: ignore order)
©2013 LinkedIn Corporation. All Rights Reserved.
How are words in a document generated?
21
Each word comes from different topics
Mixture Weight
for Topic k
Multinomial Distribution
over ALL words based
on topic k
©2013 LinkedIn Corporation. All Rights Reserved.
Just a mixture model
22
Word
Topic 1
Topic K
Leadership
Big Data
Machine Learning
©2013 LinkedIn Corporation. All Rights Reserved.
Just a mixture model
23
Word
Topic 1
Topic K
Leadership
Big Data
Machine Learning
1) Pick a topic
2) Pick a word
©2013 LinkedIn Corporation. All Rights Reserved.
Just a mixture model
24
Word
Topic 1
Topic K
Leadership
Big Data
Machine Learning
The chosen
Topic: Z
©2013 LinkedIn Corporation. All Rights Reserved.
Just a mixture model
25
Word
Topic 1
Topic K
Leadership
Big Data
Machine Learning
So we really want to know
1) Z
2) _
3) _
The chosen
Topic: Z
©2013 LinkedIn Corporation. All Rights Reserved.
Just a mixture model
26
Word
Topic 1
Topic K
Leadership
Big Data
Machine Learning
So we really want to know
1) Z (cluster for the word)
2) (document composition)
3) (key words)
The chosen
Topic: Z
©2013 LinkedIn Corporation. All Rights Reserved.
Review!
27
Z W
©2013 LinkedIn Corporation. All Rights Reserved. 28
Zd,n
k=1…K
Wd,n
n=1,…,Nd
d=1,…,D
K: number of topics
Nd: number of words
D: number of documents
©2013 LinkedIn Corporation. All Rights Reserved. 29
Zd,n
k=1…K
Wd,n
n=1,…,Nd
d=1,…,D
K: number of topics
Nd: number of words
D: number of documents
Bayesian: But what about the distribution for and ??
©2013 LinkedIn Corporation. All Rights Reserved. 30
Zd,n
k=1…K
Wd,n
n=1,…,Nd
d=1,…,D
K: number of topics
Nd: number of words
D: number of documents
Bayesian: But what about the distribution for and ??
©2013 LinkedIn Corporation. All Rights Reserved. 31
and control the “sparsity” of the weights for the multinomial.
Implications: a priori we assume
- Topics have few key words
- Documents only have a small subset of topics
©2013 LinkedIn Corporation. All Rights Reserved.
Dirichlet Distribution with Different Sparsity Parameters
32
©2013 LinkedIn Corporation. All Rights Reserved. 33
Latent Dirichlet Allocation!!!
Zd,n
k=1…K
Wd,n
n=1,…,Nd
©2013 LinkedIn Corporation. All Rights Reserved. 34
How do we fit this model?
Want the posterior:
Worst part of Bayesian Analysis…..personally speaking~
©2013 LinkedIn Corporation. All Rights Reserved. 35
Two main ways to get posterior:
- Sampling methods
- Asymtotically correct
- Time consuming
- Lots of black magic in sampling tricks
- Variational methods (practical solution!)
- An approximation with no guarantees
- Faster
- Need math skills
©2013 LinkedIn Corporation. All Rights Reserved. 36
Variational Bayes (specifically mean field variational bayes):
What’s crazy?
- Assumes all the latent variables are independent
What’s not crazy?
- Finds the “best” model within this crazy class.
- Best under KL divergence
Empirically have shown promising results!
For “sufficient” details:
“Explaining Variational Approximations ” by Ormerod and Wand
©2013 LinkedIn Corporation. All Rights Reserved.
LDA Take Home
37
- An intuitively appealing Bayesian unsupervised learning model
- Training is difficult
- Lots of packages exist, main issue is scalability
- Validation is difficult
- Usually cast into a supervised learning framework
- Presentation is difficult
- Visualization for the Bayesian model is hard.

More Related Content

What's hot

[2019] Class-based N-gram Models of Natural Language
[2019] Class-based N-gram Models of Natural Language[2019] Class-based N-gram Models of Natural Language
[2019] Class-based N-gram Models of Natural LanguageJinho Choi
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Matrix Completion Presentation
Matrix Completion PresentationMatrix Completion Presentation
Matrix Completion PresentationMichael Hankin
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa ReformerSan Kim
 
Probabilistic logic
Probabilistic logicProbabilistic logic
Probabilistic logicRushdi Shams
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Simplilearn
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular dataJimmyLiang20
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classificationSung Yub Kim
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 
Complex Network Analysis
Complex Network Analysis Complex Network Analysis
Complex Network Analysis Annu Sharma
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval Systemvimalsura
 

What's hot (20)

[2019] Class-based N-gram Models of Natural Language
[2019] Class-based N-gram Models of Natural Language[2019] Class-based N-gram Models of Natural Language
[2019] Class-based N-gram Models of Natural Language
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Matrix Completion Presentation
Matrix Completion PresentationMatrix Completion Presentation
Matrix Completion Presentation
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
XLnet RoBERTa Reformer
XLnet RoBERTa ReformerXLnet RoBERTa Reformer
XLnet RoBERTa Reformer
 
Probabilistic logic
Probabilistic logicProbabilistic logic
Probabilistic logic
 
Ontology Learning
Ontology LearningOntology Learning
Ontology Learning
 
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
 
Deep neural networks and tabular data
Deep neural networks and tabular dataDeep neural networks and tabular data
Deep neural networks and tabular data
 
NP Complete Problems in Graph Theory
NP Complete Problems in Graph TheoryNP Complete Problems in Graph Theory
NP Complete Problems in Graph Theory
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Introduction to Complex Networks
Introduction to Complex NetworksIntroduction to Complex Networks
Introduction to Complex Networks
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
Concept learning
Concept learningConcept learning
Concept learning
 
Complex Network Analysis
Complex Network Analysis Complex Network Analysis
Complex Network Analysis
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval System
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 

Similar to LDA Beginner's Tutorial

Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. Hakka Labs
 
Computing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic GraphComputing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic GraphVitaly Gordon
 
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...Vitaly Gordon
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsPeter Skomoroch
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceNeo4j
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data ProductsPeter Skomoroch
 
MIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningMIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningLex Fridman
 
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Lionel Briand
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sdThinkful
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningNeo4j
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bhaskar Ghosh
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressMarcel Blattner, PhD
 
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...Heidi Nance
 
Social Search in a Professional Context
Social Search in a Professional ContextSocial Search in a Professional Context
Social Search in a Professional ContextDaniel Tunkelang
 
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)Social Fresh Conference
 
Building Enterprise Knowledge Using Semantic Encyclopedias
Building Enterprise Knowledge Using Semantic EncyclopediasBuilding Enterprise Knowledge Using Semantic Encyclopedias
Building Enterprise Knowledge Using Semantic EncyclopediasBernadette Clemente
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AINeo4j
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
 

Similar to LDA Beginner's Tutorial (20)

Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips. Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
Crowdsourcing Series: LinkedIn. By Vitaly Gordon & Patrick Philips.
 
Computing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic GraphComputing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic Graph
 
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
 
SF Data Science: Developing Data Products
SF Data Science: Developing Data ProductsSF Data Science: Developing Data Products
SF Data Science: Developing Data Products
 
Workshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data ScienceWorkshop - Neo4j Graph Data Science
Workshop - Neo4j Graph Data Science
 
Developing Data Products
Developing Data ProductsDeveloping Data Products
Developing Data Products
 
MIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine LearningMIT Sloan: Intro to Machine Learning
MIT Sloan: Intro to Machine Learning
 
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
Mathematicians, Social Scientists, or Engineers? The Split Minds of Software ...
 
Getstarteddssd12717sd
Getstarteddssd12717sdGetstarteddssd12717sd
Getstarteddssd12717sd
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
 
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
Bg linkedin bigdata_martinschultz_symposium_yale_oct2012
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR Congress
 
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
Open Source Data Visualization for Resource Sharing: An Ivy Plus Libraries Pr...
 
Social Search in a Professional Context
Social Search in a Professional ContextSocial Search in a Professional Context
Social Search in a Professional Context
 
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
7 Badass SlideShare Tactics - Jason Miller (Social Fresh WEST 2013)
 
Building Enterprise Knowledge Using Semantic Encyclopedias
Building Enterprise Knowledge Using Semantic EncyclopediasBuilding Enterprise Knowledge Using Semantic Encyclopedias
Building Enterprise Knowledge Using Semantic Encyclopedias
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
 
Data-X-v3.1
Data-X-v3.1Data-X-v3.1
Data-X-v3.1
 
Data-X-Sparse-v2
Data-X-Sparse-v2Data-X-Sparse-v2
Data-X-Sparse-v2
 
Applied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML modelApplied Data Science Course Part 1: Concepts & your first ML model
Applied Data Science Course Part 1: Concepts & your first ML model
 

More from Wayne Lee

Feature selection can hurt model inference
Feature selection can hurt model inferenceFeature selection can hurt model inference
Feature selection can hurt model inferenceWayne Lee
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansWayne Lee
 
What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?Wayne Lee
 
R merge-tutorial
R merge-tutorialR merge-tutorial
R merge-tutorialWayne Lee
 
The Key to Blind Dates - Data Snooping
The Key to Blind Dates - Data SnoopingThe Key to Blind Dates - Data Snooping
The Key to Blind Dates - Data SnoopingWayne Lee
 
Crash Course in A/B testing
Crash Course in A/B testingCrash Course in A/B testing
Crash Course in A/B testingWayne Lee
 
Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap Wayne Lee
 

More from Wayne Lee (7)

Feature selection can hurt model inference
Feature selection can hurt model inferenceFeature selection can hurt model inference
Feature selection can hurt model inference
 
Explaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for StatisticiansExplaining the Basics of Mean Field Variational Approximation for Statisticians
Explaining the Basics of Mean Field Variational Approximation for Statisticians
 
What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?What is bayesian statistics and how is it different?
What is bayesian statistics and how is it different?
 
R merge-tutorial
R merge-tutorialR merge-tutorial
R merge-tutorial
 
The Key to Blind Dates - Data Snooping
The Key to Blind Dates - Data SnoopingThe Key to Blind Dates - Data Snooping
The Key to Blind Dates - Data Snooping
 
Crash Course in A/B testing
Crash Course in A/B testingCrash Course in A/B testing
Crash Course in A/B testing
 
Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap Introduction to Bag of Little Bootstrap
Introduction to Bag of Little Bootstrap
 

Recently uploaded

size separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticssize separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticspragatimahajan3
 
MichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfMichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfmstarkes24
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽中 央社
 
factors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxfactors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxSanjay Shekar
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfbu07226
 
Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Mohamed Rizk Khodair
 
Navigating the Misinformation Minefield: The Role of Higher Education in the ...
Navigating the Misinformation Minefield: The Role of Higher Education in the ...Navigating the Misinformation Minefield: The Role of Higher Education in the ...
Navigating the Misinformation Minefield: The Role of Higher Education in the ...Mark Carrigan
 
ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesashishpaul799
 
Behavioral-sciences-dr-mowadat rana (1).pdf
Behavioral-sciences-dr-mowadat rana (1).pdfBehavioral-sciences-dr-mowadat rana (1).pdf
Behavioral-sciences-dr-mowadat rana (1).pdfaedhbteg
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxCeline George
 
Essential Safety precautions during monsoon season
Essential Safety precautions during monsoon seasonEssential Safety precautions during monsoon season
Essential Safety precautions during monsoon seasonMayur Khatri
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...Nguyen Thanh Tu Collection
 
The Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational ResourcesThe Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational Resourcesaileywriter
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxheathfieldcps1
 
The Ultimate Guide to Social Media Marketing in 2024.pdf
The Ultimate Guide to Social Media Marketing in 2024.pdfThe Ultimate Guide to Social Media Marketing in 2024.pdf
The Ultimate Guide to Social Media Marketing in 2024.pdfdm4ashexcelr
 
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Celine George
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文中 央社
 
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Denish Jangid
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringDenish Jangid
 

Recently uploaded (20)

size separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticssize separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceutics
 
MichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdfMichaelStarkes_UncutGemsProjectSummary.pdf
MichaelStarkes_UncutGemsProjectSummary.pdf
 
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽會考英聽
 
factors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptxfactors influencing drug absorption-final-2.pptx
factors influencing drug absorption-final-2.pptx
 
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdfINU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
INU_CAPSTONEDESIGN_비밀번호486_업로드용 발표자료.pdf
 
Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).Dementia (Alzheimer & vasular dementia).
Dementia (Alzheimer & vasular dementia).
 
Navigating the Misinformation Minefield: The Role of Higher Education in the ...
Navigating the Misinformation Minefield: The Role of Higher Education in the ...Navigating the Misinformation Minefield: The Role of Higher Education in the ...
Navigating the Misinformation Minefield: The Role of Higher Education in the ...
 
ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyes
 
Behavioral-sciences-dr-mowadat rana (1).pdf
Behavioral-sciences-dr-mowadat rana (1).pdfBehavioral-sciences-dr-mowadat rana (1).pdf
Behavioral-sciences-dr-mowadat rana (1).pdf
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptx
 
Essential Safety precautions during monsoon season
Essential Safety precautions during monsoon seasonEssential Safety precautions during monsoon season
Essential Safety precautions during monsoon season
 
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
TỔNG HỢP HƠN 100 ĐỀ THI THỬ TỐT NGHIỆP THPT VẬT LÝ 2024 - TỪ CÁC TRƯỜNG, TRƯ...
 
The Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational ResourcesThe Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational Resources
 
Post Exam Fun(da) Intra UEM General Quiz - Finals.pdf
Post Exam Fun(da) Intra UEM General Quiz - Finals.pdfPost Exam Fun(da) Intra UEM General Quiz - Finals.pdf
Post Exam Fun(da) Intra UEM General Quiz - Finals.pdf
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptx
 
The Ultimate Guide to Social Media Marketing in 2024.pdf
The Ultimate Guide to Social Media Marketing in 2024.pdfThe Ultimate Guide to Social Media Marketing in 2024.pdf
The Ultimate Guide to Social Media Marketing in 2024.pdf
 
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
Removal Strategy _ FEFO _ Working with Perishable Products in Odoo 17
 
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文會考英文
 
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
Basic Civil Engineering notes on Transportation Engineering, Modes of Transpo...
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
 

LDA Beginner's Tutorial

  • 1. ©2013 LinkedIn Corporation. All Rights Reserved. Latent Dirichlet Allocation (LDA) - for ML-IR Discussion Group 1 Prepared by Wayne Tai Lee, Satpreet Singh
  • 2. ©2013 LinkedIn Corporation. All Rights Reserved. Latent Dirichlet Allocation: A Bayesian Unsupervised Learning Model Roadmap 2 • Unsupervised learning • Bayesian Statistics • Mixture Models • LDA – theory and intuition • LDA – practice and applications
  • 3. ©2013 LinkedIn Corporation. All Rights Reserved. Unsupervised Learning Learning patterns with no labels 3 • Clustering is a form of “Unsupervised learning” • Classification is known as supervised learning • Validation is difficult
  • 4. ©2013 LinkedIn Corporation. All Rights Reserved. 4 How would you cluster?
  • 5. ©2013 LinkedIn Corporation. All Rights Reserved. 5 Documents of wikipedia Now try these ones!
  • 6. ©2013 LinkedIn Corporation. All Rights Reserved. Bayesian Statistics A framework to update your beliefs 6 • Probabilities as beliefs • Updates your belief as data is observed • Requires a model that describes the data generation
  • 7. ©2013 LinkedIn Corporation. All Rights Reserved. 7 Candidate potential Example: Evaluating Candidates
  • 8. ©2013 LinkedIn Corporation. All Rights Reserved. 8 Candidate potential Example: Evaluating Candidates Schooling Experience Interview Internship
  • 9. ©2013 LinkedIn Corporation. All Rights Reserved. 9 Candidate potential Example: Evaluating Candidates Schooling Experience Interview Internship How to update?!
  • 10. ©2013 LinkedIn Corporation. All Rights Reserved. 10
  • 11. ©2013 LinkedIn Corporation. All Rights Reserved. 11 Model for candidates Model for data generation
  • 12. ©2013 LinkedIn Corporation. All Rights Reserved. Mixture Models A popular statistical model 12 • An easy way to build hierarchical relationships
  • 13. ©2013 LinkedIn Corporation. All Rights Reserved. Mixture models visualized 13 Candidate Quality High Low
  • 14. ©2013 LinkedIn Corporation. All Rights Reserved. 14 Marginal Distribution of Candidate Performance: ignore quality
  • 15. ©2013 LinkedIn Corporation. All Rights Reserved. 15 Distribution of Candidate Performance:
  • 16. ©2013 LinkedIn Corporation. All Rights Reserved. 16 Distribution of Candidate Performance: Mixture Weights
  • 17. ©2013 LinkedIn Corporation. All Rights Reserved. 17 Mixture Weights Distribution of Candidate Performance:
  • 18. ©2013 LinkedIn Corporation. All Rights Reserved. 18 Distribution of Candidate Performance: ? ? ? ?
  • 19. ©2013 LinkedIn Corporation. All Rights Reserved. How are words in a document generated? 19
  • 20. ©2013 LinkedIn Corporation. All Rights Reserved. One possibility: 20 Each word comes from different topics (bag of words: ignore order)
  • 21. ©2013 LinkedIn Corporation. All Rights Reserved. How are words in a document generated? 21 Each word comes from different topics Mixture Weight for Topic k Multinomial Distribution over ALL words based on topic k
  • 22. ©2013 LinkedIn Corporation. All Rights Reserved. Just a mixture model 22 Word Topic 1 Topic K Leadership Big Data Machine Learning
  • 23. ©2013 LinkedIn Corporation. All Rights Reserved. Just a mixture model 23 Word Topic 1 Topic K Leadership Big Data Machine Learning 1) Pick a topic 2) Pick a word
  • 24. ©2013 LinkedIn Corporation. All Rights Reserved. Just a mixture model 24 Word Topic 1 Topic K Leadership Big Data Machine Learning The chosen Topic: Z
  • 25. ©2013 LinkedIn Corporation. All Rights Reserved. Just a mixture model 25 Word Topic 1 Topic K Leadership Big Data Machine Learning So we really want to know 1) Z 2) _ 3) _ The chosen Topic: Z
  • 26. ©2013 LinkedIn Corporation. All Rights Reserved. Just a mixture model 26 Word Topic 1 Topic K Leadership Big Data Machine Learning So we really want to know 1) Z (cluster for the word) 2) (document composition) 3) (key words) The chosen Topic: Z
  • 27. ©2013 LinkedIn Corporation. All Rights Reserved. Review! 27 Z W
  • 28. ©2013 LinkedIn Corporation. All Rights Reserved. 28 Zd,n k=1…K Wd,n n=1,…,Nd d=1,…,D K: number of topics Nd: number of words D: number of documents
  • 29. ©2013 LinkedIn Corporation. All Rights Reserved. 29 Zd,n k=1…K Wd,n n=1,…,Nd d=1,…,D K: number of topics Nd: number of words D: number of documents Bayesian: But what about the distribution for and ??
  • 30. ©2013 LinkedIn Corporation. All Rights Reserved. 30 Zd,n k=1…K Wd,n n=1,…,Nd d=1,…,D K: number of topics Nd: number of words D: number of documents Bayesian: But what about the distribution for and ??
  • 31. ©2013 LinkedIn Corporation. All Rights Reserved. 31 and control the “sparsity” of the weights for the multinomial. Implications: a priori we assume - Topics have few key words - Documents only have a small subset of topics
  • 32. ©2013 LinkedIn Corporation. All Rights Reserved. Dirichlet Distribution with Different Sparsity Parameters 32
  • 33. ©2013 LinkedIn Corporation. All Rights Reserved. 33 Latent Dirichlet Allocation!!! Zd,n k=1…K Wd,n n=1,…,Nd
  • 34. ©2013 LinkedIn Corporation. All Rights Reserved. 34 How do we fit this model? Want the posterior: Worst part of Bayesian Analysis…..personally speaking~
  • 35. ©2013 LinkedIn Corporation. All Rights Reserved. 35 Two main ways to get posterior: - Sampling methods - Asymtotically correct - Time consuming - Lots of black magic in sampling tricks - Variational methods (practical solution!) - An approximation with no guarantees - Faster - Need math skills
  • 36. ©2013 LinkedIn Corporation. All Rights Reserved. 36 Variational Bayes (specifically mean field variational bayes): What’s crazy? - Assumes all the latent variables are independent What’s not crazy? - Finds the “best” model within this crazy class. - Best under KL divergence Empirically have shown promising results! For “sufficient” details: “Explaining Variational Approximations ” by Ormerod and Wand
  • 37. ©2013 LinkedIn Corporation. All Rights Reserved. LDA Take Home 37 - An intuitively appealing Bayesian unsupervised learning model - Training is difficult - Lots of packages exist, main issue is scalability - Validation is difficult - Usually cast into a supervised learning framework - Presentation is difficult - Visualization for the Bayesian model is hard.

Editor's Notes

  1. Take home: validation is difficult….no true answer here.
  2. Clustering documents is difficult because many repeated words are used. Some documents may be similar to one another on different topics. So we might want to cluster allowing membership.
  3. 2 stage process
  4. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  5. 2 stage process
  6. 2 stage process
  7. 2 stage process
  8. 2 stage process
  9. 2 stage process
  10. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  11. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  12. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  13. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  14. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  15. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  16. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  17. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.
  18. Example: the word usage of “professional” is probably higher in the topic of professional network than a social network.