SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
Topic Modeling for Information Retrieval
and Word Sense Disambiguation tasks
Università degli Studi di Milano - Bicocca
Di Donato Leonardo
Text Mining Course - Prof. Fabio Stella
Introduction
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
super abundant amount of digital unstructured information
it continues to grow at an astonishing rate (it doubles every two years)
man can not manage it: information overload.
problems: crawling, representing, storing, summarizing, clustering, searching ...
(general rule: every problem is an opportunity)
opportunity: automatically extract value from chaos
what value? how to do it?
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Goals
the value that we want to extract is: clusters of semantically related
documents
our purpose is [1] the unsupervised clustering of a text dataset
[2] the implementation of information retrieval procedures that exploit the
representation of documents at the topic level
[3] the modeling of the ability to computationally identify the meaning of
words in context (word sense disambiguation)
our documents collection: a partition of the Associated Press dataset
~ 2300 english textual news (dating back to the '90s)
characteristic of any text document: it is often messy, has flaws and noise
we need to clean the data
we need a structured representation of the data
Dataset
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Pre-Processing
google refine [ link ]
[1] replacement of abbreviations and common entities with expressions that
normalize them (e.g., {dlrs, dlr, $, ...} → {dollar}, {mln, mlns, ...} → {million})
[2] adjustment of flaws and [3] stripping metadata entities through regular
expressions
mallet [ link ]
[1] make all the characters lowercase
[2] tokenization [3] stop-word removal
[4] vocabulary proportional cut-off, with threshold 0.03
[5] term-frequency representation of each document
corpus is a unique file, every line is a document with this format:
results: |W| = 32349 token types, 241908 words
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Topic Models
probabilistic generative models for uncovering the underlying semantic
structure of a document collection based on a Bayesian analysis of the
original texts [ Blei, 2003 ]
goal: discover patterns of word-use and connect documents that exhibit
similar patterns
idea: documents are mixtures of topics (assignments) and each topic is a
multinomial probability distribution over words
which are the topics have generated the given corpus of documents with
the maximum likelihood ?
we have to infer 3 latent variables: [1] the word distribution over topics [2]
the topics distribution over documents [3] the word-topic assignments
[1] Φ(j)
= P(W|Z = j) [2] Θ(d)
= P (Z|D = d) [3] P(Z|W)
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Topic Models
Latent Dirichlet Allocation (LDA) model associates with [2] and [1] two
smoothing hyper-parameters α and β.
the number of times a topic j which has been selected for a document is
indicated by αj
(α1
, ..., αT
are the parameters of a prior Dirichlet)
β is the parameter of a prior Dirichlet which indicates the count of
extracted words from a topic (before observing any corpus document)
To estimate them we can use different methods (e.g.; Gibbs Sampling)
we need to estimate the distributions Φ and Θ: it is possible compute them
directly through the matrixes of counts
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Tuning
which are the best value for hyper-parameters ? usually α = 50/T and β =
0.01 are those that give the best results [ Steyvers and Griffiths, 2007 ]
which is the optimal number of topics T ? and the number of iterations I ?
it depends on the specific problem, it's an open problem
we have set T = 35 and T = 40
there are topics evaluation techniques that try to face this problem ...
we have used one of those techniques (i.e., the topic coherence metric, which
evaluates the semantic coherence of a topic) to compare two model
configurations: symmetric α versus asymmetric α
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Symmetric α versus Asymmetric α
an asymmetric configuration (AS) for the alpha hyper-parameters serves to
calibrate with more flexibility the degree of topics sparseness
has been empirically demonstrated that optimizing Dirichlet hyper-
parameters (αi
, ..., αT
) for topics-document distribution makes a huge
difference: topics are not dominated by very common words and they are
more stable as their number increase [ Wallach, 2009 ]
it has not been verified by our experimentation: the topic's average
coherence for AS configuration was worse than SS configuration
why ? in our corpus there isn’t a topic that tends to occur
in each document (or the optimal number of T may be greater, or simply
the answer is more trivial ...)
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Top topics for symmetric α and T = 35
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Post-Processing - Information Retrieval
why should we use topic models to improve information retrieval tasks ?
[1] we can cluster queries according the extracted topics
[2] two documents which share no common words can be measured as
similar
query likelihood model is a basic approach for information retrieval
in this context (generative model) we can evaluate how well a document
matches a query specifying how the words of the query may have been
generated by a language model
we derive a language model for each document (a mixture of topics)
so, the relevant documents will have a topic distribution that is likely may
generated the set of words contained in (or associated with) the query
→ documents similarity
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Documents Similarity
two approaches to compute the similarity between documents
[1] probabilistic query approach
[2] comparison of topics distribution of documents
how ? through divergence metrics (e.g., symmetrised Kullback-Leibler,
Jenson-Shannon)
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Similar documents for query "forest fire"
AP880727-0015 X Fire-spitting helicopters were dispatched to Yellowstone National Park on
Tuesday to help protect the Old Faithful geyser area from a 6,000-acre blaze ...
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Post-Processing - Word Sense Disambiguation
the ability to identify the meaning of words in context in a computational
manner is usually referred as the Word Sense Disambiguation
four elements: [1] selection of word senses (i.e., the classes) [2] use of
external knowledge sources [3] representation of context [4] selection of an
automatic classification method
input: a user specified context document dc
that contains the word wx
to be
disambiguated
[1] → given s most similar words for wx
, for each of this we build a sense document
capturing synsets, glosses, example phrases, and other relevant relations from
WordNet
[2] → WordNet as external knowledge sources to create the sense documents ds
[3] → the topical and the semantic features
[4] → comparison of document dc
with each of the s ds
document (with one of the two
approaches presented): the most similar will be the sense of word wx
in context dc
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Words similarity
two possible approaches to compute the similarity between words:
[1] associative relation
[2] comparison of (topics-words) P(Z|W) distribution
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Words similar to token "arab"
Di Donato Leonardo, Università degli Studi di Milano - Bicocca
Future Work
topic modeling →
● train an LDA model with asymmetric α for increasing values of T and evaluate the
resulting quality of topics
● train an LDA model with asymmetric α on a vocabulary on which has not been
performed any proportional cut-off
● investigate a possible implementation of a multiple chain model to obtain topics more
stable
● use other metric of topic evaluation
information retrieval →
● assess and fine-tune the prior probability of a document in the query likelihood model
● use other high-frequency metrics (e.g., α-skew) in relation to the comparison of
distributions
word sense disambiguation →
● implement and evaluate other methods to compare context document and sense
documents (e.g., compute P(dc
, ds
) under the assumption that they are conditionally
independent, given the topic variable)
● refine the mechanism of sense selection (e.g., choosing each of the s most probable words
into probability interval in order to minimize the risk that all the most similar words
refer to meanings really strictly correlated)
Thank you for your attention.
Di Donato Leonardo, Università degli Studi di Milano - Bicocca

Más contenido relacionado

La actualidad más candente

Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
NYC Predictive Analytics
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
KU Leuven
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
Bhaskar Mitra
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
KU Leuven
 
Search Engines
Search EnginesSearch Engines
Search Engines
butest
 

La actualidad más candente (20)

Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
similarity measure
similarity measure similarity measure
similarity measure
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Topic Models
Topic ModelsTopic Models
Topic Models
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
 

Destacado

Similarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationSimilarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguation
vini89
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
Rubén Izquierdo Beviá
 
An Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationAn Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense Disambiguation
Surabhi Verma
 
BibleTech2011
BibleTech2011BibleTech2011
BibleTech2011
Andi Wu
 
A word sense disambiguation technique for sinhala
A word sense disambiguation technique  for sinhalaA word sense disambiguation technique  for sinhala
A word sense disambiguation technique for sinhala
Vijayindu Gamage
 
Similarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationSimilarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguation
vini89
 
Amharic WSD using WordNet
Amharic WSD using WordNetAmharic WSD using WordNet
Amharic WSD using WordNet
Seid Hassen
 
Word sense disambiguation a survey
Word sense disambiguation a surveyWord sense disambiguation a survey
Word sense disambiguation a survey
unyil96
 
PhD defense Koen Deschacht
PhD defense Koen DeschachtPhD defense Koen Deschacht
PhD defense Koen Deschacht
guest1add48f
 
Biomedical Word Sense Disambiguation presentation [Autosaved]
Biomedical Word Sense Disambiguation presentation [Autosaved]Biomedical Word Sense Disambiguation presentation [Autosaved]
Biomedical Word Sense Disambiguation presentation [Autosaved]
akm sabbir
 

Destacado (20)

Similarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationSimilarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguation
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
 
Word Sense Disambiguation and Induction
Word Sense Disambiguation and InductionWord Sense Disambiguation and Induction
Word Sense Disambiguation and Induction
 
Search Engine Marketing Overview - Greenwich Library SCORE presentation
Search Engine Marketing Overview - Greenwich Library SCORE presentationSearch Engine Marketing Overview - Greenwich Library SCORE presentation
Search Engine Marketing Overview - Greenwich Library SCORE presentation
 
Draft programme 15 09-2015
Draft programme 15 09-2015Draft programme 15 09-2015
Draft programme 15 09-2015
 
Word sense dissambiguation
Word sense dissambiguationWord sense dissambiguation
Word sense dissambiguation
 
An Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationAn Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense Disambiguation
 
BibleTech2011
BibleTech2011BibleTech2011
BibleTech2011
 
A word sense disambiguation technique for sinhala
A word sense disambiguation technique  for sinhalaA word sense disambiguation technique  for sinhala
A word sense disambiguation technique for sinhala
 
Graph-based Word Sense Disambiguation
Graph-based Word Sense DisambiguationGraph-based Word Sense Disambiguation
Graph-based Word Sense Disambiguation
 
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
Disambiguating Polysemous Queries For Document Retrieval
Disambiguating Polysemous Queries For Document RetrievalDisambiguating Polysemous Queries For Document Retrieval
Disambiguating Polysemous Queries For Document Retrieval
 
Thesis
ThesisThesis
Thesis
 
Similarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguationSimilarity based methods for word sense disambiguation
Similarity based methods for word sense disambiguation
 
Amharic WSD using WordNet
Amharic WSD using WordNetAmharic WSD using WordNet
Amharic WSD using WordNet
 
Word sense disambiguation a survey
Word sense disambiguation a surveyWord sense disambiguation a survey
Word sense disambiguation a survey
 
PhD defense Koen Deschacht
PhD defense Koen DeschachtPhD defense Koen Deschacht
PhD defense Koen Deschacht
 
Word-sense disambiguation
Word-sense disambiguationWord-sense disambiguation
Word-sense disambiguation
 
Biomedical Word Sense Disambiguation presentation [Autosaved]
Biomedical Word Sense Disambiguation presentation [Autosaved]Biomedical Word Sense Disambiguation presentation [Autosaved]
Biomedical Word Sense Disambiguation presentation [Autosaved]
 

Similar a Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks

Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Fulvio Rotella
 

Similar a Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks (20)

A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector Representations
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
L0261075078
L0261075078L0261075078
L0261075078
 
L0261075078
L0261075078L0261075078
L0261075078
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
NOVELTY DETECTION VIA TOPIC MODELING IN RESEARCH ARTICLES
 
Novelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articlesNovelty detection via topic modeling in research articles
Novelty detection via topic modeling in research articles
 
Diversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News StoriesDiversified Social Media Retrieval for News Stories
Diversified Social Media Retrieval for News Stories
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 

Más de Leonardo Di Donato

Sistema Rilevamento Transiti (SRT) - Software Analysis and Design
Sistema Rilevamento Transiti (SRT) - Software Analysis and DesignSistema Rilevamento Transiti (SRT) - Software Analysis and Design
Sistema Rilevamento Transiti (SRT) - Software Analysis and Design
Leonardo Di Donato
 
CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...
CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...
CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...
Leonardo Di Donato
 

Más de Leonardo Di Donato (9)

Prometheus as exposition format for eBPF programs running on Kubernetes
Prometheus as exposition format for eBPF programs running on KubernetesPrometheus as exposition format for eBPF programs running on Kubernetes
Prometheus as exposition format for eBPF programs running on Kubernetes
 
Open metrics: Prometheus Unbound?
Open metrics: Prometheus Unbound?Open metrics: Prometheus Unbound?
Open metrics: Prometheus Unbound?
 
From logs to metrics
From logs to metricsFrom logs to metrics
From logs to metrics
 
Continuous Time Bayesian Network Classifiers, M.Sc Thesis
Continuous Time Bayesian Network Classifiers, M.Sc ThesisContinuous Time Bayesian Network Classifiers, M.Sc Thesis
Continuous Time Bayesian Network Classifiers, M.Sc Thesis
 
Guida all'estrazione di dati dai Social Network
Guida all'estrazione di dati dai Social NetworkGuida all'estrazione di dati dai Social Network
Guida all'estrazione di dati dai Social Network
 
Virtual Worlds
Virtual WorldsVirtual Worlds
Virtual Worlds
 
A Location Based Mobile Social Network
A Location Based Mobile Social NetworkA Location Based Mobile Social Network
A Location Based Mobile Social Network
 
Sistema Rilevamento Transiti (SRT) - Software Analysis and Design
Sistema Rilevamento Transiti (SRT) - Software Analysis and DesignSistema Rilevamento Transiti (SRT) - Software Analysis and Design
Sistema Rilevamento Transiti (SRT) - Software Analysis and Design
 
CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...
CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...
CRADLE: Clustering by RAndom minimization Dispersion based LEarning - Un algo...
 

Último

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 

Último (20)

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 

Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks

  • 1. Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks Università degli Studi di Milano - Bicocca Di Donato Leonardo Text Mining Course - Prof. Fabio Stella
  • 2. Introduction Di Donato Leonardo, Università degli Studi di Milano - Bicocca super abundant amount of digital unstructured information it continues to grow at an astonishing rate (it doubles every two years) man can not manage it: information overload. problems: crawling, representing, storing, summarizing, clustering, searching ... (general rule: every problem is an opportunity) opportunity: automatically extract value from chaos what value? how to do it?
  • 3. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Goals the value that we want to extract is: clusters of semantically related documents our purpose is [1] the unsupervised clustering of a text dataset [2] the implementation of information retrieval procedures that exploit the representation of documents at the topic level [3] the modeling of the ability to computationally identify the meaning of words in context (word sense disambiguation) our documents collection: a partition of the Associated Press dataset ~ 2300 english textual news (dating back to the '90s) characteristic of any text document: it is often messy, has flaws and noise we need to clean the data we need a structured representation of the data Dataset
  • 4. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Pre-Processing google refine [ link ] [1] replacement of abbreviations and common entities with expressions that normalize them (e.g., {dlrs, dlr, $, ...} → {dollar}, {mln, mlns, ...} → {million}) [2] adjustment of flaws and [3] stripping metadata entities through regular expressions mallet [ link ] [1] make all the characters lowercase [2] tokenization [3] stop-word removal [4] vocabulary proportional cut-off, with threshold 0.03 [5] term-frequency representation of each document corpus is a unique file, every line is a document with this format: results: |W| = 32349 token types, 241908 words
  • 5. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Topic Models probabilistic generative models for uncovering the underlying semantic structure of a document collection based on a Bayesian analysis of the original texts [ Blei, 2003 ] goal: discover patterns of word-use and connect documents that exhibit similar patterns idea: documents are mixtures of topics (assignments) and each topic is a multinomial probability distribution over words which are the topics have generated the given corpus of documents with the maximum likelihood ? we have to infer 3 latent variables: [1] the word distribution over topics [2] the topics distribution over documents [3] the word-topic assignments [1] Φ(j) = P(W|Z = j) [2] Θ(d) = P (Z|D = d) [3] P(Z|W)
  • 6. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Topic Models Latent Dirichlet Allocation (LDA) model associates with [2] and [1] two smoothing hyper-parameters α and β. the number of times a topic j which has been selected for a document is indicated by αj (α1 , ..., αT are the parameters of a prior Dirichlet) β is the parameter of a prior Dirichlet which indicates the count of extracted words from a topic (before observing any corpus document) To estimate them we can use different methods (e.g.; Gibbs Sampling) we need to estimate the distributions Φ and Θ: it is possible compute them directly through the matrixes of counts
  • 7. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Tuning which are the best value for hyper-parameters ? usually α = 50/T and β = 0.01 are those that give the best results [ Steyvers and Griffiths, 2007 ] which is the optimal number of topics T ? and the number of iterations I ? it depends on the specific problem, it's an open problem we have set T = 35 and T = 40 there are topics evaluation techniques that try to face this problem ... we have used one of those techniques (i.e., the topic coherence metric, which evaluates the semantic coherence of a topic) to compare two model configurations: symmetric α versus asymmetric α
  • 8. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Symmetric α versus Asymmetric α an asymmetric configuration (AS) for the alpha hyper-parameters serves to calibrate with more flexibility the degree of topics sparseness has been empirically demonstrated that optimizing Dirichlet hyper- parameters (αi , ..., αT ) for topics-document distribution makes a huge difference: topics are not dominated by very common words and they are more stable as their number increase [ Wallach, 2009 ] it has not been verified by our experimentation: the topic's average coherence for AS configuration was worse than SS configuration why ? in our corpus there isn’t a topic that tends to occur in each document (or the optimal number of T may be greater, or simply the answer is more trivial ...)
  • 9. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Top topics for symmetric α and T = 35
  • 10. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Post-Processing - Information Retrieval why should we use topic models to improve information retrieval tasks ? [1] we can cluster queries according the extracted topics [2] two documents which share no common words can be measured as similar query likelihood model is a basic approach for information retrieval in this context (generative model) we can evaluate how well a document matches a query specifying how the words of the query may have been generated by a language model we derive a language model for each document (a mixture of topics) so, the relevant documents will have a topic distribution that is likely may generated the set of words contained in (or associated with) the query → documents similarity
  • 11. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Documents Similarity two approaches to compute the similarity between documents [1] probabilistic query approach [2] comparison of topics distribution of documents how ? through divergence metrics (e.g., symmetrised Kullback-Leibler, Jenson-Shannon)
  • 12. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Similar documents for query "forest fire" AP880727-0015 X Fire-spitting helicopters were dispatched to Yellowstone National Park on Tuesday to help protect the Old Faithful geyser area from a 6,000-acre blaze ...
  • 13. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Post-Processing - Word Sense Disambiguation the ability to identify the meaning of words in context in a computational manner is usually referred as the Word Sense Disambiguation four elements: [1] selection of word senses (i.e., the classes) [2] use of external knowledge sources [3] representation of context [4] selection of an automatic classification method input: a user specified context document dc that contains the word wx to be disambiguated [1] → given s most similar words for wx , for each of this we build a sense document capturing synsets, glosses, example phrases, and other relevant relations from WordNet [2] → WordNet as external knowledge sources to create the sense documents ds [3] → the topical and the semantic features [4] → comparison of document dc with each of the s ds document (with one of the two approaches presented): the most similar will be the sense of word wx in context dc
  • 14. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Words similarity two possible approaches to compute the similarity between words: [1] associative relation [2] comparison of (topics-words) P(Z|W) distribution
  • 15. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Words similar to token "arab"
  • 16. Di Donato Leonardo, Università degli Studi di Milano - Bicocca Future Work topic modeling → ● train an LDA model with asymmetric α for increasing values of T and evaluate the resulting quality of topics ● train an LDA model with asymmetric α on a vocabulary on which has not been performed any proportional cut-off ● investigate a possible implementation of a multiple chain model to obtain topics more stable ● use other metric of topic evaluation information retrieval → ● assess and fine-tune the prior probability of a document in the query likelihood model ● use other high-frequency metrics (e.g., α-skew) in relation to the comparison of distributions word sense disambiguation → ● implement and evaluate other methods to compare context document and sense documents (e.g., compute P(dc , ds ) under the assumption that they are conditionally independent, given the topic variable) ● refine the mechanism of sense selection (e.g., choosing each of the s most probable words into probability interval in order to minimize the risk that all the most similar words refer to meanings really strictly correlated)
  • 17. Thank you for your attention. Di Donato Leonardo, Università degli Studi di Milano - Bicocca