Iswc 2011: Linking Data Across Universities: An Integrated Video Lectures Dat...
ESWC 2014 Tutorial part 3
1. Social Web: Where are the Semantics?
ESWC 2014
Miriam Fernández, Victor Rodríguez,
Andrés García-Silva, Oscar Corcho
Ontology Engineering Group, UPM, Spain
Knowledge Media Institute, The Open University
2. Outline
2
• Part 1: Understanding Social Media
– Theory: background & applications described in this tutorial
– Hands on: data extraction from Twitter and Facebook
• Part 2: Using semantics to represent data from SNS
– Theory: Using SW to represent content, users and relations
– Hands on: applying and extending SIOC
• Part 3: Using semantics to understand social media conversations
– Theory: Using semantics to understand topics in social media
– Hands on: using LDA to extract topics from social media
• Part 4: Using semantics to understand user behaviour
3. Why we need semantics to understand social media?
• Information overwhelming
– We need mechanisms to support
• Better information search/recommendation
• Better information integration
• Automatic knowledge extraction
• User generated content is generally unstructured
– Machines can not understand this content!
ESWC 2014 Social Web: Where are the Semantics? 3
"The Semantic Web is an extension of the current Web in which
information is given well-defined meaning, better enabling computers
and people to work in cooperation."
Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web,
Scientific American, May 2001
4. Implicit vs. Explicit Semantics
• Implicit Semantics
– Implicit, also called statistical semantics focuses on extracting word
sense by studying the patterns of human word usage in massive
collections of text or other human generated data.
– It does not rely on an explicit formalisation/conceptualisation of
knowledge
• Explicit Semantics
– Explicit semantics, focus on the analysis of content by using the
support of explicit conceptualisations in the form of ontologies and
knowledge bases
ESWC 2014 Social Web: Where are the Semantics? 4
5. Implicit semantics: Topic models
• Topic models: one possible way of extracting implicit semantics
ESWC 2014 Social Web: Where are the Semantics? 5
6. bags of words
ESWC 2014 Social Web: Where are the Semantics? 6
Word count
ESWC 3
rank 1
technology 2
conference 1
venue 1
semantic 5
Web 7
knowledge 5
...
Word count
ISWC 3
rank 0
venue 1
semantic 5
conference 0
venue 1
semantic 5
Web 5
knowledge 0
...
7. term-document matrix
ESWC 2014 Social Web: Where are the Semantics? 7
• Term-document matrix
– A very large, sparse matrix
– A document can be seen as a vector
8. similarity measures
• Useful to answer… how similar are two documents?
– Distance measures between two documents
• Cosine similarity = 𝐴•𝐵∕‖𝐴‖‖𝐵‖ =0,72
• Jackard index = | 𝐴∩ 𝐵|∕|𝐴∪ 𝐵| =0,50
• However:
– Synonyms will appear far apart while they aren’t
– Polysemic words will appear close while they aren’t
• What is a document talking about?
– «Explicit semantic analysis»(ESA)
ESWC 2014 Social Web: Where are the Semantics? 8
9. info retrieval, text mining
• Classification. Documents may belong to different classes
• How relevant is a word for a document (or class of documents)?
TF−IDF(𝑥, 𝑦)= 𝑡 𝑓↓𝑥, 𝑦 ×log( 𝐷⁄ 𝑑↓𝑥 )
ESWC 2014 Social Web: Where are the Semantics? 9
tfx,y=freq. of x in y
D=number of documents
Dx=number of documents containing x
10. latent semantic analysis
ESWC 2014 Social Web: Where are the Semantics? 10
Figure taken from http://faculty.washington.edu/jwilker/559/
SteyversGriffiths.pdf
Singular Value Decomposition to reduce dimensionality
• The term-document matrix is large
• Latent Semantic Analysis
• Rank of D can be reduced
• Meaning
– U=term-topic correlation
– D=topic importance
– V=document-topic correlation
11. Semantics of a social media message
ESWC 2014 Social Web: Where are the Semantics? 11
13. discriminative models / generative models
• Discriminative Models (1-step)
1. Directly infer posterior probabilities p(Ck|x)
• Generative Models (2-steps)
1. Infer class-conditional densities p(x|Ck) and priors p(Ck)
2. Use Bayes theorem to determine posterior probabilities
𝑝 𝐶↓1 𝑥 = 𝑝 𝑥 𝐶↓1 𝑝( 𝐶↓1 )/𝑝 𝑥 𝐶↓1 𝑝( 𝐶↓1 )+ 𝑝 𝑥 𝐶↓2 𝑝( 𝐶↓2 )
ESWC 2014 Social Web: Where are the Semantics? 13
We can generate x that are
likely to have been produced by
class C1
15. LDA: a probabilistic generative model
ESWC 2014 Social Web: Where are the Semantics? 15
This is a Probabilistic
Generative Process: we can
generate documents according
to certain topics.
18. Topic models
ESWC 2014 Social Web: Where are the Semantics? 18
Topics known a priori Latent topics
• We don’t know the topics in
advance
• We don’t know the importance
of each word in a topic
Latent topics are not pre-specified but found from the corpus
19. topic models vs LSA
ESWC 2014 Social Web: Where are the Semantics? 19
Figure taken from http://faculty.washington.edu/jwilker/559/
SteyversGriffiths.pdf
Singular Value Decomposition reduces dimensionality
• Latent Semantic Analysis vs Topic Model
20. topic models
ESWC 2014 Social Web: Where are the Semantics? 20
How important is a word in a topic How important is a topic in a document
21. topic models: LDA
ESWC 2014 Social Web: Where are the Semantics? 21
D documents, using a
total of W words
K topics
LDA: each document d is a mixture among Z topics with each
topic being a multinomial distribution over a vocabulary of W words
θd : topic distribution for a document d (~Dirichlet(α))
ϕz : word probabilities for a topic z (~Dirichlet(β))
• Latent Direchlet Allocation (LDA)
22. Topic models
ESWC 2014 Social Web: Where are the Semantics? 22
LDA: each document d is a mixture among Z topics with each
topic being a multinomial distribution over a vocabulary of W words
Probability of a word:
θd : topic distribution for a document d (~Dirichlet(α))
ϕz : word probabilities for a topic z (~Dirichlet(β))
• Latent Direchlet Allocation (LDA)
Reminder of Gamma function
23. Topic models
• Joint probability distribution for a document of all the random
variables, assuming we know α and β.
• Given a set of documents D and the LDA model we can use
inference to find out θ,ϕ and the topic assignment for each word
• Intractable problem, but numerically solvable with the Gibbs
sampling method (sort of Monte Carlo for Markov Chains method)
ESWC 2014 Social Web: Where are the Semantics? 23
for each word in the document
Probability of a word in a document,
knowing the distribution of words in a topic
Probability of a topic in a document,
knowing the topic distribution for a document
The Dirichlet functions declared
before
24. Topic models: summary
• Latent Dirichlet Allocation is a generative probabilistic model which
can discover latent topics in unlabelled data
• Labelled LDA: a supervised version
• Implementations:
– Mallet
– Stanford Topic Modelling Toolbox (Stanford)
• Applications:
• Similarity between two documents
• Classification of texts
• Indexing of documents
ESWC 2014 Social Web: Where are the Semantics? 24
25. Some references
• Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan. Journal
of Machine Learning Research, 3:993-1022, January 2003.
• Finding Scientific Topics. Griffiths, T., & Steyvers, M. (2004).
Proceedings of the National Academy of Sciences, 101 (suppl. 1),
5228-5235.
• Semantic Characterization of Tweets Using Topic Models: A Use
Case in the Entertainment Domain. A. García-Silva, V. Rodriguez-
Doncel, O. Corcho. Int. Journal on Semantic Web and Information
Systems (IJSWIS), 9(3), 1-13 (2013)
ESWC 2014 Social Web: Where are the Semantics? 25
26. Social Web: Where are the Semantics?
ESWC 2014
Victor, Andres, Oscar, Miriam
Universidad Politectica de Madrid, Spain
Knowledge Media Institute, The Open University,
27. mixture model
We observe some data generated by a mixture of distributions and we want to learn
the parameters of these distributions. A mixture model is a probabilistic model
representing the linear combination of several PDFs
We dont even see the colours!