ESWC 2014 Tutorial part 3

Social Web: Where are the Semantics?
ESWC 2014
Miriam Fernández, Victor Rodríguez,
Andrés García-Silva, Oscar Corcho
Ontology Engineering Group, UPM, Spain
Knowledge Media Institute, The Open University

Outline
2
•  Part 1: Understanding Social Media
–  Theory: background & applications described in this tutorial
–  Hands on: data extraction from Twitter and Facebook
•  Part 2: Using semantics to represent data from SNS
–  Theory: Using SW to represent content, users and relations
–  Hands on: applying and extending SIOC
•  Part 3: Using semantics to understand social media conversations
–  Theory: Using semantics to understand topics in social media
–  Hands on: using LDA to extract topics from social media
•  Part 4: Using semantics to understand user behaviour

Why we need semantics to understand social media?
•  Information overwhelming
–  We need mechanisms to support
•  Better information search/recommendation
•  Better information integration
•  Automatic knowledge extraction
•  User generated content is generally unstructured
–  Machines can not understand this content!
ESWC 2014 Social Web: Where are the Semantics? 3
"The Semantic Web is an extension of the current Web in which
information is given well-defined meaning, better enabling computers
and people to work in cooperation."
Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web,
Scientific American, May 2001

Implicit vs. Explicit Semantics
•  Implicit Semantics
–  Implicit, also called statistical semantics focuses on extracting word
sense by studying the patterns of human word usage in massive
collections of text or other human generated data.
–  It does not rely on an explicit formalisation/conceptualisation of
knowledge
•  Explicit Semantics
–  Explicit semantics, focus on the analysis of content by using the
support of explicit conceptualisations in the form of ontologies and
knowledge bases

Implicit semantics: Topic models
•  Topic models: one possible way of extracting implicit semantics

bags of words
Word count
ESWC 3
rank 1
technology 2
conference 1
venue 1
semantic 5
Web 7
knowledge 5
...
Word count
ISWC 3
rank 0
venue 1
semantic 5
conference 0
venue 1
semantic 5
Web 5
knowledge 0
...

term-document matrix
•  Term-document matrix
–  A very large, sparse matrix
–  A document can be seen as a vector

similarity measures
•  Useful to answer… how similar are two documents?
–  Distance measures between two documents
•  Cosine similarity = 𝐴•𝐵∕‖𝐴‖‖𝐵‖ =0,72
•  Jackard index = | 𝐴∩ 𝐵|∕|𝐴∪ 𝐵| =0,50
•  However:
–  Synonyms will appear far apart while they aren’t
–  Polysemic words will appear close while they aren’t
•  What is a document talking about?
–  «Explicit semantic analysis»(ESA)

info retrieval, text mining
•  Classification. Documents may belong to different classes
•  How relevant is a word for a document (or class of documents)?
TF−IDF(𝑥, 𝑦)= 𝑡 𝑓↓𝑥, 𝑦 ×log( 𝐷⁄ 𝑑↓𝑥  )
tfx,y=freq. of x in y
D=number of documents
Dx=number of documents containing x

latent semantic analysis
Figure taken from http://faculty.washington.edu/jwilker/559/
SteyversGriffiths.pdf
Singular Value Decomposition to reduce dimensionality
•  The term-document matrix is large
•  Latent Semantic Analysis
•  Rank of D can be reduced
•  Meaning
–  U=term-topic correlation
–  D=topic importance
–  V=document-topic correlation

Semantics of a social media message

Topics

discriminative models / generative models
•  Discriminative Models (1-step)
1.  Directly infer posterior probabilities p(Ck|x)
•  Generative Models (2-steps)
1.  Infer class-conditional densities p(x|Ck) and priors p(Ck)
2.  Use Bayes theorem to determine posterior probabilities
𝑝 𝐶↓1 ⁠ 𝑥 = 𝑝 𝑥⁠ 𝐶↓1  𝑝( 𝐶↓1 )/𝑝 𝑥⁠ 𝐶↓1  𝑝( 𝐶↓1 )+ 𝑝 𝑥⁠ 𝐶↓2  𝑝( 𝐶↓2 ) 
We can generate x that are
likely to have been produced by
class C1

Generative model

LDA: a probabilistic generative model
This is a Probabilistic
Generative Process: we can
generate documents according
to certain topics.

Topic models

Topic models
Topics known a priori Latent topics
•  We don’t know the topics in
advance
•  We don’t know the importance
of each word in a topic
Latent topics are not pre-specified but found from the corpus

topic models vs LSA
Figure taken from http://faculty.washington.edu/jwilker/559/
SteyversGriffiths.pdf
Singular Value Decomposition reduces dimensionality
•  Latent Semantic Analysis vs Topic Model

topic models
How important is a word in a topic How important is a topic in a document

topic models: LDA
D documents, using a
total of W words
K topics
LDA: each document d is a mixture among Z topics with each
topic being a multinomial distribution over a vocabulary of W words
θd : topic distribution for a document d (~Dirichlet(α))
ϕz : word probabilities for a topic z (~Dirichlet(β))
•  Latent Direchlet Allocation (LDA)

Topic models
LDA: each document d is a mixture among Z topics with each
topic being a multinomial distribution over a vocabulary of W words
Probability of a word:
θd : topic distribution for a document d (~Dirichlet(α))
ϕz : word probabilities for a topic z (~Dirichlet(β))
•  Latent Direchlet Allocation (LDA)
Reminder of Gamma function

Topic models
•  Joint probability distribution for a document of all the random
variables, assuming we know α and β.
•  Given a set of documents D and the LDA model we can use
inference to find out θ,ϕ and the topic assignment for each word
•  Intractable problem, but numerically solvable with the Gibbs
sampling method (sort of Monte Carlo for Markov Chains method)
for each word in the document
Probability of a word in a document,
knowing the distribution of words in a topic
Probability of a topic in a document,
knowing the topic distribution for a document
The Dirichlet functions declared
before

Topic models: summary
•  Latent Dirichlet Allocation is a generative probabilistic model which
can discover latent topics in unlabelled data
•  Labelled LDA: a supervised version
•  Implementations:
–  Mallet
–  Stanford Topic Modelling Toolbox (Stanford)
•  Applications:
•  Similarity between two documents
•  Classification of texts
•  Indexing of documents

Some references
•  Latent Dirichlet allocation. D. Blei, A. Ng, and M. Jordan. Journal
of Machine Learning Research, 3:993-1022, January 2003.
•  Finding Scientific Topics. Griffiths, T., & Steyvers, M. (2004).
Proceedings of the National Academy of Sciences, 101 (suppl. 1),
5228-5235.
•  Semantic Characterization of Tweets Using Topic Models: A Use
Case in the Entertainment Domain. A. García-Silva, V. Rodriguez-
Doncel, O. Corcho. Int. Journal on Semantic Web and Information
Systems (IJSWIS), 9(3), 1-13 (2013)

Social Web: Where are the Semantics?
ESWC 2014
Victor, Andres, Oscar, Miriam
Universidad Politectica de Madrid, Spain
Knowledge Media Institute, The Open University,

mixture model
We observe some data generated by a mixture of distributions and we want to learn
the parameters of these distributions. A mixture model is a probabilistic model
representing the linear combination of several PDFs
We dont even see the colours!

Rule based

Classes
Satisfaction
Insatisfaction Hate
Love
Fear
Trust Happiness
Sadness
SD LH TF HS
+
-

Rule grammar
object RuleGrammar extends StandardTokenParsers {
lexical.delimiters += ("[", "]", "#", "->", "+", "-", "*", "/", "=", "." )
lexical.reserved += ("_")
def rule_set = (replace_rule | classify_rule).*
def replace_rule = stringLit.* ~ "=" ~ stringLit
def classify_rule = morpho_sequence ~ "->" ~ action
def morpho_sequence = (word | lemma_pos| lemma | pos | entity | wildcard |
limited_wildcard).*
def word = stringLit
def lemma = ident
def pos = "[" ~> ident <~ "]"
def lemma_pos = ident ~ "#" ~ ident
def entity = "<ENTITY>"
def wildcard = "*"
def limited_wildcard = "/" ~> numericLit <~ "/"
def action = classify_action | chunk_action
def classify_action = ident ~ ("+" | "-" | "*" | "/") ~ number
def number = numericLit | negative
def negative = "-" ~ numericLit
def chunk_action = ident <~ "."
}
Examples (Spanish):
(cómo/cada día) odio (más) a (el/la/esta/…) ent:
/2/ ODIAR#V /1/ A#SP /1/ <ENTITY> -> LH -2
mi/este odio a/por ent:
D ODIO#NC [SP] <ENTITY> -> LH -1

ESWC 2014 Tutorial part 3

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Viewers also liked

Viewers also liked (8)

Similar to ESWC 2014 Tutorial part 3

Similar to ESWC 2014 Tutorial part 3 (20)

More from Miriam Fernandez

More from Miriam Fernandez (16)

ESWC 2014 Tutorial part 3