SlideShare a Scribd company logo
1 of 24
Download to read offline
Session 19: Social Media II
担当: デンソーアイティーラボラトリ 山本
【ICDE2013勉強会】
資料中の図は論文を引用しております。
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
発表論文
} (1) A Unified Model for Stable and Temporal Topic Detection from
Social Media Data
} ソーシャルメディアにおけるコンテンツが一時的なトピックかそれ
とも恒久的なトピックかを考慮した上で判定
} (2) Crowdsourced Enumeration Queries (Best Paper)
} クラウドソーシングの検索タスクに対する回答集合数
(母集団)の推定.
} 生物統計学における固有種数の推定手法を応用(CHAO92)
} (3) On Incentive-based Tagging
} tag情報の品質をインセンティブをワーカー与えることによって
向上させる。
2
13年6月29日土曜日
} 【やりたいこと】
Stable TopicとTemporal Topic考慮した上でのトピック抽出
} Stable Topic及びTemporal Topicの定義
} Stable Topic :いつも誰かがそのテーマについて言及している
} Temporal Topic: 時系列上でみて、急激にそのテーマについて言及
する回数が激増・激減するようなテーマ。通常は実生活のイベント
が影響
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
A Unified Model for Stable and Temporal Topic Detection from
Social Media Data
3
important and useful to distinguish temporal topics from stable
topics in social media. However, such a discrimination is very
challenging because the user-generated texts in social media are
very short in length and thus lack useful linguistic features for
precise analysis using traditional approaches.
In this paper, we propose a novel solution to detect both
stable and temporal topics simultaneously from social media data.
Specifically, a unified user-temporal mixture model is proposed
to distinguish temporal topics from stable topics. To improve this
model’s performance, we design a regularization framework that
exploits prior spatial information in a social network, as well
as a burst-weighted smoothing scheme that exploits temporal
prior information in the time dimension. We conduct extensive
experiments to evaluate our proposal on two real data sets
obtained from Del.icio.us and Twitter. The experimental results
verify that our mixture model is able to distinguish temporal
topics from stable topics in a single detection process. Our
mixture model enhanced with the spatial regularization and
the burst-weighted smoothing scheme significantly outperforms
competitor approaches, in terms of topic detection accuracy and
discrimination in stable and temporal topics.
I. INTRODUCTION
User-generated contents (UGC) in Web 2.0 are valuable
resources capturing people’s interests, thoughts and actions.
Such contents cover a wide variety of topics that present
online and offline lives. For example, the microblog services
gather many short but quickly-updated texts that contain both
temporal and stable topics. Such topics form a huge and rich
repository of various kinds of interesting information.
Stable topics are often on users’ regular interests and their
daily routine discussions, which usually evolve at a rather
slow speed. The extraction of such stable topics enables us to
personalize the results and to improve the result relevance in
many applications such as computational advertising, content
targeting, personal recommendation and web search.
In contrast, temporal topics are on popular real-life events
or hot spots. In many circumstances, temporal topics, e.g.,
breaking events in the real world, bring about popular discus-
sion and wide diffusion on the Internet, where social networks
further boost the discussion and diffusion. Take Twitter, the
most popular microblog service, as an example. Many social
events can be discovered in Twitter’s posts (tweets), such
illustrated in Figure 1. We can tell the difference between
them from the temporal distributions and the description
keywords. A temporal topic has its text related to a certain
event like “Independence Day celebration” in a certain period
of time, and its popularity goes through a sharp increase at the
occurring time of the event. A stable topic has its description
on user’s regular interest like “Pet Adoption” and its temporal
distribution exhibits no sharp, spike-like fluctuation.
Fig. 1. Stable and Temporal Topics in Twitter
It is important and useful to distinguish the temporal topics
from the stable topics since they convey different kinds of
information. However, temporal topics are discussed with less
urgent themes in the background, and therefore temporal topics
are deeply mixed with stable topics in social media. As a
result, it is a challenging problem to detect and differenti-
ate temporal and stable topics from large amounts of user-
generated social media data.
Research on traditional topic detection and tracking employs
on-line incremental clustering [1] or retrospective off-line clus-
tering [25] for documents and extracts representative features
for clusters as a summary of the events. These methods are
suitable for conventional web pages where most documents are
long, rich in keywords, and related to certain popular events.13年6月29日土曜日
【アプローチ】
}
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
A Unified Model for Stable and Temporal Topic Detection from
Social Media Data
4
niques for topic detection from social media data [12].
IV. A MIXTURE MODEL FOR DETECTING STABLE AND
TEMPORAL TOPICS
In this section, we propose a user-temporal mixture topic
model that integrates user and temporal features, followed by
an EM-based algorithm for inferring model parameters.
A. User-Temporal Model
SYMBOL DESCRIPTION
u, t, w user, time stamp, keyword
U, T, W set of users, time stamps and keywords
M[u, t, w] frequency of w used by u within time stamp t
λU , λT parameter controlling the branch selection
θi stable topic indexed by i
θj temporal topic indexed by j
ΘU , ΘT stable and temporal topic set
TABLE I
NOTATIONS
In the user-temporal mixture model, we pre-categorize
topics into two types: stable topics and temporal topics.
Stable topics summarize theme reflected from regular postings
according to the stable interest of a user or a community. While
temporal topics capture the popular events or controversial
news igniting hot discussion in a certain period. In this model,
we aim to detect both temporal and stable topics in one
generating process. Table I lists the relevant notations we use.
The mixture model is represented in Equation 1. For a stable
topic θ we pay particular attention to its user u who generates
Whether a keyword
a temporal topic is dec
contributions by the us
For instance, if many
certain period t, w wo
with higher probability
topics. Thus, keywords
clustered into tempora
to that of their keywor
The topics generated
individually. Both typ
during the learning pr
can filter out the stable
branch. It also helps re
disturbance from break
B. Estimation of Mode
Given an observati
procedure of our model
of generating the obser
whole document collec
2, where p(w|u, t) is d
L(C) =
U
The goal of parame
2. As this equation c
Maximum Likelihood
mixing user and temporal features, each branch deciding a
different topic type. Parameters λU and λT in Equation 1
are the probability coefficients controlling the branch choice,
which also denote the proportions of stable and temporal topics
in the data set.
p(w|u, t) = λU
θi∈ΘU
p(θi|u)p(w|θi)+λT
θj ∈ΘT
p(θj|t)p(w|θj)
(1)
For the user branch, a stable topic is chosen according to the
interest of a particular user. For the time branch, a temporal
topic is generated according to the time stamp of a post, which
means the post belongs to the topics that are popular for a
short period of time around that time stamp. Temporal topics
have their distribution on the time dimension, which indicates
its popularity probabilities. The time period during which a
temporal topic has its highest probability is its popularity
period. In our setting, the user interest is assumed to be stable
through time, and we ignore the possible slight evolution of
user interest.
maximizati
ing the so-
depends on
In our m
p(θi|u), p(
and θj. For
The detail
temporal m
E-step:
where B(w
・user uがword w を時間 t に言及する確率
要はstableなトピックは人に依存、テンポラルなトピックは時間に依存
ze
s.
gs
le
al
l,
ne
e.
le
es
to that of their keywords.
The topics generated in the two branches are not estimated
individually. Both types of topics interact with each other
during the learning procedure. This two-branch assumption
can filter out the stable components from burst topics by stable
branch. It also helps refine the quality of stable topics without
disturbance from breaking events as time elapses.
B. Estimation of Model Parameters
Given an observation matrix M(U, T, W), the learning
procedure of our model is to estimate the maximum probability
of generating the observed samples. The log-likelihood of the
whole document collection C by our approach is in Equation
2, where p(w|u, t) is defined according to Equation 1.
L(C) =
U T W
M[u, t, w] log p(w|u, t) (2)
The goal of parameter estimation is to maximize Equation
2. As this equation cannot be solved directly by applying
Maximum Likelihood Estimation (MLE), we apply an EM
approach instead. In an expectation (E) step of the EM
・user-time-associated document collection Cにおけるlog-likelihood
E-Mアルゴリズムを利用すれば、
In the user-temporal mixture model, we pre-categorize
topics into two types: stable topics and temporal topics.
Stable topics summarize theme reflected from regular postings
according to the stable interest of a user or a community. While
temporal topics capture the popular events or controversial
news igniting hot discussion in a certain period. In this model,
we aim to detect both temporal and stable topics in one
generating process. Table I lists the relevant notations we use.
The mixture model is represented in Equation 1. For a stable
topic θi we pay particular attention to its user u who generates
it. For a temporal topic θj we pay more attention to when,
indicated by time t, it is generated. Like PLSA [10], [11], our
user-temporal model consists of three layers and two branches
mixing user and temporal features, each branch deciding a
different topic type. Parameters λU and λT in Equation 1
are the probability coefficients controlling the branch choice,
which also denote the proportions of stable and temporal topics
in the data set.
p(w|u, t) = λU
θi∈ΘU
p(θi|u)p(w|θi)+λT
θj ∈ΘT
p(θj|t)p(w|θj)
(1)
For the user branch, a stable topic is chosen according to the
pr
of
wh
2,
2.
M
ap
ap
va
m
in
de
p(
an
Th
tem
In the user-temporal mixture model, we pre-categorize
topics into two types: stable topics and temporal topics.
Stable topics summarize theme reflected from regular postings
according to the stable interest of a user or a community. While
temporal topics capture the popular events or controversial
news igniting hot discussion in a certain period. In this model,
we aim to detect both temporal and stable topics in one
generating process. Table I lists the relevant notations we use.
The mixture model is represented in Equation 1. For a stable
topic θi we pay particular attention to its user u who generates
it. For a temporal topic θj we pay more attention to when,
indicated by time t, it is generated. Like PLSA [10], [11], our
user-temporal model consists of three layers and two branches
mixing user and temporal features, each branch deciding a
different topic type. Parameters λU and λT in Equation 1
are the probability coefficients controlling the branch choice,
which also denote the proportions of stable and temporal topics
in the data set.
p(w|u, t) = λU
θi∈ΘU
p(θi|u)p(w|θi)+λT
θj ∈ΘT
p(θj|t)p(w|θj)
(1)
For the user branch, a stable topic is chosen according to the
pr
of
w
2,
2.
M
ap
ap
va
m
in
de
p(
an
Th
te
In the user-temporal mixture model, we pre-categorize
topics into two types: stable topics and temporal topics.
Stable topics summarize theme reflected from regular postings
according to the stable interest of a user or a community. While
temporal topics capture the popular events or controversial
news igniting hot discussion in a certain period. In this model,
we aim to detect both temporal and stable topics in one
generating process. Table I lists the relevant notations we use.
The mixture model is represented in Equation 1. For a stable
topic θi we pay particular attention to its user u who generates
it. For a temporal topic θj we pay more attention to when,
indicated by time t, it is generated. Like PLSA [10], [11], our
user-temporal model consists of three layers and two branches
mixing user and temporal features, each branch deciding a
different topic type. Parameters λU and λT in Equation 1
are the probability coefficients controlling the branch choice,
which also denote the proportions of stable and temporal topics
in the data set.
p(w|u, t) = λU
θi∈ΘU
p(θi|u)p(w|θi)+λT
θj ∈ΘT
p(θj|t)p(w|θj)
(1)
For the user branch, a stable topic is chosen according to the
procedure of
of generating
whole docum
2, where p(w
L(C
The goal o
2. As this e
Maximum L
approach ins
approach, po
variables bas
maximization
ing the so-ca
depends on th
In our mo
p(θi|u), p(θj
and θj. For si
The detailed
temporal mo
E-step:
In the user-temporal mixture model, we pre-categorize
topics into two types: stable topics and temporal topics.
Stable topics summarize theme reflected from regular postings
according to the stable interest of a user or a community. While
temporal topics capture the popular events or controversial
news igniting hot discussion in a certain period. In this model,
we aim to detect both temporal and stable topics in one
generating process. Table I lists the relevant notations we use.
The mixture model is represented in Equation 1. For a stable
topic θi we pay particular attention to its user u who generates
it. For a temporal topic θj we pay more attention to when,
indicated by time t, it is generated. Like PLSA [10], [11], our
user-temporal model consists of three layers and two branches
mixing user and temporal features, each branch deciding a
different topic type. Parameters λU and λT in Equation 1
are the probability coefficients controlling the branch choice,
which also denote the proportions of stable and temporal topics
in the data set.
p(w|u, t) = λU
θi∈ΘU
p(θi|u)p(w|θi)+λT
θj ∈ΘT
p(θj|t)p(w|θj)
(1)
For the user branch, a stable topic is chosen according to the
procedure of
of generating
whole docum
2, where p(w
L(C
The goal o
2. As this e
Maximum L
approach ins
approach, po
variables bas
maximization
ing the so-ca
depends on th
In our mo
p(θi|u), p(θj
and θj. For si
The detailed
temporal mo
E-step:
が求まる
stableなトピック テンポラルなトピック
13年6月29日土曜日
} special smoothing
} burst word対策
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
A Unified Model for Stable and Temporal Topic Detection from
Social Media Data
5
pected complete data
:
T) (18)
al regularization, we
(θj|t) in the M-step.
opted to estimate all
anced model with the
njoys a similar form
rly, just as the spatial
izer R(C, T) is non-
d decreases R(C, T)
)
+1 and R(C, T) into
lution for ψ
(2)
n+1, and
(θj|t + 1)
(m)
n+1
(19)
and p(θi|u)n+1 re-
poral regularization
egularization
meter γ;
9);
temporal topic.
An example of two kinds of words is shown in Figure 2.
Three burst words “mj”, “moonwalk” and “michaeljackson”
have their distribution curves with sharp spikes. We can
see that although the trends of these words do not always
synchronize, they all go through a drastic increase and reach
peaks in July 2009. The bursts in their curve are ignited by
a real life event, i.e., Michael Jackson’s death. An effective
topic model should capture these words into one topic.
On the other hand, abstract words like “news” and “world”
maintains high occurrences throughout the year in Figure 2
but they convey little information. Although they are relevant
to the event in July, they also have relationships to many other
topics. For example, word “news” could be used to represent
various different news. However, such abstract words shadow
the spikes of more meaningful words. The high occurrences of
such abstract words during the burst period of the burst words
may overwhelm the latter and render them unnoticed.
Fig. 2. Normalized Word Frequency Distribution on “Michael Jackson’s
Death” in 2009
To boost interesting temporal topics, we propose a smooth-
ing technique that merges correlated words into one temporal
Bursty Degreeを計測
(Yao at al ICDE’2010)
して補正をかける。
友達間で同一のトピックに対して盛り
上がっているときは、補正をかける。
13年6月29日土曜日
} 結果
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
A Unified Model for Stable and Temporal Topic Detection from
Social Media Data
6
ds:
tes
ric
ing
ted
rds
flat
zed
nce
the
ults
ure
his
ure
al-
the
nd,
ics
by
The
tor
ral
hod
DA
of
est
Twitter data set by hiring 3 volunteers as annotators. For each
topic, we extracted keywords with the highest probabilities
to represent its content. Each topic was labeled by two
different annotators, and if they disagreed a third annotator was
introduced. Three exclusive labels were provided to indicate
the quality of temporal topic detection.
• Excellent: a nicely presented temporal topic
• Good: a topic containing bursty features
• Poor: a topic without obvious bursty features
Excellent Good Poor
EUTB 42.5% 32.5% 25%
TOT 10% 40% 50%
Individual Detection 20% 37.5% 42.5%
TimeUserLDA 29.5% 38% 32.5%
Twitter-LDA 13.5% 39% 47.5%
TABLE III
COMPARISON ON TEMPORAL TOPIC QUALITY
The labeling results are summarized in Table III. Up to 75%
of the temporal topics detected by EUTB were labeled as “Ex-
cellent” or “Good”, and 42.5% were regarded as “Excellent”.
Among all competitors, TimeUserLDA performs best. 67.5%
of the detected temporal topics were judged as “Excellent”
or “Good”, and 29.5% were regarded as “Excellent”. Other
competitors got merely or slightly more than 50% of their
detected topics labeled as “Excellent” or “Good”. In particular,
the competitors got significantly less “Excellent” labels. These
results demonstrate that our proposed user-temporal mixture
PLSA on slices Individual Detection TOT model EUTB TimeUserLDA
latest michaeljackson news michaeljackson news
headline july world jackson jackson
news breaking breaking mj michael
investigative news jackson moonwalk michaeljackson
michaeljackson headline michaeljackson death death
event investigative death news investigative
TABLE IV
TOPIC “MICHAEL JACKSON” DETECTED BY DIFFERENT APPROACHES
T77 T78 T87 T89 T60 T71
2009.1.12-2009.1.31 2009.6.15-2009.6.27 2009.4.24-2009.5.6 2009.5.27-2009.6.6 2009.1.24-2009.1.27 2009.1.1-2009.1.6
obama 0.144 moon 0.090 flu 0.158 google 0.061 droid 0.125 2008 0.099
inauguration 0.106 space 0.068 swineflu 0.124 googlewave 0.059 go 0.113 webcomics 0.046
ユーザインタビューによる
テストの結果、提案手法(EUTB)
によるトッピック抽出は
評価が高かった。
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
7
・既存のDBを利用した検索
データベースの中に
存在するデータがすべて。
(closed -world assumption)
・クラウドソースを利用した検索
データはweb/頭の中に存在
母集団の数がわからん。。
based on the CROWD annotations and optional fre
tations of columns and tables in the schema. Fig
an example HTML-based UI that would be pre
worker for the following crowd table definition:
CREATE CROWD TABLE ice_cream_flavor {
name VARCHAR PRIMARY KEY
}
Although CrowdDB supports alternate user inte
showing previously received answers), this pape
a pure form of the “getting it all” question.
alternative UIs is the subject of future work.
During query processing, the system automat
one or more HITs using the AMT web service A
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
} 内容
} クラウドソーシングの検索タスクに対する回答集合数の推定.
} 生物統計学における固有種数の推定手法を応用(CHAO92)
} 固有種数の推定手法とは?
} ある特定地域の個体数を調べ、種の種類や密度を推定。
} 同種法を用いて類推した例としては、例えば地球上の恐竜の種類の
推定、等が有名(図)。
}
8
Estimating the diversity of dinosaurs
(Steve C. Wang and Peter Dodson )
13年6月29日土曜日
} 推定関数は種を単位とするか,個体間のダイバージェンスを考慮する
か,均等度を考慮に入れるか等によっていろいろあり。
} CHAO 84 estimator
} CHAO92 estimator(今回利用したもの)
} sample coverageという概念を利用
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
9
hao84 Estimator
hao develops a simple estimator for species rich-
based solely on the number of rare species found
ple:
ˆNchao84 = c +
f2
1
2f2
d that it actually is a lower bound, but it per-
l on her test data sets. She also found that the
works best when there are relatively rare species,
ten the case in real species estimation scenarios.
hao92 Estimator
hao develops another estimator based on the no-
ple coverage. The sample coverage C is the sum
babilities pi of the observed classes. However,
underlying distribution p1...pN is unknown, this
om the Good-Turing estimator[20] is used:
ˆC = 1 f1/n
92 estimator attempts to explicitly characterize
orate the skew of the underlying distribution us-
e cient of variance (CV), denoted , a metric
e used to describe the variance in a probability
for the ice cream flavors. In this paper we clea
ified workers’ answers manually; other work h
techniques for crowd-based verification [24, 1, 1
Figure 5(a-c) shows the average cardinality es
time, i.e., for increasing numbers of HITs, for th
UN countries, and ice cream flavors using th
estimators. Error bars can be computed using v
mators provided in [8, 6], however we omit the
readability. The horizontal line indicates the t
ity if it is known. Below each graph, a tab
“f1-ratio” and the actual number of received u
over time. We define f1-ratio as f1/
P
i fi, th
the singletons as compared to the overall rec
items. Recall that the presence of singletons is
dicator that there are more undetected items;
are relatively few singletons, we have likely app
plateau of the SAC. The f1-ratio can be used
tion of whether or not the sample size is su cie
cardinality estimation. Since estimators use the
quencies of f1 compared to the other fi’s, a
will make it more di cult for the estimators
Also note that the ratio between the unique it
predicted cardinality is the completeness estim
c: 観測された種の数
f1:一度のみ観測された種の数
f2:二度観測された種の数
sed solely on the number of rare species found
ˆNchao84 = c +
f2
1
2f2
hat it actually is a lower bound, but it per-
n her test data sets. She also found that the
ks best when there are relatively rare species,
the case in real species estimation scenarios.
o92 Estimator
develops another estimator based on the no-
coverage. The sample coverage C is the sum
bilities pi of the observed classes. However,
erlying distribution p1...pN is unknown, this
the Good-Turing estimator[20] is used:
ˆC = 1 f1/n
estimator attempts to explicitly characterize
te the skew of the underlying distribution us-
cient of variance (CV), denoted , a metric
sed to describe the variance in a probability
8]; we can use the CV to compare the skew
ass distributions. The CV is defined as the
ation divided by the mean. Given the pi’s
at describe the probability of the ith class be-
with mean ¯p =
P
i pi/N = 1/N, the CV is
⇥P ⇤
techniques for crowd-based verification
Figure 5(a-c) shows the average cardi
time, i.e., for increasing numbers of HIT
UN countries, and ice cream flavors u
estimators. Error bars can be computed
mators provided in [8, 6], however we o
readability. The horizontal line indicat
ity if it is known. Below each graph
“f1-ratio” and the actual number of re
over time. We define f1-ratio as f1/
P
the singletons as compared to the ove
items. Recall that the presence of sing
dicator that there are more undetecte
are relatively few singletons, we have li
plateau of the SAC. The f1-ratio can b
tion of whether or not the sample size i
cardinality estimation. Since estimator
quencies of f1 compared to the other
will make it more di cult for the esti
Also note that the ratio between the u
predicted cardinality is the completene
3.3.1 US States
For the US states (Figure 5(a)), all
fairly well; Chao92 remains closer to
Chao84. The estimates are stable at
the true value even earlier. Note this
C: sample coverage
(観測された種の確率piの和)
since the underlying distribution p1...pN is unknown, this
estimate from the Good-Turing estimator[20] is used:
ˆC = 1 f1/n
The Chao92 estimator attempts to explicitly characterize
and incorporate the skew of the underlying distribution us-
ing the coe cient of variance (CV), denoted , a metric
that can be used to describe the variance in a probability
distribution [8]; we can use the CV to compare the skew
of di↵erent class distributions. The CV is defined as the
standard deviation divided by the mean. Given the pi’s
(p1 · · · pN ) that describe the probability of the ith class be-
ing selected, with mean ¯p =
P
i pi/N = 1/N, the CV is
expressed as =
⇥P
i(pi ¯p)2
/N
⇤1/2
/ ¯p [8]. A higher CV
indicates higher variance amongst the pi’s, while a CV of 0
indicates that each item is equally likely.
The true CV cannot be calculated without knowledge of
the pi’s, so Chao92 uses an estimate, ˆ.
ˆ2
= max
(
c
ˆC
X
i
i(i 1)fi n(n 1) 1, 0
)
(2)
The estimator that uses the coe cient of variance is below;
note that if ˆ2
= 0 (i.e., indicating a uniform distribution),
the estimator reduces to c/ ˆC
ˆNchao92 =
c
ˆC
+
n(1 ˆC)
ˆC
ˆ2
plateau of the SAC. The f1-
tion of whether or not the sa
cardinality estimation. Sinc
quencies of f1 compared to
will make it more di cult
Also note that the ratio bet
predicted cardinality is the
3.3.1 US States
For the US states (Figur
fairly well; Chao92 remains
Chao84. The estimates are
the true value even earlier.
all fifty states are acquired (
may be be surprising that t
as well as it does, as one m
would be more commonly
a few explanations for this
age coe cient of variance
0.53; in [8], Chao notes tha
reasonable for  0.5. F
typically do not submit th
samples drawn without rep
tribution will result in a les
original. We discuss sampli
in Section 4. Individual wor
di↵erent skewed distribution
states before those in the m
orporate the skew of the underlying distribution us-
coe cient of variance (CV), denoted , a metric
n be used to describe the variance in a probability
ution [8]; we can use the CV to compare the skew
rent class distributions. The CV is defined as the
rd deviation divided by the mean. Given the pi’s
pN ) that describe the probability of the ith class be-
ected, with mean ¯p =
P
i pi/N = 1/N, the CV is
ed as =
⇥P
i(pi ¯p)2
/N
⇤1/2
/ ¯p [8]. A higher CV
es higher variance amongst the pi’s, while a CV of 0
es that each item is equally likely.
true CV cannot be calculated without knowledge of
, so Chao92 uses an estimate, ˆ.
ˆ2
= max
(
c
ˆC
X
i
i(i 1)fi n(n 1) 1, 0
)
(2)
timator that uses the coe cient of variance is below;
at if ˆ2
= 0 (i.e., indicating a uniform distribution),
mator reduces to c/ ˆC
ˆNchao92 =
c
ˆC
+
n(1 ˆC)
ˆC
ˆ2
Experimental Results
an over 25,000 HITs on AMT to compare the perfor-
Also note that the ratio betw
predicted cardinality is the c
3.3.1 US States
For the US states (Figure
fairly well; Chao92 remains
Chao84. The estimates are
the true value even earlier.
all fifty states are acquired (o
may be be surprising that th
as well as it does, as one mi
would be more commonly c
a few explanations for this
age coe cient of variance ˆ
0.53; in [8], Chao notes that
reasonable for  0.5. Fu
typically do not submit the
samples drawn without repla
tribution will result in a less
original. We discuss samplin
in Section 4. Individual work
di↵erent skewed distribution
states before those in the mi
3.3.2 UN Countries
その他 
Abundance-based Coverage Estimator 等様々な手法が存在
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
} chao92を利用して国数を推定してみた。
} 原因 worker behaviorに起因
10
200 400 600 800
050100150200250300
# answers
chao92estimate
actual
expected
Fig. 4. Estimated Cardinality
(A, B, C, D, F, A, G, B, A, ….)
…
A B C D E F G H I J K...
(A, B, G, H, F, I, A, E, E, K, ….)
(a) Database Sampling (B) Crowd Based Sampling
= sampling process with replacement
= sampling process without replacement
Worker
Processes
Worker
ArrivalProcess
A B C D E F G H I J K... A B C D E F G H I J K... A B C D E F G H I J K...
Fig. 5. Sampling Process
workers complete different amounts of work and arrive/depart
from the experiment at different points in time.
The next subsection formalizes a model of how answers
arrive from the crowd in response to a set enumeration query,
as well as a description of how crowd behaviors impact
the sample of answers received. We then use simulation to
demonstrate the principles of how these behaviors play off
one another and thereby influence an estimation algorithm.
B. A Model for Human Enumerations
Species estimation algorithms assume a with-replacement
sample from some unknown distribution describing item likeli-
hoods (visualized in Figure 5(a)). The order in which elements
1) Sampling Without Replacement: When a worker submits
multiple items for a set enumeration query, each answer is
different from his previous ones. In other words, individuals
are sampling without replacement from some underlying dis-
tribution that describes the likelihood of selecting each answer.
Of course, this behavior is beneficial with respect to the goal of
acquiring all the items in the set, as low-probability items be-
come more likely after the high-probability items have already
been provided by that worker (we do not pay for duplicated
work from a single worker). A negative side effect of workers
sampling without replacement is that the estimation algorithm
receives less information about the relative frequency of items,
and thus the skew, of the underlying data distribution; having
うまくいかない。。
(A, B, C, D, F, A, G, B, A, ….)
…
A B C D E F G H I J K...
(A, B, G, H, F, I, A, E, E, K, ….)
(a) Database Sampling (B) Crowd Based Sampling
= sampling process with replacement
= sampling process without replacement
Worker
Processes
Worker
ArrivalProcess
A B C D E F G H I J K... A B C D E F G H I J K... A B C D E F G H I J K...
Fig. 5. Sampling Process
t
s
1) Sampling Without Replacement: When a worker submits
multiple items for a set enumeration query, each answer is
different from his previous ones. In other words, individuals
・種推定においては、アイテム尺度が  
 未知の分布から標本が抽出される。
・人間による列挙では、ある内在する
 アイテム分布に基づき標本(回答)が抽出される。
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
} ストリーカーの存在
} 完全な回答を提示する回答者も存在(ストリーカー)
} 特徴としては、重複なしの標本抽出をする。その結果真値
よりも過大に推定されてしまう。
} 開始時に200アイテムすべてを回答するストリーカーを追
加して検証
11
(b) forms of skew (c) impact of streaker
500 1000 1500 2000
# answers
ws=T, dd=T
ws=F, dd=T
ws=T, dd=F
ws=F, dd=F
500 1000 1500 2000
0100200300400
# answers
chao92estimate
ws=T, dd=T
ws=F, dd=T
ws=T, dd=F
ws=F, dd=F
ation simulations illustrating the impact of worker behaviors13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
Crowdsourced Enumeration Queries
} Chao 92を改良した
Streaker-tolerant Estimatorという手法を開発
} ストリーカの許容範囲を設定
} Detect and remove outliners
12
(a) UN 1
200 400 600 800
0100200300
# answers
chao92estimate
Φorig = 0.14
Φnew = 0.087
(b) UN 2
200 400 600 800
0100200300
# answers
chao92estimate
Φorig = 0.11
Φnew = 0.099
(f) States 1
100
Φorig = 0.046
Φ = 0.053
(g) States 2
100
Φorig = 0.028
Φ = 0.024
(a) UN 1
200 400 600 800
0100200300
# answers
chao92estimate
Φorig = 0.14
Φnew = 0.087
(b) UN 2
200 400 600 800
0100200300
# answers
chao92estimate
Φorig = 0.11
Φnew = 0.099
(f) States 1
100
Φorig = 0.046
Φnew = 0.053
(g) States 2
100
Φorig = 0.028
Φnew = 0.02413年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を増やせばいいの?→品質がむしろ悪くなる
(図2)
13
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を増やせばいいの?→品質がむしろ悪くなる
(図2)
13
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を増やせばいいの?→品質がむしろ悪くなる
(図2)
13
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を増やせばいいの?→品質がむしろ悪くなる
(図2)
13
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 3000万件のURL中、39%のurlが一つのみのtag(図1)
} 投稿数を増やせばいいの?→品質がむしろ悪くなる
(図2)
13
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最大化
ワーカーに対してどのbudget
を割り振るかが課題
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最大化
Selected
ワーカーに対してどのbudget
を割り振るかが課題
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最大化
Selected
Quality Metric
for Tag Data
ワーカーに対してどのbudget
を割り振るかが課題
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最大化
Selected
Quality Metric
for Tag Data
ワーカーに対してどのbudget
を割り振るかが課題
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
} 限られた予算において、インセンティブ配分を最適
化することによって品質の向上の最大化
Selected
Quality Metric
for Tag Data
ワーカーに対してどのbudget
を割り振るかが課題
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
¨ 以下の5種類を提案
¨ Free Choice (FC)
¨ Round Robin (RR)
¨ Fewest Post First (FP)
¤ タグが付けられていないものを優先
¨ Most Unstable First (MU)
¤ rfd(Relative Frequency Distribution)の値をみて最も不確
かなものを選択
¨ Hybrid (FP-MU)
¨ 以上の手法をDP(theoretically optimal solution)と比較
13年6月29日土曜日
Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ)
On Incentive-based Tagging
¨ Free Choice: 50% posts are over-tagging, wasted
}
16
¨ FP & FP-MU are close to
optimal
¨ Budget = 1,000
¤ 0.7% more posts comparing
with initial no.
¤ 6.7% quality improvement
¨ Free Choice: 50%
posts are over-
tagging, wasted
13年6月29日土曜日

More Related Content

Viewers also liked

My Blooms Taxonomy
My Blooms TaxonomyMy Blooms Taxonomy
My Blooms TaxonomyPrem Pillay
 
Sigir2014勉強会 slideshare
Sigir2014勉強会 slideshareSigir2014勉強会 slideshare
Sigir2014勉強会 slideshareMitsuo Yamamoto
 
Haiti - Camp-in-a-Box
Haiti - Camp-in-a-Box Haiti - Camp-in-a-Box
Haiti - Camp-in-a-Box Lee Murray
 
Ch 7 tutoring notes quadratics
Ch 7 tutoring notes quadraticsCh 7 tutoring notes quadratics
Ch 7 tutoring notes quadraticssrobbins4
 
Solving quadratic equations
Solving quadratic equationsSolving quadratic equations
Solving quadratic equationssrobbins4
 
Daniel Duwa's Sentiments of a Brush
Daniel Duwa's Sentiments of a BrushDaniel Duwa's Sentiments of a Brush
Daniel Duwa's Sentiments of a BrushDaniel Duwa
 
Lions, Tigers & Friends: Legal Issues in the Social Media Sandbox! Francine W...
Lions, Tigers & Friends: Legal Issues in the Social Media Sandbox! Francine W...Lions, Tigers & Friends: Legal Issues in the Social Media Sandbox! Francine W...
Lions, Tigers & Friends: Legal Issues in the Social Media Sandbox! Francine W...Francine Ward
 
Logarithms and exponents solve equations
Logarithms and exponents solve equationsLogarithms and exponents solve equations
Logarithms and exponents solve equationssrobbins4
 
My Blooms Taxonomy
My Blooms TaxonomyMy Blooms Taxonomy
My Blooms TaxonomyPrem Pillay
 
情報検索における評価指標の最新動向と新たな提案
情報検索における評価指標の最新動向と新たな提案情報検索における評価指標の最新動向と新たな提案
情報検索における評価指標の最新動向と新たな提案Mitsuo Yamamoto
 
άσκηση 9 σελ 102 σχολικό βιβλίο
άσκηση 9 σελ 102 σχολικό βιβλίοάσκηση 9 σελ 102 σχολικό βιβλίο
άσκηση 9 σελ 102 σχολικό βιβλίοBasilis Kranias
 
Sigir2013 勉強会資料
Sigir2013 勉強会資料Sigir2013 勉強会資料
Sigir2013 勉強会資料Mitsuo Yamamoto
 
Creating and Using Links between Data Objects
Creating and Using Links between Data ObjectsCreating and Using Links between Data Objects
Creating and Using Links between Data ObjectsMitsuo Yamamoto
 

Viewers also liked (20)

My Blooms Taxonomy
My Blooms TaxonomyMy Blooms Taxonomy
My Blooms Taxonomy
 
Sigir2014勉強会 slideshare
Sigir2014勉強会 slideshareSigir2014勉強会 slideshare
Sigir2014勉強会 slideshare
 
Haiti - Camp-in-a-Box
Haiti - Camp-in-a-Box Haiti - Camp-in-a-Box
Haiti - Camp-in-a-Box
 
Ch 7 tutoring notes quadratics
Ch 7 tutoring notes quadraticsCh 7 tutoring notes quadratics
Ch 7 tutoring notes quadratics
 
Corp Train 2010
Corp Train 2010Corp Train 2010
Corp Train 2010
 
Solving quadratic equations
Solving quadratic equationsSolving quadratic equations
Solving quadratic equations
 
Daniel Duwa's Sentiments of a Brush
Daniel Duwa's Sentiments of a BrushDaniel Duwa's Sentiments of a Brush
Daniel Duwa's Sentiments of a Brush
 
Lions, Tigers & Friends: Legal Issues in the Social Media Sandbox! Francine W...
Lions, Tigers & Friends: Legal Issues in the Social Media Sandbox! Francine W...Lions, Tigers & Friends: Legal Issues in the Social Media Sandbox! Francine W...
Lions, Tigers & Friends: Legal Issues in the Social Media Sandbox! Francine W...
 
Logarithms and exponents solve equations
Logarithms and exponents solve equationsLogarithms and exponents solve equations
Logarithms and exponents solve equations
 
Logarithms
LogarithmsLogarithms
Logarithms
 
My Blooms Taxonomy
My Blooms TaxonomyMy Blooms Taxonomy
My Blooms Taxonomy
 
情報検索における評価指標の最新動向と新たな提案
情報検索における評価指標の最新動向と新たな提案情報検索における評価指標の最新動向と新たな提案
情報検索における評価指標の最新動向と新たな提案
 
E rate
E rateE rate
E rate
 
E rate
E rateE rate
E rate
 
LinkAStar
LinkAStarLinkAStar
LinkAStar
 
άσκηση 9 σελ 102 σχολικό βιβλίο
άσκηση 9 σελ 102 σχολικό βιβλίοάσκηση 9 σελ 102 σχολικό βιβλίο
άσκηση 9 σελ 102 σχολικό βιβλίο
 
Sigir2013 勉強会資料
Sigir2013 勉強会資料Sigir2013 勉強会資料
Sigir2013 勉強会資料
 
Creating and Using Links between Data Objects
Creating and Using Links between Data ObjectsCreating and Using Links between Data Objects
Creating and Using Links between Data Objects
 
Go green2 live
Go green2 liveGo green2 live
Go green2 live
 
Apresentação de Negócios i9 life
Apresentação de Negócios i9 lifeApresentação de Negócios i9 life
Apresentação de Negócios i9 life
 

Similar to ICDE2013勉強会 Session 19: Social Media II

IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET Journal
 
Twitter as a personalizable information service ii
Twitter as a personalizable information service iiTwitter as a personalizable information service ii
Twitter as a personalizable information service iiKan-Han (John) Lu
 
On Joint Modeling of Topical Communities and Personal Interest in Microblogs
On Joint Modeling of Topical Communities and Personal Interest in MicroblogsOn Joint Modeling of Topical Communities and Personal Interest in Microblogs
On Joint Modeling of Topical Communities and Personal Interest in MicroblogsPC LO
 
Identifying ghost users using social media metadata - University College London
Identifying ghost users using social media metadata - University College LondonIdentifying ghost users using social media metadata - University College London
Identifying ghost users using social media metadata - University College LondonGreg Kawere
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET Journal
 
On Summarization and Timeline Generation for Evolutionary Tweet Streams
On Summarization and Timeline Generation for Evolutionary Tweet StreamsOn Summarization and Timeline Generation for Evolutionary Tweet Streams
On Summarization and Timeline Generation for Evolutionary Tweet Streams1crore projects
 
IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...
IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...
IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...IRJET Journal
 
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...IEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...
2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...
2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...IEEEFINALYEARSTUDENTPROJECT
 
2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...
2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...
2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...IEEEMEMTECHSTUDENTSPROJECTS
 
Rostislav Yavorsky - Research Challenges of Dynamic Socio-Semantic Networks
Rostislav Yavorsky - Research Challenges of Dynamic Socio-Semantic NetworksRostislav Yavorsky - Research Challenges of Dynamic Socio-Semantic Networks
Rostislav Yavorsky - Research Challenges of Dynamic Socio-Semantic NetworksWitology
 
Finding bursty topics from microblogs
Finding bursty topics from microblogsFinding bursty topics from microblogs
Finding bursty topics from microblogsmoresmile
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...IRJET Journal
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platformFayan TAO
 
Knime social media_white_paper
Knime social media_white_paperKnime social media_white_paper
Knime social media_white_paperFiras Husseini
 
Using transfer learning for video popularity prediction
Using transfer learning for video popularity predictionUsing transfer learning for video popularity prediction
Using transfer learning for video popularity predictioneSAT Publishing House
 
Characterizing microblogs
Characterizing microblogsCharacterizing microblogs
Characterizing microblogsEtico Capital
 
Ins and Outs of News Twitter as a Real-Time News Analysis Service
Ins and Outs of News Twitter as a Real-Time News Analysis ServiceIns and Outs of News Twitter as a Real-Time News Analysis Service
Ins and Outs of News Twitter as a Real-Time News Analysis ServiceArjumand Younus
 

Similar to ICDE2013勉強会 Session 19: Social Media II (20)

IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...
 
Twitter as a personalizable information service ii
Twitter as a personalizable information service iiTwitter as a personalizable information service ii
Twitter as a personalizable information service ii
 
On Joint Modeling of Topical Communities and Personal Interest in Microblogs
On Joint Modeling of Topical Communities and Personal Interest in MicroblogsOn Joint Modeling of Topical Communities and Personal Interest in Microblogs
On Joint Modeling of Topical Communities and Personal Interest in Microblogs
 
Identifying ghost users using social media metadata - University College London
Identifying ghost users using social media metadata - University College LondonIdentifying ghost users using social media metadata - University College London
Identifying ghost users using social media metadata - University College London
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
 
On Summarization and Timeline Generation for Evolutionary Tweet Streams
On Summarization and Timeline Generation for Evolutionary Tweet StreamsOn Summarization and Timeline Generation for Evolutionary Tweet Streams
On Summarization and Timeline Generation for Evolutionary Tweet Streams
 
IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...
IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...
IRJET - Socirank Identifying and Ranking Prevalent News Topics using Social M...
 
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
IEEE 2014 JAVA DATA MINING PROJECTS Discovering emerging topics in social str...
 
2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...
2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...
2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...
 
2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...
2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...
2014 IEEE JAVA DATA MINING PROJECT Discovering emerging topics in social stre...
 
Rostislav Yavorsky - Research Challenges of Dynamic Socio-Semantic Networks
Rostislav Yavorsky - Research Challenges of Dynamic Socio-Semantic NetworksRostislav Yavorsky - Research Challenges of Dynamic Socio-Semantic Networks
Rostislav Yavorsky - Research Challenges of Dynamic Socio-Semantic Networks
 
Finding bursty topics from microblogs
Finding bursty topics from microblogsFinding bursty topics from microblogs
Finding bursty topics from microblogs
 
[IJET-V2I1P14] Authors:Aditi Verma, Rachana Agarwal, Sameer Bardia, Simran Sh...
[IJET-V2I1P14] Authors:Aditi Verma, Rachana Agarwal, Sameer Bardia, Simran Sh...[IJET-V2I1P14] Authors:Aditi Verma, Rachana Agarwal, Sameer Bardia, Simran Sh...
[IJET-V2I1P14] Authors:Aditi Verma, Rachana Agarwal, Sameer Bardia, Simran Sh...
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 
Research
ResearchResearch
Research
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platform
 
Knime social media_white_paper
Knime social media_white_paperKnime social media_white_paper
Knime social media_white_paper
 
Using transfer learning for video popularity prediction
Using transfer learning for video popularity predictionUsing transfer learning for video popularity prediction
Using transfer learning for video popularity prediction
 
Characterizing microblogs
Characterizing microblogsCharacterizing microblogs
Characterizing microblogs
 
Ins and Outs of News Twitter as a Real-Time News Analysis Service
Ins and Outs of News Twitter as a Real-Time News Analysis ServiceIns and Outs of News Twitter as a Real-Time News Analysis Service
Ins and Outs of News Twitter as a Real-Time News Analysis Service
 

Recently uploaded

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

ICDE2013勉強会 Session 19: Social Media II

  • 1. Session 19: Social Media II 担当: デンソーアイティーラボラトリ 山本 【ICDE2013勉強会】 資料中の図は論文を引用しております。 13年6月29日土曜日
  • 2. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) 発表論文 } (1) A Unified Model for Stable and Temporal Topic Detection from Social Media Data } ソーシャルメディアにおけるコンテンツが一時的なトピックかそれ とも恒久的なトピックかを考慮した上で判定 } (2) Crowdsourced Enumeration Queries (Best Paper) } クラウドソーシングの検索タスクに対する回答集合数 (母集団)の推定. } 生物統計学における固有種数の推定手法を応用(CHAO92) } (3) On Incentive-based Tagging } tag情報の品質をインセンティブをワーカー与えることによって 向上させる。 2 13年6月29日土曜日
  • 3. } 【やりたいこと】 Stable TopicとTemporal Topic考慮した上でのトピック抽出 } Stable Topic及びTemporal Topicの定義 } Stable Topic :いつも誰かがそのテーマについて言及している } Temporal Topic: 時系列上でみて、急激にそのテーマについて言及 する回数が激増・激減するようなテーマ。通常は実生活のイベント が影響 Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) A Unified Model for Stable and Temporal Topic Detection from Social Media Data 3 important and useful to distinguish temporal topics from stable topics in social media. However, such a discrimination is very challenging because the user-generated texts in social media are very short in length and thus lack useful linguistic features for precise analysis using traditional approaches. In this paper, we propose a novel solution to detect both stable and temporal topics simultaneously from social media data. Specifically, a unified user-temporal mixture model is proposed to distinguish temporal topics from stable topics. To improve this model’s performance, we design a regularization framework that exploits prior spatial information in a social network, as well as a burst-weighted smoothing scheme that exploits temporal prior information in the time dimension. We conduct extensive experiments to evaluate our proposal on two real data sets obtained from Del.icio.us and Twitter. The experimental results verify that our mixture model is able to distinguish temporal topics from stable topics in a single detection process. Our mixture model enhanced with the spatial regularization and the burst-weighted smoothing scheme significantly outperforms competitor approaches, in terms of topic detection accuracy and discrimination in stable and temporal topics. I. INTRODUCTION User-generated contents (UGC) in Web 2.0 are valuable resources capturing people’s interests, thoughts and actions. Such contents cover a wide variety of topics that present online and offline lives. For example, the microblog services gather many short but quickly-updated texts that contain both temporal and stable topics. Such topics form a huge and rich repository of various kinds of interesting information. Stable topics are often on users’ regular interests and their daily routine discussions, which usually evolve at a rather slow speed. The extraction of such stable topics enables us to personalize the results and to improve the result relevance in many applications such as computational advertising, content targeting, personal recommendation and web search. In contrast, temporal topics are on popular real-life events or hot spots. In many circumstances, temporal topics, e.g., breaking events in the real world, bring about popular discus- sion and wide diffusion on the Internet, where social networks further boost the discussion and diffusion. Take Twitter, the most popular microblog service, as an example. Many social events can be discovered in Twitter’s posts (tweets), such illustrated in Figure 1. We can tell the difference between them from the temporal distributions and the description keywords. A temporal topic has its text related to a certain event like “Independence Day celebration” in a certain period of time, and its popularity goes through a sharp increase at the occurring time of the event. A stable topic has its description on user’s regular interest like “Pet Adoption” and its temporal distribution exhibits no sharp, spike-like fluctuation. Fig. 1. Stable and Temporal Topics in Twitter It is important and useful to distinguish the temporal topics from the stable topics since they convey different kinds of information. However, temporal topics are discussed with less urgent themes in the background, and therefore temporal topics are deeply mixed with stable topics in social media. As a result, it is a challenging problem to detect and differenti- ate temporal and stable topics from large amounts of user- generated social media data. Research on traditional topic detection and tracking employs on-line incremental clustering [1] or retrospective off-line clus- tering [25] for documents and extracts representative features for clusters as a summary of the events. These methods are suitable for conventional web pages where most documents are long, rich in keywords, and related to certain popular events.13年6月29日土曜日
  • 4. 【アプローチ】 } Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) A Unified Model for Stable and Temporal Topic Detection from Social Media Data 4 niques for topic detection from social media data [12]. IV. A MIXTURE MODEL FOR DETECTING STABLE AND TEMPORAL TOPICS In this section, we propose a user-temporal mixture topic model that integrates user and temporal features, followed by an EM-based algorithm for inferring model parameters. A. User-Temporal Model SYMBOL DESCRIPTION u, t, w user, time stamp, keyword U, T, W set of users, time stamps and keywords M[u, t, w] frequency of w used by u within time stamp t λU , λT parameter controlling the branch selection θi stable topic indexed by i θj temporal topic indexed by j ΘU , ΘT stable and temporal topic set TABLE I NOTATIONS In the user-temporal mixture model, we pre-categorize topics into two types: stable topics and temporal topics. Stable topics summarize theme reflected from regular postings according to the stable interest of a user or a community. While temporal topics capture the popular events or controversial news igniting hot discussion in a certain period. In this model, we aim to detect both temporal and stable topics in one generating process. Table I lists the relevant notations we use. The mixture model is represented in Equation 1. For a stable topic θ we pay particular attention to its user u who generates Whether a keyword a temporal topic is dec contributions by the us For instance, if many certain period t, w wo with higher probability topics. Thus, keywords clustered into tempora to that of their keywor The topics generated individually. Both typ during the learning pr can filter out the stable branch. It also helps re disturbance from break B. Estimation of Mode Given an observati procedure of our model of generating the obser whole document collec 2, where p(w|u, t) is d L(C) = U The goal of parame 2. As this equation c Maximum Likelihood mixing user and temporal features, each branch deciding a different topic type. Parameters λU and λT in Equation 1 are the probability coefficients controlling the branch choice, which also denote the proportions of stable and temporal topics in the data set. p(w|u, t) = λU θi∈ΘU p(θi|u)p(w|θi)+λT θj ∈ΘT p(θj|t)p(w|θj) (1) For the user branch, a stable topic is chosen according to the interest of a particular user. For the time branch, a temporal topic is generated according to the time stamp of a post, which means the post belongs to the topics that are popular for a short period of time around that time stamp. Temporal topics have their distribution on the time dimension, which indicates its popularity probabilities. The time period during which a temporal topic has its highest probability is its popularity period. In our setting, the user interest is assumed to be stable through time, and we ignore the possible slight evolution of user interest. maximizati ing the so- depends on In our m p(θi|u), p( and θj. For The detail temporal m E-step: where B(w ・user uがword w を時間 t に言及する確率 要はstableなトピックは人に依存、テンポラルなトピックは時間に依存 ze s. gs le al l, ne e. le es to that of their keywords. The topics generated in the two branches are not estimated individually. Both types of topics interact with each other during the learning procedure. This two-branch assumption can filter out the stable components from burst topics by stable branch. It also helps refine the quality of stable topics without disturbance from breaking events as time elapses. B. Estimation of Model Parameters Given an observation matrix M(U, T, W), the learning procedure of our model is to estimate the maximum probability of generating the observed samples. The log-likelihood of the whole document collection C by our approach is in Equation 2, where p(w|u, t) is defined according to Equation 1. L(C) = U T W M[u, t, w] log p(w|u, t) (2) The goal of parameter estimation is to maximize Equation 2. As this equation cannot be solved directly by applying Maximum Likelihood Estimation (MLE), we apply an EM approach instead. In an expectation (E) step of the EM ・user-time-associated document collection Cにおけるlog-likelihood E-Mアルゴリズムを利用すれば、 In the user-temporal mixture model, we pre-categorize topics into two types: stable topics and temporal topics. Stable topics summarize theme reflected from regular postings according to the stable interest of a user or a community. While temporal topics capture the popular events or controversial news igniting hot discussion in a certain period. In this model, we aim to detect both temporal and stable topics in one generating process. Table I lists the relevant notations we use. The mixture model is represented in Equation 1. For a stable topic θi we pay particular attention to its user u who generates it. For a temporal topic θj we pay more attention to when, indicated by time t, it is generated. Like PLSA [10], [11], our user-temporal model consists of three layers and two branches mixing user and temporal features, each branch deciding a different topic type. Parameters λU and λT in Equation 1 are the probability coefficients controlling the branch choice, which also denote the proportions of stable and temporal topics in the data set. p(w|u, t) = λU θi∈ΘU p(θi|u)p(w|θi)+λT θj ∈ΘT p(θj|t)p(w|θj) (1) For the user branch, a stable topic is chosen according to the pr of wh 2, 2. M ap ap va m in de p( an Th tem In the user-temporal mixture model, we pre-categorize topics into two types: stable topics and temporal topics. Stable topics summarize theme reflected from regular postings according to the stable interest of a user or a community. While temporal topics capture the popular events or controversial news igniting hot discussion in a certain period. In this model, we aim to detect both temporal and stable topics in one generating process. Table I lists the relevant notations we use. The mixture model is represented in Equation 1. For a stable topic θi we pay particular attention to its user u who generates it. For a temporal topic θj we pay more attention to when, indicated by time t, it is generated. Like PLSA [10], [11], our user-temporal model consists of three layers and two branches mixing user and temporal features, each branch deciding a different topic type. Parameters λU and λT in Equation 1 are the probability coefficients controlling the branch choice, which also denote the proportions of stable and temporal topics in the data set. p(w|u, t) = λU θi∈ΘU p(θi|u)p(w|θi)+λT θj ∈ΘT p(θj|t)p(w|θj) (1) For the user branch, a stable topic is chosen according to the pr of w 2, 2. M ap ap va m in de p( an Th te In the user-temporal mixture model, we pre-categorize topics into two types: stable topics and temporal topics. Stable topics summarize theme reflected from regular postings according to the stable interest of a user or a community. While temporal topics capture the popular events or controversial news igniting hot discussion in a certain period. In this model, we aim to detect both temporal and stable topics in one generating process. Table I lists the relevant notations we use. The mixture model is represented in Equation 1. For a stable topic θi we pay particular attention to its user u who generates it. For a temporal topic θj we pay more attention to when, indicated by time t, it is generated. Like PLSA [10], [11], our user-temporal model consists of three layers and two branches mixing user and temporal features, each branch deciding a different topic type. Parameters λU and λT in Equation 1 are the probability coefficients controlling the branch choice, which also denote the proportions of stable and temporal topics in the data set. p(w|u, t) = λU θi∈ΘU p(θi|u)p(w|θi)+λT θj ∈ΘT p(θj|t)p(w|θj) (1) For the user branch, a stable topic is chosen according to the procedure of of generating whole docum 2, where p(w L(C The goal o 2. As this e Maximum L approach ins approach, po variables bas maximization ing the so-ca depends on th In our mo p(θi|u), p(θj and θj. For si The detailed temporal mo E-step: In the user-temporal mixture model, we pre-categorize topics into two types: stable topics and temporal topics. Stable topics summarize theme reflected from regular postings according to the stable interest of a user or a community. While temporal topics capture the popular events or controversial news igniting hot discussion in a certain period. In this model, we aim to detect both temporal and stable topics in one generating process. Table I lists the relevant notations we use. The mixture model is represented in Equation 1. For a stable topic θi we pay particular attention to its user u who generates it. For a temporal topic θj we pay more attention to when, indicated by time t, it is generated. Like PLSA [10], [11], our user-temporal model consists of three layers and two branches mixing user and temporal features, each branch deciding a different topic type. Parameters λU and λT in Equation 1 are the probability coefficients controlling the branch choice, which also denote the proportions of stable and temporal topics in the data set. p(w|u, t) = λU θi∈ΘU p(θi|u)p(w|θi)+λT θj ∈ΘT p(θj|t)p(w|θj) (1) For the user branch, a stable topic is chosen according to the procedure of of generating whole docum 2, where p(w L(C The goal o 2. As this e Maximum L approach ins approach, po variables bas maximization ing the so-ca depends on th In our mo p(θi|u), p(θj and θj. For si The detailed temporal mo E-step: が求まる stableなトピック テンポラルなトピック 13年6月29日土曜日
  • 5. } special smoothing } burst word対策 Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) A Unified Model for Stable and Temporal Topic Detection from Social Media Data 5 pected complete data : T) (18) al regularization, we (θj|t) in the M-step. opted to estimate all anced model with the njoys a similar form rly, just as the spatial izer R(C, T) is non- d decreases R(C, T) ) +1 and R(C, T) into lution for ψ (2) n+1, and (θj|t + 1) (m) n+1 (19) and p(θi|u)n+1 re- poral regularization egularization meter γ; 9); temporal topic. An example of two kinds of words is shown in Figure 2. Three burst words “mj”, “moonwalk” and “michaeljackson” have their distribution curves with sharp spikes. We can see that although the trends of these words do not always synchronize, they all go through a drastic increase and reach peaks in July 2009. The bursts in their curve are ignited by a real life event, i.e., Michael Jackson’s death. An effective topic model should capture these words into one topic. On the other hand, abstract words like “news” and “world” maintains high occurrences throughout the year in Figure 2 but they convey little information. Although they are relevant to the event in July, they also have relationships to many other topics. For example, word “news” could be used to represent various different news. However, such abstract words shadow the spikes of more meaningful words. The high occurrences of such abstract words during the burst period of the burst words may overwhelm the latter and render them unnoticed. Fig. 2. Normalized Word Frequency Distribution on “Michael Jackson’s Death” in 2009 To boost interesting temporal topics, we propose a smooth- ing technique that merges correlated words into one temporal Bursty Degreeを計測 (Yao at al ICDE’2010) して補正をかける。 友達間で同一のトピックに対して盛り 上がっているときは、補正をかける。 13年6月29日土曜日
  • 6. } 結果 Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) A Unified Model for Stable and Temporal Topic Detection from Social Media Data 6 ds: tes ric ing ted rds flat zed nce the ults ure his ure al- the nd, ics by The tor ral hod DA of est Twitter data set by hiring 3 volunteers as annotators. For each topic, we extracted keywords with the highest probabilities to represent its content. Each topic was labeled by two different annotators, and if they disagreed a third annotator was introduced. Three exclusive labels were provided to indicate the quality of temporal topic detection. • Excellent: a nicely presented temporal topic • Good: a topic containing bursty features • Poor: a topic without obvious bursty features Excellent Good Poor EUTB 42.5% 32.5% 25% TOT 10% 40% 50% Individual Detection 20% 37.5% 42.5% TimeUserLDA 29.5% 38% 32.5% Twitter-LDA 13.5% 39% 47.5% TABLE III COMPARISON ON TEMPORAL TOPIC QUALITY The labeling results are summarized in Table III. Up to 75% of the temporal topics detected by EUTB were labeled as “Ex- cellent” or “Good”, and 42.5% were regarded as “Excellent”. Among all competitors, TimeUserLDA performs best. 67.5% of the detected temporal topics were judged as “Excellent” or “Good”, and 29.5% were regarded as “Excellent”. Other competitors got merely or slightly more than 50% of their detected topics labeled as “Excellent” or “Good”. In particular, the competitors got significantly less “Excellent” labels. These results demonstrate that our proposed user-temporal mixture PLSA on slices Individual Detection TOT model EUTB TimeUserLDA latest michaeljackson news michaeljackson news headline july world jackson jackson news breaking breaking mj michael investigative news jackson moonwalk michaeljackson michaeljackson headline michaeljackson death death event investigative death news investigative TABLE IV TOPIC “MICHAEL JACKSON” DETECTED BY DIFFERENT APPROACHES T77 T78 T87 T89 T60 T71 2009.1.12-2009.1.31 2009.6.15-2009.6.27 2009.4.24-2009.5.6 2009.5.27-2009.6.6 2009.1.24-2009.1.27 2009.1.1-2009.1.6 obama 0.144 moon 0.090 flu 0.158 google 0.061 droid 0.125 2008 0.099 inauguration 0.106 space 0.068 swineflu 0.124 googlewave 0.059 go 0.113 webcomics 0.046 ユーザインタビューによる テストの結果、提案手法(EUTB) によるトッピック抽出は 評価が高かった。 13年6月29日土曜日
  • 7. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) Crowdsourced Enumeration Queries 7 ・既存のDBを利用した検索 データベースの中に 存在するデータがすべて。 (closed -world assumption) ・クラウドソースを利用した検索 データはweb/頭の中に存在 母集団の数がわからん。。 based on the CROWD annotations and optional fre tations of columns and tables in the schema. Fig an example HTML-based UI that would be pre worker for the following crowd table definition: CREATE CROWD TABLE ice_cream_flavor { name VARCHAR PRIMARY KEY } Although CrowdDB supports alternate user inte showing previously received answers), this pape a pure form of the “getting it all” question. alternative UIs is the subject of future work. During query processing, the system automat one or more HITs using the AMT web service A 13年6月29日土曜日
  • 8. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) Crowdsourced Enumeration Queries } 内容 } クラウドソーシングの検索タスクに対する回答集合数の推定. } 生物統計学における固有種数の推定手法を応用(CHAO92) } 固有種数の推定手法とは? } ある特定地域の個体数を調べ、種の種類や密度を推定。 } 同種法を用いて類推した例としては、例えば地球上の恐竜の種類の 推定、等が有名(図)。 } 8 Estimating the diversity of dinosaurs (Steve C. Wang and Peter Dodson ) 13年6月29日土曜日
  • 9. } 推定関数は種を単位とするか,個体間のダイバージェンスを考慮する か,均等度を考慮に入れるか等によっていろいろあり。 } CHAO 84 estimator } CHAO92 estimator(今回利用したもの) } sample coverageという概念を利用 Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) Crowdsourced Enumeration Queries 9 hao84 Estimator hao develops a simple estimator for species rich- based solely on the number of rare species found ple: ˆNchao84 = c + f2 1 2f2 d that it actually is a lower bound, but it per- l on her test data sets. She also found that the works best when there are relatively rare species, ten the case in real species estimation scenarios. hao92 Estimator hao develops another estimator based on the no- ple coverage. The sample coverage C is the sum babilities pi of the observed classes. However, underlying distribution p1...pN is unknown, this om the Good-Turing estimator[20] is used: ˆC = 1 f1/n 92 estimator attempts to explicitly characterize orate the skew of the underlying distribution us- e cient of variance (CV), denoted , a metric e used to describe the variance in a probability for the ice cream flavors. In this paper we clea ified workers’ answers manually; other work h techniques for crowd-based verification [24, 1, 1 Figure 5(a-c) shows the average cardinality es time, i.e., for increasing numbers of HITs, for th UN countries, and ice cream flavors using th estimators. Error bars can be computed using v mators provided in [8, 6], however we omit the readability. The horizontal line indicates the t ity if it is known. Below each graph, a tab “f1-ratio” and the actual number of received u over time. We define f1-ratio as f1/ P i fi, th the singletons as compared to the overall rec items. Recall that the presence of singletons is dicator that there are more undetected items; are relatively few singletons, we have likely app plateau of the SAC. The f1-ratio can be used tion of whether or not the sample size is su cie cardinality estimation. Since estimators use the quencies of f1 compared to the other fi’s, a will make it more di cult for the estimators Also note that the ratio between the unique it predicted cardinality is the completeness estim c: 観測された種の数 f1:一度のみ観測された種の数 f2:二度観測された種の数 sed solely on the number of rare species found ˆNchao84 = c + f2 1 2f2 hat it actually is a lower bound, but it per- n her test data sets. She also found that the ks best when there are relatively rare species, the case in real species estimation scenarios. o92 Estimator develops another estimator based on the no- coverage. The sample coverage C is the sum bilities pi of the observed classes. However, erlying distribution p1...pN is unknown, this the Good-Turing estimator[20] is used: ˆC = 1 f1/n estimator attempts to explicitly characterize te the skew of the underlying distribution us- cient of variance (CV), denoted , a metric sed to describe the variance in a probability 8]; we can use the CV to compare the skew ass distributions. The CV is defined as the ation divided by the mean. Given the pi’s at describe the probability of the ith class be- with mean ¯p = P i pi/N = 1/N, the CV is ⇥P ⇤ techniques for crowd-based verification Figure 5(a-c) shows the average cardi time, i.e., for increasing numbers of HIT UN countries, and ice cream flavors u estimators. Error bars can be computed mators provided in [8, 6], however we o readability. The horizontal line indicat ity if it is known. Below each graph “f1-ratio” and the actual number of re over time. We define f1-ratio as f1/ P the singletons as compared to the ove items. Recall that the presence of sing dicator that there are more undetecte are relatively few singletons, we have li plateau of the SAC. The f1-ratio can b tion of whether or not the sample size i cardinality estimation. Since estimator quencies of f1 compared to the other will make it more di cult for the esti Also note that the ratio between the u predicted cardinality is the completene 3.3.1 US States For the US states (Figure 5(a)), all fairly well; Chao92 remains closer to Chao84. The estimates are stable at the true value even earlier. Note this C: sample coverage (観測された種の確率piの和) since the underlying distribution p1...pN is unknown, this estimate from the Good-Turing estimator[20] is used: ˆC = 1 f1/n The Chao92 estimator attempts to explicitly characterize and incorporate the skew of the underlying distribution us- ing the coe cient of variance (CV), denoted , a metric that can be used to describe the variance in a probability distribution [8]; we can use the CV to compare the skew of di↵erent class distributions. The CV is defined as the standard deviation divided by the mean. Given the pi’s (p1 · · · pN ) that describe the probability of the ith class be- ing selected, with mean ¯p = P i pi/N = 1/N, the CV is expressed as = ⇥P i(pi ¯p)2 /N ⇤1/2 / ¯p [8]. A higher CV indicates higher variance amongst the pi’s, while a CV of 0 indicates that each item is equally likely. The true CV cannot be calculated without knowledge of the pi’s, so Chao92 uses an estimate, ˆ. ˆ2 = max ( c ˆC X i i(i 1)fi n(n 1) 1, 0 ) (2) The estimator that uses the coe cient of variance is below; note that if ˆ2 = 0 (i.e., indicating a uniform distribution), the estimator reduces to c/ ˆC ˆNchao92 = c ˆC + n(1 ˆC) ˆC ˆ2 plateau of the SAC. The f1- tion of whether or not the sa cardinality estimation. Sinc quencies of f1 compared to will make it more di cult Also note that the ratio bet predicted cardinality is the 3.3.1 US States For the US states (Figur fairly well; Chao92 remains Chao84. The estimates are the true value even earlier. all fifty states are acquired ( may be be surprising that t as well as it does, as one m would be more commonly a few explanations for this age coe cient of variance 0.53; in [8], Chao notes tha reasonable for  0.5. F typically do not submit th samples drawn without rep tribution will result in a les original. We discuss sampli in Section 4. Individual wor di↵erent skewed distribution states before those in the m orporate the skew of the underlying distribution us- coe cient of variance (CV), denoted , a metric n be used to describe the variance in a probability ution [8]; we can use the CV to compare the skew rent class distributions. The CV is defined as the rd deviation divided by the mean. Given the pi’s pN ) that describe the probability of the ith class be- ected, with mean ¯p = P i pi/N = 1/N, the CV is ed as = ⇥P i(pi ¯p)2 /N ⇤1/2 / ¯p [8]. A higher CV es higher variance amongst the pi’s, while a CV of 0 es that each item is equally likely. true CV cannot be calculated without knowledge of , so Chao92 uses an estimate, ˆ. ˆ2 = max ( c ˆC X i i(i 1)fi n(n 1) 1, 0 ) (2) timator that uses the coe cient of variance is below; at if ˆ2 = 0 (i.e., indicating a uniform distribution), mator reduces to c/ ˆC ˆNchao92 = c ˆC + n(1 ˆC) ˆC ˆ2 Experimental Results an over 25,000 HITs on AMT to compare the perfor- Also note that the ratio betw predicted cardinality is the c 3.3.1 US States For the US states (Figure fairly well; Chao92 remains Chao84. The estimates are the true value even earlier. all fifty states are acquired (o may be be surprising that th as well as it does, as one mi would be more commonly c a few explanations for this age coe cient of variance ˆ 0.53; in [8], Chao notes that reasonable for  0.5. Fu typically do not submit the samples drawn without repla tribution will result in a less original. We discuss samplin in Section 4. Individual work di↵erent skewed distribution states before those in the mi 3.3.2 UN Countries その他  Abundance-based Coverage Estimator 等様々な手法が存在 13年6月29日土曜日
  • 10. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) Crowdsourced Enumeration Queries } chao92を利用して国数を推定してみた。 } 原因 worker behaviorに起因 10 200 400 600 800 050100150200250300 # answers chao92estimate actual expected Fig. 4. Estimated Cardinality (A, B, C, D, F, A, G, B, A, ….) … A B C D E F G H I J K... (A, B, G, H, F, I, A, E, E, K, ….) (a) Database Sampling (B) Crowd Based Sampling = sampling process with replacement = sampling process without replacement Worker Processes Worker ArrivalProcess A B C D E F G H I J K... A B C D E F G H I J K... A B C D E F G H I J K... Fig. 5. Sampling Process workers complete different amounts of work and arrive/depart from the experiment at different points in time. The next subsection formalizes a model of how answers arrive from the crowd in response to a set enumeration query, as well as a description of how crowd behaviors impact the sample of answers received. We then use simulation to demonstrate the principles of how these behaviors play off one another and thereby influence an estimation algorithm. B. A Model for Human Enumerations Species estimation algorithms assume a with-replacement sample from some unknown distribution describing item likeli- hoods (visualized in Figure 5(a)). The order in which elements 1) Sampling Without Replacement: When a worker submits multiple items for a set enumeration query, each answer is different from his previous ones. In other words, individuals are sampling without replacement from some underlying dis- tribution that describes the likelihood of selecting each answer. Of course, this behavior is beneficial with respect to the goal of acquiring all the items in the set, as low-probability items be- come more likely after the high-probability items have already been provided by that worker (we do not pay for duplicated work from a single worker). A negative side effect of workers sampling without replacement is that the estimation algorithm receives less information about the relative frequency of items, and thus the skew, of the underlying data distribution; having うまくいかない。。 (A, B, C, D, F, A, G, B, A, ….) … A B C D E F G H I J K... (A, B, G, H, F, I, A, E, E, K, ….) (a) Database Sampling (B) Crowd Based Sampling = sampling process with replacement = sampling process without replacement Worker Processes Worker ArrivalProcess A B C D E F G H I J K... A B C D E F G H I J K... A B C D E F G H I J K... Fig. 5. Sampling Process t s 1) Sampling Without Replacement: When a worker submits multiple items for a set enumeration query, each answer is different from his previous ones. In other words, individuals ・種推定においては、アイテム尺度が    未知の分布から標本が抽出される。 ・人間による列挙では、ある内在する  アイテム分布に基づき標本(回答)が抽出される。 13年6月29日土曜日
  • 11. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) Crowdsourced Enumeration Queries } ストリーカーの存在 } 完全な回答を提示する回答者も存在(ストリーカー) } 特徴としては、重複なしの標本抽出をする。その結果真値 よりも過大に推定されてしまう。 } 開始時に200アイテムすべてを回答するストリーカーを追 加して検証 11 (b) forms of skew (c) impact of streaker 500 1000 1500 2000 # answers ws=T, dd=T ws=F, dd=T ws=T, dd=F ws=F, dd=F 500 1000 1500 2000 0100200300400 # answers chao92estimate ws=T, dd=T ws=F, dd=T ws=T, dd=F ws=F, dd=F ation simulations illustrating the impact of worker behaviors13年6月29日土曜日
  • 12. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) Crowdsourced Enumeration Queries } Chao 92を改良した Streaker-tolerant Estimatorという手法を開発 } ストリーカの許容範囲を設定 } Detect and remove outliners 12 (a) UN 1 200 400 600 800 0100200300 # answers chao92estimate Φorig = 0.14 Φnew = 0.087 (b) UN 2 200 400 600 800 0100200300 # answers chao92estimate Φorig = 0.11 Φnew = 0.099 (f) States 1 100 Φorig = 0.046 Φ = 0.053 (g) States 2 100 Φorig = 0.028 Φ = 0.024 (a) UN 1 200 400 600 800 0100200300 # answers chao92estimate Φorig = 0.14 Φnew = 0.087 (b) UN 2 200 400 600 800 0100200300 # answers chao92estimate Φorig = 0.11 Φnew = 0.099 (f) States 1 100 Φorig = 0.046 Φnew = 0.053 (g) States 2 100 Φorig = 0.028 Φnew = 0.02413年6月29日土曜日
  • 13. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 3000万件のURL中、39%のurlが一つのみのtag(図1) } 投稿数を増やせばいいの?→品質がむしろ悪くなる (図2) 13 13年6月29日土曜日
  • 14. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 3000万件のURL中、39%のurlが一つのみのtag(図1) } 投稿数を増やせばいいの?→品質がむしろ悪くなる (図2) 13 13年6月29日土曜日
  • 15. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 3000万件のURL中、39%のurlが一つのみのtag(図1) } 投稿数を増やせばいいの?→品質がむしろ悪くなる (図2) 13 13年6月29日土曜日
  • 16. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 3000万件のURL中、39%のurlが一つのみのtag(図1) } 投稿数を増やせばいいの?→品質がむしろ悪くなる (図2) 13 13年6月29日土曜日
  • 17. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 3000万件のURL中、39%のurlが一つのみのtag(図1) } 投稿数を増やせばいいの?→品質がむしろ悪くなる (図2) 13 13年6月29日土曜日
  • 18. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 限られた予算において、インセンティブ配分を最適 化することによって品質の向上の最大化 ワーカーに対してどのbudget を割り振るかが課題 13年6月29日土曜日
  • 19. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 限られた予算において、インセンティブ配分を最適 化することによって品質の向上の最大化 Selected ワーカーに対してどのbudget を割り振るかが課題 13年6月29日土曜日
  • 20. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 限られた予算において、インセンティブ配分を最適 化することによって品質の向上の最大化 Selected Quality Metric for Tag Data ワーカーに対してどのbudget を割り振るかが課題 13年6月29日土曜日
  • 21. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 限られた予算において、インセンティブ配分を最適 化することによって品質の向上の最大化 Selected Quality Metric for Tag Data ワーカーに対してどのbudget を割り振るかが課題 13年6月29日土曜日
  • 22. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging } 限られた予算において、インセンティブ配分を最適 化することによって品質の向上の最大化 Selected Quality Metric for Tag Data ワーカーに対してどのbudget を割り振るかが課題 13年6月29日土曜日
  • 23. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging ¨ 以下の5種類を提案 ¨ Free Choice (FC) ¨ Round Robin (RR) ¨ Fewest Post First (FP) ¤ タグが付けられていないものを優先 ¨ Most Unstable First (MU) ¤ rfd(Relative Frequency Distribution)の値をみて最も不確 かなものを選択 ¨ Hybrid (FP-MU) ¨ 以上の手法をDP(theoretically optimal solution)と比較 13年6月29日土曜日
  • 24. Session 19 担当:山本光穂(デンソーアイティ Social Media II ーラボラトリ) On Incentive-based Tagging ¨ Free Choice: 50% posts are over-tagging, wasted } 16 ¨ FP & FP-MU are close to optimal ¨ Budget = 1,000 ¤ 0.7% more posts comparing with initial no. ¤ 6.7% quality improvement ¨ Free Choice: 50% posts are over- tagging, wasted 13年6月29日土曜日