Management and analysis of social media data

Management and analysis of social media data
A case study based on Sina Weibo
Weining Qian
Center for Cloud Computing and Big Data
East China Normal University
wnqian@sei.ecnu.edu.cn
database.ecnu.edu.cn

Outline
Social media
Data
Data collecting
Modeling microblogs
Management
Schema
Queries
Data generator: On-going work
Benchmarking social media data analytical queries
Applications
2 of 53

What is social media?
A group of Internet-based applications that build on the ideological and
technological foundations of Web 2.0, and that allow the creation and
exchange of user-generated content.
Andreas M. Kaplan, Michael Haenlein. “Users of the world, unite! The
challenges and opportunities of Social Media”. Business Horizons 53(1). 2010
3 of 53

Why social media?
Sense the world!
4 of 53

Finantial index and mood on social media
5 of 53

Finantial index and mood on social media
6 of 53

Why case study based on Sina Weibo?
• “Real-world” data (valuable for universities)
• Related to many real applications
• (Relatively) easy to get those data
• Big data?
◦ Unstructured data
◦ Time evolving data
◦ Fast arriving (if we crawl the data on-line)
◦ Low quality (abbr., smileyes, typos, multi-language, . . . )
• Intuition helps (everyone understand social media nowadays!)
7 of 53

Outline
Social media
Data
Data collecting
Modeling microblogs
Management
Schema
Queries
Applications
8 of 53

Data collecting: Distributed crawler
9 of 53

Data: Gradually updating
Followship network
• Seed users: 11 lawyers and opinion leaders and 21 researchers
• 2nd level users from seeds: 120,000+ users
• 3rd level users from seeds: 1.7+ million users
• 4th level users from seeds: 18+ million users (incomplete)
• More than 1 billion following relationships
Tweets from 1.6+ million users
• From Aug. 2009 to Jun. 2012
• 480+ million tweets (about 51.11% of them are retweeted tweets, and
others are original tweets)
10 of 53

Data: Two dimentions
Timeline Followship network
11 of 53

Outline
Social media
Data
Data collecting
Modeling microblogs
Management
Schema
Queries
Applications
12 of 53

The challenge of modeling
What we expect?
13 of 53

External events
14 of 53

Bursts/tipping points
15 of 53

Modeling
It’s difﬁcult to model a long-term time-series in
social media
• Affected by external events
Is it possible to model the life-cycle of a single
tweet?
To predicate its
• retweet path
• #retweet
• impression
16 of 53

Various measurements
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
#Retweet/#Hashtag
Frequency
#Hashtag
#Retweet
[1,10) [10~100) [100,1000) [1000,)
0
10
20
30
40
50
60
70
80
90
100
#Retweet
ThePercentageofTweet
93.8
5.85
0.65 0.02
7.42
29.4230.5832.76
The Percentage of #Tweet*#Retweet
The Percentage of #Tweet
17 of 53

The life-cycle of a tweet
18 of 53

Sigmoid function: S-Curve
F(x) =
N
1 +a ·e−b(x−c)
0 50 100 150
0
10
20
30
40
50
60
70
80
90
100
x
y
a=100,b=0.2
a=1000,b=0.2
a=100000,b=0.2
a=1000,b=0.1
a=1000,b=0.3
19 of 53

Modeling tweets popularity with S-Curve
20 of 53

Bursts of a tweet (and its retweets)
21 of 53

Tipping points
1. (Γ(t +ε)−Γ(t)) > κ
2. (Γ(t)−Γ(t −ε)) < κ
3. (Γ(t +ε)−Γ(t)) > µ ∗(Γ(t)−Γ(t −ε))
4. (Γ(t +ε)−Γ(t)) > N/log(N)
22 of 53

Piece-wise Sigmoid function
F(x) =



N1
1 +a0 ·e−b0(x−c0)
x <= x1
Ni−1 +
Ni −Ni−1
1 +ai ·e−bi (x−ci )
xi−1 < x <= xi ,2 ≤ i ≤ λ
(1)
where
λ
∑
i=1
Ni = N. (2)
23 of 53

Result of modeling
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
R
2
(Single S−Curve)
R
2
(MultiS−Curve)
y=x
24 of 53

What causes a burst in social media?
25 of 53

Intuitive illustration
26 of 53

Outline
Social media
Data
Data collecting
Modeling microblogs
Management
Schema
Queries
Applications
27 of 53

Schema: Tweets
Table : The microblog Table
Attribute Data Type Description
MID ID Message identifier
UID ID Author’s user identifier
TIME DATE/TIME Time that the tweet is posted
CONTENT TEXT Content of the tweet
Table : The retweet Table
MID ID Message identifier of the retweet
REMID ID MID of the tweet that is retweeted
28 of 53

Schema: Content
Table : The mention Table
UID ID A user identiﬁer that is mentioned
in the message
Table : The topic Table
TAG TEXT The hashtag of a topic
Could be extended for links, images, video, etc.
29 of 53

Schema: Users
Table : The user Table
UID ID User identifier
Email TEXT Email of the user
Name TEXT Name of the user
. . . . . . Profile attributes
Table : The friendlist Table
UID ID User identifier
FRIENDID ID A user that is followed by UID
30 of 53

Outline
Social media
Data
Data collecting
Modeling microblogs
Management
Schema
Queries
Applications
31 of 53

Queries
Q: Rank tweets appearing in my followees’ timelines according to the number of retweet.
SELECT x.remid FROM microblog,
(SELECT retweet.mid AS mid,retweet.remid AS remid
FROM microblog,retweet
WHERE microblog.mid = retweet.remid) AS x
WHERE microblog.mid = x.mid AND
microblog.uid IN
(SELECT friendID FROM friendList
WHERE uid = "A" OR
uid IN
(SELECT friendID FROM friendList
WHERE uid = "A")) AND
microblog.time BETWEEN TO_DAYS(’YYYY-MM-DDHH:MM:SS’) AND
DATE_ADD(’YYYY-MM-DD HH:MM:SS’,INTERVAL 1HOUR)
GROUP BY x.remid
ORDER BY COUNT(*)DESC
LIMIT 10;
32 of 53

Difﬁculties
Joins of very large tables
• self-join of friendList
• join of microblog and retweet
33 of 53

Queries
Q: Find the set of people who share the same followee with the speciﬁed user.
SELECT f1.uid
FROM friendList AS f1,
(SELECT friendID
FROM friendList
WHERE uid = "A") AS f2
WHERE f1.uid <> "A" AND
f1.friendID = f2.friendID AND
f1.uid <> f2.friendID
GROUP BY f1.uid
ORDER BY COUNT(f1.friendID)DESC
LIMIT 10;
34 of 53

Difﬁculties
Power-law distribution
• The size of results from the inner-subquery may vary a lot!
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
−6
10
−5
10
−4
10
−3
10
−2
10
−1
10
0
10
1
10
2
#Followees
Frequency(Normalized)
Twitter
Sina Weibo
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
10
−6
10
−5
10
−4
10
−3
10
−2
10
−1
10
0
10
1
10
2
#Followers
Frequency(Normalized)
Twitter
Sina Weibo
35 of 53

Outline
Social media
Data
Data collecting
Modeling microblogs
Management
Schema
Queries
Applications
36 of 53

Why a data generator is needed?
• Useful in benchmark
◦ For scalability issue
◦ For privacy issue
◦ For diversity issues
• Though social media data from different services tend to follow similar
distribution, they are different.
37 of 53

Distribution of real-data vs. generated data
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1 10 100 1000 10000
Frequency
Number of Comments per post
SIB
BSMA
1
10
100
1000
10000
100000
1e+06
1 10 100 1000 10000
Frequency
Number of Friends
SIB
BSMA
0
0.05
0.1
0.15
0.2
0 5 10 15 20 25 30
NumberofPost
Day
SIB
BSMA
1
10
100
1000
10000
100000
1e+06
1 10 100 1000 100001000001e+06 1e+07
Frequency
Number of Posts
SIB
BSMA
38 of 53

Outline
Social media
Data
Data collecting
Modeling microblogs
Management
Schema
Queries
Applications
39 of 53

Measurements
• Throughput
• Latency
• Scalability
40 of 53

Workloads
• 19 queries in 3 categories
◦ Social network queries (joins of very large tables)
◦ Timeline queries (order-preserving)
◦ Hotspot queries (skewed data)
41 of 53

Preliminary results
0
500
1000
1500
2000
Q
1
Q
2
Q
3
Q
4
Q
5
Q
8
Q
9
Q
10Q
11Q
12Q
13Q
14Q
15Q
16Q
17Q
19
Througput(ops)
Query
Average Hightest Throughput
0
5000
10000
15000
20000
Q
1
Q
2
Q
3
Q
4
Q
5
Q
8
Q
9
Q
10Q
11Q
12Q
13Q
14Q
15Q
16Q
17Q
19
Latency(ms)
Query
Average Hightest Latency
42 of 53

Preliminary results
1
10
100
1000
10000
100000
1e+06
1e+07
Q
1
Q
2
Q
3
Q
4
Q
5
Q
8
Q
9
Q
10Q
11Q
12Q
13Q
14Q
15Q
16Q
17Q
19
Scalability
Query
Team1
Team2
Team3
Team4
43 of 53

On-going work
BSMA: http://github.com/xiafan68/BSMA
• Data generator
• Queries related to content of tweets
• More queries
• Performance testing of more systems
44 of 53

Collective bahavior analysis
What is collective behavior?
Three kinds of actions:
Conforming : actors follow prevailing norms
Deviant : actors violate those norms
Collective behavior : a third form of action, takes place when norms are
absent or unclear, or when they contradict each other
45 of 53

What is collective bahavior?
Four forms of collective behavior
• The crowd
• The public
• The mass
• The social movement
46 of 53

Mood analysis
Essentially time series
47 of 53

Mood analysis
Essentially time series
Disasters have strong affect on “death” mood (up-down-up pattern)
The mood of death is strongly correlated with mood on anxiety and calm
48 of 53

On-going work
A shared dataset of hotspots on Sina Weibo
• Events and descriptions
• Evolutions of hotspots
• Information propagation
• Spatial attributes
• Users’ involvement
By-products
• Spamming detection
• Fake IDs
• . . .
49 of 53

Spamming?
创意工坊
冷笑话精选
作业本
团800网
微博经典语录
微博搞笑排行榜
时尚经典语录
电影工厂
最音乐
全球热门段子
全球创意搜罗
全球时尚最前线
全球奇闻趣事
星座爱情001
全球热门排行榜
胡椒蓓蓓网
新浪数码
新浪科技
新浪科技
新浪科技
新浪科技
头条新闻
新浪财经
任志强
微群小助手
黄健翔
环球音乐榜
当时我震惊了
冷笑话精选
薛蛮子
徐小平
薛蛮子
邓飞
老榕
黄健翔
薛蛮子
李开复
薛蛮子
李开复
薛蛮子
薛蛮子
薛蛮子
李开复-2
袁岳
50 of 53

Summary
• Data collecting/pre-processing is dirty-work
◦ Topic/semantic entity extraction
◦ Mood detection
◦ . . .
• Real-life data depict interesting patterns
◦ even with simple exploratory analysis
• Modeling is difﬁcult
◦ yet possible under certain circumstance
◦ Monitoring is possible
◦ Prediction remains an open problem
• Building system for analyzing social media data is a challenge
• Benchmark is a basis for better understanding social media analytics
51 of 53

Contributed students
• MA Haixin
• XIA Fan
• WEI Jinxian
• YU Chengcheng
• ZHANG Qunyan
52 of 53

Management and analysis of social media data

Recomendados

Recomendados

Más contenido relacionado

Similar a Management and analysis of social media data

Similar a Management and analysis of social media data (20)

Último

Último (20)

Management and analysis of social media data