1. Management and analysis of social media data
A case study based on Sina Weibo
Weining Qian
Center for Cloud Computing and Big Data
East China Normal University
wnqian@sei.ecnu.edu.cn
database.ecnu.edu.cn
3. What is social media?
A group of Internet-based applications that build on the ideological and
technological foundations of Web 2.0, and that allow the creation and
exchange of user-generated content.
Andreas M. Kaplan, Michael Haenlein. “Users of the world, unite! The
challenges and opportunities of Social Media”. Business Horizons 53(1). 2010
3 of 53
7. Why case study based on Sina Weibo?
• “Real-world” data (valuable for universities)
• Related to many real applications
• (Relatively) easy to get those data
• Big data?
◦ Unstructured data
◦ Time evolving data
◦ Fast arriving (if we crawl the data on-line)
◦ Low quality (abbr., smileyes, typos, multi-language, . . . )
• Intuition helps (everyone understand social media nowadays!)
7 of 53
10. Data: Gradually updating
Followship network
• Seed users: 11 lawyers and opinion leaders and 21 researchers
• 2nd level users from seeds: 120,000+ users
• 3rd level users from seeds: 1.7+ million users
• 4th level users from seeds: 18+ million users (incomplete)
• More than 1 billion following relationships
Tweets from 1.6+ million users
• From Aug. 2009 to Jun. 2012
• 480+ million tweets (about 51.11% of them are retweeted tweets, and
others are original tweets)
10 of 53
16. Modeling
It’s difficult to model a long-term time-series in
social media
• Affected by external events
Is it possible to model the life-cycle of a single
tweet?
To predicate its
• retweet path
• #retweet
• impression
16 of 53
23. Piece-wise Sigmoid function
F(x) =
N1
1 +a0 ·e−b0(x−c0)
x <= x1
Ni−1 +
Ni −Ni−1
1 +ai ·e−bi (x−ci )
xi−1 < x <= xi ,2 ≤ i ≤ λ
(1)
where
λ
∑
i=1
Ni = N. (2)
23 of 53
24. Result of modeling
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
R
2
(Single S−Curve)
R
2
(MultiS−Curve)
y=x
24 of 53
28. Schema: Tweets
Table : The microblog Table
Attribute Data Type Description
MID ID Message identifier
UID ID Author’s user identifier
TIME DATE/TIME Time that the tweet is posted
CONTENT TEXT Content of the tweet
Table : The retweet Table
Attribute Data Type Description
MID ID Message identifier of the retweet
REMID ID MID of the tweet that is retweeted
28 of 53
29. Schema: Content
Table : The mention Table
Attribute Data Type Description
MID ID Message identifier
UID ID A user identifier that is mentioned
in the message
Table : The topic Table
Attribute Data Type Description
MID ID Message identifier
TAG TEXT The hashtag of a topic
Could be extended for links, images, video, etc.
29 of 53
30. Schema: Users
Table : The user Table
Attribute Data Type Description
UID ID User identifier
Email TEXT Email of the user
Name TEXT Name of the user
. . . . . . Profile attributes
Table : The friendlist Table
Attribute Data Type Description
UID ID User identifier
FRIENDID ID A user that is followed by UID
30 of 53
32. Queries
Q: Rank tweets appearing in my followees’ timelines according to the number of retweet.
SELECT x.remid FROM microblog,
(SELECT retweet.mid AS mid,retweet.remid AS remid
FROM microblog,retweet
WHERE microblog.mid = retweet.remid) AS x
WHERE microblog.mid = x.mid AND
microblog.uid IN
(SELECT friendID FROM friendList
WHERE uid = "A" OR
uid IN
(SELECT friendID FROM friendList
WHERE uid = "A")) AND
microblog.time BETWEEN TO_DAYS(’YYYY-MM-DDHH:MM:SS’) AND
DATE_ADD(’YYYY-MM-DD HH:MM:SS’,INTERVAL 1HOUR)
GROUP BY x.remid
ORDER BY COUNT(*)DESC
LIMIT 10;
32 of 53
33. Difficulties
Joins of very large tables
• self-join of friendList
• join of microblog and retweet
33 of 53
34. Queries
Q: Find the set of people who share the same followee with the specified user.
SELECT f1.uid
FROM friendList AS f1,
(SELECT friendID
FROM friendList
WHERE uid = "A") AS f2
WHERE f1.uid <> "A" AND
f1.friendID = f2.friendID AND
f1.uid <> f2.friendID
GROUP BY f1.uid
ORDER BY COUNT(f1.friendID)DESC
LIMIT 10;
34 of 53
35. Difficulties
Power-law distribution
• The size of results from the inner-subquery may vary a lot!
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
−6
10
−5
10
−4
10
−3
10
−2
10
−1
10
0
10
1
10
2
#Followees
Frequency(Normalized)
Twitter
Sina Weibo
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
10
−6
10
−5
10
−4
10
−3
10
−2
10
−1
10
0
10
1
10
2
#Followers
Frequency(Normalized)
Twitter
Sina Weibo
35 of 53
37. Why a data generator is needed?
• Useful in benchmark
◦ For scalability issue
◦ For privacy issue
◦ For diversity issues
• Though social media data from different services tend to follow similar
distribution, they are different.
37 of 53
38. Distribution of real-data vs. generated data
1
10
100
1000
10000
100000
1e+06
1e+07
1e+08
1 10 100 1000 10000
Frequency
Number of Comments per post
SIB
BSMA
1
10
100
1000
10000
100000
1e+06
1 10 100 1000 10000
Frequency
Number of Friends
SIB
BSMA
0
0.05
0.1
0.15
0.2
0 5 10 15 20 25 30
NumberofPost
Day
SIB
BSMA
1
10
100
1000
10000
100000
1e+06
1 10 100 1000 100001000001e+06 1e+07
Frequency
Number of Posts
SIB
BSMA
38 of 53
41. Workloads
• 19 queries in 3 categories
◦ Social network queries (joins of very large tables)
◦ Timeline queries (order-preserving)
◦ Hotspot queries (skewed data)
41 of 53
45. Collective bahavior analysis
What is collective behavior?
Three kinds of actions:
Conforming : actors follow prevailing norms
Deviant : actors violate those norms
Collective behavior : a third form of action, takes place when norms are
absent or unclear, or when they contradict each other
45 of 53
46. What is collective bahavior?
Four forms of collective behavior
• The crowd
• The public
• The mass
• The social movement
46 of 53
48. Mood analysis
Essentially time series
Disasters have strong affect on “death” mood (up-down-up pattern)
The mood of death is strongly correlated with mood on anxiety and calm
48 of 53
49. On-going work
A shared dataset of hotspots on Sina Weibo
• Events and descriptions
• Evolutions of hotspots
• Information propagation
• Spatial attributes
• Users’ involvement
By-products
• Spamming detection
• Fake IDs
• . . .
49 of 53
51. Summary
• Data collecting/pre-processing is dirty-work
◦ Topic/semantic entity extraction
◦ Mood detection
◦ . . .
• Real-life data depict interesting patterns
◦ even with simple exploratory analysis
• Modeling is difficult
◦ yet possible under certain circumstance
◦ Monitoring is possible
◦ Prediction remains an open problem
• Building system for analyzing social media data is a challenge
• Benchmark is a basis for better understanding social media analytics
51 of 53