Unveiling SOCIO COSMOS: Where Socializing Meets the Stars
Social Computing Research with Apache Spark
1. Social
Compu,ng
Research
with
Apache
Spark
Hadoop
and
Big
Data
Meetup
Manchester,
July
2015
Dr.
MaEhew
Rowe
Lecturer
in
Social
Compu,ng
|
M.Sc.
Data
Science
Director
hEp://www.lancaster.ac.uk/staff/rowem/
|
m.rowe@lancaster.ac.uk
|
@mrowebot
2. Social
Compu,ng
Research
The
inves,ga,on
of
how
and
why
social
behaviour
occurs
in
computa,onal
systems
https://prinsayn.files.wordpress.com/2013/01/tile.jpg
3. Social
Compu,ng
Team
Small
team
comprised
of
myself
+
3
Ph.D.
students
Researching
a
range
of
social
compu,ng
topics:
• Churn
predic,on
• User
engagement
in
social
systems
• Recommender
systems
• Informa(on
diffusion
• Digital
Accountability
Common
theme:
inves(ga(ng
and
applying
data
mining
techniques
to
large-‐scale
data
4. 'Culture'
Parallel
Processing
Cluster
Contains
12
rack-‐mounted
Dell
servers
• 1
x
4
core
with
64Gb
RAM
• 1
x
6
core
with
64Gb
RAM
• 10
x
2
core
with
16Gb
RAM
Cloudera
+
Apache
MESOS
installed
on
the
cluster
to
provide
access
to:
• Apache
Hadoop
Stack
(HDFS,
HBASE)
• Apache
Spark
• RabbitMQ
(for
custom
distributed
processing
apps)
• E.g.
Parallelised
parameter
tuning
for
Recommender
Systems
6. Diffusion
of
Language
Innova,on
Language
innova,on
can
take
various
forms:
• Neologisms
(e.g.
brah)
• Word
blends
(e.g.
downvo+ng,
cooldown)
• Shortening
(e.g.
ur)
Studying
the
adop,on
of
such
innova,ons
is
hard:
• Interviews
with
different
communi,es
• Travelling
between
different
loca,ons
• Relies
on
understanding
the
agent,
the
social
structure
+
their
interplay
Social
media
allows
language
spread
to
be
inves,gated
at
scale:
• To
understand
who
influences
whom
• To
understand
the
language
of
brands'
audiences
7. Compu,ng
Language
Innova,on
Diffusion
1.
Varia,on
in
Term
Frequency
Probability
of
a
term
being
used
in
a
context:
,me
(week)
and
community
(e.g.
subreddit)
2.
Varia,on
in
Term
Form
Probability
of
a
term
having
a
suffix
or
prefix
added
(as
a
word
blend)
Goal:
Look
for
significant
increases
&
decreases
in
term
frequency
and
form:
(i)
globally
in
a
system
(ii)
locally
in
communi,es
8. The
Role
of
Apache
Spark
Collect
datasets from
Twitter and
Reddit
Identify
innovations
Compute
frequency and
form values
Write
significant
increases +
decreases to
HDFS
Point to TSV
file in HDFS
Map: return
<<term,
context>,
value> pairs as
RDD
ReduceByKey:
merge <<term,
context>,
value> pairs as
RDD
Identify
significant
contextual
increases +
decreases
Note: the key
here is a tuple
14. Censorship
Monitoring
Project
Open
Rights
Group
constructed
a
system
of
probes
to
check
URLs
across
ISPs
for
blocking
Lancaster’s
goal:
build
a
system
to
gauge
filters’
accuracy
and
categories
of
blocks
15. Computing per-ISP accuracy
Broadcasting DMOZ Category RDD <URL, topics>
Pseudo-‐classifiers
in
Apache
Spark
hEps://github.com/openrightsgroup/cmp-‐analysis
Point to DMOZ
JSON file in
HDFS
Map: return
<URL, topics>
pairs as RDD
ReduceByKey:
merge <URL,
topics> as
RDD
Point to Adult
DMOZ JSON
file in HDFS
Map: return
<URL, topics>
pairs as RDD
ReduceByKey:
merge <URL,
topics> as
RDD
Take
union of
RDD
objects
Broadcast
RDD to
cluster
Point to probe
request file in
HDFS
Retrieve
DMOZ RDD
from broadcast
Map: return
<ISP, Result>
RDD w/ pseudo-
classifiers
ReduceByKey:
merge <ISP,
Result> as
RDD
Collect map
and compute
per-ISP
accuracy
16. What
did
we
find?
For
examples
of
overblocks
&
underblocks:
hEps://github.com/openrightsgroup/cmp-‐analysis/tree/master/data/output
30%
to
82%
of
sites
are
underblocked
2%
to
6%
of
sites
are
overblocked