Social Computing Research with Apache Spark

Social
Compu,ng
Research
with
Apache

Spark

Hadoop
and
Big
Data
Meetup
Manchester,
July
2015

Dr.
MaEhew
Rowe

Lecturer
in
Social
Compu,ng
|
M.Sc.
Data
Science
Director

hEp://www.lancaster.ac.uk/staﬀ/rowem/
|
m.rowe@lancaster.ac.uk
|
@mrowebot

Social
Compu,ng
Research
The
inves,ga,on
of
how
and
why
social
behaviour
occurs
in

computa,onal
systems
https://prinsayn.files.wordpress.com/2013/01/tile.jpg

Social
Compu,ng
Team
Small
team
comprised
of
myself
+
3
Ph.D.
students
Researching
a
range
of
social
compu,ng
topics:
•  Churn
predic,on
•  User
engagement
in
social
systems
•  Recommender
systems
•  Informa(on
diﬀusion
•  Digital
Accountability
Common
theme:
inves(ga(ng
and
applying
data
mining

techniques
to
large-‐scale
data

'Culture'
Parallel
Processing
Cluster
Contains
12
rack-‐mounted
Dell
servers
•  1
x
4
core
with
64Gb
RAM
•  1
x
6
core
with
64Gb
RAM
•  10
x
2
core
with
16Gb
RAM
Cloudera
+
Apache
MESOS
installed
on
the
cluster
to
provide

access
to:
•  Apache
Hadoop
Stack
(HDFS,
HBASE)
•  Apache
Spark
•  RabbitMQ
(for
custom
distributed
processing
apps)

•  E.g.
Parallelised
parameter
tuning
for
Recommender
Systems

Language

Innova,on
Diﬀusion
on
Social
Media

Diffusion
of
Language
Innova,on
Language
innova,on
can
take
various
forms:
•  Neologisms
(e.g.
brah)
•  Word
blends
(e.g.
downvo+ng,
cooldown)
•  Shortening
(e.g.
ur)
Studying
the
adop,on
of
such
innova,ons
is
hard:
•  Interviews
with
different
communi,es

•  Travelling
between
different
loca,ons

•  Relies
on
understanding
the
agent,
the
social
structure
+
their

interplay

Social
media
allows
language
spread
to
be
inves,gated
at
scale:
•  To
understand
who
influences
whom
•  To
understand
the
language
of
brands'
audiences

Compu,ng
Language
Innova,on

Diffusion
1. 
Varia,on
in
Term
Frequency

Probability
of
a
term
being
used
in
a
context:
,me
(week)
and

community
(e.g.
subreddit)

2. 
Varia,on
in
Term
Form

Probability
of
a
term
having
a
suffix
or
prefix
added
(as
a
word

blend)
Goal:
Look
for
significant
increases
&
decreases
in
term
frequency

and
form:

(i)
globally
in
a
system

(ii)
locally
in
communi,es

The
Role
of
Apache
Spark
Collect
datasets from
Twitter and
Reddit
Identify
innovations
Compute
frequency and
form values
Write
significant
increases +
decreases to
HDFS
Point to TSV
file in HDFS
Map: return
<<term,
context>,
value> pairs as
RDD
ReduceByKey:
merge <<term,
context>,
value> pairs as
RDD
Identify
significant
contextual
increases +
decreases
Note: the key
here is a tuple

Inves,ga,ng
UK
Web
Filters

(A
Data
Science
Approach)

UK
Web
Filtering:
Default-‐on

Collateral
Filtering
How
accurate
are
the
ﬁlters?

What
is
being
overblocked
and
underblocked?

Censorship
Monitoring

Project
Open
Rights
Group
constructed
a
system
of
probes
to
check
URLs

across
ISPs
for
blocking

Lancaster’s
goal:
build
a
system
to
gauge
ﬁlters’
accuracy
and

categories
of
blocks

Computing per-ISP accuracy
Broadcasting DMOZ Category RDD <URL, topics>
Pseudo-‐classiﬁers
in

Apache
Spark
hEps://github.com/openrightsgroup/cmp-‐analysis

Point to DMOZ
JSON file in
HDFS
Map: return
<URL, topics>
pairs as RDD
ReduceByKey:
merge <URL,
topics> as
RDD
Point to Adult
DMOZ JSON
file in HDFS
Map: return
<URL, topics>
pairs as RDD
ReduceByKey:
merge <URL,
topics> as
RDD
Take
union of
RDD
objects
Broadcast
RDD to
cluster
Point to probe
request file in
HDFS
Retrieve
DMOZ RDD
from broadcast
Map: return
<ISP, Result>
RDD w/ pseudo-
classifiers
ReduceByKey:
merge <ISP,
Result> as
RDD
Collect map
and compute
per-ISP
accuracy

What
did
we
ﬁnd?
For
examples
of
overblocks
&
underblocks:

hEps://github.com/openrightsgroup/cmp-‐analysis/tree/master/data/output

30%
to
82%
of
sites
are
underblocked

2%
to
6%
of
sites
are
overblocked

Ques,ons?

Web:
hEp://www.lancaster.ac.uk/staﬀ/rowem/

(For
publica,ons
and
current
projects)

Code:
hEps://github.com/maEroweshow/

Email:
m.rowe@lancaster.ac.uk

TwiEer:
@mrowebot

Social Computing Research with Apache Spark

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (9)

Destacado

Destacado (7)

Similar a Social Computing Research with Apache Spark

Similar a Social Computing Research with Apache Spark (20)

Más de Matthew Rowe

Más de Matthew Rowe (20)

Último

Último (20)

Social Computing Research with Apache Spark