SlideShare una empresa de Scribd logo
1 de 20
Descargar para leer sin conexión
Social	
  Compu,ng	
  Research	
  with	
  Apache	
  
Spark	
  
Hadoop	
  and	
  Big	
  Data	
  Meetup	
  Manchester,	
  July	
  2015	
  
Dr.	
  MaEhew	
  Rowe	
  	
  
Lecturer	
  in	
  Social	
  Compu,ng	
  |	
  M.Sc.	
  Data	
  Science	
  Director	
  
hEp://www.lancaster.ac.uk/staff/rowem/	
  |	
  m.rowe@lancaster.ac.uk	
  |	
  @mrowebot	
  
Social	
  Compu,ng	
  Research
The	
  inves,ga,on	
  of	
  how	
  and	
  why	
  social	
  behaviour	
  occurs	
  in	
  
computa,onal	
  systems
https://prinsayn.files.wordpress.com/2013/01/tile.jpg
Social	
  Compu,ng	
  Team
Small	
  team	
  comprised	
  of	
  myself	
  +	
  3	
  Ph.D.	
  students
Researching	
  a	
  range	
  of	
  social	
  compu,ng	
  topics:
•  Churn	
  predic,on
•  User	
  engagement	
  in	
  social	
  systems
•  Recommender	
  systems
•  Informa(on	
  diffusion
•  Digital	
  Accountability
Common	
  theme:	
  inves(ga(ng	
  and	
  applying	
  data	
  mining	
  
techniques	
  to	
  large-­‐scale	
  data
'Culture'	
  Parallel	
  Processing	
  Cluster
Contains	
  12	
  rack-­‐mounted	
  Dell	
  servers
•  1	
  x	
  4	
  core	
  with	
  64Gb	
  RAM
•  1	
  x	
  6	
  core	
  with	
  64Gb	
  RAM
•  10	
  x	
  2	
  core	
  with	
  16Gb	
  RAM
Cloudera	
  +	
  Apache	
  MESOS	
  installed	
  on	
  the	
  cluster	
  to	
  provide	
  
access	
  to:
•  Apache	
  Hadoop	
  Stack	
  (HDFS,	
  HBASE)
•  Apache	
  Spark
•  RabbitMQ	
  (for	
  custom	
  distributed	
  processing	
  apps)	
  
•  E.g.	
  Parallelised	
  parameter	
  tuning	
  for	
  Recommender	
  Systems
Language	
  
Innova,on
Diffusion
on
Social
Media
Diffusion	
  of	
  Language	
  Innova,on
Language	
  innova,on	
  can	
  take	
  various	
  forms:
•  Neologisms	
  (e.g.	
  brah)
•  Word	
  blends	
  (e.g.	
  downvo+ng,	
  cooldown)
•  Shortening	
  (e.g.	
  ur)
Studying	
  the	
  adop,on	
  of	
  such	
  innova,ons	
  is	
  hard:
•  Interviews	
  with	
  different	
  communi,es	
  
•  Travelling	
  between	
  different	
  loca,ons	
  
•  Relies	
  on	
  understanding	
  the	
  agent,	
  the	
  social	
  structure	
  +	
  their	
  
interplay	
  
Social	
  media	
  allows	
  language	
  spread	
  to	
  be	
  inves,gated	
  at	
  scale:
•  To	
  understand	
  who	
  influences	
  whom
•  To	
  understand	
  the	
  language	
  of	
  brands'	
  audiences
Compu,ng	
  Language	
  Innova,on	
  
Diffusion
1. 	
  Varia,on	
  in	
  Term	
  Frequency	
  
Probability	
  of	
  a	
  term	
  being	
  used	
  in	
  a	
  context:	
  ,me	
  (week)	
  and	
  
community	
  (e.g.	
  subreddit)	
  
2. 	
  Varia,on	
  in	
  Term	
  Form	
  
Probability	
  of	
  a	
  term	
  having	
  a	
  suffix	
  or	
  prefix	
  added	
  (as	
  a	
  word	
  
blend)
Goal:	
  Look	
  for	
  significant	
  increases	
  &	
  decreases	
  in	
  term	
  frequency	
  
and	
  form:	
  	
  
(i)	
  globally	
  in	
  a	
  system	
  
	
  (ii)	
  locally	
  in	
  communi,es
The	
  Role	
  of	
  Apache	
  Spark
Collect
datasets from
Twitter and
Reddit
Identify
innovations
Compute
frequency and
form values
Write
significant
increases +
decreases to
HDFS
Point to TSV
file in HDFS
Map: return
<<term,
context>,
value> pairs as
RDD
ReduceByKey:
merge <<term,
context>,
value> pairs as
RDD
Identify
significant
contextual
increases +
decreases
Note: the key
here is a tuple
Increase	
  in	
  Frequency
Increase	
  in	
  Form
Inves,ga,ng	
  UK	
  Web	
  Filters	
  
(A	
  Data	
  Science	
  Approach)
UK	
  Web	
  Filtering:	
  Default-­‐on
Collateral	
  Filtering
How	
  accurate	
  are	
  the	
  filters?	
  
What	
  is	
  being	
  overblocked	
  and	
  underblocked?	
  
Censorship	
  Monitoring	
  	
  
Project
Open	
  Rights	
  Group	
  constructed	
  a	
  system	
  of	
  probes	
  to	
  check	
  URLs	
  
across	
  ISPs	
  for	
  blocking	
  
	
  
Lancaster’s	
  goal:	
  build	
  a	
  system	
  to	
  gauge	
  filters’	
  accuracy	
  and	
  
categories	
  of	
  blocks	
  
Computing per-ISP accuracy
Broadcasting DMOZ Category RDD <URL, topics>
Pseudo-­‐classifiers	
  in	
  	
  
Apache	
  Spark
hEps://github.com/openrightsgroup/cmp-­‐analysis	
  
Point to DMOZ
JSON file in
HDFS
Map: return
<URL, topics>
pairs as RDD
ReduceByKey:
merge <URL,
topics> as
RDD
Point to Adult
DMOZ JSON
file in HDFS
Map: return
<URL, topics>
pairs as RDD
ReduceByKey:
merge <URL,
topics> as
RDD
Take
union of
RDD
objects
Broadcast
RDD to
cluster
Point to probe
request file in
HDFS
Retrieve
DMOZ RDD
from broadcast
Map: return
<ISP, Result>
RDD w/ pseudo-
classifiers
ReduceByKey:
merge <ISP,
Result> as
RDD
Collect map
and compute
per-ISP
accuracy
What	
  did	
  we	
  find?
For	
  examples	
  of	
  overblocks	
  &	
  underblocks:	
  	
  
hEps://github.com/openrightsgroup/cmp-­‐analysis/tree/master/data/output	
  	
  
30%	
  to	
  82%	
  of	
  sites	
  are	
  underblocked	
  
2%	
  to	
  6%	
  of	
  sites	
  are	
  overblocked	
  
M.Sc.	
  Data	
  Science
Ques,ons?
	
  
Web:	
  hEp://www.lancaster.ac.uk/staff/rowem/	
  	
  
	
  (For	
  publica,ons	
  and	
  current	
  projects)	
  
	
  
Code:	
  hEps://github.com/maEroweshow/	
  	
  
	
  
Email:	
  m.rowe@lancaster.ac.uk	
  	
  
	
  
TwiEer:	
  @mrowebot	
  

Más contenido relacionado

La actualidad más candente

Hate speech detection
Hate speech detectionHate speech detection
Hate speech detectionNASIM ALAM
 
ICDE2014 Session 14 Data Warehousing
ICDE2014 Session 14 Data WarehousingICDE2014 Session 14 Data Warehousing
ICDE2014 Session 14 Data WarehousingTakuma Wakamori
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)Kunwoo Park
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 
Team activity analysis / visualization
Team activity analysis / visualizationTeam activity analysis / visualization
Team activity analysis / visualizationNicolas Maisonneuve
 
[SOCRS2013]Differential Context Modeling in Collaborative Filtering
[SOCRS2013]Differential Context Modeling in Collaborative Filtering[SOCRS2013]Differential Context Modeling in Collaborative Filtering
[SOCRS2013]Differential Context Modeling in Collaborative FilteringYONG ZHENG
 
Download
DownloadDownload
Downloadbutest
 
Recsys 2016: Modeling Contextual Information in Session-Aware Recommender Sys...
Recsys 2016: Modeling Contextual Information in Session-Aware Recommender Sys...Recsys 2016: Modeling Contextual Information in Session-Aware Recommender Sys...
Recsys 2016: Modeling Contextual Information in Session-Aware Recommender Sys...Bartlomiej Twardowski
 

La actualidad más candente (9)

Hate speech detection
Hate speech detectionHate speech detection
Hate speech detection
 
ICDE2014 Session 14 Data Warehousing
ICDE2014 Session 14 Data WarehousingICDE2014 Session 14 Data Warehousing
ICDE2014 Session 14 Data Warehousing
 
[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)[CS570] Machine Learning Team Project (I know what items really are)
[CS570] Machine Learning Team Project (I know what items really are)
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Team activity analysis / visualization
Team activity analysis / visualizationTeam activity analysis / visualization
Team activity analysis / visualization
 
[SOCRS2013]Differential Context Modeling in Collaborative Filtering
[SOCRS2013]Differential Context Modeling in Collaborative Filtering[SOCRS2013]Differential Context Modeling in Collaborative Filtering
[SOCRS2013]Differential Context Modeling in Collaborative Filtering
 
Download
DownloadDownload
Download
 
Recsys 2016: Modeling Contextual Information in Session-Aware Recommender Sys...
Recsys 2016: Modeling Contextual Information in Session-Aware Recommender Sys...Recsys 2016: Modeling Contextual Information in Session-Aware Recommender Sys...
Recsys 2016: Modeling Contextual Information in Session-Aware Recommender Sys...
 

Destacado

Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into CassandraBrent Theisen
 
Manchester Hadoop User Group: Cassandra Intro
Manchester Hadoop User Group: Cassandra IntroManchester Hadoop User Group: Cassandra Intro
Manchester Hadoop User Group: Cassandra IntroChristopher Batey
 
Open source or proprietary, choose wisely!
Open source or proprietary,  choose wisely!Open source or proprietary,  choose wisely!
Open source or proprietary, choose wisely!Patrick McFadin
 
Introduction to cassandra 2014
Introduction to cassandra 2014Introduction to cassandra 2014
Introduction to cassandra 2014Patrick McFadin
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basicsnickmbailey
 
Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Eric Evans
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra ExplainedEric Evans
 

Destacado (7)

Deep Dive into Cassandra
Deep Dive into CassandraDeep Dive into Cassandra
Deep Dive into Cassandra
 
Manchester Hadoop User Group: Cassandra Intro
Manchester Hadoop User Group: Cassandra IntroManchester Hadoop User Group: Cassandra Intro
Manchester Hadoop User Group: Cassandra Intro
 
Open source or proprietary, choose wisely!
Open source or proprietary,  choose wisely!Open source or proprietary,  choose wisely!
Open source or proprietary, choose wisely!
 
Introduction to cassandra 2014
Introduction to cassandra 2014Introduction to cassandra 2014
Introduction to cassandra 2014
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
 
Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3Cassandra By Example: Data Modelling with CQL3
Cassandra By Example: Data Modelling with CQL3
 
Cassandra Explained
Cassandra ExplainedCassandra Explained
Cassandra Explained
 

Similar a Social Computing Research with Apache Spark

The web of data: how are we doing so far?
The web of data: how are we doing so far?The web of data: how are we doing so far?
The web of data: how are we doing so far?Elena Simperl
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataIJSTA
 
Project E: Citation
Project E: CitationProject E: Citation
Project E: CitationLizLyon
 
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptxSampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx20211a05p7
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxelisarosa29
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so farElena Simperl
 
RDA, Data Citation, and PIDs for DataOne
RDA, Data Citation, and PIDs for DataOneRDA, Data Citation, and PIDs for DataOne
RDA, Data Citation, and PIDs for DataOneResearch Data Alliance
 
RDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupRDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupAnita de Waard
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchangelagoze
 
Paving the way to open and interoperable research data service workflows
Paving the way to open and interoperable research data service workflowsPaving the way to open and interoperable research data service workflows
Paving the way to open and interoperable research data service workflowsThe University of Edinburgh
 
NIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsNIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsVivien Bonazzi
 
Open government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactElena Simperl
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
 
Collaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareCollaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareAnita de Waard
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentationKlawal13
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Comparative analysis for_ddp_frameworks
Comparative analysis for_ddp_frameworksComparative analysis for_ddp_frameworks
Comparative analysis for_ddp_frameworksElenaEtchemendy1
 

Similar a Social Computing Research with Apache Spark (20)

The web of data: how are we doing so far?
The web of data: how are we doing so far?The web of data: how are we doing so far?
The web of data: how are we doing so far?
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
Project E: Citation
Project E: CitationProject E: Citation
Project E: Citation
 
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptxSampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
 
Big Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARLBig Data & DS Analytics for PAARL
Big Data & DS Analytics for PAARL
 
Pemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptxPemanfaatan Big Data Dalam Riset 2023.pptx
Pemanfaatan Big Data Dalam Riset 2023.pptx
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so far
 
RDA, Data Citation, and PIDs for DataOne
RDA, Data Citation, and PIDs for DataOneRDA, Data Citation, and PIDs for DataOne
RDA, Data Citation, and PIDs for DataOne
 
RDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest GroupRDA-WDS Publishing Data Interest Group
RDA-WDS Publishing Data Interest Group
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchange
 
Paving the way to open and interoperable research data service workflows
Paving the way to open and interoperable research data service workflowsPaving the way to open and interoperable research data service workflows
Paving the way to open and interoperable research data service workflows
 
NIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data CommonsNIH Data Summit - The NIH Data Commons
NIH Data Summit - The NIH Data Commons
 
Open government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impact
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
Collaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and softwareCollaboratively creating a network of ideas, data and software
Collaboratively creating a network of ideas, data and software
 
Seminar presentation
Seminar presentationSeminar presentation
Seminar presentation
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Social media with big data analytics
Social media with big data analyticsSocial media with big data analytics
Social media with big data analytics
 
Bar camp bigdata
Bar camp bigdataBar camp bigdata
Bar camp bigdata
 
Comparative analysis for_ddp_frameworks
Comparative analysis for_ddp_frameworksComparative analysis for_ddp_frameworks
Comparative analysis for_ddp_frameworks
 

Más de Matthew Rowe

Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Matthew Rowe
 
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...Matthew Rowe
 
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Matthew Rowe
 
Identity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureIdentity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureMatthew Rowe
 
Measuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMeasuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMatthew Rowe
 
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Matthew Rowe
 
Attention Economics in Social Web Systems
Attention Economics in Social Web SystemsAttention Economics in Social Web Systems
Attention Economics in Social Web SystemsMatthew Rowe
 
What makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsWhat makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsMatthew Rowe
 
Existing Research and Future Research Agenda
Existing Research and Future Research AgendaExisting Research and Future Research Agenda
Existing Research and Future Research AgendaMatthew Rowe
 
Tutorial: Social Semantics
Tutorial: Social SemanticsTutorial: Social Semantics
Tutorial: Social SemanticsMatthew Rowe
 
Modelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesModelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesMatthew Rowe
 
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsUsing Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsMatthew Rowe
 
Anticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsAnticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsMatthew Rowe
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataMatthew Rowe
 
Forecasting Audience Increase on Youtube
Forecasting Audience Increase on YoutubeForecasting Audience Increase on Youtube
Forecasting Audience Increase on YoutubeMatthew Rowe
 
Predicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic WebPredicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic WebMatthew Rowe
 
PhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social DataPhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social DataMatthew Rowe
 
Integrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous SourcesIntegrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous SourcesMatthew Rowe
 
Inferring Web Citations using Social Data and SPARQL Rules
Inferring Web Citations using Social Data and SPARQL RulesInferring Web Citations using Social Data and SPARQL Rules
Inferring Web Citations using Social Data and SPARQL RulesMatthew Rowe
 
The Credibility of Digital Identity Information on the Social Web: A User Study
The Credibility of Digital Identity Information on the Social Web: A User StudyThe Credibility of Digital Identity Information on the Social Web: A User Study
The Credibility of Digital Identity Information on the Social Web: A User StudyMatthew Rowe
 

Más de Matthew Rowe (20)

Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...
 
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
 
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
 
Identity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureIdentity: Physical, Cyber, Future
Identity: Physical, Cyber, Future
 
Measuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMeasuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online Communities
 
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
 
Attention Economics in Social Web Systems
Attention Economics in Social Web SystemsAttention Economics in Social Web Systems
Attention Economics in Social Web Systems
 
What makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsWhat makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositions
 
Existing Research and Future Research Agenda
Existing Research and Future Research AgendaExisting Research and Future Research Agenda
Existing Research and Future Research Agenda
 
Tutorial: Social Semantics
Tutorial: Social SemanticsTutorial: Social Semantics
Tutorial: Social Semantics
 
Modelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesModelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online Communities
 
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsUsing Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
 
Anticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsAnticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community Forums
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
 
Forecasting Audience Increase on Youtube
Forecasting Audience Increase on YoutubeForecasting Audience Increase on Youtube
Forecasting Audience Increase on Youtube
 
Predicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic WebPredicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic Web
 
PhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social DataPhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social Data
 
Integrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous SourcesIntegrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous Sources
 
Inferring Web Citations using Social Data and SPARQL Rules
Inferring Web Citations using Social Data and SPARQL RulesInferring Web Citations using Social Data and SPARQL Rules
Inferring Web Citations using Social Data and SPARQL Rules
 
The Credibility of Digital Identity Information on the Social Web: A User Study
The Credibility of Digital Identity Information on the Social Web: A User StudyThe Credibility of Digital Identity Information on the Social Web: A User Study
The Credibility of Digital Identity Information on the Social Web: A User Study
 

Último

定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一
定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一
定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一ra6e69ou
 
Mastering Wealth with YouTube Content Marketing.pdf
Mastering Wealth with YouTube Content Marketing.pdfMastering Wealth with YouTube Content Marketing.pdf
Mastering Wealth with YouTube Content Marketing.pdfTirupati Social Media
 
Dubai Call Girls O528786472 Diabolic Call Girls In Dubai
Dubai Call Girls O528786472 Diabolic Call Girls In DubaiDubai Call Girls O528786472 Diabolic Call Girls In Dubai
Dubai Call Girls O528786472 Diabolic Call Girls In Dubaihf8803863
 
When-technology-and-Humanity-Cross-1.pptx
When-technology-and-Humanity-Cross-1.pptxWhen-technology-and-Humanity-Cross-1.pptx
When-technology-and-Humanity-Cross-1.pptxReaper61
 
Amplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing ServicesAmplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing ServicesNetqom Solutions
 
AI Virtual Influencers: The Future of Influencer Marketing
AI Virtual Influencers:  The Future of Influencer MarketingAI Virtual Influencers:  The Future of Influencer Marketing
AI Virtual Influencers: The Future of Influencer MarketingCut-the-SaaS
 
Unlock Your Social Media Potential with IndianLikes - IndianLikes.com
Unlock Your Social Media Potential with IndianLikes - IndianLikes.comUnlock Your Social Media Potential with IndianLikes - IndianLikes.com
Unlock Your Social Media Potential with IndianLikes - IndianLikes.comSagar Sinha
 
The--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media PitchThe--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media Pitch17mos052
 
fraud storyboards powerpoint media project
fraud storyboards powerpoint media projectfraud storyboards powerpoint media project
fraud storyboards powerpoint media project17mos052
 
YouScan Company Overview - Social Media Listening with Visual Insights.pdf
YouScan Company Overview - Social Media Listening with Visual Insights.pdfYouScan Company Overview - Social Media Listening with Visual Insights.pdf
YouScan Company Overview - Social Media Listening with Visual Insights.pdfAlexander Sirach
 
AI Virtual Influencers: The Future of Influencer Marketing
AI Virtual Influencers:  The Future of Influencer MarketingAI Virtual Influencers:  The Future of Influencer Marketing
AI Virtual Influencers: The Future of Influencer MarketingCut-the-SaaS
 
Cosmic Conversations with Sociocosmos...
Cosmic Conversations with Sociocosmos...Cosmic Conversations with Sociocosmos...
Cosmic Conversations with Sociocosmos...SocioCosmos
 
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECTTHE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT17mos052
 
办理伯明翰大学毕业证书文凭学位证书
办理伯明翰大学毕业证书文凭学位证书办理伯明翰大学毕业证书文凭学位证书
办理伯明翰大学毕业证书文凭学位证书saphesg8
 
Unveiling SOCIO COSMOS: Where Socializing Meets the Stars
Unveiling SOCIO COSMOS: Where Socializing Meets the StarsUnveiling SOCIO COSMOS: Where Socializing Meets the Stars
Unveiling SOCIO COSMOS: Where Socializing Meets the StarsSocioCosmos
 

Último (20)

定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一
定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一
定制(ENU毕业证书)英国爱丁堡龙比亚大学毕业证成绩单原版一比一
 
Mastering Wealth with YouTube Content Marketing.pdf
Mastering Wealth with YouTube Content Marketing.pdfMastering Wealth with YouTube Content Marketing.pdf
Mastering Wealth with YouTube Content Marketing.pdf
 
young Call girls in Dwarka sector 23🔝 9953056974 🔝 Delhi escort Service
young Call girls in Dwarka sector 23🔝 9953056974 🔝 Delhi escort Serviceyoung Call girls in Dwarka sector 23🔝 9953056974 🔝 Delhi escort Service
young Call girls in Dwarka sector 23🔝 9953056974 🔝 Delhi escort Service
 
Dubai Call Girls O528786472 Diabolic Call Girls In Dubai
Dubai Call Girls O528786472 Diabolic Call Girls In DubaiDubai Call Girls O528786472 Diabolic Call Girls In Dubai
Dubai Call Girls O528786472 Diabolic Call Girls In Dubai
 
When-technology-and-Humanity-Cross-1.pptx
When-technology-and-Humanity-Cross-1.pptxWhen-technology-and-Humanity-Cross-1.pptx
When-technology-and-Humanity-Cross-1.pptx
 
Hot Sexy call girls in Ramesh Nagar🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Ramesh Nagar🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Ramesh Nagar🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Ramesh Nagar🔝 9953056974 🔝 Delhi escort Service
 
Amplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing ServicesAmplify Your Brand with Our Tailored Social Media Marketing Services
Amplify Your Brand with Our Tailored Social Media Marketing Services
 
FULL ENJOY Call Girls In Mohammadpur (Delhi) Call Us 9953056974
FULL ENJOY Call Girls In Mohammadpur  (Delhi) Call Us 9953056974FULL ENJOY Call Girls In Mohammadpur  (Delhi) Call Us 9953056974
FULL ENJOY Call Girls In Mohammadpur (Delhi) Call Us 9953056974
 
AI Virtual Influencers: The Future of Influencer Marketing
AI Virtual Influencers:  The Future of Influencer MarketingAI Virtual Influencers:  The Future of Influencer Marketing
AI Virtual Influencers: The Future of Influencer Marketing
 
Enjoy ➥8448380779▻ Call Girls In Noida Sector 93 Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Noida Sector 93 Escorts Delhi NCREnjoy ➥8448380779▻ Call Girls In Noida Sector 93 Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Noida Sector 93 Escorts Delhi NCR
 
Unlock Your Social Media Potential with IndianLikes - IndianLikes.com
Unlock Your Social Media Potential with IndianLikes - IndianLikes.comUnlock Your Social Media Potential with IndianLikes - IndianLikes.com
Unlock Your Social Media Potential with IndianLikes - IndianLikes.com
 
The--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media PitchThe--Fraud: Netflix Original Media Pitch
The--Fraud: Netflix Original Media Pitch
 
fraud storyboards powerpoint media project
fraud storyboards powerpoint media projectfraud storyboards powerpoint media project
fraud storyboards powerpoint media project
 
young call girls in Greater Noida 🔝 9953056974 🔝 Delhi escort Service
young call girls in  Greater Noida 🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in  Greater Noida 🔝 9953056974 🔝 Delhi escort Service
young call girls in Greater Noida 🔝 9953056974 🔝 Delhi escort Service
 
YouScan Company Overview - Social Media Listening with Visual Insights.pdf
YouScan Company Overview - Social Media Listening with Visual Insights.pdfYouScan Company Overview - Social Media Listening with Visual Insights.pdf
YouScan Company Overview - Social Media Listening with Visual Insights.pdf
 
AI Virtual Influencers: The Future of Influencer Marketing
AI Virtual Influencers:  The Future of Influencer MarketingAI Virtual Influencers:  The Future of Influencer Marketing
AI Virtual Influencers: The Future of Influencer Marketing
 
Cosmic Conversations with Sociocosmos...
Cosmic Conversations with Sociocosmos...Cosmic Conversations with Sociocosmos...
Cosmic Conversations with Sociocosmos...
 
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECTTHE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
THE FRAUD NETFLIX ORIGINAL MEDIA PITCH PROJECT
 
办理伯明翰大学毕业证书文凭学位证书
办理伯明翰大学毕业证书文凭学位证书办理伯明翰大学毕业证书文凭学位证书
办理伯明翰大学毕业证书文凭学位证书
 
Unveiling SOCIO COSMOS: Where Socializing Meets the Stars
Unveiling SOCIO COSMOS: Where Socializing Meets the StarsUnveiling SOCIO COSMOS: Where Socializing Meets the Stars
Unveiling SOCIO COSMOS: Where Socializing Meets the Stars
 

Social Computing Research with Apache Spark

  • 1. Social  Compu,ng  Research  with  Apache   Spark   Hadoop  and  Big  Data  Meetup  Manchester,  July  2015   Dr.  MaEhew  Rowe     Lecturer  in  Social  Compu,ng  |  M.Sc.  Data  Science  Director   hEp://www.lancaster.ac.uk/staff/rowem/  |  m.rowe@lancaster.ac.uk  |  @mrowebot  
  • 2. Social  Compu,ng  Research The  inves,ga,on  of  how  and  why  social  behaviour  occurs  in   computa,onal  systems https://prinsayn.files.wordpress.com/2013/01/tile.jpg
  • 3. Social  Compu,ng  Team Small  team  comprised  of  myself  +  3  Ph.D.  students Researching  a  range  of  social  compu,ng  topics: •  Churn  predic,on •  User  engagement  in  social  systems •  Recommender  systems •  Informa(on  diffusion •  Digital  Accountability Common  theme:  inves(ga(ng  and  applying  data  mining   techniques  to  large-­‐scale  data
  • 4. 'Culture'  Parallel  Processing  Cluster Contains  12  rack-­‐mounted  Dell  servers •  1  x  4  core  with  64Gb  RAM •  1  x  6  core  with  64Gb  RAM •  10  x  2  core  with  16Gb  RAM Cloudera  +  Apache  MESOS  installed  on  the  cluster  to  provide   access  to: •  Apache  Hadoop  Stack  (HDFS,  HBASE) •  Apache  Spark •  RabbitMQ  (for  custom  distributed  processing  apps)   •  E.g.  Parallelised  parameter  tuning  for  Recommender  Systems
  • 6. Diffusion  of  Language  Innova,on Language  innova,on  can  take  various  forms: •  Neologisms  (e.g.  brah) •  Word  blends  (e.g.  downvo+ng,  cooldown) •  Shortening  (e.g.  ur) Studying  the  adop,on  of  such  innova,ons  is  hard: •  Interviews  with  different  communi,es   •  Travelling  between  different  loca,ons   •  Relies  on  understanding  the  agent,  the  social  structure  +  their   interplay   Social  media  allows  language  spread  to  be  inves,gated  at  scale: •  To  understand  who  influences  whom •  To  understand  the  language  of  brands'  audiences
  • 7. Compu,ng  Language  Innova,on   Diffusion 1.   Varia,on  in  Term  Frequency   Probability  of  a  term  being  used  in  a  context:  ,me  (week)  and   community  (e.g.  subreddit)   2.   Varia,on  in  Term  Form   Probability  of  a  term  having  a  suffix  or  prefix  added  (as  a  word   blend) Goal:  Look  for  significant  increases  &  decreases  in  term  frequency   and  form:     (i)  globally  in  a  system    (ii)  locally  in  communi,es
  • 8. The  Role  of  Apache  Spark Collect datasets from Twitter and Reddit Identify innovations Compute frequency and form values Write significant increases + decreases to HDFS Point to TSV file in HDFS Map: return <<term, context>, value> pairs as RDD ReduceByKey: merge <<term, context>, value> pairs as RDD Identify significant contextual increases + decreases Note: the key here is a tuple
  • 11. Inves,ga,ng  UK  Web  Filters   (A  Data  Science  Approach)
  • 12. UK  Web  Filtering:  Default-­‐on
  • 13. Collateral  Filtering How  accurate  are  the  filters?   What  is  being  overblocked  and  underblocked?  
  • 14. Censorship  Monitoring     Project Open  Rights  Group  constructed  a  system  of  probes  to  check  URLs   across  ISPs  for  blocking     Lancaster’s  goal:  build  a  system  to  gauge  filters’  accuracy  and   categories  of  blocks  
  • 15. Computing per-ISP accuracy Broadcasting DMOZ Category RDD <URL, topics> Pseudo-­‐classifiers  in     Apache  Spark hEps://github.com/openrightsgroup/cmp-­‐analysis   Point to DMOZ JSON file in HDFS Map: return <URL, topics> pairs as RDD ReduceByKey: merge <URL, topics> as RDD Point to Adult DMOZ JSON file in HDFS Map: return <URL, topics> pairs as RDD ReduceByKey: merge <URL, topics> as RDD Take union of RDD objects Broadcast RDD to cluster Point to probe request file in HDFS Retrieve DMOZ RDD from broadcast Map: return <ISP, Result> RDD w/ pseudo- classifiers ReduceByKey: merge <ISP, Result> as RDD Collect map and compute per-ISP accuracy
  • 16. What  did  we  find? For  examples  of  overblocks  &  underblocks:     hEps://github.com/openrightsgroup/cmp-­‐analysis/tree/master/data/output     30%  to  82%  of  sites  are  underblocked   2%  to  6%  of  sites  are  overblocked  
  • 17.
  • 19.
  • 20. Ques,ons?   Web:  hEp://www.lancaster.ac.uk/staff/rowem/      (For  publica,ons  and  current  projects)     Code:  hEps://github.com/maEroweshow/       Email:  m.rowe@lancaster.ac.uk       TwiEer:  @mrowebot