SlideShare a Scribd company logo
1 of 34
Download to read offline
Headline Goes Here
Speaker Name or Subhead Goes Here
Hadoop in Social Network Analysis
- review on tools and some best practices -
Hadoop User Group Munich | 2013-05-22
Mirko Kämpf | mirko@cloudera.com
Montag, 27. Mai 13
OVERVIEW ...
About me ... (Java, Physics,Wikipedia, Hadoop & Cloudera)
Topic : Complex Systems Analysis, from Time Series to Networks
Data, data, and even more data ... but how to handle it?
Some results of our project ...
Lessons learned ... some recommendations.
Montag, 27. Mai 13
urlurl
Hadoop in Social Network Analysis
Abstract: A Hadoop cluster is the tool of choice for many
large scale analytics applications. A large variety of
commercial tools is available for typical SQL like or data
warehouse applications, but how to deal with networks
and time series?
How to collect and store data for social media analysis
and what are good practices for working with libraries like
Mahout and Giraph?
The sample use case deals with a data set from Wikipedia
to illustrate how to combine multiple public data sources
with personal data collections, e.g. from Twitter or even
personal mailboxes. We discuss efficient approaches for
data organisation, data preprocessing and for time
dependent graph analysis.
Social networks consist of nodes,
which are the real world objects
and edges, which are e.g. relations,
interactions or dependencies
between nodes.
Montag, 27. Mai 13
urlurl
Hadoop in Social Network Analysis
Abstract: A Hadoop cluster is the tool of choice for many
large scale analytics applications. A large variety of
commercial tools is available for typical SQL like or data
warehouse applications, but how to deal with networks
and time series?
How to collect and store data for social media analysis
and what are good practices for working with libraries like
Mahout and Giraph?
The sample use case deals with a data set from Wikipedia
to illustrate how to combine multiple public data sources
with personal data collections, e.g. from Twitter or even
personal mailboxes. We discuss efficient approaches for
data organisation, data preprocessing and for time
dependent graph analysis.
Social networks consist of nodes,
which are the real world objects
and edges, which are e.g. relations,
interactions or dependencies
between nodes.
Datatypes I/O-Formats
Storage System Random-Access
Bulk Processing
HBaseOpenTSDB Hive
Pig
SequenceFile
VectorWriteable
Partitions
Sampling Rate
images from Google image search ...
Montag, 27. Mai 13
INTRODUCTION
Social online-systems are complex systems used for, e.g., information spread.
We develop and apply tools from time series analysis and network analysis
to study the static and dynamic properties of social on-line systems and their relations.
5
Montag, 27. Mai 13
INTRODUCTION
Webpages (the nodes of the WWW) are linked in different, but related ways:
direct links pointing from one page to another (binary, directional)
similar access activity (cross-correlated time series of download rates)
similar edit activity (synchronized events of edits or changes)
We extract the time-evolution of these three networks from real data.
Nodes are identical for all three studied networks, but links and network
structure as well as dynamics are different.We quantify how the inter-
relations and inter-dependencies between the three networks change in time and affect each other.
Social online-systems are complex systems used for, e.g., information spread.
We develop and apply tools from time series analysis and network analysis
to study the static and dynamic properties of social on-line systems and their relations.
5
Montag, 27. Mai 13
INTRODUCTION
Webpages (the nodes of the WWW) are linked in different, but related ways:
direct links pointing from one page to another (binary, directional)
similar access activity (cross-correlated time series of download rates)
similar edit activity (synchronized events of edits or changes)
We extract the time-evolution of these three networks from real data.
Nodes are identical for all three studied networks, but links and network
structure as well as dynamics are different.We quantify how the inter-
relations and inter-dependencies between the three networks change in time and affect each other.
Example: Wikipedia → reconstruct co-evolving networks
1. Cross-link network between articles (pages, nodes)
2. Access behavior network (characterizing similarity in article reading behavior)
3. Edit activity network (characterizing similarities in edit activity for each article)
Social online-systems are complex systems used for, e.g., information spread.
We develop and apply tools from time series analysis and network analysis
to study the static and dynamic properties of social on-line systems and their relations.
5
Montag, 27. Mai 13
CHALLENGES ...
6
• Data points (time series) collected at independent locations or obtained form
individual objects do not show dependencies directly.
• It is a common task, to calculate several types of correlations, but how are these
results affected by special properties of the raw data?
• What meaning do different correlations have and how can we eliminate artifacts
of the calculation method?
Complex System Time evolution ???
Montag, 27. Mai 13
CHALLENGES ...
6
• Data points (time series) collected at independent locations or obtained form
individual objects do not show dependencies directly.
• It is a common task, to calculate several types of correlations, but how are these
results affected by special properties of the raw data?
• What meaning do different correlations have and how can we eliminate artifacts
of the calculation method?
Complex System Time evolution ???
Montag, 27. Mai 13
CHALLENGES ...
7
• Data points (time series) collected at independent locations or obtained form
individual objects do not show dependencies directly.
• It is a common task, to calculate several types of correlations, but how are these
results affected by special properties of the raw data?
• What meaning do different correlations have and how can we eliminate artifacts
of the calculation method?
Complex System
Element properties
Measured data is
“disconnected”
Time evolution ???
Montag, 27. Mai 13
CHALLENGES ...
7
• Data points (time series) collected at independent locations or obtained form
individual objects do not show dependencies directly.
• It is a common task, to calculate several types of correlations, but how are these
results affected by special properties of the raw data?
• What meaning do different correlations have and how can we eliminate artifacts
of the calculation method?
Complex System
Element properties
Measured data is
“disconnected”
Time evolution ???
Montag, 27. Mai 13
CHALLENGES ...
8
• Data points (time series) collected at independent locations or obtained form
individual objects do not show dependencies directly.
• It is a common task, to calculate several types of correlations, but how are these
results affected by special properties of the raw data?
• What meaning do different correlations have and how can we eliminate artifacts
of the calculation method?
Complex System
Element properties
Measured data is
“disconnected”
System properties
Derived from relations
between elements and
structure of the network
Time evolution ???
Montag, 27. Mai 13
CHALLENGES ...
8
• Data points (time series) collected at independent locations or obtained form
individual objects do not show dependencies directly.
• It is a common task, to calculate several types of correlations, but how are these
results affected by special properties of the raw data?
• What meaning do different correlations have and how can we eliminate artifacts
of the calculation method?
Complex System
Element properties
Measured data is
“disconnected”
System properties
Derived from relations
between elements and
structure of the network
Time evolution ???
Montag, 27. Mai 13
TYPICALTASK :
• Time Series Analysis
• if our data set is prepared and we have records
with well defined properties*,
than UDF in Hive and Pig work well ( )
• How to organize the data in records?
• How to deal with sliding windows?
• How to handle intermediate data?
Montag, 27. Mai 13
TIME SERIES :WIKIPEDIA ACTIVITY
Examples of Wikipedia access time series for three articles with (a,b) stationary access rates ('Illuminati (book)'), (c,d) an
endogenous burst of activity ('Heidelberg'), and (e,f) an exogenous burst of activity ('Amoklauf Erfurt'). The left parts show the
complete hourly access rate time series (from January 1, 2009, till October 21, 2009; i.e. for 42 weeks = 294 days = 7056 hours).
The right parts show edit-event data for the three representative articles.
Node = article
(specific topic)
Available data:
1. Hourly access
frequency
(number of article
downloads for each
hour in ≈ 300 days)
2. Edit events
(time stamps for all changes
in the article pages)
10
Montag, 27. Mai 13
TIME SERIES :WIKIPEDIA ACTIVITY
Node = article
(specific topic)
Available data:
1. Hourly access
frequency
(number of article
downloads for each
hour in ≈ 300 days)
2. Edit events
(time stamps for all changes
in the article pages)
11
•filtering, resampling
•feature extraction (peak detection)
•creation of (non)-overlapping episodes or (sliding) windows
•creation of time series pairs for cross-correlation
or event synchronisation
Map-Reduce / UDF
preprocessing calculation on single records
Montag, 27. Mai 13
• Graph Analysis
• If network data is prepared as an adjacency list or an
adjacency matrix, tools like Giraph or HAMA work well.
But : only if one has the appro-
priate InputFormat Reader.
Montag, 27. Mai 13
TYPES OF NETWORKS
13
a) unipartite network, one type of nodes and links
b) bipartite network, one type of connections
c) hypergraph, one link relates more than two nodes
Montag, 27. Mai 13
MULTIPLEX NETWORKS
14
HIERARCHICAL
NETWORK
TYPES OF NETWORKS
Montag, 27. Mai 13
• Graph Analysis
• If network data is prepared as an adjacency list or an
adjacency matrix, tools like Giraph or HAMA work well.
• How to organize node/edge properties?
• How to deal with time dependent properties?
• How to calculate link properties on the fly?
•creation of adjacency matrix is not trivial
•adjacency matrix is an ineffiecient format
•in HDFS files can not be changed
•store dynamic edge / node properties in HBase
•aggregate relevant data to network snapshots
•store intermediate results back to HBase ...
and preprocess this data in following step
Montag, 27. Mai 13
RECAP : WHAT IS HADOOP ?
• distributed platform for processing massive amounts of data in parallel
• implements Map-Reduce paradigm on top of Hadoop Distributed File System.
• Map-Reduce : JAVA API to implement a map and a reduce phase
• Map phase uses key/value pairs,
• Reduce phase uses a key/value list
• HDFS files consist of one or more blocks
(distributed chunks of data)
• Chunks are distributed transparently
(in background) and processed in parallel
• using data locality when possible by
assigning the map task to a node that
contains the chunk locally.
Montag, 27. Mai 13
TYPICAL APPLICATIONS
• Filter, group, and join operations of large data sets ...
• the data set (or a part of it)* is streamed
and processed in parallel, but not in real time
• Algorithms like k-Means Clustering (in Apache Mahout)
or Map-Reduce based implementations of SSSP work
in multiple iterations
• data is loaded from disk to CPU in each iteration
* if partitio-ning is used
Montag, 27. Mai 13
OVERVIEW - DATA FLOW
18
Text
Montag, 27. Mai 13
OVERVIEW - DATA FLOW
19
Text
MR, UDF
Hive, Pig,
Impala
Giraph, Mahout
Sqoop
MR
Hive, Pig,
Impala
HBase
(cache)
HDFS
(static group)
HBase
(cache)
Network Clustering != k-Means Clustering
Montag, 27. Mai 13
RESULTS : CROSS-CORRELATION
20
 
 
Distribution of cross-correlation coefficients for pairs of
access-rate time series of Wikipedia pages (top) compared
to surrogat data (bottom) - 100 shuffled configurations are considered
Montag, 27. Mai 13
EVOLUTION OF CORRELATION NETWORKS
21
time
time
Montag, 27. Mai 13
EVOLUTION OF CORRELATION NETWORKS
21
time
time
Montag, 27. Mai 13
EVOLUTION OF CORRELATION NETWORKS
22
time
time
Obviously, the distribution of link strength
values changes in time, but a only a calcultion of
structural properties of the underlying network
allows a detailed view on the dynamics if the
system.
Montag, 27. Mai 13
EVOLUTION OF CORRELATION NETWORKS
22
time
time
Obviously, the distribution of link strength
values changes in time, but a only a calcultion of
structural properties of the underlying network
allows a detailed view on the dynamics if the
system.
Montag, 27. Mai 13
EXAMPLES OF RECONSTRUCTED NETWORKS
Spatially embedded correlation networks,
for wikipedia pages of all German cities.
Access network Edit network
23
Montag, 27. Mai 13
RECOMMENDATIONS (1.)
• Create algorithms based on reusable components!
• Select or create stable and standardized I/O-Formats!
• Use preprocessing for re-organization of unstructured data,
if you have to process the data many times.
• Store intermediate data (e.g. time dependent properties)
in HBase, near the raw data.
• CollectTime-Series data in HBase and create Time-Series
Buckets for advanced procedures on a subset of the data.
Montag, 27. Mai 13
RECOMMENDATIONS (2.)
• Consider Design Patterns :
• Partitioning vs. Binning
• Map-Side vs. Reduce-Side Joins
• Use Bulk Synchronuos Processing for graph processing
instead of Map-Reduce, or even a combination of both.
• In classical programming: (and also in Hadoop!!!)
find good data representation and find good algorithms
Montag, 27. Mai 13
REFERENCES
26
Montag, 27. Mai 13
27
Montag, 27. Mai 13

More Related Content

Viewers also liked

Viewers also liked (10)

Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
 
Debugging (Docker) containers in production
Debugging (Docker) containers in productionDebugging (Docker) containers in production
Debugging (Docker) containers in production
 
Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in Impala
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8Building a Modern Analytic Database with Cloudera 5.8
Building a Modern Analytic Database with Cloudera 5.8
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 

More from Dr. Mirko Kämpf

More from Dr. Mirko Kämpf (13)

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata Integration
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on Scale
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
 
Hadoop & Complex Systems Research
Hadoop & Complex Systems ResearchHadoop & Complex Systems Research
Hadoop & Complex Systems Research
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Meetup - Hadoop User Group - Munich : 2013-05-22

  • 1. Headline Goes Here Speaker Name or Subhead Goes Here Hadoop in Social Network Analysis - review on tools and some best practices - Hadoop User Group Munich | 2013-05-22 Mirko Kämpf | mirko@cloudera.com Montag, 27. Mai 13
  • 2. OVERVIEW ... About me ... (Java, Physics,Wikipedia, Hadoop & Cloudera) Topic : Complex Systems Analysis, from Time Series to Networks Data, data, and even more data ... but how to handle it? Some results of our project ... Lessons learned ... some recommendations. Montag, 27. Mai 13
  • 3. urlurl Hadoop in Social Network Analysis Abstract: A Hadoop cluster is the tool of choice for many large scale analytics applications. A large variety of commercial tools is available for typical SQL like or data warehouse applications, but how to deal with networks and time series? How to collect and store data for social media analysis and what are good practices for working with libraries like Mahout and Giraph? The sample use case deals with a data set from Wikipedia to illustrate how to combine multiple public data sources with personal data collections, e.g. from Twitter or even personal mailboxes. We discuss efficient approaches for data organisation, data preprocessing and for time dependent graph analysis. Social networks consist of nodes, which are the real world objects and edges, which are e.g. relations, interactions or dependencies between nodes. Montag, 27. Mai 13
  • 4. urlurl Hadoop in Social Network Analysis Abstract: A Hadoop cluster is the tool of choice for many large scale analytics applications. A large variety of commercial tools is available for typical SQL like or data warehouse applications, but how to deal with networks and time series? How to collect and store data for social media analysis and what are good practices for working with libraries like Mahout and Giraph? The sample use case deals with a data set from Wikipedia to illustrate how to combine multiple public data sources with personal data collections, e.g. from Twitter or even personal mailboxes. We discuss efficient approaches for data organisation, data preprocessing and for time dependent graph analysis. Social networks consist of nodes, which are the real world objects and edges, which are e.g. relations, interactions or dependencies between nodes. Datatypes I/O-Formats Storage System Random-Access Bulk Processing HBaseOpenTSDB Hive Pig SequenceFile VectorWriteable Partitions Sampling Rate images from Google image search ... Montag, 27. Mai 13
  • 5. INTRODUCTION Social online-systems are complex systems used for, e.g., information spread. We develop and apply tools from time series analysis and network analysis to study the static and dynamic properties of social on-line systems and their relations. 5 Montag, 27. Mai 13
  • 6. INTRODUCTION Webpages (the nodes of the WWW) are linked in different, but related ways: direct links pointing from one page to another (binary, directional) similar access activity (cross-correlated time series of download rates) similar edit activity (synchronized events of edits or changes) We extract the time-evolution of these three networks from real data. Nodes are identical for all three studied networks, but links and network structure as well as dynamics are different.We quantify how the inter- relations and inter-dependencies between the three networks change in time and affect each other. Social online-systems are complex systems used for, e.g., information spread. We develop and apply tools from time series analysis and network analysis to study the static and dynamic properties of social on-line systems and their relations. 5 Montag, 27. Mai 13
  • 7. INTRODUCTION Webpages (the nodes of the WWW) are linked in different, but related ways: direct links pointing from one page to another (binary, directional) similar access activity (cross-correlated time series of download rates) similar edit activity (synchronized events of edits or changes) We extract the time-evolution of these three networks from real data. Nodes are identical for all three studied networks, but links and network structure as well as dynamics are different.We quantify how the inter- relations and inter-dependencies between the three networks change in time and affect each other. Example: Wikipedia → reconstruct co-evolving networks 1. Cross-link network between articles (pages, nodes) 2. Access behavior network (characterizing similarity in article reading behavior) 3. Edit activity network (characterizing similarities in edit activity for each article) Social online-systems are complex systems used for, e.g., information spread. We develop and apply tools from time series analysis and network analysis to study the static and dynamic properties of social on-line systems and their relations. 5 Montag, 27. Mai 13
  • 8. CHALLENGES ... 6 • Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. • It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? • What meaning do different correlations have and how can we eliminate artifacts of the calculation method? Complex System Time evolution ??? Montag, 27. Mai 13
  • 9. CHALLENGES ... 6 • Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. • It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? • What meaning do different correlations have and how can we eliminate artifacts of the calculation method? Complex System Time evolution ??? Montag, 27. Mai 13
  • 10. CHALLENGES ... 7 • Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. • It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? • What meaning do different correlations have and how can we eliminate artifacts of the calculation method? Complex System Element properties Measured data is “disconnected” Time evolution ??? Montag, 27. Mai 13
  • 11. CHALLENGES ... 7 • Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. • It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? • What meaning do different correlations have and how can we eliminate artifacts of the calculation method? Complex System Element properties Measured data is “disconnected” Time evolution ??? Montag, 27. Mai 13
  • 12. CHALLENGES ... 8 • Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. • It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? • What meaning do different correlations have and how can we eliminate artifacts of the calculation method? Complex System Element properties Measured data is “disconnected” System properties Derived from relations between elements and structure of the network Time evolution ??? Montag, 27. Mai 13
  • 13. CHALLENGES ... 8 • Data points (time series) collected at independent locations or obtained form individual objects do not show dependencies directly. • It is a common task, to calculate several types of correlations, but how are these results affected by special properties of the raw data? • What meaning do different correlations have and how can we eliminate artifacts of the calculation method? Complex System Element properties Measured data is “disconnected” System properties Derived from relations between elements and structure of the network Time evolution ??? Montag, 27. Mai 13
  • 14. TYPICALTASK : • Time Series Analysis • if our data set is prepared and we have records with well defined properties*, than UDF in Hive and Pig work well ( ) • How to organize the data in records? • How to deal with sliding windows? • How to handle intermediate data? Montag, 27. Mai 13
  • 15. TIME SERIES :WIKIPEDIA ACTIVITY Examples of Wikipedia access time series for three articles with (a,b) stationary access rates ('Illuminati (book)'), (c,d) an endogenous burst of activity ('Heidelberg'), and (e,f) an exogenous burst of activity ('Amoklauf Erfurt'). The left parts show the complete hourly access rate time series (from January 1, 2009, till October 21, 2009; i.e. for 42 weeks = 294 days = 7056 hours). The right parts show edit-event data for the three representative articles. Node = article (specific topic) Available data: 1. Hourly access frequency (number of article downloads for each hour in ≈ 300 days) 2. Edit events (time stamps for all changes in the article pages) 10 Montag, 27. Mai 13
  • 16. TIME SERIES :WIKIPEDIA ACTIVITY Node = article (specific topic) Available data: 1. Hourly access frequency (number of article downloads for each hour in ≈ 300 days) 2. Edit events (time stamps for all changes in the article pages) 11 •filtering, resampling •feature extraction (peak detection) •creation of (non)-overlapping episodes or (sliding) windows •creation of time series pairs for cross-correlation or event synchronisation Map-Reduce / UDF preprocessing calculation on single records Montag, 27. Mai 13
  • 17. • Graph Analysis • If network data is prepared as an adjacency list or an adjacency matrix, tools like Giraph or HAMA work well. But : only if one has the appro- priate InputFormat Reader. Montag, 27. Mai 13
  • 18. TYPES OF NETWORKS 13 a) unipartite network, one type of nodes and links b) bipartite network, one type of connections c) hypergraph, one link relates more than two nodes Montag, 27. Mai 13
  • 20. • Graph Analysis • If network data is prepared as an adjacency list or an adjacency matrix, tools like Giraph or HAMA work well. • How to organize node/edge properties? • How to deal with time dependent properties? • How to calculate link properties on the fly? •creation of adjacency matrix is not trivial •adjacency matrix is an ineffiecient format •in HDFS files can not be changed •store dynamic edge / node properties in HBase •aggregate relevant data to network snapshots •store intermediate results back to HBase ... and preprocess this data in following step Montag, 27. Mai 13
  • 21. RECAP : WHAT IS HADOOP ? • distributed platform for processing massive amounts of data in parallel • implements Map-Reduce paradigm on top of Hadoop Distributed File System. • Map-Reduce : JAVA API to implement a map and a reduce phase • Map phase uses key/value pairs, • Reduce phase uses a key/value list • HDFS files consist of one or more blocks (distributed chunks of data) • Chunks are distributed transparently (in background) and processed in parallel • using data locality when possible by assigning the map task to a node that contains the chunk locally. Montag, 27. Mai 13
  • 22. TYPICAL APPLICATIONS • Filter, group, and join operations of large data sets ... • the data set (or a part of it)* is streamed and processed in parallel, but not in real time • Algorithms like k-Means Clustering (in Apache Mahout) or Map-Reduce based implementations of SSSP work in multiple iterations • data is loaded from disk to CPU in each iteration * if partitio-ning is used Montag, 27. Mai 13
  • 23. OVERVIEW - DATA FLOW 18 Text Montag, 27. Mai 13
  • 24. OVERVIEW - DATA FLOW 19 Text MR, UDF Hive, Pig, Impala Giraph, Mahout Sqoop MR Hive, Pig, Impala HBase (cache) HDFS (static group) HBase (cache) Network Clustering != k-Means Clustering Montag, 27. Mai 13
  • 25. RESULTS : CROSS-CORRELATION 20     Distribution of cross-correlation coefficients for pairs of access-rate time series of Wikipedia pages (top) compared to surrogat data (bottom) - 100 shuffled configurations are considered Montag, 27. Mai 13
  • 26. EVOLUTION OF CORRELATION NETWORKS 21 time time Montag, 27. Mai 13
  • 27. EVOLUTION OF CORRELATION NETWORKS 21 time time Montag, 27. Mai 13
  • 28. EVOLUTION OF CORRELATION NETWORKS 22 time time Obviously, the distribution of link strength values changes in time, but a only a calcultion of structural properties of the underlying network allows a detailed view on the dynamics if the system. Montag, 27. Mai 13
  • 29. EVOLUTION OF CORRELATION NETWORKS 22 time time Obviously, the distribution of link strength values changes in time, but a only a calcultion of structural properties of the underlying network allows a detailed view on the dynamics if the system. Montag, 27. Mai 13
  • 30. EXAMPLES OF RECONSTRUCTED NETWORKS Spatially embedded correlation networks, for wikipedia pages of all German cities. Access network Edit network 23 Montag, 27. Mai 13
  • 31. RECOMMENDATIONS (1.) • Create algorithms based on reusable components! • Select or create stable and standardized I/O-Formats! • Use preprocessing for re-organization of unstructured data, if you have to process the data many times. • Store intermediate data (e.g. time dependent properties) in HBase, near the raw data. • CollectTime-Series data in HBase and create Time-Series Buckets for advanced procedures on a subset of the data. Montag, 27. Mai 13
  • 32. RECOMMENDATIONS (2.) • Consider Design Patterns : • Partitioning vs. Binning • Map-Side vs. Reduce-Side Joins • Use Bulk Synchronuos Processing for graph processing instead of Map-Reduce, or even a combination of both. • In classical programming: (and also in Hadoop!!!) find good data representation and find good algorithms Montag, 27. Mai 13