SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Scalable Distributed Real-Time
Clustering for Big Data Streams
European Masters in Distributed Computing (EMDC)

Student
Antonio Severien
severien@yahoo-inc.com
Supervisors
Albert Bifet (Yahoo! Research)
Gianmarco De Francisci Morales (Yahoo! Research)
Marta Arias (Universitat Politecnica de Catalunya)
27/06/13

Contributions
¤  SAMOA (Scalable Advanced Massive Online Analysis)
¤  Stream Processing Engine (SPE) abstraction framework
¤  Machine learning libraries adapter layer
¤  API for implementing data flow topologies

¤  SAMOA Clustering Algorithm
¤  Distributed stream clustering algorithm based on CluStream*
¤  Parallelize clustering task and scale-up on resource usage

(*) “A Framework for Clustering Evolving Data Streams”, Aggarwal et al. 2003
2
27/06/13

Motivation
¤  How BIG is BIG in BIG Data???
¤  2.5 quintillion of bytes generated every day.
¤  90% of todays data was generated in the last 2 years
¤  Sensors, social networks, e-business, mobile, internet logs, etc.

¤  Problems… 3 Vs
¤  Storage is unviable due to massive Volume
¤  Production rate on increasing in Velocity
¤  Different sources, different data, different types means Variety

3
27/06/13

Where is the Big Data?
¤  Where is the food?
¤  Databases?
¤  Data warehouses?
¤  Distributed databases?
¤  Distributed file systems?
¤  It’s flowing online! It’s Streaming!

4
27/06/13

Crunching Big Data
¤  Map and Reduce
¤  MapReduce/GFS
¤  Hadoop/HDFS

¤  Stream Processing Engines (SPE)
¤  Apache S4
¤  Twitter Storm

5
27/06/13

Distributed Systems
¤  Actors Model
¤  Independent concurrent processes
¤  Communicate asynchronously by message passing

¤  MapReduce Model
¤  Mappers: filter and sorting
¤  Reducers: summary and aggregation
¤  Large volume of data distributed
¤  Iterative: map-reduce-map-reduce…

6
27/06/13

Streaming
¤  Streaming Model
¤  One-pass processing: discard item after use
¤  Low memory usage: store statistics and summaries
¤  Unbounded flow of data
¤  Evolving data sets
¤  Limited processing time
¤  Arrival order is not guaranteed

7
27/06/13

Making sense
¤  Machine Learning & Data Mining
¤  Make sense, extract patterns and react accordingly
¤  Train machines to “think”
¤  Perceive behavior
¤  Relations between similar information

¤  Unsupervised Learning
¤  Clustering algorithms

8
27/06/13

Machine Learning Tools
¤  Mahout
¤  Machine learning framework used on top of Hadoop/HDFS
¤  Batch processing with MapReduce model
¤  Open-source and good community support

¤  Massive Online Analysis (MOA)
¤  Stream machine learning tool
¤  Many algorithms implemented; based on WEKA
¤  Single machine constraint

¤  Jubatus
¤  Distributed streaming machine learning framework
¤  No clustering algorithms yet
¤  No stream platform abstraction
9
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)
¤  Distributed data streaming machine learning framework
¤  Stream Platform Engine Abstraction
¤  Code once, run everywhere
¤  Focus on distributed algorithm design
¤  Fault-tolerance, communication, consistency and
availability are provided by the underlying distributed
processing platform

¤  Initial release provides integration with,
¤  Apache S4
¤  Twitter Storm
10
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
libraries

SPE Adapter

S4

Storm

Other
SPE
11
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
libraries

SPE Adapter

S4

Storm

Other
SPE
12
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
libraries

SPE Adapter

S4

Storm

Other
SPE
13
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)

SAMOA Algorithms
&
SAMOA-API

ML Adapter

SAMOA
MOA
Other
ML
libraries

SPE Adapter

S4

Storm

Other
SPE
14
27/06/13

( Apache S4 )
¤  Distributed, semi fault-tolerant, stream processing
platform
¤  Based on the Actors model and inspired by the
MapReduce model
¤  Flexibility on data flow; any topology and processor unit
can be built, besides the mappers and reducers design
¤  Specialized in processing events from a stream and
emitting events into a stream

15
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)
SAMOA Topology

Task
PI

STREAM
SOURCE

Stream

PI

EPI
PI

PI

MAP
S4 App
Stream
STREAM
SOURCE

PE

PE

PE
PE

PE
16
27/06/13

How to use?
¤  Adding SPE using API
¤  S4ProcessingItem: processing element wrapper
¤  S4Stream: wrapper for a S4 stream
¤  S4ComponentFactory: provides components specific from Apache
S4, such as processing elements and streams
¤  S4TopologyBuilder: creates the topology instances

¤  Adding algorithm and building topology
class	
  SimpleTask	
  {	
  
...	
  
	
  TopologyBuilder	
  topologyBuilder	
  =	
  new	
  TopologyBuilder(	
  );	
  	
  
	
  EntranceProcessinItem	
  entranceProcessingItem	
  =	
  	
  
	
  
	
  topologyBuilder.createEntrancePI(	
  new	
  SourceProcessor(	
  )	
  );	
  	
  
	
  Stream	
  stream	
  =	
  topologyBuilder.createStream(	
  entranceProcessingItem	
  );	
  
	
  ProcessingItem	
  processingItem	
  =	
  topologyBuilder.createPI(	
  new	
  Processor(	
  )	
  );	
  
	
  processingItem.connectInputKey(	
  stream	
  );	
  	
  
...	
  
	
  

17
27/06/13

Grouping the Best of All
¤  Flexible programming model
¤  Distributed stream processing engine abstraction
¤  Integrated machine learning and data mining algorithms
¤  Easy API to implement new algorithms and SPE adapters

18
27/06/13

SAMOA Clustering Algorithm
¤  Distributed stream clustering algorithm
¤  Validate SAMOA implementation and
¤  Integration with Apache S4 using the SAMOA-S4 adapter
¤  Deploy on Apache S4

19
27/06/13

Stream Clustering Algorithm
¤  CluStream Framework
¤  Based on k-means
¤  Online phase (micro-clustering)
¤  Offline phase (macro-clustering)

¤  k-means: partition a set of data into k distinct clusters
according to a similarity function
¤  Minimization of squared Euclidean distance objective
function:

20
27/06/13

K-means Clustering Algorithm
¤  Advantages
¤  Simple, fast and efficient

¤  Known issues with k-means
¤  Sensitive to initial seeding
¤  Minimization problem is NP-hard even for simple
configurations
¤  1-dimensional points
¤  Global optimum not guaranteed
¤  Good for spherical clustering, not good for arbitrary shapes

21
27/06/13

Distributed Stream Clustering
¤  Online micro-clustering
¤  Apply on a local clustering phase
¤  Cluster Feature Vectors with Timestamp (CFT)
¤ 

N: number of data objects

¤ 

LS: linear sum of data objects

¤ 

SS: sum of squares of data objects

¤ 

LST: sum of timestamps

¤ 

SST: sum of squares of timestamps

¤  Offline macro-clustering
¤  Use of micro-clusters as weighted pseudo-points
¤  Apply on a global clustering phase with a weighted k-means
¤  Uses probabilistic seeding depending on the weighted
micro-clusters
22
27/06/13

CluStream Snapshot
Micro-clusters

Macro-clusters

Ground Truth

23
27/06/13

Scalable Advanced Massive Online
Analysis (SAMOA)
SAMOA Clustering Task
Clustering

STREAM
SOURCE

Global
Clustering PI

Distribution
PI

OUTPUT

Local Clustering PI
Evaluation
OUTPUT

Sampling PI

Evaluator PI

24
27/06/13

Experiments, Evaluation & Results
¤  Experimental Setup
¤  Four 2.4Mhz Intel Xeon dual-quadcore, 48GB RAM
¤  Process parallelism level: 1, 8 & 16
¤  Instance dimensions: 3 & 15
¤  Source dataset: random events generator
¤  Noise: 0% & 10%
¤  Cluster movement speed: move 0.1 unit every 500 & 12000 instances

¤  Evaluations
¤  Scalability: measure throughput when adding concurrent
processes
¤  Clustering quality: measure if the clustering algorithm are
accurate
25
27/06/13

Scalability

Throughput (instances/second)

Baseline Comparison

Evaluation Step
26
27/06/13

Scalability

Average Throughput
(instances/second)

Average Throughput with Dimensions 3 and 15

Process Parallelism
27
27/06/13

Scalability

Avg. Cumulative Throughput
(instances/sec)

Parallelism Throughput with
Dimension 3

Process Parallelism
28
27/06/13

Clustering Quality Metrics
¤  Internal & External evaluations
¤  Internal evaluation uses attributes available from the clustering
structure.
¤  External evaluation uses external validation structures.
¤  ex.: ground truth provided by the source generator.

¤  Metrics
¤  Cohesion coefficient (SSE): measures the intra clusters sum of
squares error
¤  Separation coefficient (BSS): measures the inter cluster betweensum of squares.

29
27/06/13

Clustering Quality 0% Noise
Snapshot 25,000 instances

Snapshot 45,000 instances

30
27/06/13

Clustering Quality 0% Noise
Ratio = BSS / GT

31
27/06/13

Clustering Quality 10% Noise
Snapshot 25,000 instances

Snapshot 45,000 instances

Good clustering
Poor clustering

32
27/06/13

Clustering Quality 10% Noise

33
27/06/13

Conclusion
¤  There is important information on the massive amount of
data being produced and discarded
¤  There is a need for tools to deal with this efficiently
¤  Efforts have been done to crunch big data
¤  Interpreting and retrieving relevant information is where
machine learning and data mining operate
¤  Using real-time analysis responds faster to evolving data
¤  SAMOA abstracts the platform and maintains the
algorithms; good to implement, test and use.
34
27/06/13

Acknowledgements
¤  Thanks the Erasmus Mundus and all three universities
(UPC, KTH and IST) for providing this opportunity
¤  Thanks all the EMDC students
¤  Thanks Yahoo! Research for the great project

35

Más contenido relacionado

La actualidad más candente

Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learningStanley Wang
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learningViet-Trung TRAN
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkFlink Forward
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Otávio Carvalho
 
Deep recurrent neutral networks for Sequence Learning in Spark
Deep recurrent neutral networks for Sequence Learning in SparkDeep recurrent neutral networks for Sequence Learning in Spark
Deep recurrent neutral networks for Sequence Learning in SparkDataWorks Summit/Hadoop Summit
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream ProcessingZbigniew Jerzak
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetJ On The Beach
 
Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Egbert Gramsbergen
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016ijcsbi
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesNatalino Busa
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and KerasJie He
 

La actualidad más candente (20)

18 Data Streams
18 Data Streams18 Data Streams
18 Data Streams
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
 
Mining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDTMining high speed data streams: Hoeffding and VFDT
Mining high speed data streams: Hoeffding and VFDT
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
Distributed Near Real-Time Processing of Sensor Network Data Flows for Smart ...
 
Deep recurrent neutral networks for Sequence Learning in Spark
Deep recurrent neutral networks for Sequence Learning in SparkDeep recurrent neutral networks for Sequence Learning in Spark
Deep recurrent neutral networks for Sequence Learning in Spark
 
Clustering
ClusteringClustering
Clustering
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
Mining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert BifetMining big data streams with APACHE SAMOA by Albert Bifet
Mining big data streams with APACHE SAMOA by Albert Bifet
 
Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.Elag 2012 - Under the hood of 3TU.Datacentrum.
Elag 2012 - Under the hood of 3TU.Datacentrum.
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
 
Introduction to neural networks and Keras
Introduction to neural networks and KerasIntroduction to neural networks and Keras
Introduction to neural networks and Keras
 

Similar a Scalable Distributed Real-Time Clustering for Big Data Streams

Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph✔ Eric David Benari, PMP
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
 
Stream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data MicroservicesStream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data Microservicesmarius_bogoevici
 
Scientific
Scientific Scientific
Scientific marpierc
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at ScaleJeff Henrikson
 
Swisscom Network Analytics
Swisscom Network AnalyticsSwisscom Network Analytics
Swisscom Network Analyticsconfluent
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsPetr Novotný
 
Parallel machines flinkforward2017
Parallel machines flinkforward2017Parallel machines flinkforward2017
Parallel machines flinkforward2017Nisha Talagala
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Scalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using RayScalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using RayDatabricks
 
Galaxy
GalaxyGalaxy
Galaxybosc
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
 

Similar a Scalable Distributed Real-Time Clustering for Big Data Streams (20)

Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
Apache edgent
Apache edgentApache edgent
Apache edgent
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
NextGenML
NextGenML NextGenML
NextGenML
 
Stream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data MicroservicesStream and Batch Processing in the Cloud with Data Microservices
Stream and Batch Processing in the Cloud with Data Microservices
 
Scientific
Scientific Scientific
Scientific
 
RAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Anomaly Detection at Scale
Anomaly Detection at ScaleAnomaly Detection at Scale
Anomaly Detection at Scale
 
Swisscom Network Analytics
Swisscom Network AnalyticsSwisscom Network Analytics
Swisscom Network Analytics
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
Parallel machines flinkforward2017
Parallel machines flinkforward2017Parallel machines flinkforward2017
Parallel machines flinkforward2017
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Scalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using RayScalable AutoML for Time Series Forecasting using Ray
Scalable AutoML for Time Series Forecasting using Ray
 
Galaxy
GalaxyGalaxy
Galaxy
 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 

Último (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

Scalable Distributed Real-Time Clustering for Big Data Streams

  • 1. Scalable Distributed Real-Time Clustering for Big Data Streams European Masters in Distributed Computing (EMDC) Student Antonio Severien severien@yahoo-inc.com Supervisors Albert Bifet (Yahoo! Research) Gianmarco De Francisci Morales (Yahoo! Research) Marta Arias (Universitat Politecnica de Catalunya)
  • 2. 27/06/13 Contributions ¤  SAMOA (Scalable Advanced Massive Online Analysis) ¤  Stream Processing Engine (SPE) abstraction framework ¤  Machine learning libraries adapter layer ¤  API for implementing data flow topologies ¤  SAMOA Clustering Algorithm ¤  Distributed stream clustering algorithm based on CluStream* ¤  Parallelize clustering task and scale-up on resource usage (*) “A Framework for Clustering Evolving Data Streams”, Aggarwal et al. 2003 2
  • 3. 27/06/13 Motivation ¤  How BIG is BIG in BIG Data??? ¤  2.5 quintillion of bytes generated every day. ¤  90% of todays data was generated in the last 2 years ¤  Sensors, social networks, e-business, mobile, internet logs, etc. ¤  Problems… 3 Vs ¤  Storage is unviable due to massive Volume ¤  Production rate on increasing in Velocity ¤  Different sources, different data, different types means Variety 3
  • 4. 27/06/13 Where is the Big Data? ¤  Where is the food? ¤  Databases? ¤  Data warehouses? ¤  Distributed databases? ¤  Distributed file systems? ¤  It’s flowing online! It’s Streaming! 4
  • 5. 27/06/13 Crunching Big Data ¤  Map and Reduce ¤  MapReduce/GFS ¤  Hadoop/HDFS ¤  Stream Processing Engines (SPE) ¤  Apache S4 ¤  Twitter Storm 5
  • 6. 27/06/13 Distributed Systems ¤  Actors Model ¤  Independent concurrent processes ¤  Communicate asynchronously by message passing ¤  MapReduce Model ¤  Mappers: filter and sorting ¤  Reducers: summary and aggregation ¤  Large volume of data distributed ¤  Iterative: map-reduce-map-reduce… 6
  • 7. 27/06/13 Streaming ¤  Streaming Model ¤  One-pass processing: discard item after use ¤  Low memory usage: store statistics and summaries ¤  Unbounded flow of data ¤  Evolving data sets ¤  Limited processing time ¤  Arrival order is not guaranteed 7
  • 8. 27/06/13 Making sense ¤  Machine Learning & Data Mining ¤  Make sense, extract patterns and react accordingly ¤  Train machines to “think” ¤  Perceive behavior ¤  Relations between similar information ¤  Unsupervised Learning ¤  Clustering algorithms 8
  • 9. 27/06/13 Machine Learning Tools ¤  Mahout ¤  Machine learning framework used on top of Hadoop/HDFS ¤  Batch processing with MapReduce model ¤  Open-source and good community support ¤  Massive Online Analysis (MOA) ¤  Stream machine learning tool ¤  Many algorithms implemented; based on WEKA ¤  Single machine constraint ¤  Jubatus ¤  Distributed streaming machine learning framework ¤  No clustering algorithms yet ¤  No stream platform abstraction 9
  • 10. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) ¤  Distributed data streaming machine learning framework ¤  Stream Platform Engine Abstraction ¤  Code once, run everywhere ¤  Focus on distributed algorithm design ¤  Fault-tolerance, communication, consistency and availability are provided by the underlying distributed processing platform ¤  Initial release provides integration with, ¤  Apache S4 ¤  Twitter Storm 10
  • 11. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 11
  • 12. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 12
  • 13. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 13
  • 14. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Algorithms & SAMOA-API ML Adapter SAMOA MOA Other ML libraries SPE Adapter S4 Storm Other SPE 14
  • 15. 27/06/13 ( Apache S4 ) ¤  Distributed, semi fault-tolerant, stream processing platform ¤  Based on the Actors model and inspired by the MapReduce model ¤  Flexibility on data flow; any topology and processor unit can be built, besides the mappers and reducers design ¤  Specialized in processing events from a stream and emitting events into a stream 15
  • 16. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Topology Task PI STREAM SOURCE Stream PI EPI PI PI MAP S4 App Stream STREAM SOURCE PE PE PE PE PE 16
  • 17. 27/06/13 How to use? ¤  Adding SPE using API ¤  S4ProcessingItem: processing element wrapper ¤  S4Stream: wrapper for a S4 stream ¤  S4ComponentFactory: provides components specific from Apache S4, such as processing elements and streams ¤  S4TopologyBuilder: creates the topology instances ¤  Adding algorithm and building topology class  SimpleTask  {   ...    TopologyBuilder  topologyBuilder  =  new  TopologyBuilder(  );      EntranceProcessinItem  entranceProcessingItem  =        topologyBuilder.createEntrancePI(  new  SourceProcessor(  )  );      Stream  stream  =  topologyBuilder.createStream(  entranceProcessingItem  );    ProcessingItem  processingItem  =  topologyBuilder.createPI(  new  Processor(  )  );    processingItem.connectInputKey(  stream  );     ...     17
  • 18. 27/06/13 Grouping the Best of All ¤  Flexible programming model ¤  Distributed stream processing engine abstraction ¤  Integrated machine learning and data mining algorithms ¤  Easy API to implement new algorithms and SPE adapters 18
  • 19. 27/06/13 SAMOA Clustering Algorithm ¤  Distributed stream clustering algorithm ¤  Validate SAMOA implementation and ¤  Integration with Apache S4 using the SAMOA-S4 adapter ¤  Deploy on Apache S4 19
  • 20. 27/06/13 Stream Clustering Algorithm ¤  CluStream Framework ¤  Based on k-means ¤  Online phase (micro-clustering) ¤  Offline phase (macro-clustering) ¤  k-means: partition a set of data into k distinct clusters according to a similarity function ¤  Minimization of squared Euclidean distance objective function: 20
  • 21. 27/06/13 K-means Clustering Algorithm ¤  Advantages ¤  Simple, fast and efficient ¤  Known issues with k-means ¤  Sensitive to initial seeding ¤  Minimization problem is NP-hard even for simple configurations ¤  1-dimensional points ¤  Global optimum not guaranteed ¤  Good for spherical clustering, not good for arbitrary shapes 21
  • 22. 27/06/13 Distributed Stream Clustering ¤  Online micro-clustering ¤  Apply on a local clustering phase ¤  Cluster Feature Vectors with Timestamp (CFT) ¤  N: number of data objects ¤  LS: linear sum of data objects ¤  SS: sum of squares of data objects ¤  LST: sum of timestamps ¤  SST: sum of squares of timestamps ¤  Offline macro-clustering ¤  Use of micro-clusters as weighted pseudo-points ¤  Apply on a global clustering phase with a weighted k-means ¤  Uses probabilistic seeding depending on the weighted micro-clusters 22
  • 24. 27/06/13 Scalable Advanced Massive Online Analysis (SAMOA) SAMOA Clustering Task Clustering STREAM SOURCE Global Clustering PI Distribution PI OUTPUT Local Clustering PI Evaluation OUTPUT Sampling PI Evaluator PI 24
  • 25. 27/06/13 Experiments, Evaluation & Results ¤  Experimental Setup ¤  Four 2.4Mhz Intel Xeon dual-quadcore, 48GB RAM ¤  Process parallelism level: 1, 8 & 16 ¤  Instance dimensions: 3 & 15 ¤  Source dataset: random events generator ¤  Noise: 0% & 10% ¤  Cluster movement speed: move 0.1 unit every 500 & 12000 instances ¤  Evaluations ¤  Scalability: measure throughput when adding concurrent processes ¤  Clustering quality: measure if the clustering algorithm are accurate 25
  • 27. 27/06/13 Scalability Average Throughput (instances/second) Average Throughput with Dimensions 3 and 15 Process Parallelism 27
  • 28. 27/06/13 Scalability Avg. Cumulative Throughput (instances/sec) Parallelism Throughput with Dimension 3 Process Parallelism 28
  • 29. 27/06/13 Clustering Quality Metrics ¤  Internal & External evaluations ¤  Internal evaluation uses attributes available from the clustering structure. ¤  External evaluation uses external validation structures. ¤  ex.: ground truth provided by the source generator. ¤  Metrics ¤  Cohesion coefficient (SSE): measures the intra clusters sum of squares error ¤  Separation coefficient (BSS): measures the inter cluster betweensum of squares. 29
  • 30. 27/06/13 Clustering Quality 0% Noise Snapshot 25,000 instances Snapshot 45,000 instances 30
  • 31. 27/06/13 Clustering Quality 0% Noise Ratio = BSS / GT 31
  • 32. 27/06/13 Clustering Quality 10% Noise Snapshot 25,000 instances Snapshot 45,000 instances Good clustering Poor clustering 32
  • 34. 27/06/13 Conclusion ¤  There is important information on the massive amount of data being produced and discarded ¤  There is a need for tools to deal with this efficiently ¤  Efforts have been done to crunch big data ¤  Interpreting and retrieving relevant information is where machine learning and data mining operate ¤  Using real-time analysis responds faster to evolving data ¤  SAMOA abstracts the platform and maintains the algorithms; good to implement, test and use. 34
  • 35. 27/06/13 Acknowledgements ¤  Thanks the Erasmus Mundus and all three universities (UPC, KTH and IST) for providing this opportunity ¤  Thanks all the EMDC students ¤  Thanks Yahoo! Research for the great project 35