SlideShare una empresa de Scribd logo
1 de 76
Descargar para leer sin conexión
Big Data Processing
January 2017
• Data Architect at unbelievable machine Company
• Software Engineering Background
• Jack of all Trades who also dives into Business topics,
Systems Engineering and Data Science.
• Big Data since 2011
• Cross-Industry: From Automotive to Transportation
• Other Activities
• Trainer: Hortonworks Apache Hadoop Certified Trainer
• Author: Articles and book projects
• Lector: Big Data at FH Technikum and FH Wiener Neustadt
Stefan Papp
Agenda
• Big Data Processing
• Evolution in Processing Big Data
• Data Processing Patterns
• Components of a Data Processing Engine
• Apache Spark
• Concept
• Ecosystem
• Apache Flink
• Concept
• Ecosystem
Big Data Processing Engines on a Hadoop 2.x Reference(!) Stack
HADOOP 2.x STACK
HDFS
(redundant, reliable storage)
YARN
(cluster resource management)
Batch
MapReduce
Direct
Java
Search
Solr
API
Engine
Data
Operating
System
File
System
Batch & Interactive
Tez
Script
Pig
SQL
Hive
Cascading
Java
Real-Time
Slider
NoSQL
HBase
Stream
Storm
RDD & PACT
Spark, Flink
Machine
Learning
SparkML
Other
Application
Graph
Giraph
Applications
Evolution
in Processing
Big Data Roots – Content Processing for Search Engines
IO Read Challenge: Read 500 GB Data (as a Reference)
• Assumption
• Shared nothing, plain read
• You can read 256 MB in 1.9 seconds
• Single Node
• Total Blocks in 500 GB = 1954 Blocks
• 1954 * 1,9 / 3600 = approx. 1 hour sequential read.
• A 40 node cluster with 8 HDs on each node
• 320 HDs -> 6 to 7 blocks on each disk
• 7 blocks * 1,9 = 13,3 seconds total read time
Parallelism and Concurrency are Complex
a = 1
a = a + async_add(1)
a = a * async_mul(2)
Data Flow Engine to Abstract Data Processing
• Provide a programming interface
• Express jobs as graphs of high-level operators
• System picks how to split each operator into task
• and where to run each task
• Solve topics such as
• Concurrency
• Fault recovery
MapReduce: Divide and Conquer
Map
Map
Map
Reduce
Reduce
Input Output
Read Read Read ReadWrite Write Write Write
Iter. Iter. Iter. Iter.
Iteration:	Map	Reduce
Evolution of Data Processing
2004
2007
2010 2010
Data Processing
Patterns
Classification and Processing Patterns
Batch Processing
14
Source
Source
Storage Layer
Periodic ingestion
Batch Processor
Periodic analysis job
Consumer
Job scheduler
SQL
Import
File
Import
Stream processor / Kappa Architecture
15
Source
Source
Consumer
Forward events
immediately to
pub/sub bus
Stream
Processor
Process at event time &
update serving layer
Messaging
System
Hybrid / Lambda processing
16
Storage Layer
Batch job(s) for
analysis
Serving
layer
Job scheduler
Stream Processor
Messaging
System
Consumer
Source
Source
Source
Technology Mapping
Messaging	
System
Storage	
Layer
Processing
Serving
layer
Job scheduler
Data Processing
Engines
Essential Components
General Purpose Data Processing Engines
Processing Engine (with API)
Abstraction
Engines
SQL/Query
Lang.engine
Real-time
Processing
Machine
Learning
Graph
Processing
<interface to> Storage Layer
Apache Spark
Pig
SparkSQL,
SparkR
Spark
Streaming
MLLib
H20
GrpahX
Cloud, Hadoop, Local Env…..
Apache Flink
HadoopMR,
Cascading
TabeAPI
CEP
FlinkML
Gelly
Cloud, Hadoop, Local Env…..
Features of Data Processing Engines
• Processing Mode: Batch, Streaming, Hybrid
• Category: DC/SEP/ESP/CEP
• Delivery guarantees: at least once/exactly once
• State management: distributed snapshots/checkpoints
• Out of ordering processing: y/n
• Windowing: time-based, count-based
• Latency: low or medium
Apache Spark
January 2017
A Unified engine across data workloads and platforms
Diffentiation to Map Reduce
• MapReduce was designed to process shared nothing data
• Processing with data sharing:
• complex, multi-pass analytics (e.g. ML, graph)
• interactive ad-hoc queries
• real-time stream processing
• Improvements for coding:
• Less boilerplate code, richer API
• Support of various programming languages
Two Key Innovation of Spark
24
Execution optimization
via DAGs
Distributed data containers
(RDDs) to avoid Serialization.
Query
Input
Query
Query
Start REPL locally, delegate execution via cluster manager
Execute on REPL
Ø ./bin/spark-shell --master local
Ø ./bin/pyspark --master yarn-
client
Execute as application
Ø ./bin/spark-submit
--class
org.apache.spark.examples.SparkPi
--master spark://207.184.161.138:7077
--executor-memory 20G
--total-executor-cores 100
/path/to/examples.jar
1000
Execute within Application
Components of a spark application
Driver program
• SparkContext/SparkSession as Hook to Execution
Environment
• Java, Scala, Python or R Code (REPL or App)
• Creates a DAG of Jobs and
Cluster manager
• grants executors to a Spark application
• Included: Standalone, Yarn, Mesos or local
• Custom made: e.g. Cassandra
• Distributes Jobs to executors
Executors
• Worker processes that execute tasks and store
data
Resource	Manager	(default	port	4040)
•Supervise	execution
Spark in Scala and Python
// Scala:
val distFile = sc.textFile("README.md")
distFile.map(l => l.split(" ")).collect()
distFile.flatMap(l => l.split(" ")).collect()
29
// Python:
distFile = sc.textFile("README.md")
distFile.map(lambda x: x.split(' ')).collect()
distFile.flatMap(lambda x: x.split(' ')).collect()
// Java 7:
JavaRDD<String> distFile = sc.textFile("README.md");
JavaRDD<String> words = distFile.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String line) {
return Arrays.asList(line.split(" "));
}});
// Java 8:
JavaRDD<String> distFile = sc.textFile("README.md");
JavaRDD<String> words =
distFile.flatMap(line -> Arrays.asList(line.split("
")));
History of Spark API Containers
SparkSession / SparkContext – the standard way to create container
• SparkSession (starting from 2.0) as Hook to the Data,
• SparkContext still available (can be created via SparkSession.sparkContext())
• Use SparkSession to create DataSets
• Use SparkContext to create RDD
• A session object knows about the execution environment
• Can be used to load data into a container
Operations on Collections: Transformations and Actions
val lines = sc.textFile("hdfs:///data/shakespeare/input") // Transformation
val lineLengths = lines.map(s => s.length) // Transformation
val totalLength = lineLengths.reduce((a, b) => a + b) // Action
Transformation:
• Create a new distributed data set from Source or from other data set
• Transformations are stacked until execution (Lazy Loading)
Actions:
• Trigger an Execution
• Create the most optimal execution path
Common Transformations
val	rdd	=	sc.parallelize(Array("This	is	a	line	of	text",	"And	so	is	this"))
map	- apply	a	function	while	preserving	structure
rdd.map(	line	=>	line.split("s+")	)	
→	Array(Array(This,	is,	a,	line,	of,	text),	Array(And,	so,	is,	this))
flatMap	- apply	a	function	while	flattening	structure
rdd.flatMap(	line	=>	line.split("s+")	)	
→	Array(This,	is,	a,	line,	of,	text,	And,	so,	is,	this)	
filter	- discard	elements	from	an	RDD	which	don’t	match	a	condition
rdd.filter(	line	=>	line.contains("so")	)	
→	Array(And	so	is	this)
reduceByKey - apply a	function to each value for each key
pair_rdd.reduceByKey(	(v1,	v2)	=>	v1	+	v2	)	
→	Array((this,2),	(is,2),	(line,1),	(so,1),	(text,1),	(a,1),	(of,1),	(and,1))
Common Spark Actions
collect	- gather	results	from	nodes	and	return
first	- return	the	first	element	of	the	RDD
take(N)	- return	the	first	N elements	of	the	RDD
saveAsTextFile	- write	the	RDD	as	a	text	file
saveAsSequenceFile	- write	the	RDD	as	a	SequenceFile
count	- count	elements	in	the	RDD
countByKey	- count	elements	in	the	RDD	by	key
foreach	- process	each	element	of	an	RDD	
(e.g.,	rdd.collect.foreach(println)	)
WordCount in Scala
val text = sc.textFile(source_file)
words = text.flatMap( line => line.split("W+") )
val kv = words.map( word => (word.toLowerCase(), 1) )
val totals = kv.reduceByKey( (v1, v2) => v1 + v2 )
totals.saveAsTextFile(output)
BDAS – Berkeley Data Analytics Stack
How to use SQL on Spark
• Spark SQL: Component direct on the Berkeley ecosystem
• Hive on Spark: Use Spark as execution engine for hive
• BlinkDB: Approximate SQL Engine
Spark SQL
Spark SQL uses DataFrames (Typed Data Containers) for SQL
Hive:
c = HiveContext(sc)
rows = c.sql(“select * from titanic”)
rows.filter(rows[‘age’] > 25).show()
JSON:
c.read.format(‘json’).load(’file:///root/tweets.json”).registerTe
mpTable(“tweets”)
c.sql(“select text, user.name from tweets”)
39
28.01.17
Spark SQL Example
40
28.01.17
BlinkDB
• An approximate query engine for running interactive SQL queries.
• allows to trade-off query accuracy for response time,
• enabling interactive queries over massive data by running queries on data samples and
presenting results annotated with meaningful error bars.
Streaming
Spark Streaming and Structured Streaming
Spark Streaming
Streaming on RDDs
Structured Streaming
Streaming on DataFrames
Machine Learning and Graph
Analytics
SparkML, Spark MLLib, GraphX, GraphFrames,
Typical Use Cases
Classification and regression
• Linear support vector machine
• Logistic regression
• Linear least squares, Lasso, ridge regression
• Decision tree
• Naive Bayes
Collaborative filtering
• Alternating least squares
Clustering
• K-means
Dimensionality reduction
• Singular value decomposition
• Principal component analysis
Optimization
• Stochastic gradient descent
• Limited-memory BFGS
http://spark.apache.org/docs/latest/mllib-guide.html
MLLIb and H2O
• DataBricks-ML Libraries: inspired by the sci-kit learn library.
• MLLIB works with RDDs
• ML works with DataFrames
• H2O- library: Library build by the company H2O.
• H2O can be integrated with Spark with the 'Sparkling Water' connector.
Graph Analytics
Graph Engine that analyzes tabular data
• Nodes: People and things (nouns/keys)
• Edges: relationships between nodes
Algorithms
PageRank
Connected components
Label propagation
SVD++
Strongly connected components
Triangle count
One	Framework	per	Container-API
•GraphX is	designed	for	RDDs,	
•GraphFrames for	DataFrames
Survival of the “Fastest“
Apache Flink
0 4,000,000 8,000,000 12,000,000 16,000,000
Storm
Flink
Flink (10 GigE)
Throughput: msgs/sec
10 GigE end-to-end
15m msgs/sec
49
Streaming: continuous processing on
data that is continuously produced
Sources
Message	
Broker
Stream	processor
collect publish/subscribe analyse serve&store
Apache Flink Ecosystem
50
Gelly
Table
ML
SAMOA
DataSet (Java/Scala)DataStream (Java / Scala)
HadoopM/R
LocalClusterYARN
ApacheBeam
ApacheBeam
Table
Cascading
Streaming dataflow
runtime
StormAPI
Zeppelin
CEP
Expressive APIs
51
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Flink Engine – Core Design Principles
1. Execute everything as streams
2. Allow some iterative (cyclic) dataflows
3. Allow some (mutable) state
4. Operate on managed memory
Memory Management
further	reading:	https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html
54
Streaming	
Source
Streaming	
Source
Streaming	
Source
Consumer
Forward	events	
immediately	to	pub/sub	bus
Stream	Processor
Process	at	event	time	&	
update	serving	layer
Message	
Broker
Low	latency
High	throughput
Windowing	/	Out	
of	order	events
State	handling
Fault	tolerance	
and	correctness
Windowing
Apache Flink
Low latency
High throughput
State handling
Windowing
Fault tolerance
and correctness
Building windows from a stream
“Number of visitors in the last 5 minutes per country”
56
source
Kafka topic
Stream processor
// create stream from Kafka source
DataStream<LogEvent> stream = env.addSource(new KafkaConsumer());
// group by country
DataStream<LogEvent> keyedStream = stream.keyBy(“country“);
// window of size 5 minutes
keyedStream.timeWindow(Time.minutes(5))
// do operations per window
.apply(new CountPerWindowFunction());
Building windows: Execution
57
Kafka
Source
Window
Operator
S
S
S
W
W
W
group by
country
// window of size 5 minutes
keyedStream.timeWindow(Time.minutes(5));
Job	plan Parallel	execution	on	the	cluster
Time
Streaming: Windows
58
Time
Aggregates on streams
are	scoped	by	windows
Time-driven Data-driven
e.g.	last	X	minutes e.g.	last	X	records
Window types in Flink
Tumbling windows
Sliding windows
Custom windows with window assigners, triggers and evictors
59
Further reading: http://flink.apache.org/news/2015/12/04/Introducing-windows.html
1977 1980 1983 1999 2002 2005 2015
Processing Time
Episode
IV
Episode
V
Episode
VI
Episode
I
Episode
II
Episode
III
Episode
VII
Event Time
Event Time	vs.	Processing	Time
60
State Handling
Apache Flink
Low	latency
High	throughput
State	handling
Windowing
Fault	tolerance	
and	correctness
Batch vs. Continuous
62
• No	state	across	batches
• Fault	tolerance	within	a	job
• Re-processing	starts	empty
Batch	Jobs
Continuous	
Programs
• Continuous	state	across	time
• Fault	tolerance	guards	state
• Reprocessing	starts	stateful
Streaming: Savepoints
63
Savepoint A Savepoint B
Globally	consistent	point-in-time	snapshot
of	the	streaming	application
Re-processing data (continuous)
• Draw savepoints at times that you will want to start new jobs from (daily, hourly, …)
• Reprocess by starting a new job from a savepoint
• Defines start position in stream (for example Kafka offsets)
• Initializes pending state (like partial sessions)
64
Savepoint
Run	new	streaming
program	from	savepoint
Stream	processor:	Flink
Managed state in Flink
• Flink automatically backups and restores state
• State can be larger than the available memory
• State back ends: (embedded) RocksDB, Heap memory
65
Operator	with	windows	
(large	state)
State	backend(local)
Distributed	File	
System
Periodic	backup	/
recovery
Source Kafka
Fault Tolerance
Apache Flink
Low latency
High throughput
State handlingWindowing 7
Fault tolerance
and correctness
Fault tolerance in streaming
• How do we ensure the results are always correct?
• Failures should not lead to data loss or incorrect results
67
Source
Kafka
topic
Stream processor
Fault tolerance in streaming
• At least once: ensure all events are transmitted
• May lead to duplicates
• At most once: ensure that a known state of data is transmitted
• May lead to data loss
• Exactly once: ensure that operators do not perform duplicate updates to their state
• Flink achieves exactly once with Distributed Snapshots
Low Latency
Apache Flink
Low latency
High throughput
State handlingWindowing
Fault tolerance
and correctness
Yahoo! Benchmark
• Count ad impressions grouped by campaign
• Compute aggregates over a 10 second window
• Emit window aggregates to Redis every second for query
70
Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking-
streaming-computation-engines-at
“Storm […] and Flink […] show sub-second latencies at relatively high
throughputs with Storm having the lowest 99th percentile latency.
Spark streaming 1.5.1 supports high throughputs, but at a relatively higher
latency.”
(Quote from the blog post’s executive summary)
Windowing with state in Redis
• Original use case did not use Flink’s windowing implementation.
• Data Artisans implemented the use case with Flink windowing.
71
KafkaConsumer
map()
filter()
group
Flink event
time windows
realtime queries
Results after rewrite
72
0 750,000 1,500,000 2,250,000 3,000,000 3,750,000
Storm
Flink
Throughput: msgs/sec
400k msgs/sec
Can we even go further?
73
KafkaConsumer
map()
filter()
group
Flink event
time windows
Network link to Kafka
cluster is bottleneck!
(1GigE)
Data Generator
map()
filter()
group
Flink event
time windows
Solution: Move data
generator into job (10
GigE)
Results without network bottleneck
74
0 4,000,000 8,000,000 12,000,000 16,000,000
Storm
Flink
Flink (10 GigE)
Throughput: msgs/sec
10 GigE end-to-end
15m msgs/sec
400k msgs/sec
3m msgs/sec
Survival of the Fastest – Flink Performance
• throughput of 15 million messages/second on 10 machines
• 35x higher throughput compared to Storm (80x compared to Yahoo’s runs)
• exactly once guarantees
• Read the full report: http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
The	unbelievable Machine Company	GmbH
Museumsplatz 1/10/13	
1070	Wien
Contact:
Stefan	Papp,	Data	Architect
stefan.papp@unbelievable-machine.com
Tel.	+43	- 1	- 361	99	77	- 215	
Mobile:	+43	664	2614367

Más contenido relacionado

La actualidad más candente

Apache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart CityApache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart City
Kai Wähner
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise Company
Databricks
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & Kibana
Amazee Labs
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
Vikrant Chauhan
 

La actualidad más candente (20)

Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
Apache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart CityApache Kafka for Automotive Industry, Mobility Services & Smart City
Apache Kafka for Automotive Industry, Mobility Services & Smart City
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan Stream Processing with Kafka in Uber, Danny Yuan
Stream Processing with Kafka in Uber, Danny Yuan
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise CompanyApache Spark for Cyber Security in an Enterprise Company
Apache Spark for Cyber Security in an Enterprise Company
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & Kibana
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
Log analysis with the elk stack
Log analysis with the elk stackLog analysis with the elk stack
Log analysis with the elk stack
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Elk
Elk Elk
Elk
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniert
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniertFast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniert
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniert
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache Kafka
 

Destacado

Good Practices and Recommendations on the Security and Resilience of Big Data...
Good Practices and Recommendations on the Security and Resilience of Big Data...Good Practices and Recommendations on the Security and Resilience of Big Data...
Good Practices and Recommendations on the Security and Resilience of Big Data...
Eftychia Chalvatzi
 
MS PPM Summit Chicago_Nov 2015
MS PPM Summit Chicago_Nov 2015MS PPM Summit Chicago_Nov 2015
MS PPM Summit Chicago_Nov 2015
Ludvic Baquie
 

Destacado (20)

Data science for CRM in banks
Data science for CRM in banksData science for CRM in banks
Data science for CRM in banks
 
Data Market Austria and Data Science Continuing Education Course
Data Market Austria and Data Science Continuing Education CourseData Market Austria and Data Science Continuing Education Course
Data Market Austria and Data Science Continuing Education Course
 
Big Data Processing in the Cloud: A Hydra/Sufia Experience
Big Data Processing in the Cloud: A Hydra/Sufia ExperienceBig Data Processing in the Cloud: A Hydra/Sufia Experience
Big Data Processing in the Cloud: A Hydra/Sufia Experience
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
PDU 214 Methods of Observation & Interviewing: Observation - Methods & Record...
PDU 214 Methods of Observation & Interviewing: Observation - Methods & Record...PDU 214 Methods of Observation & Interviewing: Observation - Methods & Record...
PDU 214 Methods of Observation & Interviewing: Observation - Methods & Record...
 
Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
Cloud PARTE: Elastic Complex Event Processing based on Mobile ActorsCloud PARTE: Elastic Complex Event Processing based on Mobile Actors
Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
 
Flink. Pure Streaming
Flink. Pure StreamingFlink. Pure Streaming
Flink. Pure Streaming
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - Sogeti
 
Mapping presentation THAG big data from space
Mapping presentation THAG big data from spaceMapping presentation THAG big data from space
Mapping presentation THAG big data from space
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Good Practices and Recommendations on the Security and Resilience of Big Data...
Good Practices and Recommendations on the Security and Resilience of Big Data...Good Practices and Recommendations on the Security and Resilience of Big Data...
Good Practices and Recommendations on the Security and Resilience of Big Data...
 
Benefícios e desafios que Big Data & Analytics traz para as empresas na jorna...
Benefícios e desafios que Big Data & Analytics traz para as empresas na jorna...Benefícios e desafios que Big Data & Analytics traz para as empresas na jorna...
Benefícios e desafios que Big Data & Analytics traz para as empresas na jorna...
 
MS PPM Summit Chicago_Nov 2015
MS PPM Summit Chicago_Nov 2015MS PPM Summit Chicago_Nov 2015
MS PPM Summit Chicago_Nov 2015
 
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
 
Big data Europe: concept, platform and pilots
Big data Europe: concept, platform and pilotsBig data Europe: concept, platform and pilots
Big data Europe: concept, platform and pilots
 
Graph Stream Processing : spinning fast, large scale, complex analytics
Graph Stream Processing : spinning fast, large scale, complex analyticsGraph Stream Processing : spinning fast, large scale, complex analytics
Graph Stream Processing : spinning fast, large scale, complex analytics
 
3 Big Data Trends for 2017
3 Big Data Trends for 20173 Big Data Trends for 2017
3 Big Data Trends for 2017
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Data Mining, Predictive Analytics and Big Data - Course information Spring 2017
Data Mining, Predictive Analytics and Big Data -  Course information Spring 2017Data Mining, Predictive Analytics and Big Data -  Course information Spring 2017
Data Mining, Predictive Analytics and Big Data - Course information Spring 2017
 
Uji perbedaan ayda tri_valen_virdya
Uji perbedaan ayda tri_valen_virdyaUji perbedaan ayda tri_valen_virdya
Uji perbedaan ayda tri_valen_virdya
 

Similar a 20170126 big data processing

Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 

Similar a 20170126 big data processing (20)

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 

Más de Vienna Data Science Group

Más de Vienna Data Science Group (20)

Deep learning in algorithmic trading
Deep learning in algorithmic tradingDeep learning in algorithmic trading
Deep learning in algorithmic trading
 
Multi state churn analysis with a subscription product
Multi state churn analysis with a subscription productMulti state churn analysis with a subscription product
Multi state churn analysis with a subscription product
 
Modelling the-spread-of-sars-cov-2
Modelling the-spread-of-sars-cov-2Modelling the-spread-of-sars-cov-2
Modelling the-spread-of-sars-cov-2
 
Deeplearning ai june-sharable (1)
Deeplearning ai june-sharable (1)Deeplearning ai june-sharable (1)
Deeplearning ai june-sharable (1)
 
Liability for machine learning systems by Daniel Deutsch
Liability for machine learning systems by Daniel DeutschLiability for machine learning systems by Daniel Deutsch
Liability for machine learning systems by Daniel Deutsch
 
On data literacy by Marek Danis
On data literacy by Marek Danis On data literacy by Marek Danis
On data literacy by Marek Danis
 
How to get into Kaggle? by Philipp Singer and Dmitry Gordeev
How to get into Kaggle? by Philipp Singer and Dmitry GordeevHow to get into Kaggle? by Philipp Singer and Dmitry Gordeev
How to get into Kaggle? by Philipp Singer and Dmitry Gordeev
 
NLP in a Bank: Automated Document Reading: Yevgen Kolesnyk / Patrik Zatko / D...
NLP in a Bank: Automated Document Reading: Yevgen Kolesnyk / Patrik Zatko / D...NLP in a Bank: Automated Document Reading: Yevgen Kolesnyk / Patrik Zatko / D...
NLP in a Bank: Automated Document Reading: Yevgen Kolesnyk / Patrik Zatko / D...
 
Anita Graser: Analyzing Movment Data with MovingPandas
Anita Graser: Analyzing Movment Data  with MovingPandas Anita Graser: Analyzing Movment Data  with MovingPandas
Anita Graser: Analyzing Movment Data with MovingPandas
 
Armin Rabitsch's presentation on the importance of social media in the electi...
Armin Rabitsch's presentation on the importance of social media in the electi...Armin Rabitsch's presentation on the importance of social media in the electi...
Armin Rabitsch's presentation on the importance of social media in the electi...
 
Martina Chichi describes Amnesty International Italy's Barometer of Hate Project
Martina Chichi describes Amnesty International Italy's Barometer of Hate ProjectMartina Chichi describes Amnesty International Italy's Barometer of Hate Project
Martina Chichi describes Amnesty International Italy's Barometer of Hate Project
 
Vdsg /Craftworks Industrial-AI
Vdsg /Craftworks Industrial-AIVdsg /Craftworks Industrial-AI
Vdsg /Craftworks Industrial-AI
 
Roessler, Hafner - Modelling and Simulation in Industrial Applications: Apply...
Roessler, Hafner - Modelling and Simulation in Industrial Applications: Apply...Roessler, Hafner - Modelling and Simulation in Industrial Applications: Apply...
Roessler, Hafner - Modelling and Simulation in Industrial Applications: Apply...
 
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
 
Openfabnet - A collaborative approach towards industry 4.0 based on open sour...
Openfabnet - A collaborative approach towards industry 4.0 based on open sour...Openfabnet - A collaborative approach towards industry 4.0 based on open sour...
Openfabnet - A collaborative approach towards industry 4.0 based on open sour...
 
Lange - Industrial Data Space – Digital Sovereignty over Data
Lange - Industrial Data Space – Digital Sovereignty over DataLange - Industrial Data Space – Digital Sovereignty over Data
Lange - Industrial Data Space – Digital Sovereignty over Data
 
Industry 4.0 by VDSG and Informance
Industry 4.0 by VDSG and InformanceIndustry 4.0 by VDSG and Informance
Industry 4.0 by VDSG and Informance
 
Donner - Deep Learning - Overview and practical aspects
Donner - Deep Learning - Overview and practical aspectsDonner - Deep Learning - Overview and practical aspects
Donner - Deep Learning - Overview and practical aspects
 
Langs - Machine Learning in Medical Imaging: Learning from Large-scale popula...
Langs - Machine Learning in Medical Imaging: Learning from Large-scale popula...Langs - Machine Learning in Medical Imaging: Learning from Large-scale popula...
Langs - Machine Learning in Medical Imaging: Learning from Large-scale popula...
 
Brunauer, Weidinger - Welcome from the Vienna Data Science Group
Brunauer, Weidinger - Welcome from the Vienna Data Science GroupBrunauer, Weidinger - Welcome from the Vienna Data Science Group
Brunauer, Weidinger - Welcome from the Vienna Data Science Group
 

Último

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Último (20)

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

20170126 big data processing

  • 2. • Data Architect at unbelievable machine Company • Software Engineering Background • Jack of all Trades who also dives into Business topics, Systems Engineering and Data Science. • Big Data since 2011 • Cross-Industry: From Automotive to Transportation • Other Activities • Trainer: Hortonworks Apache Hadoop Certified Trainer • Author: Articles and book projects • Lector: Big Data at FH Technikum and FH Wiener Neustadt Stefan Papp
  • 3. Agenda • Big Data Processing • Evolution in Processing Big Data • Data Processing Patterns • Components of a Data Processing Engine • Apache Spark • Concept • Ecosystem • Apache Flink • Concept • Ecosystem
  • 4. Big Data Processing Engines on a Hadoop 2.x Reference(!) Stack HADOOP 2.x STACK HDFS (redundant, reliable storage) YARN (cluster resource management) Batch MapReduce Direct Java Search Solr API Engine Data Operating System File System Batch & Interactive Tez Script Pig SQL Hive Cascading Java Real-Time Slider NoSQL HBase Stream Storm RDD & PACT Spark, Flink Machine Learning SparkML Other Application Graph Giraph Applications
  • 6. Big Data Roots – Content Processing for Search Engines
  • 7. IO Read Challenge: Read 500 GB Data (as a Reference) • Assumption • Shared nothing, plain read • You can read 256 MB in 1.9 seconds • Single Node • Total Blocks in 500 GB = 1954 Blocks • 1954 * 1,9 / 3600 = approx. 1 hour sequential read. • A 40 node cluster with 8 HDs on each node • 320 HDs -> 6 to 7 blocks on each disk • 7 blocks * 1,9 = 13,3 seconds total read time
  • 8. Parallelism and Concurrency are Complex a = 1 a = a + async_add(1) a = a * async_mul(2)
  • 9. Data Flow Engine to Abstract Data Processing • Provide a programming interface • Express jobs as graphs of high-level operators • System picks how to split each operator into task • and where to run each task • Solve topics such as • Concurrency • Fault recovery
  • 10. MapReduce: Divide and Conquer Map Map Map Reduce Reduce Input Output Read Read Read ReadWrite Write Write Write Iter. Iter. Iter. Iter. Iteration: Map Reduce
  • 11.
  • 12. Evolution of Data Processing 2004 2007 2010 2010
  • 14. Batch Processing 14 Source Source Storage Layer Periodic ingestion Batch Processor Periodic analysis job Consumer Job scheduler SQL Import File Import
  • 15. Stream processor / Kappa Architecture 15 Source Source Consumer Forward events immediately to pub/sub bus Stream Processor Process at event time & update serving layer Messaging System
  • 16. Hybrid / Lambda processing 16 Storage Layer Batch job(s) for analysis Serving layer Job scheduler Stream Processor Messaging System Consumer Source Source Source
  • 19. General Purpose Data Processing Engines Processing Engine (with API) Abstraction Engines SQL/Query Lang.engine Real-time Processing Machine Learning Graph Processing <interface to> Storage Layer Apache Spark Pig SparkSQL, SparkR Spark Streaming MLLib H20 GrpahX Cloud, Hadoop, Local Env….. Apache Flink HadoopMR, Cascading TabeAPI CEP FlinkML Gelly Cloud, Hadoop, Local Env…..
  • 20. Features of Data Processing Engines • Processing Mode: Batch, Streaming, Hybrid • Category: DC/SEP/ESP/CEP • Delivery guarantees: at least once/exactly once • State management: distributed snapshots/checkpoints • Out of ordering processing: y/n • Windowing: time-based, count-based • Latency: low or medium
  • 22. A Unified engine across data workloads and platforms
  • 23. Diffentiation to Map Reduce • MapReduce was designed to process shared nothing data • Processing with data sharing: • complex, multi-pass analytics (e.g. ML, graph) • interactive ad-hoc queries • real-time stream processing • Improvements for coding: • Less boilerplate code, richer API • Support of various programming languages
  • 24. Two Key Innovation of Spark 24 Execution optimization via DAGs Distributed data containers (RDDs) to avoid Serialization. Query Input Query Query
  • 25.
  • 26. Start REPL locally, delegate execution via cluster manager Execute on REPL Ø ./bin/spark-shell --master local Ø ./bin/pyspark --master yarn- client Execute as application Ø ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://207.184.161.138:7077 --executor-memory 20G --total-executor-cores 100 /path/to/examples.jar 1000 Execute within Application
  • 27. Components of a spark application Driver program • SparkContext/SparkSession as Hook to Execution Environment • Java, Scala, Python or R Code (REPL or App) • Creates a DAG of Jobs and Cluster manager • grants executors to a Spark application • Included: Standalone, Yarn, Mesos or local • Custom made: e.g. Cassandra • Distributes Jobs to executors Executors • Worker processes that execute tasks and store data Resource Manager (default port 4040) •Supervise execution
  • 28.
  • 29. Spark in Scala and Python // Scala: val distFile = sc.textFile("README.md") distFile.map(l => l.split(" ")).collect() distFile.flatMap(l => l.split(" ")).collect() 29 // Python: distFile = sc.textFile("README.md") distFile.map(lambda x: x.split(' ')).collect() distFile.flatMap(lambda x: x.split(' ')).collect() // Java 7: JavaRDD<String> distFile = sc.textFile("README.md"); JavaRDD<String> words = distFile.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String line) { return Arrays.asList(line.split(" ")); }}); // Java 8: JavaRDD<String> distFile = sc.textFile("README.md"); JavaRDD<String> words = distFile.flatMap(line -> Arrays.asList(line.split(" ")));
  • 30. History of Spark API Containers
  • 31. SparkSession / SparkContext – the standard way to create container • SparkSession (starting from 2.0) as Hook to the Data, • SparkContext still available (can be created via SparkSession.sparkContext()) • Use SparkSession to create DataSets • Use SparkContext to create RDD • A session object knows about the execution environment • Can be used to load data into a container
  • 32. Operations on Collections: Transformations and Actions val lines = sc.textFile("hdfs:///data/shakespeare/input") // Transformation val lineLengths = lines.map(s => s.length) // Transformation val totalLength = lineLengths.reduce((a, b) => a + b) // Action Transformation: • Create a new distributed data set from Source or from other data set • Transformations are stacked until execution (Lazy Loading) Actions: • Trigger an Execution • Create the most optimal execution path
  • 33. Common Transformations val rdd = sc.parallelize(Array("This is a line of text", "And so is this")) map - apply a function while preserving structure rdd.map( line => line.split("s+") ) → Array(Array(This, is, a, line, of, text), Array(And, so, is, this)) flatMap - apply a function while flattening structure rdd.flatMap( line => line.split("s+") ) → Array(This, is, a, line, of, text, And, so, is, this) filter - discard elements from an RDD which don’t match a condition rdd.filter( line => line.contains("so") ) → Array(And so is this) reduceByKey - apply a function to each value for each key pair_rdd.reduceByKey( (v1, v2) => v1 + v2 ) → Array((this,2), (is,2), (line,1), (so,1), (text,1), (a,1), (of,1), (and,1))
  • 34. Common Spark Actions collect - gather results from nodes and return first - return the first element of the RDD take(N) - return the first N elements of the RDD saveAsTextFile - write the RDD as a text file saveAsSequenceFile - write the RDD as a SequenceFile count - count elements in the RDD countByKey - count elements in the RDD by key foreach - process each element of an RDD (e.g., rdd.collect.foreach(println) )
  • 35. WordCount in Scala val text = sc.textFile(source_file) words = text.flatMap( line => line.split("W+") ) val kv = words.map( word => (word.toLowerCase(), 1) ) val totals = kv.reduceByKey( (v1, v2) => v1 + v2 ) totals.saveAsTextFile(output)
  • 36.
  • 37. BDAS – Berkeley Data Analytics Stack
  • 38. How to use SQL on Spark • Spark SQL: Component direct on the Berkeley ecosystem • Hive on Spark: Use Spark as execution engine for hive • BlinkDB: Approximate SQL Engine
  • 39. Spark SQL Spark SQL uses DataFrames (Typed Data Containers) for SQL Hive: c = HiveContext(sc) rows = c.sql(“select * from titanic”) rows.filter(rows[‘age’] > 25).show() JSON: c.read.format(‘json’).load(’file:///root/tweets.json”).registerTe mpTable(“tweets”) c.sql(“select text, user.name from tweets”) 39 28.01.17
  • 41. BlinkDB • An approximate query engine for running interactive SQL queries. • allows to trade-off query accuracy for response time, • enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars.
  • 42. Streaming Spark Streaming and Structured Streaming
  • 43. Spark Streaming Streaming on RDDs Structured Streaming Streaming on DataFrames
  • 44. Machine Learning and Graph Analytics SparkML, Spark MLLib, GraphX, GraphFrames,
  • 45. Typical Use Cases Classification and regression • Linear support vector machine • Logistic regression • Linear least squares, Lasso, ridge regression • Decision tree • Naive Bayes Collaborative filtering • Alternating least squares Clustering • K-means Dimensionality reduction • Singular value decomposition • Principal component analysis Optimization • Stochastic gradient descent • Limited-memory BFGS http://spark.apache.org/docs/latest/mllib-guide.html
  • 46. MLLIb and H2O • DataBricks-ML Libraries: inspired by the sci-kit learn library. • MLLIB works with RDDs • ML works with DataFrames • H2O- library: Library build by the company H2O. • H2O can be integrated with Spark with the 'Sparkling Water' connector.
  • 47. Graph Analytics Graph Engine that analyzes tabular data • Nodes: People and things (nouns/keys) • Edges: relationships between nodes Algorithms PageRank Connected components Label propagation SVD++ Strongly connected components Triangle count One Framework per Container-API •GraphX is designed for RDDs, •GraphFrames for DataFrames
  • 48. Survival of the “Fastest“ Apache Flink 0 4,000,000 8,000,000 12,000,000 16,000,000 Storm Flink Flink (10 GigE) Throughput: msgs/sec 10 GigE end-to-end 15m msgs/sec
  • 49. 49 Streaming: continuous processing on data that is continuously produced Sources Message Broker Stream processor collect publish/subscribe analyse serve&store
  • 50. Apache Flink Ecosystem 50 Gelly Table ML SAMOA DataSet (Java/Scala)DataStream (Java / Scala) HadoopM/R LocalClusterYARN ApacheBeam ApacheBeam Table Cascading Streaming dataflow runtime StormAPI Zeppelin CEP
  • 51. Expressive APIs 51 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 52. Flink Engine – Core Design Principles 1. Execute everything as streams 2. Allow some iterative (cyclic) dataflows 3. Allow some (mutable) state 4. Operate on managed memory
  • 55. Windowing Apache Flink Low latency High throughput State handling Windowing Fault tolerance and correctness
  • 56. Building windows from a stream “Number of visitors in the last 5 minutes per country” 56 source Kafka topic Stream processor // create stream from Kafka source DataStream<LogEvent> stream = env.addSource(new KafkaConsumer()); // group by country DataStream<LogEvent> keyedStream = stream.keyBy(“country“); // window of size 5 minutes keyedStream.timeWindow(Time.minutes(5)) // do operations per window .apply(new CountPerWindowFunction());
  • 57. Building windows: Execution 57 Kafka Source Window Operator S S S W W W group by country // window of size 5 minutes keyedStream.timeWindow(Time.minutes(5)); Job plan Parallel execution on the cluster Time
  • 58. Streaming: Windows 58 Time Aggregates on streams are scoped by windows Time-driven Data-driven e.g. last X minutes e.g. last X records
  • 59. Window types in Flink Tumbling windows Sliding windows Custom windows with window assigners, triggers and evictors 59 Further reading: http://flink.apache.org/news/2015/12/04/Introducing-windows.html
  • 60. 1977 1980 1983 1999 2002 2005 2015 Processing Time Episode IV Episode V Episode VI Episode I Episode II Episode III Episode VII Event Time Event Time vs. Processing Time 60
  • 62. Batch vs. Continuous 62 • No state across batches • Fault tolerance within a job • Re-processing starts empty Batch Jobs Continuous Programs • Continuous state across time • Fault tolerance guards state • Reprocessing starts stateful
  • 63. Streaming: Savepoints 63 Savepoint A Savepoint B Globally consistent point-in-time snapshot of the streaming application
  • 64. Re-processing data (continuous) • Draw savepoints at times that you will want to start new jobs from (daily, hourly, …) • Reprocess by starting a new job from a savepoint • Defines start position in stream (for example Kafka offsets) • Initializes pending state (like partial sessions) 64 Savepoint Run new streaming program from savepoint
  • 65. Stream processor: Flink Managed state in Flink • Flink automatically backups and restores state • State can be larger than the available memory • State back ends: (embedded) RocksDB, Heap memory 65 Operator with windows (large state) State backend(local) Distributed File System Periodic backup / recovery Source Kafka
  • 66. Fault Tolerance Apache Flink Low latency High throughput State handlingWindowing 7 Fault tolerance and correctness
  • 67. Fault tolerance in streaming • How do we ensure the results are always correct? • Failures should not lead to data loss or incorrect results 67 Source Kafka topic Stream processor
  • 68. Fault tolerance in streaming • At least once: ensure all events are transmitted • May lead to duplicates • At most once: ensure that a known state of data is transmitted • May lead to data loss • Exactly once: ensure that operators do not perform duplicate updates to their state • Flink achieves exactly once with Distributed Snapshots
  • 69. Low Latency Apache Flink Low latency High throughput State handlingWindowing Fault tolerance and correctness
  • 70. Yahoo! Benchmark • Count ad impressions grouped by campaign • Compute aggregates over a 10 second window • Emit window aggregates to Redis every second for query 70 Full Yahoo! article: https://yahooeng.tumblr.com/post/135321837876/benchmarking- streaming-computation-engines-at “Storm […] and Flink […] show sub-second latencies at relatively high throughputs with Storm having the lowest 99th percentile latency. Spark streaming 1.5.1 supports high throughputs, but at a relatively higher latency.” (Quote from the blog post’s executive summary)
  • 71. Windowing with state in Redis • Original use case did not use Flink’s windowing implementation. • Data Artisans implemented the use case with Flink windowing. 71 KafkaConsumer map() filter() group Flink event time windows realtime queries
  • 72. Results after rewrite 72 0 750,000 1,500,000 2,250,000 3,000,000 3,750,000 Storm Flink Throughput: msgs/sec 400k msgs/sec
  • 73. Can we even go further? 73 KafkaConsumer map() filter() group Flink event time windows Network link to Kafka cluster is bottleneck! (1GigE) Data Generator map() filter() group Flink event time windows Solution: Move data generator into job (10 GigE)
  • 74. Results without network bottleneck 74 0 4,000,000 8,000,000 12,000,000 16,000,000 Storm Flink Flink (10 GigE) Throughput: msgs/sec 10 GigE end-to-end 15m msgs/sec 400k msgs/sec 3m msgs/sec
  • 75. Survival of the Fastest – Flink Performance • throughput of 15 million messages/second on 10 machines • 35x higher throughput compared to Storm (80x compared to Yahoo’s runs) • exactly once guarantees • Read the full report: http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
  • 76. The unbelievable Machine Company GmbH Museumsplatz 1/10/13 1070 Wien Contact: Stefan Papp, Data Architect stefan.papp@unbelievable-machine.com Tel. +43 - 1 - 361 99 77 - 215 Mobile: +43 664 2614367