SlideShare una empresa de Scribd logo
1 de 67
Flink vs. Spark
Slim Baltagi @SlimBaltagi
Director of Big Data Engineering, Fellow
Capital One
2
Agenda
I. Motivation for this talk
II. Apache Flink vs. Apache Spark?
III. How Flink is used at Capital One?
IV. What are some key takeaways?
3
I. Motivation for this talk
1. Marketing fluff
2. Confusing statements
3. Burning questions & incorrect or
outdated answers
4. Helping others evaluating Flink vs.
Spark
4
1. Marketing fluff
5
1. Marketing fluff
6
2. Confusing statements
“Spark is already an excellent piece of software and is
advancing very quickly. No vendor — no new project —
is likely to catch up. Chasing Spark would be a waste of
time, and would delay availability of real-time analytic
and processing services for no good reason.” Source:
MapReduce and Spark, Mike Olson. Chief Strategy
Officer, Cloudera. December, 30th 2013
http://vision.cloudera.com/mapreduce-spark/
“Goal: one engine for all data sources, workloads and
environments.” Source: Slide 15 of ‘New Directions for
Apache Spark in 2015’, Matei Zaharia. CTO, Databricks.
February 20th , 2015. http://www.slideshare.net/databricks/new-directions-for-
apache-spark-in-2015
7
3. Burning questions & incorrect or
outdated answers
"Projects that depend on smart optimizers rarely work
well in real life.” Curt Monash, Monash Research.
January 16, 2015http://www.computerworld.com/article/2871760/big-data-
digest-how-many-hadoops-do-we-really-need.html
“Flink is basically a Spark alternative out of Germany,
which I’ve been dismissing as unneeded”. Curt
Monash, Monash Research, March 5, 2015.
http://www.dbms2.com/2015/03/05/cask-and-cdap/
“Of course, this is all a bullish argument for Spark (or
Flink, if I’m wrong to dismiss its chances as a Spark
competitor).” Curt Monash, Monash Research,
September 28, 2015. http://www.dbms2.com/2015/09/28/the-potential-
significance-of-cloudera-kudu/
8
3. Burning questions & incorrect or
outdated answers
“The benefit of Spark's micro-batch model is that you
get full fault-tolerance and "exactly-once" processing
for the entire computation, meaning it can recover all
state and results even if a node crashes. Flink and
Storm don't provide this…” Matei Zaharia. CTO,
Databricks. May 2015http://www.kdnuggets.com/2015/05/interview-matei-
zaharia-creator-apache-spark.html
“I understand Spark Streaming uses micro-batching.
Does this increase latency? While Spark does use a
micro-batch execution model, this does not have
much impact on applications…” http://spark.apache.org/faq.html
9
4. Help others evaluating Flink vs. Spark
Besides the marketing fluff, the confusing statements,
the incorrect or outdated answers to burning
questions, the little information on the subject of Flink
vs. Spark is available piecemeal!
While evaluating different stream processing tools at
Capital One, we built a framework listing categories
and over 100 criteria to assess these stream
processing tools.
In the next section, I’ll be sharing this framework and
use it to compare Spark and Flink on a few key
criteria.
We hope this will be beneficial to you as well when
selecting Flink and/or Spark for stream processing.
10
Agenda
I. Motivation for this talk
II. Apache Flink vs. Apache Spark?
III. How Flink is used at Capital One?
IV. What are some key takeaways?
11
II. Apache Flink vs. Apache Spark?
1. What is Apache Flink?
2. What is Apache Spark?
3. Framework to evaluate Flink and Spark
4. Flink vs. Spark on a few key criteria
5. Future work
12
1.What is Apache Flink?
Squirrel: Animal. In harmony with other animals in the
Hadoop ecosystem (Zoo): elephant, pig, python,
camel,...
Squirrel: reflects the meaning of the word ‘Flink’:
German for “nimble, swift, speedy” which are also
characteristics of the squirrel.
Red color. In harmony with red squirrels in Germany to
reflect its root at German universities
Tail: colors matching the ones of the feather
symbolizing the Apache Software Foundation.
Commitment to build Flink in the open source!?
13
1.What is Apache Flink?
“Apache Flink is an open source platform for
distributed stream and batch data processing.”
https://flink.apache.org/
See also the definition in
Wikipedia:https://en.wikipedia.org/wiki/Apache_Flink
14
1.What is Apache Flink?
Gelly
Table
HadoopM/R
SAMOA
DataSet (Java/Scala/Python)
Batch Processing
DataStream (Java/Scala)
Stream Processing
FlinkML
Local
Single JVM
Embedded
Docker
Cluster
Standalone
YARN, Tez,
Mesos (WIP)
Cloud
Google’s GCE
Amazon’s EC2
IBM Docker Cloud, …
GoogleDataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Runtime - Distributed
Streaming Dataflow
Zeppelin
DEPLOYSYSTEMAPIs&LIBRARIESSTORAGE
Files
Local
HDFS
S3, Azure Storage
Tachyon
Databases
MongoDB
HBase
SQL
…
Streams
Flume
Kafka
RabbitMQ
…
Batch Optimizer Stream Builder
Storm
15
2. What is Apache Spark?
“Apache Spark™ is a fast and general engine for
large-scale data processing.”
http://spark.apache.org/
See also definition in Wikipedia:
https://en.wikipedia.org/wiki/Apache_Spark
Logo was picked to reflect Lightning-fast cluster
computing
16
2. What is Apache Spark?
17
3. Framework to evaluate Flink and Spark
1. Background
2. Fit-for-purpose Categories
3. Organizational-fit Categories
4. Miscellaneous/Other Categories
18
1. Background
1.1 Definition
1.2 Origin
1.3 Maturity
1.4 Version
1.5 Governance model
1.6 License model
19
2. Fit-for-purpose Categories
2.1 Security
2.2 Provisioning & Monitoring Capabilities
2.3 Latency & Processing Architecture
2.4 State Management
2.5 Processing Delivery Assurance
2.6 Database Integrations, Native vs. Third
party connector
2.7 High Availability & Resiliency
2.8 Ease of Development
2.9 Scalability
2.10 Unique Capabilities/Key Differentiators
20
2. Fit-for-purpose Categories
2.1 Security
2.1.1 Authentication, Authorization
2.1.2 Data at rest encryption (data persisted in the
framework)
2.1.3 Data in motion encryption (producer -
>framework -> consumer)
2.1.4 Data in motion encryption (inter-node
communication)
21
2. Fit-for-purpose Categories
2.2 Provisioning & Monitoring Capabilities
2.2.1 Robustness of Administration
2.2.2 Ease of maintenance: Does technology provide
configuration, deployment, scaling, monitoring,
performance tuning and auditing capabilities?
2.2.3 Monitoring & Alerting
2.2.4 Logging
2.2.5 Audit
2.2.6 Transparent Upgrade: Version upgrade with
minimum downtime
22
2. Fit-for-purpose Categories
2.3 Latency & Processing Architecture
2.3.1 Supports tuple at a time, micro-batch,
transactional updates and batch processing
2.3.2 Computational model
2.3.3 Ability to reprocess historical data from source
2.3.4 Ability to reprocess historical data from native
engine
2.3.5 Call external source (API/database calls)
2.3.6 Integration with Batch (static) source
2.3.7 Data Types (images, sound etc.)
2.3.8 Supports complex event processing and pattern
detection vs. continuous operator model (low latency,
flow control)
23
2. Fit-for-purpose Categories
2.3 Latency & Processing Architecture
2.3.9 Handles stream imperfections (delayed)
2.3.10 Handles stream imperfections (out-of-order)
2.3.11 Handles stream imperfections (duplicate)
2.3.12 Handles seconds, sub-second or millisecond
event processing (Latency)
2.3.13 Compression
2.3.14 Support for batch analytics
2.3.15 Support for iterative analytics (machine learning,
graph analytics)
2.3.16 Data lineage provenance (origin of the owner)
2.3.17 Data lineage (accelerate recovery time)
24
2. Fit-for-purpose Categories
2.4 State Management
2.4.1 Stateful vs. Stateless
2.4.2 Is stateful data Persisted locally vs.
external database vs. Ephemeral
2.4.3 Native rolling, tumbling and hopping
window support
2.4.4 Native support for integrated data store
25
2. Fit-for-purpose Categories
2.5 Processing Delivery Assurance
2.5.1 Guarantee (At least once)
2.5.2 Guarantee (At most once)
2.5.3 Guarantee (Exactly once)
2.5.4 Global Event order guaranteed
2.5.5 Guarantee predictable and repeatable
outcomes( deterministic or not)
26
2. Fit-for-purpose Categories
2.6 Database Integrations, Native vs. Third
party connector
2.6.1 NoSQL database integration
2.6.2 File Format (Avro, Parquet and other
format support)
2.6.3 RDBMS integration
2.6.4 In-memory database integration/ Caching
integration
27
2. Fit-for-purpose Categories
2.7 High Availability & Resiliency
2.7.1 Can the system avoid slowdown due to straggler node
2.7.2 Fault-Tolerance (does the tool handle
node/operator/messaging failures without catastrophically failing)
2.7.3 State recovery from in-memory
2.7.4 State recovery from reliable storage
2.7.5 Overhead of fault tolerance mechanism (Does failure
handling introduce additional latency or negatively impact
throughput?)
2.7.6 Multi-site support (multi-region)
2.7.7 Flow control: backpressure tolerance from slow operators or
consumers
2.7.8 Fast parallel recovery vs. replication or serial recovery on
one node at a time
28
2. Fit-for-purpose Categories
2.8 Ease of Development
2.8.1 SQL Interface
2.8.2 Real-Time debugging option
2.8.3 Built-in stream oriented abstraction (streams, windows, operators , iterators
- expressive APIs that enable programmers to quickly develop streaming data
applications)
2.8.4 Separation of application logic from fault tolerance
2.8.5 Testing tools and framework
2.8.6 Change management: multiple model deployment ( E.g. separate cluster or
can one create multiple independent redundant streams internally)
2.8.7 Dynamic model swapping (Support dynamic updating of
operators/topology/DAG without restart or service interruption)
2.8.8 Required knowledge of system internals to develop an application
2.8.9 Time to market for applications
2.8.10 Supports plug-in of external libraries
2.8.11 API High Level/Low Level
2.8.12 Easy to configuration
2.8.13 GUI based abstraction layer
29
2. Fit-for-purpose Categories
2.9 Scalability
2.9.1 Supports multi-thread across multiple
processors/cores
2.9.2 Distributed across multiple machines/servers
2.9.3 Partition Algorithm
2.9.4 Dynamic elasticity - Scaling with minimum impact/
performance penalty
2.9.5 Horizontal scaling with linear
performance/throughput
2.9.6 Vertical scaling (GPU)
2.9.7 Scaling without downtime
30
3. Organizational-fit Categories
3.1 Maturity & Community Support
3.2 Support Languages for Development
3.3 Cloud Portability
3.4 Compatibility with Native Hadoop
Architecture
3.5 Adoption of Community vs. Enterprise
Edition
3.6 Integration with Message Brokers
31
3. Organizational-fit Categories
3.1 Maturity & Community Support
3.1.1 Open Source Support
3.1.2 Maturity (years)
3.1.3 Stable
3.1.4 Centralized documentation with versioning
support
3.1.5 Documentation of programming API with good
code examples
3.1.6 Centralized visible roadmap
3.1.7 Community acceptance vs. Vendor driven
3.1.8 Contributors
32
3. Organizational-fit Categories
3.2 Support Languages for Development
3.2.1 Language technology was built on
3.2.2 Language supported to access
technology
33
3. Organizational-fit Categories
3.3 Cloud Portability
3.3.1 Ease of migration between cloud vendors
3.3.2 Ease of migration between on premise to cloud
3.3.3 Ease of migration from on premise to complete
cloud services
3.3.4 Cloud compatibility (AWS, Google, Azure)
34
3. Organizational-fit Categories
3.4 Compatibility with Native Hadoop
Architecture
3.4.1 Implement on top of Hadoop YARN vs.
Standalone
3.4.2 Mesos
3.4.3 Coordination with Apache Zookeeper
35
3. Organizational-fit Categories
3.5 Adoption of Community vs. Enterprise
Edition
3.5.1 Open Source
3.5.2 Enterprise Support
36
4. Miscellaneous/Other Categories
4.1 Best Suited for
4.2 Key use case scenarios
4.3 Companies using technology
37
4. Flink vs. Spark on a few key criteria
1. Streaming Engine
2. Iterative Processing
3. Memory Management
4. Optimization
5. Configuration
6. Tuning
7. Performance
38
4.1. Streaming Engine
 Many time-critical applications need to process large
streams of live data and provide results in real-time.
For example:
Financial Fraud detection
Financial Stock monitoring
Anomaly detection
Traffic management applications
Patient monitoring
Online recommenders
 Some claim that 95% of streaming use cases can
be handled with micro-batches!? Really!!!
39
4.1. Streaming Engine
Spark’s micro-batching isn’t good enough!
Ted Dunning, Chief Applications Architect at MapR,
talk at the Bay Area Apache Flink Meetup on August
27, 2015
http://www.meetup.com/Bay-Area-Apache-Flink-
Meetup/events/224189524/
Ted described several use cases where batch and micro
batch processing is not appropriate and described
why.
He also described what a true streaming solution needs
to provide for solving these problems.
These use cases were taken from real industrial
situations, but the descriptions drove down to technical
details as well.
40
4.1. Streaming Engine
 “I would consider stream data analysis to be a major
unique selling proposition for Flink. Due to its
pipelined architecture, Flink is a perfect match for big
data stream processing in the Apache stack.” – Volker
Markl
Ref.: On Apache Flink. Interview with Volker Markl, June 24th 2015
http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/
 Apache Flink uses streams for all workloads:
streaming, SQL, micro-batch and batch. Batch is just
treated as a finite set of streamed data. This makes
Flink the most sophisticated distributed open source
Big Data processing engine (not the most mature one
yet!).
41
4.2. Iterative Processing
Why Iterations? Many Machine Learning and Graph
processing algorithms need iterations! For example:
 Machine Learning Algorithms
Clustering (K-Means, Canopy, …)
Gradient descent (Logistic Regression, Matrix
Factorization)
 Graph Processing Algorithms
Page-Rank, Line-Rank
Path algorithms on graphs (shortest paths,
centralities, …)
Graph communities / dense sub-components
Inference (Belief propagation)
42
4.2. Iterative Processing
 Flink's API offers two dedicated iteration operations:
Iterate and Delta Iterate.
 Flink executes programs with iterations as cyclic
data flows: a data flow program (and all its operators)
is scheduled just once.
 In each iteration, the step function consumes the
entire input (the result of the previous iteration, or the
initial data set), and computes the next version of the
partial solution
43
4.2. Iterative Processing
 Delta iterations run only on parts of the data that is
changing and can significantly speed up many
machine learning and graph algorithms because the
work in each iteration decreases as the number of
iterations goes on.
 Documentation on iterations with Apache Flink
http://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html
44
4.2. Iterative Processing
Step
Step
Step Step Step
Client
for (int i = 0; i < maxIterations; i++) {
// Execute MapReduce job
}
Non-native iterations in Hadoop and Spark are
implemented as regular for-loops outside the system.
45
4.2. Iterative Processing
 Although Spark caches data across iterations, it still
needs to schedule and execute a new set of tasks for
each iteration.
 Spinning Fast Iterative Data Flows - Ewen et al. 2012 :
http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf The
Apache Flink model for incremental iterative dataflow
processing. Academic paper.
 Recap of the paper, June 18,
2015http://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/
Documentation on iterations with Apache
Flinkhttp://ci.apache.org/projects/flink/flink-docs-
master/apis/iterations.html
46
4.3. Memory Management
Question: Spark vs. Flink low memory available?
Question answered on
stackoverflow.comhttp://stackoverflow.com/questions/31935299/
spark-vs-flink-low-memory-available
The same question still unanswered on the Apache
Spark Mailing List!! http://apache-flink-user-mailing-list-
archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available-
td2364.html
47
4.3. Memory Management
Features:
 C++ style memory management inside the JVM
 User data stored in serialized byte arrays in JVM
 Memory is allocated, de-allocated, and used strictly
using an internal buffer pool implementation.
Advantages:
1. Flink will not throw an OOM exception on you.
2. Reduction of Garbage Collection (GC)
3. Very efficient disk spilling and network transfers
4. No Need for runtime tuning
5. More reliable and stable performance
48
4.3. Memory Management
public class WC {
public String word;
public int count;
}
empty
page
Pool of Memory Pages
Sorting,
hashing,
caching
Shuffles/
broadcasts
User code
objects
ManagedUnmanagedFlink contains its own memory management stack.
To do that, Flink contains its own type extraction
and serialization components.
JVM Heap
Network
Buffers
49
4.3. Memory Management
Peeking into Apache Flink's Engine Room - by Fabian
Hüske, March 13, 2015 http://flink.apache.org/news/2015/03/13/peeking-
into-Apache-Flinks-Engine-Room.html
Juggling with Bits and Bytes - by Fabian Hüske, May
11,2015
https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html
Memory Management (Batch API) by Stephan Ewen-
May 16,
2015https://cwiki.apache.org/confluence/pages/viewpage.action?pageId
=53741525
Flink added an Off-Heap option for its memory
management component in Flink 0.10:
https://issues.apache.org/jira/browse/FLINK-1320
50
4.3. Memory Management
Compared to Flink, Spark is still behind in custom
memory management but is catching up with its
project Tungsten for Memory Management and
Binary Processing: manage memory explicitly and
eliminate the overhead of JVM object model and
garbage collection. April 28,
2014https://databricks.com/blog/2015/04/28/project-tungsten-bringing-
spark-closer-to-bare-metal.html
It seems that Spark is adopting something similar to
Flink and the initial Tungsten announcement read
almost like Flink documentation!!
51
4.4 Optimization
 Apache Flink comes with an optimizer that is
independent of the actual programming interface.
 It chooses a fitting execution strategy depending on
the inputs and operations.
 Example: the "Join" operator will choose between
partitioning and broadcasting the data, as well as
between running a sort-merge-join or a hybrid hash
join algorithm.
 This helps you focus on your application logic
rather than parallel execution.
 Quick introduction to the Optimizer: section 6 of the
paper: ‘The Stratosphere platform for big data
analytics’http://stratosphere.eu/assets/papers/2014-
VLDBJ_Stratosphere_Overview.pdf
52
4.4 Optimization
Run locally on a data
sample
on the laptop
Run a month later
after the data evolved
Hash vs. Sort
Partition vs. Broadcast
Caching
Reusing partition/sort
Execution
Plan A
Execution
Plan B
Run on large files
on the cluster
Execution
Plan C
What is Automatic Optimization? The system's built-in
optimizer takes care of finding the best way to
execute the program in any environment.
53
4.4 Optimization
In contrast to Flink’s built-in automatic optimization,
Spark jobs have to be manually optimized and
adapted to specific datasets because you need to
manually control partitioning and caching if you
want to get it right.
Spark SQL uses the Catalyst optimizer that
supports both rule-based and cost-based
optimization. References:
Spark SQL: Relational Data Processing in
Sparkhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
Deep Dive into Spark SQL’s Catalyst Optimizer
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-
optimizer.html
54
4.5. Configuration
 Flink requires no memory thresholds to
configure
 Flink manages its own memory
 Flink requires no complicated network
configurations
 Pipelining engine requires much less
memory for data exchange
 Flink requires no serializers to be configured
Flink handles its own type extraction and
data representation
55
4.6. Tuning
According to Mike Olsen, Chief Strategy Officer of
Cloudera Inc. “Spark is too knobby — it has too many
tuning parameters, and they need constant
adjustment as workloads, data volumes, user counts
change. Reference: http://vision.cloudera.com/one-
platform/
Tuning Spark Streaming for Throughput By Gerard
Maas from Virdata. December 22, 2014
http://www.virdata.com/tuning-spark/
Spark Tuning:
http://spark.apache.org/docs/latest/tuning.html
56
4.6. Tuning
Run locally on a data
sample
on the laptop
Run a month later
after the data evolved
Hash vs. Sort
Partition vs. Broadcast
Caching
Reusing partition/sort
Execution
Plan A
Execution
Plan B
Run on large files
on the cluster
Execution
Plan C
What is Automatic Optimization? The system's built-in
optimizer takes care of finding the best way to
execute the program in any environment.
57
7. Performance
Why Flink provides a better performance?
Custom memory manager
Native closed-loop iteration operators make graph
and machine learning applications run much faster.
Role of the built-in automatic optimizer. For
example: more efficient join processing.
Pipelining data to the next operator in Flink is more
efficient than in Spark.
See benchmarking results against Flink here:
http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big-
data-analytics-frameworks/87
58
5. Future work
The framework from Capital One to evaluate stream
processing tools is being refined and will be
published at http://www.capitalone.io/
The assessment of the major open source streaming
tools will be published as well as a live document
continuously updated by Capital One.
I also have a work in progress on comparing Spark
and Flink as multi-purpose Big Data analytics
framework
Check my blog at http://www.SparkBigData.com
Check also my slide decks on the Flink and Spark on
http://slideshare.net/sbaltagi
59
Agenda
I. Motivation for this talk
II. Apache Flink vs. Apache Spark?
III. How Flink is used at Capital
One?
IV. What are some key takeaways?
60
III. How Flink is used at Capital One?
We started our journey with Apache Flink at Capital
One while researching and contrasting stream
processing tools in the Hadoop ecosystem with a
particular interest in the ones providing real-time
stream processing capabilities and not just micro-
batching as in Apache Spark.
 While learning more about Apache Flink, we
discovered some unique capabilities of Flink which
differentiate it from other Big Data analytics tools not
only for Real-Time streaming but also for Batch
processing.
We evaluated Apache Flink Real-Time stream
processing capabilities in a POC.
61
III. How Apache Flink is used at Capital One?
Where are we in our Flink journey?
Successful installation of Apache Flink 0.9 in our
Pre-Production cluster running on CDH 5.4 with
security and High Availability enabled.
Successful installation of Apache Flink 0.9 in a 10
nodes R&D cluster running HDP.
Successful completion of Flink POC for real-time
stream processing. The POC proved that propriety
system can be replaced by a combination of tools:
Apache Kafka, Apache Flink, Elasticsearch and
Kibana in addition to advanced real-time streaming
analytics.
62
III. How Apache Flink is used at Capital One?
What are the opportunities for using Apache
Flink at Capital One?
1. Real-Time streaming analytics
2. Cascading on Flink
3. Flink’s MapReduce Compatibility Layer
4. Flink’s Storm Compatibility Layer
5. Other Flink libraries (Machine Learning
and Graph processing) once they come
out of beta.
63
III. How Apache Flink is used at Capital One?
Cascading on Flink:
 First release of Cascading on Flink was announced
recently by Data Artisans and Concurrent. It will be
supported in upcoming Cascading 3.1.
 Capital One is the first company verifying this release
on real-world Cascading data flows with a simple
configuration switch and no code re-work needed!
 This is a good example of doing analytics on bounded
data sets (Cascading) using a stream processor (Flink)
 Expected advantages of performance boost and less
resource consumption.
 Future work is to support ‘Driven’ from Concurrent Inc.
to provide performance management for Cascading
data flows running on Flink.
64
III. How Apache Flink is used at Capital One?
Flink’s compatibility layer for Storm:
We can execute existing Storm topologies
using Flink as the underlying engine.
We can reuse our application code (bolts and
spouts) inside Flink programs.
 Flink’s libraries (FlinkML for Machine
Learning and Gelly for Large scale graph
processing) can be used along Flink’s
DataStream API and DataSet API for our end to
end big data analytics needs.
65
Agenda
I. Motivation for this talk
II. Apache Flink vs. Apache Spark?
III. How Flink is used at Capital One?
IV. What are some key takeaways?
66
III. What are some key takeaways?
Neither Flink nor Spark will be the single analytics
framework that will solve every Big Data problem!
By design, Spark is not for real-time stream processing
while Flink provides a true low latency streaming
engine and advanced DataStream API for real-time
streaming analytics.
Although Spark is ahead in popularity and adoption,
Flink is ahead in technology innovation and is growing
fast.
It is not always the most innovative tool that gets the
largest market share, the Flink community needs to
take into account the market dynamics!
Both Spark and Flink will have their sweet spots
despite their “Me too syndrome”.
67
Thanks!
• To all of you for attending!
• To Capital One for giving me the
opportunity to meet with the growing
Apache Flink family.
• To the Apache Flink community for the
great spirit of collaboration and help.
• 2016 will be the year of Apache Flink!
• See you at FlinkForward 2016!

Más contenido relacionado

La actualidad más candente

Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache FlinkAKASH SIHAG
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkSuneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkFlink Forward
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era  for Real-World Streaming AnalyticsApache Fink 1.0: A New Era  for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling
 
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Kanak Biscuitwala
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksJamie Grier
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016Robert Metzger
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleFlink Forward
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Flink Forward
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsKamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsFlink Forward
 
Strata Hadoop Hopsworks
Strata Hadoop HopsworksStrata Hadoop Hopsworks
Strata Hadoop HopsworksJim Dowling
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
 
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinInteractive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinTill Rohrmann
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 

La actualidad más candente (20)

Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkSuneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era  for Real-World Streaming AnalyticsApache Fink 1.0: A New Era  for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
LinkedIn
LinkedInLinkedIn
LinkedIn
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
Kamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed ExperimentsKamal Hakimzadeh – Reproducible Distributed Experiments
Kamal Hakimzadeh – Reproducible Distributed Experiments
 
Strata Hadoop Hopsworks
Strata Hadoop HopsworksStrata Hadoop Hopsworks
Strata Hadoop Hopsworks
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
 
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in BerlinInteractive Data Analysis with Apache Flink @ Flink Meetup in Berlin
Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 

Destacado

Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingFlink Forward
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 BasicFlink Forward
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Flink Forward
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeSimon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeFlink Forward
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFlink Forward
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsFlink Forward
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkFlink Forward
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityFabian Hueske
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Flink Forward
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkFlink Forward
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteFlink Forward
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Flink Forward
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flinkFlink Forward
 

Destacado (20)

Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
 
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
 
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-ComposeSimon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose
 
Fabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and BytesFabian Hueske – Juggling with Bits and Bytes
Fabian Hueske – Juggling with Bits and Bytes
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Matthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and StormsMatthias J. Sax – A Tale of Squirrels and Storms
Matthias J. Sax – A Tale of Squirrels and Storms
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
 
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkTill Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink
 
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkAnwar Rizal – Streaming & Parallel Decision Tree in Flink
Anwar Rizal – Streaming & Parallel Decision Tree in Flink
 
Apache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce CompatibilityApache Flink - Hadoop MapReduce Compatibility
Apache Flink - Hadoop MapReduce Compatibility
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
 
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkTran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?Alexander Kolb – Flink. Yet another Streaming Framework?
Alexander Kolb – Flink. Yet another Streaming Framework?
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 

Similar a Slim Baltagi – Flink vs. Spark

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
 
Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Trayan Iliev
 
Is 12 Factor App Right About Logging
Is 12 Factor App Right About LoggingIs 12 Factor App Right About Logging
Is 12 Factor App Right About LoggingPhil Wilkins
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
 
101 ways to configure kafka - badly
101 ways to configure kafka - badly101 ways to configure kafka - badly
101 ways to configure kafka - badlyHenning Spjelkavik
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Spark Summit
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoTJim Haughwout
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshConfluentInc1
 
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
Pivoting Spring XD to Spring Cloud Data Flow with Sabby AnandanPivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
Pivoting Spring XD to Spring Cloud Data Flow with Sabby AnandanPivotalOpenSourceHub
 
Distributed Data Processing for Real-time Applications
Distributed Data Processing for Real-time ApplicationsDistributed Data Processing for Real-time Applications
Distributed Data Processing for Real-time ApplicationsScyllaDB
 
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c FeaturesBest Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c FeaturesMarkus Michalewicz
 

Similar a Slim Baltagi – Flink vs. Spark (20)

Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 
Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9
 
Is 12 Factor App Right About Logging
Is 12 Factor App Right About LoggingIs 12 Factor App Right About Logging
Is 12 Factor App Right About Logging
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
101 ways to configure kafka - badly
101 ways to configure kafka - badly101 ways to configure kafka - badly
101 ways to configure kafka - badly
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
Pivoting Spring XD to Spring Cloud Data Flow with Sabby AnandanPivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
Pivoting Spring XD to Spring Cloud Data Flow with Sabby Anandan
 
Distributed Data Processing for Real-time Applications
Distributed Data Processing for Real-time ApplicationsDistributed Data Processing for Real-time Applications
Distributed Data Processing for Real-time Applications
 
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c FeaturesBest Practices for the Most Impactful Oracle Database 18c and 19c Features
Best Practices for the Most Impactful Oracle Database 18c and 19c Features
 

Más de Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

Más de Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Último

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Último (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Slim Baltagi – Flink vs. Spark

  • 1. Flink vs. Spark Slim Baltagi @SlimBaltagi Director of Big Data Engineering, Fellow Capital One
  • 2. 2 Agenda I. Motivation for this talk II. Apache Flink vs. Apache Spark? III. How Flink is used at Capital One? IV. What are some key takeaways?
  • 3. 3 I. Motivation for this talk 1. Marketing fluff 2. Confusing statements 3. Burning questions & incorrect or outdated answers 4. Helping others evaluating Flink vs. Spark
  • 6. 6 2. Confusing statements “Spark is already an excellent piece of software and is advancing very quickly. No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason.” Source: MapReduce and Spark, Mike Olson. Chief Strategy Officer, Cloudera. December, 30th 2013 http://vision.cloudera.com/mapreduce-spark/ “Goal: one engine for all data sources, workloads and environments.” Source: Slide 15 of ‘New Directions for Apache Spark in 2015’, Matei Zaharia. CTO, Databricks. February 20th , 2015. http://www.slideshare.net/databricks/new-directions-for- apache-spark-in-2015
  • 7. 7 3. Burning questions & incorrect or outdated answers "Projects that depend on smart optimizers rarely work well in real life.” Curt Monash, Monash Research. January 16, 2015http://www.computerworld.com/article/2871760/big-data- digest-how-many-hadoops-do-we-really-need.html “Flink is basically a Spark alternative out of Germany, which I’ve been dismissing as unneeded”. Curt Monash, Monash Research, March 5, 2015. http://www.dbms2.com/2015/03/05/cask-and-cdap/ “Of course, this is all a bullish argument for Spark (or Flink, if I’m wrong to dismiss its chances as a Spark competitor).” Curt Monash, Monash Research, September 28, 2015. http://www.dbms2.com/2015/09/28/the-potential- significance-of-cloudera-kudu/
  • 8. 8 3. Burning questions & incorrect or outdated answers “The benefit of Spark's micro-batch model is that you get full fault-tolerance and "exactly-once" processing for the entire computation, meaning it can recover all state and results even if a node crashes. Flink and Storm don't provide this…” Matei Zaharia. CTO, Databricks. May 2015http://www.kdnuggets.com/2015/05/interview-matei- zaharia-creator-apache-spark.html “I understand Spark Streaming uses micro-batching. Does this increase latency? While Spark does use a micro-batch execution model, this does not have much impact on applications…” http://spark.apache.org/faq.html
  • 9. 9 4. Help others evaluating Flink vs. Spark Besides the marketing fluff, the confusing statements, the incorrect or outdated answers to burning questions, the little information on the subject of Flink vs. Spark is available piecemeal! While evaluating different stream processing tools at Capital One, we built a framework listing categories and over 100 criteria to assess these stream processing tools. In the next section, I’ll be sharing this framework and use it to compare Spark and Flink on a few key criteria. We hope this will be beneficial to you as well when selecting Flink and/or Spark for stream processing.
  • 10. 10 Agenda I. Motivation for this talk II. Apache Flink vs. Apache Spark? III. How Flink is used at Capital One? IV. What are some key takeaways?
  • 11. 11 II. Apache Flink vs. Apache Spark? 1. What is Apache Flink? 2. What is Apache Spark? 3. Framework to evaluate Flink and Spark 4. Flink vs. Spark on a few key criteria 5. Future work
  • 12. 12 1.What is Apache Flink? Squirrel: Animal. In harmony with other animals in the Hadoop ecosystem (Zoo): elephant, pig, python, camel,... Squirrel: reflects the meaning of the word ‘Flink’: German for “nimble, swift, speedy” which are also characteristics of the squirrel. Red color. In harmony with red squirrels in Germany to reflect its root at German universities Tail: colors matching the ones of the feather symbolizing the Apache Software Foundation. Commitment to build Flink in the open source!?
  • 13. 13 1.What is Apache Flink? “Apache Flink is an open source platform for distributed stream and batch data processing.” https://flink.apache.org/ See also the definition in Wikipedia:https://en.wikipedia.org/wiki/Apache_Flink
  • 14. 14 1.What is Apache Flink? Gelly Table HadoopM/R SAMOA DataSet (Java/Scala/Python) Batch Processing DataStream (Java/Scala) Stream Processing FlinkML Local Single JVM Embedded Docker Cluster Standalone YARN, Tez, Mesos (WIP) Cloud Google’s GCE Amazon’s EC2 IBM Docker Cloud, … GoogleDataflow Dataflow(WiP) MRQL Table Cascading(WiP) Runtime - Distributed Streaming Dataflow Zeppelin DEPLOYSYSTEMAPIs&LIBRARIESSTORAGE Files Local HDFS S3, Azure Storage Tachyon Databases MongoDB HBase SQL … Streams Flume Kafka RabbitMQ … Batch Optimizer Stream Builder Storm
  • 15. 15 2. What is Apache Spark? “Apache Spark™ is a fast and general engine for large-scale data processing.” http://spark.apache.org/ See also definition in Wikipedia: https://en.wikipedia.org/wiki/Apache_Spark Logo was picked to reflect Lightning-fast cluster computing
  • 16. 16 2. What is Apache Spark?
  • 17. 17 3. Framework to evaluate Flink and Spark 1. Background 2. Fit-for-purpose Categories 3. Organizational-fit Categories 4. Miscellaneous/Other Categories
  • 18. 18 1. Background 1.1 Definition 1.2 Origin 1.3 Maturity 1.4 Version 1.5 Governance model 1.6 License model
  • 19. 19 2. Fit-for-purpose Categories 2.1 Security 2.2 Provisioning & Monitoring Capabilities 2.3 Latency & Processing Architecture 2.4 State Management 2.5 Processing Delivery Assurance 2.6 Database Integrations, Native vs. Third party connector 2.7 High Availability & Resiliency 2.8 Ease of Development 2.9 Scalability 2.10 Unique Capabilities/Key Differentiators
  • 20. 20 2. Fit-for-purpose Categories 2.1 Security 2.1.1 Authentication, Authorization 2.1.2 Data at rest encryption (data persisted in the framework) 2.1.3 Data in motion encryption (producer - >framework -> consumer) 2.1.4 Data in motion encryption (inter-node communication)
  • 21. 21 2. Fit-for-purpose Categories 2.2 Provisioning & Monitoring Capabilities 2.2.1 Robustness of Administration 2.2.2 Ease of maintenance: Does technology provide configuration, deployment, scaling, monitoring, performance tuning and auditing capabilities? 2.2.3 Monitoring & Alerting 2.2.4 Logging 2.2.5 Audit 2.2.6 Transparent Upgrade: Version upgrade with minimum downtime
  • 22. 22 2. Fit-for-purpose Categories 2.3 Latency & Processing Architecture 2.3.1 Supports tuple at a time, micro-batch, transactional updates and batch processing 2.3.2 Computational model 2.3.3 Ability to reprocess historical data from source 2.3.4 Ability to reprocess historical data from native engine 2.3.5 Call external source (API/database calls) 2.3.6 Integration with Batch (static) source 2.3.7 Data Types (images, sound etc.) 2.3.8 Supports complex event processing and pattern detection vs. continuous operator model (low latency, flow control)
  • 23. 23 2. Fit-for-purpose Categories 2.3 Latency & Processing Architecture 2.3.9 Handles stream imperfections (delayed) 2.3.10 Handles stream imperfections (out-of-order) 2.3.11 Handles stream imperfections (duplicate) 2.3.12 Handles seconds, sub-second or millisecond event processing (Latency) 2.3.13 Compression 2.3.14 Support for batch analytics 2.3.15 Support for iterative analytics (machine learning, graph analytics) 2.3.16 Data lineage provenance (origin of the owner) 2.3.17 Data lineage (accelerate recovery time)
  • 24. 24 2. Fit-for-purpose Categories 2.4 State Management 2.4.1 Stateful vs. Stateless 2.4.2 Is stateful data Persisted locally vs. external database vs. Ephemeral 2.4.3 Native rolling, tumbling and hopping window support 2.4.4 Native support for integrated data store
  • 25. 25 2. Fit-for-purpose Categories 2.5 Processing Delivery Assurance 2.5.1 Guarantee (At least once) 2.5.2 Guarantee (At most once) 2.5.3 Guarantee (Exactly once) 2.5.4 Global Event order guaranteed 2.5.5 Guarantee predictable and repeatable outcomes( deterministic or not)
  • 26. 26 2. Fit-for-purpose Categories 2.6 Database Integrations, Native vs. Third party connector 2.6.1 NoSQL database integration 2.6.2 File Format (Avro, Parquet and other format support) 2.6.3 RDBMS integration 2.6.4 In-memory database integration/ Caching integration
  • 27. 27 2. Fit-for-purpose Categories 2.7 High Availability & Resiliency 2.7.1 Can the system avoid slowdown due to straggler node 2.7.2 Fault-Tolerance (does the tool handle node/operator/messaging failures without catastrophically failing) 2.7.3 State recovery from in-memory 2.7.4 State recovery from reliable storage 2.7.5 Overhead of fault tolerance mechanism (Does failure handling introduce additional latency or negatively impact throughput?) 2.7.6 Multi-site support (multi-region) 2.7.7 Flow control: backpressure tolerance from slow operators or consumers 2.7.8 Fast parallel recovery vs. replication or serial recovery on one node at a time
  • 28. 28 2. Fit-for-purpose Categories 2.8 Ease of Development 2.8.1 SQL Interface 2.8.2 Real-Time debugging option 2.8.3 Built-in stream oriented abstraction (streams, windows, operators , iterators - expressive APIs that enable programmers to quickly develop streaming data applications) 2.8.4 Separation of application logic from fault tolerance 2.8.5 Testing tools and framework 2.8.6 Change management: multiple model deployment ( E.g. separate cluster or can one create multiple independent redundant streams internally) 2.8.7 Dynamic model swapping (Support dynamic updating of operators/topology/DAG without restart or service interruption) 2.8.8 Required knowledge of system internals to develop an application 2.8.9 Time to market for applications 2.8.10 Supports plug-in of external libraries 2.8.11 API High Level/Low Level 2.8.12 Easy to configuration 2.8.13 GUI based abstraction layer
  • 29. 29 2. Fit-for-purpose Categories 2.9 Scalability 2.9.1 Supports multi-thread across multiple processors/cores 2.9.2 Distributed across multiple machines/servers 2.9.3 Partition Algorithm 2.9.4 Dynamic elasticity - Scaling with minimum impact/ performance penalty 2.9.5 Horizontal scaling with linear performance/throughput 2.9.6 Vertical scaling (GPU) 2.9.7 Scaling without downtime
  • 30. 30 3. Organizational-fit Categories 3.1 Maturity & Community Support 3.2 Support Languages for Development 3.3 Cloud Portability 3.4 Compatibility with Native Hadoop Architecture 3.5 Adoption of Community vs. Enterprise Edition 3.6 Integration with Message Brokers
  • 31. 31 3. Organizational-fit Categories 3.1 Maturity & Community Support 3.1.1 Open Source Support 3.1.2 Maturity (years) 3.1.3 Stable 3.1.4 Centralized documentation with versioning support 3.1.5 Documentation of programming API with good code examples 3.1.6 Centralized visible roadmap 3.1.7 Community acceptance vs. Vendor driven 3.1.8 Contributors
  • 32. 32 3. Organizational-fit Categories 3.2 Support Languages for Development 3.2.1 Language technology was built on 3.2.2 Language supported to access technology
  • 33. 33 3. Organizational-fit Categories 3.3 Cloud Portability 3.3.1 Ease of migration between cloud vendors 3.3.2 Ease of migration between on premise to cloud 3.3.3 Ease of migration from on premise to complete cloud services 3.3.4 Cloud compatibility (AWS, Google, Azure)
  • 34. 34 3. Organizational-fit Categories 3.4 Compatibility with Native Hadoop Architecture 3.4.1 Implement on top of Hadoop YARN vs. Standalone 3.4.2 Mesos 3.4.3 Coordination with Apache Zookeeper
  • 35. 35 3. Organizational-fit Categories 3.5 Adoption of Community vs. Enterprise Edition 3.5.1 Open Source 3.5.2 Enterprise Support
  • 36. 36 4. Miscellaneous/Other Categories 4.1 Best Suited for 4.2 Key use case scenarios 4.3 Companies using technology
  • 37. 37 4. Flink vs. Spark on a few key criteria 1. Streaming Engine 2. Iterative Processing 3. Memory Management 4. Optimization 5. Configuration 6. Tuning 7. Performance
  • 38. 38 4.1. Streaming Engine  Many time-critical applications need to process large streams of live data and provide results in real-time. For example: Financial Fraud detection Financial Stock monitoring Anomaly detection Traffic management applications Patient monitoring Online recommenders  Some claim that 95% of streaming use cases can be handled with micro-batches!? Really!!!
  • 39. 39 4.1. Streaming Engine Spark’s micro-batching isn’t good enough! Ted Dunning, Chief Applications Architect at MapR, talk at the Bay Area Apache Flink Meetup on August 27, 2015 http://www.meetup.com/Bay-Area-Apache-Flink- Meetup/events/224189524/ Ted described several use cases where batch and micro batch processing is not appropriate and described why. He also described what a true streaming solution needs to provide for solving these problems. These use cases were taken from real industrial situations, but the descriptions drove down to technical details as well.
  • 40. 40 4.1. Streaming Engine  “I would consider stream data analysis to be a major unique selling proposition for Flink. Due to its pipelined architecture, Flink is a perfect match for big data stream processing in the Apache stack.” – Volker Markl Ref.: On Apache Flink. Interview with Volker Markl, June 24th 2015 http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/  Apache Flink uses streams for all workloads: streaming, SQL, micro-batch and batch. Batch is just treated as a finite set of streamed data. This makes Flink the most sophisticated distributed open source Big Data processing engine (not the most mature one yet!).
  • 41. 41 4.2. Iterative Processing Why Iterations? Many Machine Learning and Graph processing algorithms need iterations! For example:  Machine Learning Algorithms Clustering (K-Means, Canopy, …) Gradient descent (Logistic Regression, Matrix Factorization)  Graph Processing Algorithms Page-Rank, Line-Rank Path algorithms on graphs (shortest paths, centralities, …) Graph communities / dense sub-components Inference (Belief propagation)
  • 42. 42 4.2. Iterative Processing  Flink's API offers two dedicated iteration operations: Iterate and Delta Iterate.  Flink executes programs with iterations as cyclic data flows: a data flow program (and all its operators) is scheduled just once.  In each iteration, the step function consumes the entire input (the result of the previous iteration, or the initial data set), and computes the next version of the partial solution
  • 43. 43 4.2. Iterative Processing  Delta iterations run only on parts of the data that is changing and can significantly speed up many machine learning and graph algorithms because the work in each iteration decreases as the number of iterations goes on.  Documentation on iterations with Apache Flink http://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html
  • 44. 44 4.2. Iterative Processing Step Step Step Step Step Client for (int i = 0; i < maxIterations; i++) { // Execute MapReduce job } Non-native iterations in Hadoop and Spark are implemented as regular for-loops outside the system.
  • 45. 45 4.2. Iterative Processing  Although Spark caches data across iterations, it still needs to schedule and execute a new set of tasks for each iteration.  Spinning Fast Iterative Data Flows - Ewen et al. 2012 : http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf The Apache Flink model for incremental iterative dataflow processing. Academic paper.  Recap of the paper, June 18, 2015http://blog.acolyer.org/2015/06/18/spinning-fast-iterative-dataflows/ Documentation on iterations with Apache Flinkhttp://ci.apache.org/projects/flink/flink-docs- master/apis/iterations.html
  • 46. 46 4.3. Memory Management Question: Spark vs. Flink low memory available? Question answered on stackoverflow.comhttp://stackoverflow.com/questions/31935299/ spark-vs-flink-low-memory-available The same question still unanswered on the Apache Spark Mailing List!! http://apache-flink-user-mailing-list- archive.2336050.n4.nabble.com/spark-vs-flink-low-memory-available- td2364.html
  • 47. 47 4.3. Memory Management Features:  C++ style memory management inside the JVM  User data stored in serialized byte arrays in JVM  Memory is allocated, de-allocated, and used strictly using an internal buffer pool implementation. Advantages: 1. Flink will not throw an OOM exception on you. 2. Reduction of Garbage Collection (GC) 3. Very efficient disk spilling and network transfers 4. No Need for runtime tuning 5. More reliable and stable performance
  • 48. 48 4.3. Memory Management public class WC { public String word; public int count; } empty page Pool of Memory Pages Sorting, hashing, caching Shuffles/ broadcasts User code objects ManagedUnmanagedFlink contains its own memory management stack. To do that, Flink contains its own type extraction and serialization components. JVM Heap Network Buffers
  • 49. 49 4.3. Memory Management Peeking into Apache Flink's Engine Room - by Fabian Hüske, March 13, 2015 http://flink.apache.org/news/2015/03/13/peeking- into-Apache-Flinks-Engine-Room.html Juggling with Bits and Bytes - by Fabian Hüske, May 11,2015 https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html Memory Management (Batch API) by Stephan Ewen- May 16, 2015https://cwiki.apache.org/confluence/pages/viewpage.action?pageId =53741525 Flink added an Off-Heap option for its memory management component in Flink 0.10: https://issues.apache.org/jira/browse/FLINK-1320
  • 50. 50 4.3. Memory Management Compared to Flink, Spark is still behind in custom memory management but is catching up with its project Tungsten for Memory Management and Binary Processing: manage memory explicitly and eliminate the overhead of JVM object model and garbage collection. April 28, 2014https://databricks.com/blog/2015/04/28/project-tungsten-bringing- spark-closer-to-bare-metal.html It seems that Spark is adopting something similar to Flink and the initial Tungsten announcement read almost like Flink documentation!!
  • 51. 51 4.4 Optimization  Apache Flink comes with an optimizer that is independent of the actual programming interface.  It chooses a fitting execution strategy depending on the inputs and operations.  Example: the "Join" operator will choose between partitioning and broadcasting the data, as well as between running a sort-merge-join or a hybrid hash join algorithm.  This helps you focus on your application logic rather than parallel execution.  Quick introduction to the Optimizer: section 6 of the paper: ‘The Stratosphere platform for big data analytics’http://stratosphere.eu/assets/papers/2014- VLDBJ_Stratosphere_Overview.pdf
  • 52. 52 4.4 Optimization Run locally on a data sample on the laptop Run a month later after the data evolved Hash vs. Sort Partition vs. Broadcast Caching Reusing partition/sort Execution Plan A Execution Plan B Run on large files on the cluster Execution Plan C What is Automatic Optimization? The system's built-in optimizer takes care of finding the best way to execute the program in any environment.
  • 53. 53 4.4 Optimization In contrast to Flink’s built-in automatic optimization, Spark jobs have to be manually optimized and adapted to specific datasets because you need to manually control partitioning and caching if you want to get it right. Spark SQL uses the Catalyst optimizer that supports both rule-based and cost-based optimization. References: Spark SQL: Relational Data Processing in Sparkhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf Deep Dive into Spark SQL’s Catalyst Optimizer https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst- optimizer.html
  • 54. 54 4.5. Configuration  Flink requires no memory thresholds to configure  Flink manages its own memory  Flink requires no complicated network configurations  Pipelining engine requires much less memory for data exchange  Flink requires no serializers to be configured Flink handles its own type extraction and data representation
  • 55. 55 4.6. Tuning According to Mike Olsen, Chief Strategy Officer of Cloudera Inc. “Spark is too knobby — it has too many tuning parameters, and they need constant adjustment as workloads, data volumes, user counts change. Reference: http://vision.cloudera.com/one- platform/ Tuning Spark Streaming for Throughput By Gerard Maas from Virdata. December 22, 2014 http://www.virdata.com/tuning-spark/ Spark Tuning: http://spark.apache.org/docs/latest/tuning.html
  • 56. 56 4.6. Tuning Run locally on a data sample on the laptop Run a month later after the data evolved Hash vs. Sort Partition vs. Broadcast Caching Reusing partition/sort Execution Plan A Execution Plan B Run on large files on the cluster Execution Plan C What is Automatic Optimization? The system's built-in optimizer takes care of finding the best way to execute the program in any environment.
  • 57. 57 7. Performance Why Flink provides a better performance? Custom memory manager Native closed-loop iteration operators make graph and machine learning applications run much faster. Role of the built-in automatic optimizer. For example: more efficient join processing. Pipelining data to the next operator in Flink is more efficient than in Spark. See benchmarking results against Flink here: http://www.slideshare.net/sbaltagi/why-apache-flink-is-the-4g-of-big- data-analytics-frameworks/87
  • 58. 58 5. Future work The framework from Capital One to evaluate stream processing tools is being refined and will be published at http://www.capitalone.io/ The assessment of the major open source streaming tools will be published as well as a live document continuously updated by Capital One. I also have a work in progress on comparing Spark and Flink as multi-purpose Big Data analytics framework Check my blog at http://www.SparkBigData.com Check also my slide decks on the Flink and Spark on http://slideshare.net/sbaltagi
  • 59. 59 Agenda I. Motivation for this talk II. Apache Flink vs. Apache Spark? III. How Flink is used at Capital One? IV. What are some key takeaways?
  • 60. 60 III. How Flink is used at Capital One? We started our journey with Apache Flink at Capital One while researching and contrasting stream processing tools in the Hadoop ecosystem with a particular interest in the ones providing real-time stream processing capabilities and not just micro- batching as in Apache Spark.  While learning more about Apache Flink, we discovered some unique capabilities of Flink which differentiate it from other Big Data analytics tools not only for Real-Time streaming but also for Batch processing. We evaluated Apache Flink Real-Time stream processing capabilities in a POC.
  • 61. 61 III. How Apache Flink is used at Capital One? Where are we in our Flink journey? Successful installation of Apache Flink 0.9 in our Pre-Production cluster running on CDH 5.4 with security and High Availability enabled. Successful installation of Apache Flink 0.9 in a 10 nodes R&D cluster running HDP. Successful completion of Flink POC for real-time stream processing. The POC proved that propriety system can be replaced by a combination of tools: Apache Kafka, Apache Flink, Elasticsearch and Kibana in addition to advanced real-time streaming analytics.
  • 62. 62 III. How Apache Flink is used at Capital One? What are the opportunities for using Apache Flink at Capital One? 1. Real-Time streaming analytics 2. Cascading on Flink 3. Flink’s MapReduce Compatibility Layer 4. Flink’s Storm Compatibility Layer 5. Other Flink libraries (Machine Learning and Graph processing) once they come out of beta.
  • 63. 63 III. How Apache Flink is used at Capital One? Cascading on Flink:  First release of Cascading on Flink was announced recently by Data Artisans and Concurrent. It will be supported in upcoming Cascading 3.1.  Capital One is the first company verifying this release on real-world Cascading data flows with a simple configuration switch and no code re-work needed!  This is a good example of doing analytics on bounded data sets (Cascading) using a stream processor (Flink)  Expected advantages of performance boost and less resource consumption.  Future work is to support ‘Driven’ from Concurrent Inc. to provide performance management for Cascading data flows running on Flink.
  • 64. 64 III. How Apache Flink is used at Capital One? Flink’s compatibility layer for Storm: We can execute existing Storm topologies using Flink as the underlying engine. We can reuse our application code (bolts and spouts) inside Flink programs.  Flink’s libraries (FlinkML for Machine Learning and Gelly for Large scale graph processing) can be used along Flink’s DataStream API and DataSet API for our end to end big data analytics needs.
  • 65. 65 Agenda I. Motivation for this talk II. Apache Flink vs. Apache Spark? III. How Flink is used at Capital One? IV. What are some key takeaways?
  • 66. 66 III. What are some key takeaways? Neither Flink nor Spark will be the single analytics framework that will solve every Big Data problem! By design, Spark is not for real-time stream processing while Flink provides a true low latency streaming engine and advanced DataStream API for real-time streaming analytics. Although Spark is ahead in popularity and adoption, Flink is ahead in technology innovation and is growing fast. It is not always the most innovative tool that gets the largest market share, the Flink community needs to take into account the market dynamics! Both Spark and Flink will have their sweet spots despite their “Me too syndrome”.
  • 67. 67 Thanks! • To all of you for attending! • To Capital One for giving me the opportunity to meet with the growing Apache Flink family. • To the Apache Flink community for the great spirit of collaboration and help. • 2016 will be the year of Apache Flink! • See you at FlinkForward 2016!