SlideShare una empresa de Scribd logo
1 de 33
Descargar para leer sin conexión
Apache Samza
Past, Present and Future
Kartik Paramasivam
Director of Engineering, Streams Infra@ LinkedIn
Agenda
1. Stream Processing
2. State of the Union
3. Apache Samza : Key Differentiators
4. Apache Samza Futures
Stream Processing: Processing events as soon as they happen..
● Stateless Processing
■ Transformation etc.
■ Lookup adjunct data (lookup databases/call services )
■ Producing results for every event
● Stateful Processing
■ Triggering/Producing results periodically (time-windows)
● Maintain intermediate state
■ E.g. Joining across multiple streams of events.
● Common Issues
■ Scale !! Scale !! Scale !!
■ Reliability !!
■ Everything else (upgrades, debugging, diagnostics, security, ……)
Stream Processing: State of the Union
MillwheelStorm
Heron
Spark Streaming
S4
Dempsey
Samza
Flink
Beam
Dataflow
Azure Stream Analytics
AWS Kinesis AnalyticsGearPump
Kafka Streams
Orleans
Not meant to be an
accurate timeline..
Yes It is CROWDED !!
Apache Samza
● Top level Apache project since Dec 2014
● 5 big Releases (0.7, 0.8, 0.9, 0.10, 0.11)
● 62 Contributors
● 14 Committers
● Companies using : LinkedIn, Uber, MetaMarkets, Netflix,
Intuit, TripAdvisor, MobileAware, Optimizely ….
https://cwiki.apache.org/confluence/display/SAMZA/Powered+By
● Applications at LinkedIn : from ~20 to ~200 in 2 years.
Key Differentiators for Apache Samza
● Performance !!
● Stability
● Support for a variety of input sources
● Stream processing as a service AND as an embedded library
Performance : Accessing Adjunct Data
Performance : Maintaining Temporary State
Performance : Let us talk numbers !
● 100x Difference between using Local State vs Remote No-Sql store
● Local State details:
○ 1.1 Million TPS on a single processing machine (SSD)
○ Used a 3 node Kafka cluster for storing the durable changelog
● Remote State details:
○ 8500 TPS when the Samza job was changed to accessing a remote
No-Sql store
○ No-Sql Store was also on a 3 node (ssd) cluster
Remote State : Asynchronous Event Processing
Event Loop
(Single thread)
ProcessAsync
Remote
DB
/Services
Asynchronous I/O calls,
using Java Nio, Netty...
Responses sent to
main thread via
callback
Event loop is woken up to process next message
Task.max.concurrency >1 to
enable pipelining
Available with Samza 0.11
Remote State: Synchronous Processing on Multiple
Threads
Event Loop
(Single thread)
Schedule
Process()
Remote
DB/
Services
Built-In
Thread pool
Blocking I/O
calls
Event loop is
woken up by
the worker
thread
job.container.thread.pool.size = N
Available with Samza 0.11
Incremental Checkpointing : MVP for stateful apps
Input stream(e.g. Kafka)
Key Differentiators for Apache Samza
● Performance !!
● Stability
● Support for a variety of input sources
● Stream processing as a service AND as an embedded library
Speed Thrills .. but can kill
● Local State Considerations:
○ State should NOT be reseeded under normal operations (e.g.
Upgrades, Application restarts)
○ Minimal State should be reseeded
- If a container dies/removed
- If a container is added
How Samza keeps Local state ‘stable’ ?
Samza Job
Input
Stream
Change-log
Enable Continuous Scheduling
● Kafka or durable intermediate
queues are leveraged to
avoid backpressure issues in
a pipeline.
● Allows each stage to be
independent of the next stage
Backpressure in a Pipeline
Key Differentiators for Apache Samza
● Performance !!
● Stability
● Support for a variety of input sources
● Stream processing as a service AND as an embedded library
Pluggable system consumers
… Azure EventHub,
Azure Document DB,
Google Pub-Sub etc.
Batch processing in Samza!! (NEW)
● HDFS system consumer for Samza
● Same Samza processor can be used for processing
events from Kafka and HDFS with no code changes
● Scenarios :
○ Experimentation and Testing
○ Re-processing of large datasets
○ Some datasets are readily available on HDFS
(company specific)
Samza - HDFS support
HDFS input
HDFS
output
HDFS output
HDFS input
New
Available since
Samza 0.10
The batch job
auto-terminates
when the input is
fully processed.
Brooklin
Brooklin
set offset=0
Backup
Databus
Database
Backup
(HDFS)
Samza batch pipelines
HDFS
output
HDFS
input
HDFS
output
HDFS
input
Samza- HDFS Early Performance Results !!
Benchmark : Count number
of records grouped by <Field>
DataSize (bytes): 250 GB
Number of files : 487
Samza
Map/Reduce
Spark
Number of Containers
T
i
m
e
-s
e
c
o
n
d
s
Key Differentiators for Apache Samza
● Performance !!
● Stability
● Support for a variety of input sources (batch and streaming)
● Stream processing as a service AND (coming soon) as an
embedded library
Stream Processing as a Service
● Based on YARN
○ Yarn-RM high availability
○ Work preserving RM
○ Support for Heterogenous hardware with Node Labels (NEW)
● Easy upgrade of Samza framework :
Use the Samza version deployed on the machine instead of packaging it with the application.
● Disk Quotas for local state (e.g. rocksDB state)
● Samza Management Service(SAMZA-REST)-> Next Slide
YARN
Resource
Managers
Nodes
in the
YARN
cluster
RM SRR
RM SRR
RM SRR
NM SRN
Samza Management Service (Samza REST) (NEW)
NM SRN
NM SRN
NM SRN
NM SRN
NM SRN
NM SRN
NM SRN
/v1/jobs /v1/jobs /v1/jobs
Samza Containers
1. Exposes /jobs
resource to
start, stop, get
status of jobs
etc.
2. Cleans up
stores from
dead jobs
Samza
REST
YARN
processes(RM/NM)
Agenda
1. Stream processing
2. State of the union
3. Apache Samza : Key differentiators
4. Apache Samza Futures
Coming Soon : Samza as a Library
Stream Processor
Code
Job
Coordinator
Stream Processor
Code
Job
Coordinator
Stream Processor
Code
Job
Coordinator
...
Leader
● No YARN dependency
● Will use ZK for leader
election
● Embed stream processing
into your bigger application
StreamProcessor processor = new StreamProcessor (config, “job-name”,
“job-id”);
processor.start();
processor.awaitStart();
…
processor.stop();
Coming Soon: High Level API and Event Time
(SAMZA-914/915)
Count the number of PageViews by Region, every 30 minutes.
@Override public void init(Collection<SystemMessageStream> sources) {
sources.forEach(source -> {
Function<PageView, String> keyExtractor = view -> view.getRegion();
source.map(msg -> new PageViewMessage(msg))
.window(Windows.<PageViewMessage, String>intoSessionCounter(keyExtractor,
WindowType.Tumbling, 30*60
))
});
}
Coming Soon: First class support for Pipelines
(Samza- 1041)
public class MyPipeline implements PipelineFactory {
public Pipeline create(Config config) {
Processor myShuffler = getShuffle(config);
Processor myJoiner = getJoin(config);
Stream inStream = getStream(config, “inStream1”);
// … omitted for brevity
PipelineBuilder builder = new PipelineBuilder();
return builder.addInputStreams(myShuffler, inStream1)
.addOutputStreams(myShuffler, intermediateOutStream)
.addInputStreams(myJoiner, intermediateOutStream, inStream2)
.addOutputStreams(myJoiner, finalOutStream)
.build();
}
}
Shuffle
Join
input output
Future: Miscellaneous
● Exactly once processing
● Making it easier to auto-scale even with Local State
(on-demand Standby containers)
● Turnkey Disaster Recovery for stateful applications
○ Easy Restore of changelog and checkpoints from
some other datacenter.
● Improved support for Batch jobs
● SQL over Streams
● A default Dashboard :)
Questions ?

Más contenido relacionado

La actualidad más candente

Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
 
Cassandra - Tips And Techniques
Cassandra - Tips And TechniquesCassandra - Tips And Techniques
Cassandra - Tips And TechniquesKnoldus Inc.
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedOmid Vahdaty
 
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...ScyllaDB
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniMonal Daxini
 
Apache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comApache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comknowbigdata
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Monal Daxini
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNblueboxtraveler
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon
 
Stateful stream processing with kafka and samza
Stateful stream processing with kafka and samzaStateful stream processing with kafka and samza
Stateful stream processing with kafka and samzaGeorge Li
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark StreamingGerard Maas
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopTao Feng
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormRan Silberman
 
Spark streaming
Spark streamingSpark streaming
Spark streamingWhiteklay
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv
 

La actualidad más candente (20)

Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Cassandra - Tips And Techniques
Cassandra - Tips And TechniquesCassandra - Tips And Techniques
Cassandra - Tips And Techniques
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
 
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
Scylla Summit 2018: Introducing ValuStor, A Memcached Alternative Made to Run...
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
Apache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.comApache Spark Streaming - www.know bigdata.com
Apache Spark Streaming - www.know bigdata.com
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
 
Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014Netflix at-disney-09-26-2014
Netflix at-disney-09-26-2014
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
 
Stateful stream processing with kafka and samza
Stateful stream processing with kafka and samzaStateful stream processing with kafka and samza
Stateful stream processing with kafka and samza
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
Dive into Spark Streaming
Dive into Spark StreamingDive into Spark Streaming
Dive into Spark Streaming
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
 
Real Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & StormReal Time Data Streaming using Kafka & Storm
Real Time Data Streaming using Kafka & Storm
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 

Similar a Apache Samza: Past, Present and Future Stream Processing Framework

Samza portable runner for beam
Samza portable runner for beamSamza portable runner for beam
Samza portable runner for beamHai Lu
 
Stream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamStream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamHai Lu
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Beam me up, Samza!
Beam me up, Samza!Beam me up, Samza!
Beam me up, Samza!Xinyu Liu
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaAmazon Web Services
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationYi Pan
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...Flink Forward
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
Scalable Stream Processing with Apache Samza
Scalable Stream Processing with Apache SamzaScalable Stream Processing with Apache Samza
Scalable Stream Processing with Apache SamzaPrateek Maheshwari
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Timothy Spann
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsAsis Mohanty
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
 

Similar a Apache Samza: Past, Present and Future Stream Processing Framework (20)

Samza portable runner for beam
Samza portable runner for beamSamza portable runner for beam
Samza portable runner for beam
 
Stream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and BeamStream processing in python with Apache Samza and Beam
Stream processing in python with Apache Samza and Beam
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Beam me up, Samza!
Beam me up, Samza!Beam me up, Samza!
Beam me up, Samza!
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
SamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentationSamzaSQL QCon'16 presentation
SamzaSQL QCon'16 presentation
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Scalable Stream Processing with Apache Samza
Scalable Stream Processing with Apache SamzaScalable Stream Processing with Apache Samza
Scalable Stream Processing with Apache Samza
 
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 

Último

Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesVictoriaMetrics
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Anthony Dahanne
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...OnePlan Solutions
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptxVinzoCenzo
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxRTS corp
 

Último (20)

Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
What’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 UpdatesWhat’s New in VictoriaMetrics: Q1 2024 Updates
What’s New in VictoriaMetrics: Q1 2024 Updates
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024Not a Kubernetes fan? The state of PaaS in 2024
Not a Kubernetes fan? The state of PaaS in 2024
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
Revolutionizing the Digital Transformation Office - Leveraging OnePlan’s AI a...
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
Osi security architecture in network.pptx
Osi security architecture in network.pptxOsi security architecture in network.pptx
Osi security architecture in network.pptx
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptxThe Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
The Role of IoT and Sensor Technology in Cargo Cloud Solutions.pptx
 

Apache Samza: Past, Present and Future Stream Processing Framework

  • 1. Apache Samza Past, Present and Future Kartik Paramasivam Director of Engineering, Streams Infra@ LinkedIn
  • 2. Agenda 1. Stream Processing 2. State of the Union 3. Apache Samza : Key Differentiators 4. Apache Samza Futures
  • 3. Stream Processing: Processing events as soon as they happen.. ● Stateless Processing ■ Transformation etc. ■ Lookup adjunct data (lookup databases/call services ) ■ Producing results for every event ● Stateful Processing ■ Triggering/Producing results periodically (time-windows) ● Maintain intermediate state ■ E.g. Joining across multiple streams of events. ● Common Issues ■ Scale !! Scale !! Scale !! ■ Reliability !! ■ Everything else (upgrades, debugging, diagnostics, security, ……)
  • 4. Stream Processing: State of the Union MillwheelStorm Heron Spark Streaming S4 Dempsey Samza Flink Beam Dataflow Azure Stream Analytics AWS Kinesis AnalyticsGearPump Kafka Streams Orleans Not meant to be an accurate timeline.. Yes It is CROWDED !!
  • 5. Apache Samza ● Top level Apache project since Dec 2014 ● 5 big Releases (0.7, 0.8, 0.9, 0.10, 0.11) ● 62 Contributors ● 14 Committers ● Companies using : LinkedIn, Uber, MetaMarkets, Netflix, Intuit, TripAdvisor, MobileAware, Optimizely …. https://cwiki.apache.org/confluence/display/SAMZA/Powered+By ● Applications at LinkedIn : from ~20 to ~200 in 2 years.
  • 6. Key Differentiators for Apache Samza ● Performance !! ● Stability ● Support for a variety of input sources ● Stream processing as a service AND as an embedded library
  • 7. Performance : Accessing Adjunct Data
  • 8. Performance : Maintaining Temporary State
  • 9. Performance : Let us talk numbers ! ● 100x Difference between using Local State vs Remote No-Sql store ● Local State details: ○ 1.1 Million TPS on a single processing machine (SSD) ○ Used a 3 node Kafka cluster for storing the durable changelog ● Remote State details: ○ 8500 TPS when the Samza job was changed to accessing a remote No-Sql store ○ No-Sql Store was also on a 3 node (ssd) cluster
  • 10. Remote State : Asynchronous Event Processing Event Loop (Single thread) ProcessAsync Remote DB /Services Asynchronous I/O calls, using Java Nio, Netty... Responses sent to main thread via callback Event loop is woken up to process next message Task.max.concurrency >1 to enable pipelining Available with Samza 0.11
  • 11. Remote State: Synchronous Processing on Multiple Threads Event Loop (Single thread) Schedule Process() Remote DB/ Services Built-In Thread pool Blocking I/O calls Event loop is woken up by the worker thread job.container.thread.pool.size = N Available with Samza 0.11
  • 12. Incremental Checkpointing : MVP for stateful apps Input stream(e.g. Kafka)
  • 13. Key Differentiators for Apache Samza ● Performance !! ● Stability ● Support for a variety of input sources ● Stream processing as a service AND as an embedded library
  • 14. Speed Thrills .. but can kill ● Local State Considerations: ○ State should NOT be reseeded under normal operations (e.g. Upgrades, Application restarts) ○ Minimal State should be reseeded - If a container dies/removed - If a container is added
  • 15. How Samza keeps Local state ‘stable’ ? Samza Job Input Stream Change-log Enable Continuous Scheduling
  • 16. ● Kafka or durable intermediate queues are leveraged to avoid backpressure issues in a pipeline. ● Allows each stage to be independent of the next stage Backpressure in a Pipeline
  • 17. Key Differentiators for Apache Samza ● Performance !! ● Stability ● Support for a variety of input sources ● Stream processing as a service AND as an embedded library
  • 18. Pluggable system consumers … Azure EventHub, Azure Document DB, Google Pub-Sub etc.
  • 19. Batch processing in Samza!! (NEW) ● HDFS system consumer for Samza ● Same Samza processor can be used for processing events from Kafka and HDFS with no code changes ● Scenarios : ○ Experimentation and Testing ○ Re-processing of large datasets ○ Some datasets are readily available on HDFS (company specific)
  • 20. Samza - HDFS support HDFS input HDFS output HDFS output HDFS input New Available since Samza 0.10 The batch job auto-terminates when the input is fully processed.
  • 24. Samza- HDFS Early Performance Results !! Benchmark : Count number of records grouped by <Field> DataSize (bytes): 250 GB Number of files : 487 Samza Map/Reduce Spark Number of Containers T i m e -s e c o n d s
  • 25. Key Differentiators for Apache Samza ● Performance !! ● Stability ● Support for a variety of input sources (batch and streaming) ● Stream processing as a service AND (coming soon) as an embedded library
  • 26. Stream Processing as a Service ● Based on YARN ○ Yarn-RM high availability ○ Work preserving RM ○ Support for Heterogenous hardware with Node Labels (NEW) ● Easy upgrade of Samza framework : Use the Samza version deployed on the machine instead of packaging it with the application. ● Disk Quotas for local state (e.g. rocksDB state) ● Samza Management Service(SAMZA-REST)-> Next Slide
  • 27. YARN Resource Managers Nodes in the YARN cluster RM SRR RM SRR RM SRR NM SRN Samza Management Service (Samza REST) (NEW) NM SRN NM SRN NM SRN NM SRN NM SRN NM SRN NM SRN /v1/jobs /v1/jobs /v1/jobs Samza Containers 1. Exposes /jobs resource to start, stop, get status of jobs etc. 2. Cleans up stores from dead jobs Samza REST YARN processes(RM/NM)
  • 28. Agenda 1. Stream processing 2. State of the union 3. Apache Samza : Key differentiators 4. Apache Samza Futures
  • 29. Coming Soon : Samza as a Library Stream Processor Code Job Coordinator Stream Processor Code Job Coordinator Stream Processor Code Job Coordinator ... Leader ● No YARN dependency ● Will use ZK for leader election ● Embed stream processing into your bigger application StreamProcessor processor = new StreamProcessor (config, “job-name”, “job-id”); processor.start(); processor.awaitStart(); … processor.stop();
  • 30. Coming Soon: High Level API and Event Time (SAMZA-914/915) Count the number of PageViews by Region, every 30 minutes. @Override public void init(Collection<SystemMessageStream> sources) { sources.forEach(source -> { Function<PageView, String> keyExtractor = view -> view.getRegion(); source.map(msg -> new PageViewMessage(msg)) .window(Windows.<PageViewMessage, String>intoSessionCounter(keyExtractor, WindowType.Tumbling, 30*60 )) }); }
  • 31. Coming Soon: First class support for Pipelines (Samza- 1041) public class MyPipeline implements PipelineFactory { public Pipeline create(Config config) { Processor myShuffler = getShuffle(config); Processor myJoiner = getJoin(config); Stream inStream = getStream(config, “inStream1”); // … omitted for brevity PipelineBuilder builder = new PipelineBuilder(); return builder.addInputStreams(myShuffler, inStream1) .addOutputStreams(myShuffler, intermediateOutStream) .addInputStreams(myJoiner, intermediateOutStream, inStream2) .addOutputStreams(myJoiner, finalOutStream) .build(); } } Shuffle Join input output
  • 32. Future: Miscellaneous ● Exactly once processing ● Making it easier to auto-scale even with Local State (on-demand Standby containers) ● Turnkey Disaster Recovery for stateful applications ○ Easy Restore of changelog and checkpoints from some other datacenter. ● Improved support for Batch jobs ● SQL over Streams ● A default Dashboard :)