Apache Samza: Past, Present and Future Stream Processing Framework

Apache Samza
Past, Present and Future
Kartik Paramasivam
Director of Engineering, Streams Infra@ LinkedIn

Agenda
1. Stream Processing
2. State of the Union
3. Apache Samza : Key Differentiators
4. Apache Samza Futures

Stream Processing: Processing events as soon as they happen..
● Stateless Processing
■ Transformation etc.
■ Lookup adjunct data (lookup databases/call services )
■ Producing results for every event
● Stateful Processing
■ Triggering/Producing results periodically (time-windows)
● Maintain intermediate state
■ E.g. Joining across multiple streams of events.
● Common Issues
■ Scale !! Scale !! Scale !!
■ Reliability !!
■ Everything else (upgrades, debugging, diagnostics, security, ……)

Stream Processing: State of the Union
MillwheelStorm
Heron
Spark Streaming
S4
Dempsey
Samza
Flink
Beam
Dataflow
Azure Stream Analytics
AWS Kinesis AnalyticsGearPump
Kafka Streams
Orleans
Not meant to be an
accurate timeline..
Yes It is CROWDED !!

Apache Samza
● Top level Apache project since Dec 2014
● 5 big Releases (0.7, 0.8, 0.9, 0.10, 0.11)
● 62 Contributors
● 14 Committers
● Companies using : LinkedIn, Uber, MetaMarkets, Netflix,
Intuit, TripAdvisor, MobileAware, Optimizely ….
https://cwiki.apache.org/confluence/display/SAMZA/Powered+By
● Applications at LinkedIn : from ~20 to ~200 in 2 years.

Key Differentiators for Apache Samza
● Performance !!
● Stability
● Support for a variety of input sources
● Stream processing as a service AND as an embedded library

Performance : Accessing Adjunct Data

Performance : Maintaining Temporary State

Performance : Let us talk numbers !
● 100x Difference between using Local State vs Remote No-Sql store
● Local State details:
○ 1.1 Million TPS on a single processing machine (SSD)
○ Used a 3 node Kafka cluster for storing the durable changelog
● Remote State details:
○ 8500 TPS when the Samza job was changed to accessing a remote
No-Sql store
○ No-Sql Store was also on a 3 node (ssd) cluster

Remote State : Asynchronous Event Processing
Event Loop
(Single thread)
ProcessAsync
Remote
DB
/Services
Asynchronous I/O calls,
using Java Nio, Netty...
Responses sent to
main thread via
callback
Event loop is woken up to process next message
Task.max.concurrency >1 to
enable pipelining
Available with Samza 0.11

Remote State: Synchronous Processing on Multiple
Threads
Event Loop
(Single thread)
Schedule
Process()
Remote
DB/
Services
Built-In
Thread pool
Blocking I/O
calls
Event loop is
woken up by
the worker
thread
job.container.thread.pool.size = N
Available with Samza 0.11

Incremental Checkpointing : MVP for stateful apps
Input stream(e.g. Kafka)

Speed Thrills .. but can kill
● Local State Considerations:
○ State should NOT be reseeded under normal operations (e.g.
Upgrades, Application restarts)
○ Minimal State should be reseeded
- If a container dies/removed
- If a container is added

How Samza keeps Local state ‘stable’ ?
Samza Job
Input
Stream
Change-log
Enable Continuous Scheduling

● Kafka or durable intermediate
queues are leveraged to
avoid backpressure issues in
a pipeline.
● Allows each stage to be
independent of the next stage
Backpressure in a Pipeline

Pluggable system consumers
… Azure EventHub,
Azure Document DB,
Google Pub-Sub etc.

Batch processing in Samza!! (NEW)
● HDFS system consumer for Samza
● Same Samza processor can be used for processing
events from Kafka and HDFS with no code changes
● Scenarios :
○ Experimentation and Testing
○ Re-processing of large datasets
○ Some datasets are readily available on HDFS
(company specific)

Samza - HDFS support
HDFS input
HDFS
output
HDFS output
HDFS input
New
Available since
Samza 0.10
The batch job
auto-terminates
when the input is
fully processed.

Brooklin
Brooklin
set offset=0

Backup
Databus
Database
Backup
(HDFS)

Samza batch pipelines
HDFS
output
HDFS
input
HDFS
output
HDFS
input

Samza- HDFS Early Performance Results !!
Benchmark : Count number
of records grouped by <Field>
DataSize (bytes): 250 GB
Number of files : 487
Samza
Map/Reduce
Spark
Number of Containers
T
i
m
e
-s
e
c
o
n
d
s

Key Differentiators for Apache Samza
● Performance !!
● Stability
● Support for a variety of input sources (batch and streaming)
● Stream processing as a service AND (coming soon) as an
embedded library

Stream Processing as a Service
● Based on YARN
○ Yarn-RM high availability
○ Work preserving RM
○ Support for Heterogenous hardware with Node Labels (NEW)
● Easy upgrade of Samza framework :
Use the Samza version deployed on the machine instead of packaging it with the application.
● Disk Quotas for local state (e.g. rocksDB state)
● Samza Management Service(SAMZA-REST)-> Next Slide

YARN
Resource
Managers
Nodes
in the
YARN
cluster
RM SRR
RM SRR
RM SRR
NM SRN
Samza Management Service (Samza REST) (NEW)
NM SRN
NM SRN
NM SRN
NM SRN
NM SRN
NM SRN
NM SRN
/v1/jobs /v1/jobs /v1/jobs
Samza Containers
1. Exposes /jobs
resource to
start, stop, get
status of jobs
etc.
2. Cleans up
stores from
dead jobs
Samza
REST
YARN
processes(RM/NM)

Agenda
1. Stream processing
2. State of the union
3. Apache Samza : Key differentiators
4. Apache Samza Futures

Coming Soon : Samza as a Library
Stream Processor
Code
Job
Coordinator
Stream Processor
Code
Job
Coordinator
Stream Processor
Code
Job
Coordinator
...
Leader
● No YARN dependency
● Will use ZK for leader
election
● Embed stream processing
into your bigger application
StreamProcessor processor = new StreamProcessor (config, “job-name”,
“job-id”);
processor.start();
processor.awaitStart();
…
processor.stop();

Coming Soon: High Level API and Event Time
(SAMZA-914/915)
Count the number of PageViews by Region, every 30 minutes.
@Override public void init(Collection<SystemMessageStream> sources) {
sources.forEach(source -> {
Function<PageView, String> keyExtractor = view -> view.getRegion();
source.map(msg -> new PageViewMessage(msg))
.window(Windows.<PageViewMessage, String>intoSessionCounter(keyExtractor,
WindowType.Tumbling, 30*60
))
});
}

Coming Soon: First class support for Pipelines
(Samza- 1041)
public class MyPipeline implements PipelineFactory {
public Pipeline create(Config config) {
Processor myShuffler = getShuffle(config);
Processor myJoiner = getJoin(config);
Stream inStream = getStream(config, “inStream1”);
// … omitted for brevity
PipelineBuilder builder = new PipelineBuilder();
return builder.addInputStreams(myShuffler, inStream1)
.addOutputStreams(myShuffler, intermediateOutStream)
.addInputStreams(myJoiner, intermediateOutStream, inStream2)
.addOutputStreams(myJoiner, finalOutStream)
.build();
}
}
Shuffle
Join
input output

Future: Miscellaneous
● Exactly once processing
● Making it easier to auto-scale even with Local State
(on-demand Standby containers)
● Turnkey Disaster Recovery for stateful applications
○ Easy Restore of changelog and checkpoints from
some other datacenter.
● Improved support for Batch jobs
● SQL over Streams
● A default Dashboard :)

Apache Samza: Past, Present and Future Stream Processing Framework

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Apache Samza: Past, Present and Future Stream Processing Framework

Similar a Apache Samza: Past, Present and Future Stream Processing Framework (20)

Último

Último (20)

Apache Samza: Past, Present and Future Stream Processing Framework