The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day.
https://www.bigdataspain.org/2017/talk/apache-samza-jake-maes
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
ICT role in 21st century education and its challenges
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data Spain 2017
1.
2. 1
Unified Processing at Scale with Apache
Samza
Jake Maes
Staff SW Engineer at LinkedIn
Apache Samza PMC
3. 2
About Me
● Apache Samza PMC member
● LinkedIn 3 years
● 8 years performance & infra development
● Passionate about scale
● Long walks on the peaks
4. 3
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
5. 4
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
6. 5
About
● Production at LinkedIn since 2014
● Apache top level project since 2014
● 16 Committers
● 74 Contributors
● Known for
Scale
Pluggability
Kafka integration
7. 6
● Low latency
● One message at a time
● Checkpointing, state, durability
● All I/O with high-performance message brokers
Traditional Stream Processing
10. 9
Typical Flow - Two Stages Minimum
Re-
partitio
n
windo
w
ma
p
sendT
o
PageVie
w
Event
PageViewEven
t
ByMemberId
PageViewEventP
er
MemberStream
PageViewRepartitionTask PageViewByMemberIdCounterTask
11. 10
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
12. 11
Stream Processing Ecosystem – The Dream
Applications and Services
Samz
a
Kafka
Storag
e
Externa
l
Stream
s
Storage
&
Serving
Brooklin
13. 12
Stream Processing Ecosystem - Reality
Applications and Services
Samz
a
Kafka
Storag
e
Externa
l
Stream
s
Storage
&
Serving
Brooklin
14. 13
Expansion of Stream Processing at LinkedIn
● Influx of new applications
10 -> over 200
● New use cases
Batch Streaming
Remote I/O
Composable API
● Incoming applications have different expectations
● Let’s take a look at two
Services
15. 14
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
16. 15
Online Service + Stream Processing
Requirements:
● Deployment model
Cluster environment not suitable
● Remote I/O
Dependencies on other services
I/O latency stalls single threaded processor
Container parallelism - too much overhead
Services
17. 16
App Instance
Embedded Samza
● Zookeeper-based JobCoordinator
Uses Zookeeper for leader election
Leader assigns work to the processors
ZooKeeperZooKeeper
Stream Processor
Samza
Container
Job
Coordinato
r*
App Instance
Stream Processor
Samza
Container
Job
Coordinato
r
App Instance
Stream Processor
Samza
Container
Job
Coordinato
r
* Leader
19. 18
Checkpointing
● Sync – Barrier
● Async - Watermark
t1 t2 t3 tc
t4
checkpoint
callback
3
complet
e
time
callback
1
complet
e
callback
2compl
ete
callback
4
complet
e
20. 19
Performance for Remote I/O
Baseline
Thread pool size =
10
Max concurrency =
1
Thread pool size =
10
Max concurrency =
3
Sync I/O with MultithreadingSingle thread
21. 20
Case Study – Notification Scheduler
Processor
User Chat
Event
User
Action
Event
Connectio
n Activity
Event
Restful
Service
s
Member
profile
database
Aggregatio
n Engine
Channel
Selection
State
store
input1
input2
input3
① Local Data Access
② Remote Database
Lookup
③ Remote Service
Call
outp
ut
22. 21
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
23. 22
Offline Jobs
Requirements:
● Performance and low latency
● Resource hungry
Finite jobs can hog resources
Infinite jobs need to be better citizens
● Composable API
● Same app in batch and streaming
Best of both worlds
● HDFS I/O
25. 24
High Level Logic
public class RepartitionAndCounterExample implements StreamApplication {
@Override public void init(StreamGraph graph, Config config) {
MessageStream<PageViewEvent> pve =
graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m);
OutputStream<String, MyOutputType, MyOutputType> mpv = graph
.getOutputStream("memberPageViews", m -> m.memberId, m -> m);
pve
.partitionBy(m -> m.memberId)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), () -> 0,
(m, c) -> c + 1))
.map(MyOutputType::new)
.sendTo(mpv);
}
} Built-in transform
functions
26. 25
High Level API - Composable Operators
filter select a subset of messages from the stream
map map one input message to an output message
flatMap map one input message to 0 or more output messages
merge union all inputs into a single output stream
partitionBy re-partition the input messages based on a specific field
sendTo send the result to an output stream
sink send the result to an external system (e.g. external DB)
window window aggregation on the input stream
join join messages from two input streams
Stateless
Functions
I/O
Function
s
Stateful
Functions
30. 29
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
31. 30
What’s Next?
● SQL
Prototyped 2015
Now getting full time attention
● High Level API extensions
Better config, I/O, windowing, and more
● Beam Runner
Samza performance with Beam API
● Table support