SlideShare a Scribd company logo
1 of 32
Download to read offline
Scalable Stream
Processing
With
Apache Samza
Prateek Maheshwari
Apache Samza PMC
Agenda
● Stream Processing at LinkedIn
○ Scale at LinkedIn
○ Scenarios at LinkedIn
● Apache Samza
○ Processing Model
○ Stateful Processing
○ Processing APIs
○ Deployment Model
Apache Kafka
5 Trillion+ messages ingested
per day
1.5+ PB data per day
100k+ topics, 5M+ partitions
Brooklin
2 Trillion+ messages moved
per day
10k+ topics mirrored
2k+ change capture streams
Apache Samza
1.5 Trillion+ messages
processed per day
3k+ jobs in production
500 TB+ local state
Scale at LinkedIn
Scenarios at LinkedIn
DDoS prevention,
bot detection, access
monitoring
Security
Email and Push
notifications
Notifications
Topic tagging, NER in
news articles, image
classification
Classification
Site speed and
health monitoring
Site Speed
Monitoring
inter-service
dependencies and
SLAs
Call Graphs
Scenarios at LinkedIn
Tracking ad views
and clicks
Ad CTR
Tracking
Pre-aggregated
real-time counts by
dimensions
Business
Metrics
Standardizing titles,
companies,
education
Profile
Standardization
Updating search
indices with new
data
Index
Updates
Tracking member
page views,
dwell-time, sessions
Activity
Tracking
Hardened
at Scale
In production at
LinkedIn, Slack, Intuit,
TripAdvisor, VMWare,
Redfin, etc.
Processing events from
Kafka, Brooklin,
Kinesis, EventHubs,
HDFS, DynamoDB
Streams, Databus, etc.
Apache Samza
Incremental checkpoints
for large local state and
instant recovery.
Local state that works
seamlessly across
upgrades and failures.
APIs for simple and
efficient remote I/O
Best In Class
Stateful Processing
Stream and batch
processing without
changing code.
Convenient High-level
DSLs and a powerful
Low-level API.
Universal
Processing APIs
Write once, run
anywhere.
Run on a multi-tenant
cluster or as an
embedded library.
Flexible
Deployment Model
Brooklin
Hadoop
Task-1
Task-2
Task-3
Container-1
Container-2
Kafka
Heartbeat
Job Coordinator
Samza Application
Processing Model
Kafka
Hadoop
Serving Stores (e.g.
Espresso, Venice, Pinot)
Elasticsearch
● Parallelism across tasks by increasing the number of containers.
○ Up to 1 container per task.
● Parallelism across partitions by increasing the number of tasks.
○ Up to 1 task per partition.
● Parallelism within a partition for out of order processing.
○ Any number of threads.
Scaling a Samza Application
Hardened
at Scale
In production at
LinkedIn, Uber, Slack,
Intuit, TripAdvisor,
VMWare, Redfin, etc.
Processing events from
Kafka, Brooklin, Kinesis,
EventHubs, HDFS,
DynamoDB Streams,
Databus, etc.
Apache Samza
Incremental checkpoints
for large local state and
instant recovery.
Local state that works
seamlessly across
upgrades and failures.
APIs for simple and
efficient remote I/O
Best In Class
Stateful Processing
Stream and batch
processing without
changing code.
Convenient High-level
DSLs and a powerful
Low-level API.
Universal
Processing APIs
Write once, run
anywhere.
Run on a multi-tenant
cluster or as an
embedded library.
Flexible
Deployment Model
• State is used for performing lookups
and joins, caching data,
buffering/batching data, and writing
computed results.
• State can be local (in-memory or on
disk) or remote.
Samza
Local Store I/O
Samza
Why State Matters
and
Remote DB I/O
Why Local State Matters: Throughput
on disk w/ caching comparable with in memory changelog adds minimal overhead
remote state
30-150x worse than
local state
Terminology
Disk Type: SSD
Max-Net: Max network bandwidth
CLog: Kafka changelog
ReadOnly: read only workloads (lookups)
ReadWrite: read - write workloads (counts)
Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
Why Local State Matters: Latency
on disk w/ caching comparable with in memory changelog adds minimal overhead
> 2 orders of magnitude slower compared to
local state
Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
Optimizations for Local State
Task-1
Container-1
Samza Application Master
Durable Container ID – host mapping
1. Log state changes to a Kafka compacted
topic for durability.
2. Catch up on only the delta from the
change log topic on restart.
Task-2
Container-2
Optimizations for Local State
1. Host Affinity
2. Parallel Recovery
3. Bulk Load Mode
4. Standby Containers
5. Log Compaction
Task-1
Container-1
Samza Application Master
Durable Container ID – host mapping
Task-2
Container-2
Why Remote I/O Matters
• Data is only available in the remote store (no change capture).
• Need strong consistency or transactions.
• Data cannot be partitioned but is too large to copy to every container.
• Writing processed results for online serving.
• Calling other services to handle complex business logic.
Optimizations for Remote I/O: Table API
• Async Requests
• Rate Limiting
• Batching
• Caching
• Retries
• Stream Table Joins
Hardened
at Scale
In production at
LinkedIn, Uber, Slack,
Intuit, TripAdvisor,
VMWare, Redfin, etc.
Processing events from
Kafka, Brooklin, Kinesis,
EventHubs, HDFS,
DynamoDB Streams,
Databus, etc.
Apache Samza
Incremental checkpoints
for large local state and
instant recovery.
Local state that works
seamlessly across
upgrades and failures.
APIs for simple and
efficient remote I/O
Best In Class
Stateful Processing
Stream and batch
processing without
changing code.
Convenient High-level
DSLs and a powerful
Low-level API.
Universal
Processing APIs
Write once, run
anywhere.
Run on a multi-tenant
cluster or as an
embedded library.
Flexible
Deployment Model
Example Application
Count number of "Page Views" for each member in a 5 minute window
18
Page View
Page View Per
Member
Repartition
by member id
Window Map SendTo
Intermediate Stream
High Level API
● Complex Processing Pipelines
● Easy Repartitioning
● Stream-Stream and Stream-Table Joins
● Processing Time Windows and Joins
High Level API
public class PageViewCountApplication implements StreamApplication {
@Override
public void describe(StreamApplicationDescriptor appDescriptor) {
KafkaSystemDescriptor ksd = new KafkaSystemDescriptor("tracking");
KafkaInputDescriptor<PageViewEvent> pageViews = ksd.getInputDescriptor("PageView", serde);
KafkaOutputDescriptor<PageViewCount> pageViewCounts = ksd.getOutputDescriptor("PageViewCount", serde);
appDescriptor.getInputStream(pageViews)
.partitionBy(m -> m.memberId, serde)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5),
initialValue, (m, c) -> c + 1))
.map(PageViewCount::new)
.sendTo(appDescriptor.getOutputStream(pageViewCounts));
}
}
Apache Beam
● Event Time Processing
● Multi-lingual APIs (Java, Python, Go*)
● Advanced Windows and Joins
Apache Beam
public class PageViewCount {
public static void main(String[] args) {
...
pipeline
.apply(LiKafkaIO.<PageViewEvent>read()
.withTopic("PageView")
.withTimestampFn(kv -> new Instant(kv.getValue().header.time))
.withWatermarkFn(kv -> new Instant(kv.getValue().header.time - 60000))
.apply(Values.create())
.apply(MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers()))
.via((PageViewEvent pv) -> KV.of(String.valueOf(pv.header.memberId), 1)))
.apply(Window.into(TumblingWindows.of(Duration.standardMinutes(5))))
.apply(Count.perKey())
.apply(MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptor.of(Counter.class)))
.via(newCounter()))
.apply(LiKafkaIO.<Counter>write().withTopic("PageViewCount")
pipeline.run();
}
}
Apache Beam: Python
p = Pipeline(options=pipeline_options)
(p
| 'read' >> ReadFromKafka(cluster="tracking",
topic="PageViewEvent", config=config)
| 'extract' >> beam.Map(lambda record: (record.value['memberId'], 1))
| "windowing" >> beam.WindowInto(window.FixedWindows(60*5))
| "compute" >> beam.CombinePerKey(beam.combiners.CountCombineFn())
| 'write' >> WriteToKafka(cluster = "queuing",
topic = "PageViewCount", config = config)
p.run().waitUntilFinish()
Samza SQL
● Declarative streaming SQL API
● Managed service at LinkedIn
● Create and deploy applications in minutes using SQL Shell
Samza SQL
INSERT INTO kafka.tracking.PageViewCount
SELECT memberId, count(*) FROM kafka.tracking.PageView
GROUP BY memberId, TUMBLE(current_timestamp, INTERVAL '5' MINUTES)
Low Level
High Level
Samza SQL
Apache Beam
Java
Python
Samza APIs
Hardened
at Scale
In production at
LinkedIn, Uber, Slack,
Intuit, TripAdvisor,
VMWare, Redfin, etc.
Processing events from
Kafka, Brooklin, Kinesis,
EventHubs, HDFS,
DynamoDB Streams,
Databus, etc.
Apache Samza
Incremental checkpoints
for large local state and
instant recovery.
Local state that works
seamlessly across
upgrades and failures.
APIs for simple and
efficient remote I/O
Best In Class
Stateful Processing
Stream and batch
processing without
changing code.
Convenient High-level
DSLs and a powerful
Low-level API.
Universal
Processing APIs
Write once, run
anywhere.
Run on a multi-tenant
cluster or as an
embedded library.
Flexible
Deployment Model
Samza on a Multi-Tenant Cluster
• Uses a cluster manager (e.g. YARN) for resource management,
coordination, liveness monitoring, etc.
• Better resource utilization in a multi-tenant environment.
• Works well for large number of applications.
Samza as an Embedded Library
• Embed Samza as a library in an application. No cluster manager dependency.
• Dynamically scale out applications by increasing or decreasing the number of
processors at run-time.
• Supports rolling upgrades and canaries.
● Uses ZooKeeper for leader election and liveness monitoring for processors.
● Leader JobCoordinator performs work assignments among processors.
● Leader redistributes partitions when processors join or leave the group.
Samza as a Library
ZooKeeper Based Coordination
Zookeeper
StreamProcessor
Samza
Container
Job Coordinator
StreamProcessor
Samza
Container
Job
Coordinator
StreamProcessor
Samza
Container
Job
Coordinator…
Leader
Apache Samza
• Mature, versatile, and scalable processing framework
• Best-in-class support for local and remote state
• Powerful and flexible APIs
• Can be operated as a platform or used as an embedded library
Contact Us
http://samza.apache.org
dev@samza.apache.org

More Related Content

What's hot

Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysisAmazon Web Services
 
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScaleHow Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScaleMariaDB plc
 
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기Amazon Web Services Korea
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Chris Fregly
 
APAC Kafka Summit - Best Of
APAC Kafka Summit - Best Of APAC Kafka Summit - Best Of
APAC Kafka Summit - Best Of confluent
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Event Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuEvent Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuHeroku
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka coreKafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka coreGuido Schmutz
 
Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData FlyData Inc.
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaGuido Schmutz
 
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...confluent
 
How to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech Talks
How to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech TalksHow to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech Talks
How to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech TalksAmazon Web Services
 
Change Data Capture using Kafka
Change Data Capture using KafkaChange Data Capture using Kafka
Change Data Capture using KafkaAkash Vacher
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...confluent
 

What's hot (20)

Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScaleHow Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
How Alibaba Cloud scaled ApsaraDB with MariaDB MaxScale
 
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
2017 AWS DB Day | Amazon Redshift 자세히 살펴보기
 
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
Kinesis and Spark Streaming - Advanced AWS Meetup - August 2014
 
APAC Kafka Summit - Best Of
APAC Kafka Summit - Best Of APAC Kafka Summit - Best Of
APAC Kafka Summit - Best Of
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Event Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuEvent Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on Heroku
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
MongoDB 3.4 webinar
MongoDB 3.4 webinarMongoDB 3.4 webinar
MongoDB 3.4 webinar
 
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka coreKafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core
 
Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData Near Real-Time Data Analysis With FlyData
Near Real-Time Data Analysis With FlyData
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
 
Kafka - Linkedin's messaging backbone
Kafka - Linkedin's messaging backboneKafka - Linkedin's messaging backbone
Kafka - Linkedin's messaging backbone
 
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
 
How to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech Talks
How to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech TalksHow to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech Talks
How to Migrate from Cassandra to Amazon DynamoDB - AWS Online Tech Talks
 
Change Data Capture using Kafka
Change Data Capture using KafkaChange Data Capture using Kafka
Change Data Capture using Kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 

Similar to Scalable Stream Processing with Apache Samza

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
 
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...HostedbyConfluent
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsDirecti Group
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
Event Streaming Architectures with Confluent and ScyllaDB
Event Streaming Architectures with Confluent and ScyllaDBEvent Streaming Architectures with Confluent and ScyllaDB
Event Streaming Architectures with Confluent and ScyllaDBScyllaDB
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAData Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAAndrew Morgan
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBMongoDB
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...nnakasone
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBconfluent
 
Scaling up Near Real-time Analytics @Uber &LinkedIn
Scaling up Near Real-time Analytics @Uber &LinkedInScaling up Near Real-time Analytics @Uber &LinkedIn
Scaling up Near Real-time Analytics @Uber &LinkedInC4Media
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
MongoDB .local Chicago 2019: MongoDB Atlas Jumpstart
MongoDB .local Chicago 2019: MongoDB Atlas JumpstartMongoDB .local Chicago 2019: MongoDB Atlas Jumpstart
MongoDB .local Chicago 2019: MongoDB Atlas JumpstartMongoDB
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Amazon Web Services
 
Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...
 Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T... Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...
Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...Amazon Web Services
 
GWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformGWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformMarcelo Paiva
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2
 

Similar to Scalable Stream Processing with Apache Samza (20)

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
Event Streaming Architectures with Confluent and ScyllaDB
Event Streaming Architectures with Confluent and ScyllaDBEvent Streaming Architectures with Confluent and ScyllaDB
Event Streaming Architectures with Confluent and ScyllaDB
 
Data Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEAData Streaming with Apache Kafka & MongoDB - EMEA
Data Streaming with Apache Kafka & MongoDB - EMEA
 
Webinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDBWebinar: Data Streaming with Apache Kafka & MongoDB
Webinar: Data Streaming with Apache Kafka & MongoDB
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Scaling up Near Real-time Analytics @Uber &LinkedIn
Scaling up Near Real-time Analytics @Uber &LinkedInScaling up Near Real-time Analytics @Uber &LinkedIn
Scaling up Near Real-time Analytics @Uber &LinkedIn
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
MongoDB .local Chicago 2019: MongoDB Atlas Jumpstart
MongoDB .local Chicago 2019: MongoDB Atlas JumpstartMongoDB .local Chicago 2019: MongoDB Atlas Jumpstart
MongoDB .local Chicago 2019: MongoDB Atlas Jumpstart
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...
 Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T... Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...
Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...
 
GWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformGWAB 2015 - Data Plaraform
GWAB 2015 - Data Plaraform
 
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and MoreWSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More
 

Recently uploaded

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 

Recently uploaded (20)

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 

Scalable Stream Processing with Apache Samza

  • 2. Agenda ● Stream Processing at LinkedIn ○ Scale at LinkedIn ○ Scenarios at LinkedIn ● Apache Samza ○ Processing Model ○ Stateful Processing ○ Processing APIs ○ Deployment Model
  • 3. Apache Kafka 5 Trillion+ messages ingested per day 1.5+ PB data per day 100k+ topics, 5M+ partitions Brooklin 2 Trillion+ messages moved per day 10k+ topics mirrored 2k+ change capture streams Apache Samza 1.5 Trillion+ messages processed per day 3k+ jobs in production 500 TB+ local state Scale at LinkedIn
  • 4. Scenarios at LinkedIn DDoS prevention, bot detection, access monitoring Security Email and Push notifications Notifications Topic tagging, NER in news articles, image classification Classification Site speed and health monitoring Site Speed Monitoring inter-service dependencies and SLAs Call Graphs
  • 5. Scenarios at LinkedIn Tracking ad views and clicks Ad CTR Tracking Pre-aggregated real-time counts by dimensions Business Metrics Standardizing titles, companies, education Profile Standardization Updating search indices with new data Index Updates Tracking member page views, dwell-time, sessions Activity Tracking
  • 6. Hardened at Scale In production at LinkedIn, Slack, Intuit, TripAdvisor, VMWare, Redfin, etc. Processing events from Kafka, Brooklin, Kinesis, EventHubs, HDFS, DynamoDB Streams, Databus, etc. Apache Samza Incremental checkpoints for large local state and instant recovery. Local state that works seamlessly across upgrades and failures. APIs for simple and efficient remote I/O Best In Class Stateful Processing Stream and batch processing without changing code. Convenient High-level DSLs and a powerful Low-level API. Universal Processing APIs Write once, run anywhere. Run on a multi-tenant cluster or as an embedded library. Flexible Deployment Model
  • 7. Brooklin Hadoop Task-1 Task-2 Task-3 Container-1 Container-2 Kafka Heartbeat Job Coordinator Samza Application Processing Model Kafka Hadoop Serving Stores (e.g. Espresso, Venice, Pinot) Elasticsearch
  • 8. ● Parallelism across tasks by increasing the number of containers. ○ Up to 1 container per task. ● Parallelism across partitions by increasing the number of tasks. ○ Up to 1 task per partition. ● Parallelism within a partition for out of order processing. ○ Any number of threads. Scaling a Samza Application
  • 9. Hardened at Scale In production at LinkedIn, Uber, Slack, Intuit, TripAdvisor, VMWare, Redfin, etc. Processing events from Kafka, Brooklin, Kinesis, EventHubs, HDFS, DynamoDB Streams, Databus, etc. Apache Samza Incremental checkpoints for large local state and instant recovery. Local state that works seamlessly across upgrades and failures. APIs for simple and efficient remote I/O Best In Class Stateful Processing Stream and batch processing without changing code. Convenient High-level DSLs and a powerful Low-level API. Universal Processing APIs Write once, run anywhere. Run on a multi-tenant cluster or as an embedded library. Flexible Deployment Model
  • 10. • State is used for performing lookups and joins, caching data, buffering/batching data, and writing computed results. • State can be local (in-memory or on disk) or remote. Samza Local Store I/O Samza Why State Matters and Remote DB I/O
  • 11. Why Local State Matters: Throughput on disk w/ caching comparable with in memory changelog adds minimal overhead remote state 30-150x worse than local state Terminology Disk Type: SSD Max-Net: Max network bandwidth CLog: Kafka changelog ReadOnly: read only workloads (lookups) ReadWrite: read - write workloads (counts) Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
  • 12. Why Local State Matters: Latency on disk w/ caching comparable with in memory changelog adds minimal overhead > 2 orders of magnitude slower compared to local state Shadi A. Noghabi et al. Samza: stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (August 2017), 1634-1645.
  • 13. Optimizations for Local State Task-1 Container-1 Samza Application Master Durable Container ID – host mapping 1. Log state changes to a Kafka compacted topic for durability. 2. Catch up on only the delta from the change log topic on restart. Task-2 Container-2
  • 14. Optimizations for Local State 1. Host Affinity 2. Parallel Recovery 3. Bulk Load Mode 4. Standby Containers 5. Log Compaction Task-1 Container-1 Samza Application Master Durable Container ID – host mapping Task-2 Container-2
  • 15. Why Remote I/O Matters • Data is only available in the remote store (no change capture). • Need strong consistency or transactions. • Data cannot be partitioned but is too large to copy to every container. • Writing processed results for online serving. • Calling other services to handle complex business logic.
  • 16. Optimizations for Remote I/O: Table API • Async Requests • Rate Limiting • Batching • Caching • Retries • Stream Table Joins
  • 17. Hardened at Scale In production at LinkedIn, Uber, Slack, Intuit, TripAdvisor, VMWare, Redfin, etc. Processing events from Kafka, Brooklin, Kinesis, EventHubs, HDFS, DynamoDB Streams, Databus, etc. Apache Samza Incremental checkpoints for large local state and instant recovery. Local state that works seamlessly across upgrades and failures. APIs for simple and efficient remote I/O Best In Class Stateful Processing Stream and batch processing without changing code. Convenient High-level DSLs and a powerful Low-level API. Universal Processing APIs Write once, run anywhere. Run on a multi-tenant cluster or as an embedded library. Flexible Deployment Model
  • 18. Example Application Count number of "Page Views" for each member in a 5 minute window 18 Page View Page View Per Member Repartition by member id Window Map SendTo Intermediate Stream
  • 19. High Level API ● Complex Processing Pipelines ● Easy Repartitioning ● Stream-Stream and Stream-Table Joins ● Processing Time Windows and Joins
  • 20. High Level API public class PageViewCountApplication implements StreamApplication { @Override public void describe(StreamApplicationDescriptor appDescriptor) { KafkaSystemDescriptor ksd = new KafkaSystemDescriptor("tracking"); KafkaInputDescriptor<PageViewEvent> pageViews = ksd.getInputDescriptor("PageView", serde); KafkaOutputDescriptor<PageViewCount> pageViewCounts = ksd.getOutputDescriptor("PageViewCount", serde); appDescriptor.getInputStream(pageViews) .partitionBy(m -> m.memberId, serde) .window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), initialValue, (m, c) -> c + 1)) .map(PageViewCount::new) .sendTo(appDescriptor.getOutputStream(pageViewCounts)); } }
  • 21. Apache Beam ● Event Time Processing ● Multi-lingual APIs (Java, Python, Go*) ● Advanced Windows and Joins
  • 22. Apache Beam public class PageViewCount { public static void main(String[] args) { ... pipeline .apply(LiKafkaIO.<PageViewEvent>read() .withTopic("PageView") .withTimestampFn(kv -> new Instant(kv.getValue().header.time)) .withWatermarkFn(kv -> new Instant(kv.getValue().header.time - 60000)) .apply(Values.create()) .apply(MapElements .into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptors.integers())) .via((PageViewEvent pv) -> KV.of(String.valueOf(pv.header.memberId), 1))) .apply(Window.into(TumblingWindows.of(Duration.standardMinutes(5)))) .apply(Count.perKey()) .apply(MapElements .into(TypeDescriptors.kvs(TypeDescriptors.strings(), TypeDescriptor.of(Counter.class))) .via(newCounter())) .apply(LiKafkaIO.<Counter>write().withTopic("PageViewCount") pipeline.run(); } }
  • 23. Apache Beam: Python p = Pipeline(options=pipeline_options) (p | 'read' >> ReadFromKafka(cluster="tracking", topic="PageViewEvent", config=config) | 'extract' >> beam.Map(lambda record: (record.value['memberId'], 1)) | "windowing" >> beam.WindowInto(window.FixedWindows(60*5)) | "compute" >> beam.CombinePerKey(beam.combiners.CountCombineFn()) | 'write' >> WriteToKafka(cluster = "queuing", topic = "PageViewCount", config = config) p.run().waitUntilFinish()
  • 24. Samza SQL ● Declarative streaming SQL API ● Managed service at LinkedIn ● Create and deploy applications in minutes using SQL Shell
  • 25. Samza SQL INSERT INTO kafka.tracking.PageViewCount SELECT memberId, count(*) FROM kafka.tracking.PageView GROUP BY memberId, TUMBLE(current_timestamp, INTERVAL '5' MINUTES)
  • 26. Low Level High Level Samza SQL Apache Beam Java Python Samza APIs
  • 27. Hardened at Scale In production at LinkedIn, Uber, Slack, Intuit, TripAdvisor, VMWare, Redfin, etc. Processing events from Kafka, Brooklin, Kinesis, EventHubs, HDFS, DynamoDB Streams, Databus, etc. Apache Samza Incremental checkpoints for large local state and instant recovery. Local state that works seamlessly across upgrades and failures. APIs for simple and efficient remote I/O Best In Class Stateful Processing Stream and batch processing without changing code. Convenient High-level DSLs and a powerful Low-level API. Universal Processing APIs Write once, run anywhere. Run on a multi-tenant cluster or as an embedded library. Flexible Deployment Model
  • 28. Samza on a Multi-Tenant Cluster • Uses a cluster manager (e.g. YARN) for resource management, coordination, liveness monitoring, etc. • Better resource utilization in a multi-tenant environment. • Works well for large number of applications.
  • 29. Samza as an Embedded Library • Embed Samza as a library in an application. No cluster manager dependency. • Dynamically scale out applications by increasing or decreasing the number of processors at run-time. • Supports rolling upgrades and canaries.
  • 30. ● Uses ZooKeeper for leader election and liveness monitoring for processors. ● Leader JobCoordinator performs work assignments among processors. ● Leader redistributes partitions when processors join or leave the group. Samza as a Library ZooKeeper Based Coordination Zookeeper StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator StreamProcessor Samza Container Job Coordinator… Leader
  • 31. Apache Samza • Mature, versatile, and scalable processing framework • Best-in-class support for local and remote state • Powerful and flexible APIs • Can be operated as a platform or used as an embedded library