SlideShare una empresa de Scribd logo
1 de 26
Back-Pressure in
Action
Intro & Agenda Crawler Intro & Problem
Statements
Crawler Architecture
Infrastructure: Akka Streams,
Kafka, etc.
The Goodies
Crawl
Jobs
Job DB
Validate
URL
Cache
Downloa
d
Process
URLs
URLs
Timestamps
High-Level View
Requirements Ever-expanding # of URLs
Can’t crawl all URLs at once
Control over concurrent web GETs
Efficient resource usage
Resilient under high burst
Scales horizontally & vertically
Sizing the Crawl Job
Let:
i = Number of crawl URLs in a job
n = Average number of links per page
d = The crawl depth
(how many layers to follow links)
u = The max number of URLs to process
Then:
u = ind
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
0 2 4 6 8 10 12
totalURLs vs depth
depth (initialURLs = 1, outLinks = 5)
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1.00E+10
1.00E+11
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
totalURLs vs initialURLs
initialURLs (depth = 5, outLinks = 5)
The Reactive Manifesto
Responsive
Message Driven
Elastic Resilient
Why Does it Matter?
Respond in a deterministic, timely manner
Stays responsive in the face of failure – even cascading failures
Stays responsive under workload spikes
Basic building block for responsive, resilient, and elastic systems
Responsive
Resilient
Elastic
Message Driven
The Right Ingredients
• Kafka
• Huge persistent buffer for the bursts
• Load distribution to very large number of
processing nodes
• Enable horizontal scalability
• Akka streams
• High performance, highly efficient processing
pipeline
• Resilient with end-to-end back-pressure
• Fully asynchronous – utilizes mapAsyncUnordered
with Async HTTP client
• Async HTTP client
• Non-blocking and consumes no threads in waiting
• Integrates with Akka Streams for a high
parallelism, low resource solution
Efficient
Resilient
Scale
Akka
Stream
Async
HTTP
Reactive
Kafka
Crawl
Jobs
Job DB
Validate
URL
Cache
Downloa
d
Process
URLs
URLs
Timestamps
Adding Kafka & Akka Streams
URLs
Akka Streams
Akka Streams,
what???
High performance, pure async,
stream processing
Conforms to reactive streams
Simple, yet powerful GraphDSL
allows clear stream topology
declaration
Central point to understand
processing pipeline
Crawl Stream
Actual Stream Declaration in Code
prioritizeSource ~> crawlerFlow ~> bCast0 ~> result ~> bCast ~> outLinksFlow ~> outLinksSink
bCast ~> dataSinkFlow ~> kafkaDataSink
bCast ~> hdfsDataSink
bCast ~> graphFlow ~> merge ~> graphSink
bCast0 ~> maxPage ~> merge
bCast0 ~> retry ~> bCastRetry ~> retryFailed ~> merge
bCastRetry ~> errorSink
Prioritized
Source
Crawl
Result
MaxPageReached
Retry
OutLinks
Data
Graph
CheckFail
CheckErr
OutLinks
Sink
Kafka Data
Sink
HDFS Data
Sink
Graph
Sink
Error
Sink
Resulting Characteristics
Efficient
• Low thread count, controlled by Akka and pure non-blocking async HTTP
• High latency URLs do not block low latency URLs using MapAsyncUnordered
• Well-controlled download concurrency using MapAsyncUnordered
• Thread per concurrent crawl job
Resilient
• Processes only what can be processed – no resource overload
• Kafka as short-term, persistent queue
Scale
• Kafka feeds next batch of URLs to available node cluster
• Pull model – only processes that have capacity will get the load
• Kafka distributes work to large number of processing nodes in cluster
Back-Pressure
0
20000
40000
60000
80000
100000
120000
0 100 200 300 400 500 600 700
Queue Size
Time (seconds)
0
200
400
URLs/sec
Time (seconds)
initialURLs : 100
parallelism : 1000
processTime : 1 – 5
s
outLinks : 0 - 10
depth : 5
totalCrawled :
312500
Challenges
Training
• Developers not used to E2E stream
definitions
• More familiar with deeply nested function
calls
Maturity of Infrastructure
• Kafka 0.9 use fetch as heartbeat
• Slow nodes cause timeout & rebalance
• Solved in 0.10
What it would
have been…
Bloated, ineffective concurrency
control
Lack of well-thought-out and visible
processing pipeline
Clumsy code, hard to manage &
understand
Low training cost, high project TCO
Dev / Support / Maintenance
Bottom Line
Standardized Reactive Platform
Efficiency & Resilience meets Standardization
• Monitoring
• Need to collect metrics, consistently
• Logging
• Correlation across services
• Uniformity in logs
• Security
• Need to apply standard security configuration
• Environment Resolution
• Staging, production, etc.
Consistency in the face of Heterogeneity
squbs is not… A framework by its own
A programming model – use Akka
Take all or none –
Components/patterns can mostly be
used independently
squbs
Akka for large
scale deployments
Bootstrap
Lifecycle management
Loosely-coupled module system
Integration hooks for logging,
monitoring, ops integration
squbs
Akka for large
scale deployments
JSON console
HttpClient with pluggable resolver and
monitoring/logging hooks
Test tools and interfaces
Goodies:
- Activators for Scala & Java
- Programming patterns and helpers for
Akka and Akka Stream Use cases…,
and growing
PerpetualStream
• Provides a convenience trait to help
write streams controlled by system
lifecycle
• Minimal/no message losses
• Register PerpetualStream to make
stream start/stop
• Provides customization hooks –
especially for how to stop the stream
• Provides killSwitch (from Akka) to be
embedded into stream
• Implementers - just provide your
stream!
A non-stop stream; starts and stops with the system
class MyStream extends PerpetualStream[Future[Int]] {
def generator = Iterator.iterate(0) { p =>
if (p == Int.MaxValue) 0 else p + 1
}
val source = Source.fromIterator(generator _)
val ignoreSink = Sink.ignore[Int]
override def streamGraph = RunnableGraph.fromGraph(
GraphDSL.create(ignoreSink) { implicit builder =>
sink =>
import GraphDSL.Implicits._
source ~> killSwitch.flow[Int] ~> sink
ClosedShape
})
}
PersistentBuffer/BroadcastBuffer
• Data & indexes in rotating memory-mapped files
• Off-heap rotating file buffer – very large buffers
• Restarts gracefully with no or minimal message loss
• Not as durable as a remote data store, but much faster
• Does not back-pressure upstream beyond data/index writes
• Similar usage to Buffer and Broadcast
• BroadcastBuffer – a FanOutShape decouples each output port making each downstream
independent
• Useful if downstream stage blocked or unavailable
• Kafka is unavailable/rebalancing but system cannot backpressure/deny incoming
traffic
• Optional commit stage for at-least-once delivery semantics
• Implementation based on Chronicle Queue
A buffer of virtually unlimited size
Summary
• Kafka + Akka Streams + Async I/O = Ideal Architecture for High Bursts
& High Efficiency
• Akka Streams
• Clear view of stream topology
• Back-pressure & Kafka allows buffering load bursts
• Standardization
• Walk like a duck, quack like a duck, and manage it like a duck
• squbs: Have the cake, and eat it too, with goodies like
• PerpetualStream
• PersistentBuffer
• BroadcastBuffer
Q&A – Feedback Appreciated
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

Más contenido relacionado

La actualidad más candente

Akka Streams - From Zero to Kafka
Akka Streams - From Zero to KafkaAkka Streams - From Zero to Kafka
Akka Streams - From Zero to KafkaMark Harrison
 
Reactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsKonrad Malawski
 
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache Kafka
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache KafkaExploring Reactive Integrations With Akka Streams, Alpakka And Apache Kafka
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache KafkaLightbend
 
Akka streams scala italy2015
Akka streams scala italy2015Akka streams scala italy2015
Akka streams scala italy2015mircodotta
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Lightbend
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaJoe Stein
 
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Lightbend
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with CassandraRyan King
 
Reactive Streams 1.0 and Akka Streams
Reactive Streams 1.0 and Akka StreamsReactive Streams 1.0 and Akka Streams
Reactive Streams 1.0 and Akka StreamsDean Wampler
 
Journey into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka StreamsJourney into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka StreamsKevin Webber
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
Akka Revealed: A JVM Architect's Journey From Resilient Actors To Scalable Cl...
Akka Revealed: A JVM Architect's Journey From Resilient Actors To Scalable Cl...Akka Revealed: A JVM Architect's Journey From Resilient Actors To Scalable Cl...
Akka Revealed: A JVM Architect's Journey From Resilient Actors To Scalable Cl...Lightbend
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talkDataStax Academy
 
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in ScyllaScylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in ScyllaScyllaDB
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaLightbend
 
Do's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionDo's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionjglobal
 
Building Stateful Microservices With Akka
Building Stateful Microservices With AkkaBuilding Stateful Microservices With Akka
Building Stateful Microservices With AkkaYaroslav Tkachenko
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoringRohit Jnagal
 
Practical Akka HTTP - introduction
Practical Akka HTTP - introductionPractical Akka HTTP - introduction
Practical Akka HTTP - introductionŁukasz Sowa
 

La actualidad más candente (20)

Akka Streams - From Zero to Kafka
Akka Streams - From Zero to KafkaAkka Streams - From Zero to Kafka
Akka Streams - From Zero to Kafka
 
Akka streams
Akka streamsAkka streams
Akka streams
 
Reactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka Streams
 
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache Kafka
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache KafkaExploring Reactive Integrations With Akka Streams, Alpakka And Apache Kafka
Exploring Reactive Integrations With Akka Streams, Alpakka And Apache Kafka
 
Akka streams scala italy2015
Akka streams scala italy2015Akka streams scala italy2015
Akka streams scala italy2015
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
 
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
 
Reactive Streams 1.0 and Akka Streams
Reactive Streams 1.0 and Akka StreamsReactive Streams 1.0 and Akka Streams
Reactive Streams 1.0 and Akka Streams
 
Journey into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka StreamsJourney into Reactive Streams and Akka Streams
Journey into Reactive Streams and Akka Streams
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Akka Revealed: A JVM Architect's Journey From Resilient Actors To Scalable Cl...
Akka Revealed: A JVM Architect's Journey From Resilient Actors To Scalable Cl...Akka Revealed: A JVM Architect's Journey From Resilient Actors To Scalable Cl...
Akka Revealed: A JVM Architect's Journey From Resilient Actors To Scalable Cl...
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in ScyllaScylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
 
Developing Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For ScalaDeveloping Secure Scala Applications With Fortify For Scala
Developing Secure Scala Applications With Fortify For Scala
 
Do's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in productionDo's and don'ts when deploying akka in production
Do's and don'ts when deploying akka in production
 
Building Stateful Microservices With Akka
Building Stateful Microservices With AkkaBuilding Stateful Microservices With Akka
Building Stateful Microservices With Akka
 
Native container monitoring
Native container monitoringNative container monitoring
Native container monitoring
 
Practical Akka HTTP - introduction
Practical Akka HTTP - introductionPractical Akka HTTP - introduction
Practical Akka HTTP - introduction
 

Similar a Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQLKafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQLconfluent
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Developing distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterDeveloping distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterKonstantin Tsykulenko
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
StackMate - CloudFormation for CloudStack
StackMate - CloudFormation for CloudStackStackMate - CloudFormation for CloudStack
StackMate - CloudFormation for CloudStackChiradeep Vittal
 
Kafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingYaroslav Tkachenko
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Dan Halperin
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in ScalaAlex Payne
 
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka StreamsKonrad Malawski
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsJohan Andrén
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
 
Akka streams - Umeå java usergroup
Akka streams - Umeå java usergroupAkka streams - Umeå java usergroup
Akka streams - Umeå java usergroupJohan Andrén
 
Stacktician - CloudStack Collab Conference 2014
Stacktician - CloudStack Collab Conference 2014Stacktician - CloudStack Collab Conference 2014
Stacktician - CloudStack Collab Conference 2014amoghvk
 
Writing Asynchronous Programs with Scala & Akka
Writing Asynchronous Programs with Scala & AkkaWriting Asynchronous Programs with Scala & Akka
Writing Asynchronous Programs with Scala & AkkaYardena Meymann
 
Building REST API using Akka HTTP with Scala
Building REST API using Akka HTTP with ScalaBuilding REST API using Akka HTTP with Scala
Building REST API using Akka HTTP with ScalaKnoldus Inc.
 

Similar a Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka (20)

Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQLKafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
Kafka Summit SF 2017 - Kafka Stream Processing for Everyone with KSQL
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
KSQL Intro
KSQL IntroKSQL Intro
KSQL Intro
 
Developing distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka ClusterDeveloping distributed applications with Akka and Akka Cluster
Developing distributed applications with Akka and Akka Cluster
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
StackMate - CloudFormation for CloudStack
StackMate - CloudFormation for CloudStackStackMate - CloudFormation for CloudStack
StackMate - CloudFormation for CloudStack
 
Kafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processingKafka Streams: the easiest way to start with stream processing
Kafka Streams: the easiest way to start with stream processing
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
Introduction to Apache Beam & No Shard Left Behind: APIs for Massive Parallel...
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
 
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka Streams
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka Streams
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Akka streams - Umeå java usergroup
Akka streams - Umeå java usergroupAkka streams - Umeå java usergroup
Akka streams - Umeå java usergroup
 
Stacktician - CloudStack Collab Conference 2014
Stacktician - CloudStack Collab Conference 2014Stacktician - CloudStack Collab Conference 2014
Stacktician - CloudStack Collab Conference 2014
 
Writing Asynchronous Programs with Scala & Akka
Writing Asynchronous Programs with Scala & AkkaWriting Asynchronous Programs with Scala & Akka
Writing Asynchronous Programs with Scala & Akka
 
Building REST API using Akka HTTP with Scala
Building REST API using Akka HTTP with ScalaBuilding REST API using Akka HTTP with Scala
Building REST API using Akka HTTP with Scala
 

Último

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Último (20)

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

  • 2. Intro & Agenda Crawler Intro & Problem Statements Crawler Architecture Infrastructure: Akka Streams, Kafka, etc. The Goodies
  • 4. Requirements Ever-expanding # of URLs Can’t crawl all URLs at once Control over concurrent web GETs Efficient resource usage Resilient under high burst Scales horizontally & vertically
  • 5. Sizing the Crawl Job Let: i = Number of crawl URLs in a job n = Average number of links per page d = The crawl depth (how many layers to follow links) u = The max number of URLs to process Then: u = ind 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 0 2 4 6 8 10 12 totalURLs vs depth depth (initialURLs = 1, outLinks = 5) 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1.00E+10 1.00E+11 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 totalURLs vs initialURLs initialURLs (depth = 5, outLinks = 5)
  • 6. The Reactive Manifesto Responsive Message Driven Elastic Resilient
  • 7. Why Does it Matter? Respond in a deterministic, timely manner Stays responsive in the face of failure – even cascading failures Stays responsive under workload spikes Basic building block for responsive, resilient, and elastic systems Responsive Resilient Elastic Message Driven
  • 8. The Right Ingredients • Kafka • Huge persistent buffer for the bursts • Load distribution to very large number of processing nodes • Enable horizontal scalability • Akka streams • High performance, highly efficient processing pipeline • Resilient with end-to-end back-pressure • Fully asynchronous – utilizes mapAsyncUnordered with Async HTTP client • Async HTTP client • Non-blocking and consumes no threads in waiting • Integrates with Akka Streams for a high parallelism, low resource solution Efficient Resilient Scale Akka Stream Async HTTP Reactive Kafka
  • 10. Akka Streams, what??? High performance, pure async, stream processing Conforms to reactive streams Simple, yet powerful GraphDSL allows clear stream topology declaration Central point to understand processing pipeline
  • 11. Crawl Stream Actual Stream Declaration in Code prioritizeSource ~> crawlerFlow ~> bCast0 ~> result ~> bCast ~> outLinksFlow ~> outLinksSink bCast ~> dataSinkFlow ~> kafkaDataSink bCast ~> hdfsDataSink bCast ~> graphFlow ~> merge ~> graphSink bCast0 ~> maxPage ~> merge bCast0 ~> retry ~> bCastRetry ~> retryFailed ~> merge bCastRetry ~> errorSink Prioritized Source Crawl Result MaxPageReached Retry OutLinks Data Graph CheckFail CheckErr OutLinks Sink Kafka Data Sink HDFS Data Sink Graph Sink Error Sink
  • 12. Resulting Characteristics Efficient • Low thread count, controlled by Akka and pure non-blocking async HTTP • High latency URLs do not block low latency URLs using MapAsyncUnordered • Well-controlled download concurrency using MapAsyncUnordered • Thread per concurrent crawl job Resilient • Processes only what can be processed – no resource overload • Kafka as short-term, persistent queue Scale • Kafka feeds next batch of URLs to available node cluster • Pull model – only processes that have capacity will get the load • Kafka distributes work to large number of processing nodes in cluster
  • 13. Back-Pressure 0 20000 40000 60000 80000 100000 120000 0 100 200 300 400 500 600 700 Queue Size Time (seconds) 0 200 400 URLs/sec Time (seconds) initialURLs : 100 parallelism : 1000 processTime : 1 – 5 s outLinks : 0 - 10 depth : 5 totalCrawled : 312500
  • 14. Challenges Training • Developers not used to E2E stream definitions • More familiar with deeply nested function calls Maturity of Infrastructure • Kafka 0.9 use fetch as heartbeat • Slow nodes cause timeout & rebalance • Solved in 0.10
  • 15. What it would have been… Bloated, ineffective concurrency control Lack of well-thought-out and visible processing pipeline Clumsy code, hard to manage & understand Low training cost, high project TCO Dev / Support / Maintenance
  • 18. Efficiency & Resilience meets Standardization • Monitoring • Need to collect metrics, consistently • Logging • Correlation across services • Uniformity in logs • Security • Need to apply standard security configuration • Environment Resolution • Staging, production, etc. Consistency in the face of Heterogeneity
  • 19. squbs is not… A framework by its own A programming model – use Akka Take all or none – Components/patterns can mostly be used independently
  • 20. squbs Akka for large scale deployments Bootstrap Lifecycle management Loosely-coupled module system Integration hooks for logging, monitoring, ops integration
  • 21. squbs Akka for large scale deployments JSON console HttpClient with pluggable resolver and monitoring/logging hooks Test tools and interfaces Goodies: - Activators for Scala & Java - Programming patterns and helpers for Akka and Akka Stream Use cases…, and growing
  • 22. PerpetualStream • Provides a convenience trait to help write streams controlled by system lifecycle • Minimal/no message losses • Register PerpetualStream to make stream start/stop • Provides customization hooks – especially for how to stop the stream • Provides killSwitch (from Akka) to be embedded into stream • Implementers - just provide your stream! A non-stop stream; starts and stops with the system class MyStream extends PerpetualStream[Future[Int]] { def generator = Iterator.iterate(0) { p => if (p == Int.MaxValue) 0 else p + 1 } val source = Source.fromIterator(generator _) val ignoreSink = Sink.ignore[Int] override def streamGraph = RunnableGraph.fromGraph( GraphDSL.create(ignoreSink) { implicit builder => sink => import GraphDSL.Implicits._ source ~> killSwitch.flow[Int] ~> sink ClosedShape }) }
  • 23. PersistentBuffer/BroadcastBuffer • Data & indexes in rotating memory-mapped files • Off-heap rotating file buffer – very large buffers • Restarts gracefully with no or minimal message loss • Not as durable as a remote data store, but much faster • Does not back-pressure upstream beyond data/index writes • Similar usage to Buffer and Broadcast • BroadcastBuffer – a FanOutShape decouples each output port making each downstream independent • Useful if downstream stage blocked or unavailable • Kafka is unavailable/rebalancing but system cannot backpressure/deny incoming traffic • Optional commit stage for at-least-once delivery semantics • Implementation based on Chronicle Queue A buffer of virtually unlimited size
  • 24. Summary • Kafka + Akka Streams + Async I/O = Ideal Architecture for High Bursts & High Efficiency • Akka Streams • Clear view of stream topology • Back-pressure & Kafka allows buffering load bursts • Standardization • Walk like a duck, quack like a duck, and manage it like a duck • squbs: Have the cake, and eat it too, with goodies like • PerpetualStream • PersistentBuffer • BroadcastBuffer
  • 25. Q&A – Feedback Appreciated