"Stateful app as an efficient way to build dispatching for riders and drivers", Oleksandr Chumak

Fwdays
FwdaysFwdays
"Stateful app as an efficient way to build dispatching for riders and drivers",  Oleksandr Chumak
Uklon in numbers
12 130+
Engineers
Product Teams
16 M
Android/iOS
downloads
1.5M+
Riders DAU
30+
microservices
200k+
Drivers DAU
3
Countries
30
Cities
"Stateful app as an efficient way to build dispatching for riders and drivers",  Oleksandr Chumak
Uklon
RiderApp DriverApp
How to reduce CPU consumption by 10 times due
to stateful-processing and ensure high reliability
What is the report about?
3
What are the solutions employed
by our competitors?
1
Scaling of stateful services
Reliability of stateful services
Workloads that make the stateless approach inefficient
Basic concepts
Agenda
Workloads that make the
stateless approach inefficient
1. massive frequent write operations are needed to track the objects'
current locations. As drivers can move as fast as 20 meters per second,
it is therefore important to update drivers' locations at a second.
Several challenges within
the ride-hailing are…
2. a K-nearest neighbour (kNN) query poses tremendous challenges,
compared to a simple Get query, in a key-value data store such as
Redis.
Feature #1
Orders Dispatching
Find the best driver for the order
Feature #2
Orders Broadcasting
Streaming your order to many drivers
DriverApp
Feature #3
Batch dispatching
Greedy algorithm Batching algorithm
The Process of Order Dispatching
with Batch Windows
2 min
9 min
4 min
4 min
Total wait time = 11 min Total wait time = 8 min
image
Feature #4
Driver ETA Tracker
Requirements:
1. Active Orders = tens of thousands
2. Drivers send their location every
2-5 seconds
1. Order offers. Find the best driver near you.
2. Order broadcasts. Fan-out orders to multiple drivers.
3. Order chaining. Find the next order for the driver, while
completing the current one.
4. Order batching (optimization). Reduce the total waiting time
for all passengers.
5. Sector queue (airports, train stations).
6. Driver ETA tracking for accepted order.
7. Matching driver’s GPS location to map graph node.
Other Workloads
Simplified Overview of
the Architecture
Stateful
● Load balancing algorithms
● Scalability
○ Partitioning
○ Replication
● Fault tolerance and Cold start
4
Stateful
architectures
Open Problems
1
Key concept
1. Local state is stored in memory KV structures
2. The local state restored from the durable log.
In same cases, local state change may have
been checkpointed to remote KV store (or into
a separate kafka topic)
3. Local state updates occur within a
single-threaded. No concurrency, Monotonic
Writes
NFR (Kyiv only)
Writes
1.1) 5000-10000 rps
1.2) 100-500 rps
Reads
2.1) 500 rps (handle 100-500 drivers
per request)
2.2) fetch 50000-200000 rows/sec
(100-400MB/sec)
driver entity: 2 KB (50 perc)/ 13 KB (99 perc)
total size for 100K = 200 MB
Key differences
Stateless (remote KV)
● Provide GET/PUT/DELETE API
● A high CPU cost due to
marshalling and serialization
● Additional network latency
● Frequently necessitates
additional local caching
Stateful (in-memory/local KV)
● Domain specific API. Ex:
○ Find nearest drivers
○ Calculate ETA
● Data locality
● Shared-nothing
1
Access patterns for
In-memory KV
1. Key lookup
2. Index seek (Offers, Broadcast)
3. All scans / Range scans
Concept #1: Co-partitioning
Two topics are described as
co-partitioned if:
1. Their keys have the same schemas
2. They are materialized by topics
with the same number of partitions
3. Their producers have similar
'partitioner'
Concept #1: Co-partitioning
Concept #2: Re-keying partitions
● Related events are not
co-partitioned
● Well-balanced partitions
● These can be unbalanced partitions and,
as a result, consumers
● Achieving data locality for the consumer
Concept #3: Filtering + Enriching
DriverLocation {
"driver": 12345
"latitude": 50.30846,
"longitude": 30.53419
}
DriverETA {
"driver": 12345
"latitude": 50.30846,
"longitude": 30.53419
“order”: 98765,
“eta”: “2 min”
}
How to scale?
Driver Dispatching
Driver Dispatching
Driver Dispatching
Driver Dispatching
1
Scalability
1
1. geospatial indexing (geohash, S2, H3)
2. city_id (region)
Some sharding strategy
Consider the following points when you design a data
partitioning scheme:
1. Minimize cross-partition data access operations
2. Minimize cross-partition joins
1
Partitioning by Region
Possible challenges:
● down-time during rebalance:
scale-out, rolling update
● unbalanced load: The load
from Kyiv is equivalent to the
load from all cities of Ukraine
combined)
1
Try to fix:
Partitioning by Region + Replication
Replication:
● Standalone consumers
● No partitions rebalance
● No down-time
● Replication overhead is
less than 0.1CPU per pod
● Reduced requirements
for cold recovery
1
1. Scalability - adding Kafka
partitions and deploying
separate Shard-Instances for
cities/countries
2. Elasticity - scale-out of
consumers within a Shard
Scalability
Reliability?
1
Replica synchronization
● State-based CRDT
● Last write wins (LWW)
● Optimistic replication (can
become temporarily
inconsistent)
● Strong Eventual Consistency
(SEC)
● Reading Your Own
Writes
● Monotonic Reads
● Consistent Prefix Reads
Depends on your Domain
● Reading Your Own
Writes
● Monotonic Reads
● Consistent Prefix Reads
1
Problems with Replication Lag?
1
1. Single infrastructure dependency - Kafka (battle tested streaming
platform with high throughput, fault-tolerance, and scalability).
2. When a task instance restarts, local state is repopulated by reading its
own Kafka log
3. Yes, reading and repopulating will take some time
Fault tolerance with local state
1
1. Key-Based Retention
a. Aggressive topic compaction
b. Tombstones
2. Time-Based Retention
Controlling State Size.
How long time to rebuild the state?
1
1. Driver state retention: 1hour
2. Repopulate local state:
a. Read driver-state from the beginning of the topic: 400k msg (8
partitions)
b. Read driver-locations from the 'now - 5sec'
3. You need to implement own event for ”live processing started”
How long time to rebuild
the state?
"Live processing started "dispatching.driver-summary-events [0]"
after 00:00:01.7875633 sec (50142 msgs)"
SLA level of 99.998% uptime/availability
results in the following periods of allowed
downtime/unavailability:
■ Daily: 1.7s
Traffic Jams requirements
1. Reduce the cost of Google
Maps API
2. High rate of Writes (20k
online drivers)
3. Update traffic information
every 5min
Stateful processing
● Grouping messages by partition key
● Aggregating messages in hopping window
● MapReduce
Driver ETA Tracker
4
Similar workload using Redis
https://aws.amazon.com/blogs/database/optimize-redis-client-performance-for-amazon-elasticache/?utm_source=pocket_saves
○ Client: c5.4xlarge (16 vCPU 32GiB)
○ Redis: 3 nodes r6g.2xlarge (8 vCPUs 64Gib)
46
Resources Usage
Although the current design is simple, it allows flexibility to change
key aspects:
○ Replication + Sharding
4
Future works
46
1. Stateful is not always difficult
2. Simple and Reliable solution
3. Easy to maintain
4. Much more efficient in terms of resources (2 vCPUs for all
dispatching) instead of a Redis cluster with 16-24 vCPUs
5. What about MS Orleans?
Lessons learned
4
The Twelve-Factor App
Misleading
46
Space-based architecture?
https://www.amazon.com/_/dp/1492043451?smid=ATVPDKIKX0DER&_encoding=UTF8&tag=oreilly20-20
Contacts
Solution Architect
Oleksandr Chumak
https:/
/www.linkedin.com/in/oleksandr-chuma
k-45967588/
facebook.com/achumak.dev
1 de 46

Recomendados

Kubernetes @ Squarespace (SRE Portland Meetup October 2017) por
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
237 vistas51 diapositivas
Stephan Ewen - Experiences running Flink at Very Large Scale por
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
3.5K vistas76 diapositivas
BWC Supercomputing 2008 Presentation por
BWC Supercomputing 2008 PresentationBWC Supercomputing 2008 Presentation
BWC Supercomputing 2008 Presentationlilyco
343 vistas25 diapositivas
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large... por
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...
Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...Flink Forward
1.5K vistas44 diapositivas
QCON 2015: Gearpump, Realtime Streaming on Akka por
QCON 2015: Gearpump, Realtime Streaming on AkkaQCON 2015: Gearpump, Realtime Streaming on Akka
QCON 2015: Gearpump, Realtime Streaming on AkkaSean Zhong
634 vistas60 diapositivas
Our Multi-Year Journey to a 10x Faster Confluent Cloud por
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudHostedbyConfluent
32 vistas43 diapositivas

Más contenido relacionado

Similar a "Stateful app as an efficient way to build dispatching for riders and drivers", Oleksandr Chumak

Challenges in Cloud Computing – VM Migration por
Challenges in Cloud Computing – VM MigrationChallenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM MigrationSarmad Makhdoom
7.1K vistas26 diapositivas
Velocity 2018 preetha appan final por
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan finalpreethaappan
118 vistas70 diapositivas
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale por
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
900 vistas56 diapositivas
Practice and challenges from building IaaS por
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaSShawn Zhu
841 vistas26 diapositivas
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps) por
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)Art Schanz
85 vistas31 diapositivas
Unclouding Container Challenges por
 Unclouding  Container Challenges Unclouding  Container Challenges
Unclouding Container ChallengesRakuten Group, Inc.
407 vistas18 diapositivas

Similar a "Stateful app as an efficient way to build dispatching for riders and drivers", Oleksandr Chumak(20)

Challenges in Cloud Computing – VM Migration por Sarmad Makhdoom
Challenges in Cloud Computing – VM MigrationChallenges in Cloud Computing – VM Migration
Challenges in Cloud Computing – VM Migration
Sarmad Makhdoom7.1K vistas
Velocity 2018 preetha appan final por preethaappan
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
preethaappan118 vistas
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale por Sean Zhong
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Sean Zhong900 vistas
Practice and challenges from building IaaS por Shawn Zhu
Practice and challenges from building IaaSPractice and challenges from building IaaS
Practice and challenges from building IaaS
Shawn Zhu841 vistas
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps) por Art Schanz
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
MQTC V2.0.1.3 - WMQ & TCP Buffers – Size DOES Matter! (pps)
Art Schanz85 vistas
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala... por Martin Zapletal
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Martin Zapletal1.3K vistas
Oow2007 performance por Ricky Zhu
Oow2007 performanceOow2007 performance
Oow2007 performance
Ricky Zhu494 vistas
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni... por MLconf
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
MLconf9K vistas
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w... por Data Con LA
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA783 vistas
z/VM Performance Analysis por Rodrigo Campos
z/VM Performance Analysisz/VM Performance Analysis
z/VM Performance Analysis
Rodrigo Campos5.8K vistas
Ingestion and Dimensions Compute and Enrich using Apache Apex por Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache ApexIngestion and Dimensions Compute and Enrich using Apache Apex
Ingestion and Dimensions Compute and Enrich using Apache Apex
Apache Apex671 vistas
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas... por areej qasrawi
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
areej qasrawi64 vistas
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware por Lucidworks
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAwareLeveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Leveraging the Power of Solr with Spark: Presented by Johannes Weigend, QAware
Lucidworks1.1K vistas
Leveraging the Power of Solr with Spark por QAware GmbH
Leveraging the Power of Solr with SparkLeveraging the Power of Solr with Spark
Leveraging the Power of Solr with Spark
QAware GmbH959 vistas
Mobile web performance - MoDev East por Patrick Meenan
Mobile web performance - MoDev EastMobile web performance - MoDev East
Mobile web performance - MoDev East
Patrick Meenan3.4K vistas

Más de Fwdays

"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov por
"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov
"Drizzle: What Is It All About?", Alex Blokh, Dan KochetovFwdays
12 vistas33 diapositivas
"Package management in monorepos", Zoltan Kochan por
"Package management in monorepos", Zoltan Kochan"Package management in monorepos", Zoltan Kochan
"Package management in monorepos", Zoltan KochanFwdays
26 vistas18 diapositivas
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell por
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
"Node.js vs workers — A comparison of two JavaScript runtimes", James M SnellFwdays
14 vistas30 diapositivas
"AI and how to integrate ChatGPT as a customer support agent", Sergey Dyachok por
"AI and how to integrate ChatGPT as a customer support agent",  Sergey Dyachok"AI and how to integrate ChatGPT as a customer support agent",  Sergey Dyachok
"AI and how to integrate ChatGPT as a customer support agent", Sergey DyachokFwdays
30 vistas17 diapositivas
"Node.js Development in 2024: trends and tools", Nikita Galkin por
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin Fwdays
22 vistas38 diapositivas
"Running students' code in isolation. The hard way", Yurii Holiuk por
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk Fwdays
30 vistas34 diapositivas

Más de Fwdays(20)

"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov por Fwdays
"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov
"Drizzle: What Is It All About?", Alex Blokh, Dan Kochetov
Fwdays12 vistas
"Package management in monorepos", Zoltan Kochan por Fwdays
"Package management in monorepos", Zoltan Kochan"Package management in monorepos", Zoltan Kochan
"Package management in monorepos", Zoltan Kochan
Fwdays26 vistas
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell por Fwdays
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
"Node.js vs workers — A comparison of two JavaScript runtimes", James M Snell
Fwdays14 vistas
"AI and how to integrate ChatGPT as a customer support agent", Sergey Dyachok por Fwdays
"AI and how to integrate ChatGPT as a customer support agent",  Sergey Dyachok"AI and how to integrate ChatGPT as a customer support agent",  Sergey Dyachok
"AI and how to integrate ChatGPT as a customer support agent", Sergey Dyachok
Fwdays30 vistas
"Node.js Development in 2024: trends and tools", Nikita Galkin por Fwdays
"Node.js Development in 2024: trends and tools", Nikita Galkin "Node.js Development in 2024: trends and tools", Nikita Galkin
"Node.js Development in 2024: trends and tools", Nikita Galkin
Fwdays22 vistas
"Running students' code in isolation. The hard way", Yurii Holiuk por Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays30 vistas
"Surviving highload with Node.js", Andrii Shumada por Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays40 vistas
"The role of CTO in a classical early-stage startup", Eugene Gusarov por Fwdays
"The role of CTO in a classical early-stage startup", Eugene Gusarov"The role of CTO in a classical early-stage startup", Eugene Gusarov
"The role of CTO in a classical early-stage startup", Eugene Gusarov
Fwdays33 vistas
"Cross-functional teams: what to do when a new hire doesn’t solve the busines... por Fwdays
"Cross-functional teams: what to do when a new hire doesn’t solve the busines..."Cross-functional teams: what to do when a new hire doesn’t solve the busines...
"Cross-functional teams: what to do when a new hire doesn’t solve the busines...
Fwdays43 vistas
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad... por Fwdays
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad..."Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
"Ukrainian Mobile Banking Scaling in Practice. From 0 to 100 and beyond", Vad...
Fwdays47 vistas
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur por Fwdays
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
"Thriving Culture in a Product Company — Practical Story", Volodymyr Tsukur
Fwdays49 vistas
"Fast Start to Building on AWS", Igor Ivaniuk por Fwdays
"Fast Start to Building on AWS", Igor Ivaniuk"Fast Start to Building on AWS", Igor Ivaniuk
"Fast Start to Building on AWS", Igor Ivaniuk
Fwdays51 vistas
"Quality Assurance: Achieving Excellence in startup without a Dedicated QA", ... por Fwdays
"Quality Assurance: Achieving Excellence in startup without a Dedicated QA", ..."Quality Assurance: Achieving Excellence in startup without a Dedicated QA", ...
"Quality Assurance: Achieving Excellence in startup without a Dedicated QA", ...
Fwdays43 vistas
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi por Fwdays
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
Fwdays32 vistas
"How we switched to Kanban and how it integrates with product planning", Vady... por Fwdays
"How we switched to Kanban and how it integrates with product planning", Vady..."How we switched to Kanban and how it integrates with product planning", Vady...
"How we switched to Kanban and how it integrates with product planning", Vady...
Fwdays75 vistas
"Bringing Flutter to Tide: a case study of a leading fintech platform in the ... por Fwdays
"Bringing Flutter to Tide: a case study of a leading fintech platform in the ..."Bringing Flutter to Tide: a case study of a leading fintech platform in the ...
"Bringing Flutter to Tide: a case study of a leading fintech platform in the ...
Fwdays25 vistas
"Shape Up: How to Develop Quickly and Avoid Burnout", Dmytro Popov por Fwdays
"Shape Up: How to Develop Quickly and Avoid Burnout", Dmytro Popov"Shape Up: How to Develop Quickly and Avoid Burnout", Dmytro Popov
"Shape Up: How to Develop Quickly and Avoid Burnout", Dmytro Popov
Fwdays65 vistas
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy por Fwdays
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
Fwdays49 vistas
From “T” to “E”, Dmytro Gryn por Fwdays
From “T” to “E”, Dmytro GrynFrom “T” to “E”, Dmytro Gryn
From “T” to “E”, Dmytro Gryn
Fwdays37 vistas
"Why I left React in my TypeScript projects and where ", Illya Klymov por Fwdays
"Why I left React in my TypeScript projects and where ",  Illya Klymov"Why I left React in my TypeScript projects and where ",  Illya Klymov
"Why I left React in my TypeScript projects and where ", Illya Klymov
Fwdays254 vistas

Último

What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue por
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueShapeBlue
131 vistas23 diapositivas
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT por
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITShapeBlue
91 vistas8 diapositivas
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue por
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueShapeBlue
96 vistas20 diapositivas
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates por
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesShapeBlue
119 vistas15 diapositivas
HTTP headers that make your website go faster - devs.gent November 2023 por
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023Thijs Feryn
28 vistas151 diapositivas
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... por
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...ShapeBlue
82 vistas62 diapositivas

Último(20)

What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue por ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue131 vistas
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT por ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue91 vistas
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue por ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue96 vistas
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates por ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue119 vistas
HTTP headers that make your website go faster - devs.gent November 2023 por Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn28 vistas
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... por ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue82 vistas
NTGapps NTG LowCode Platform por Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu141 vistas
Why and How CloudStack at weSystems - Stephan Bienek - weSystems por ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue111 vistas
DRBD Deep Dive - Philipp Reisner - LINBIT por ShapeBlue
DRBD Deep Dive - Philipp Reisner - LINBITDRBD Deep Dive - Philipp Reisner - LINBIT
DRBD Deep Dive - Philipp Reisner - LINBIT
ShapeBlue62 vistas
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive por Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Business Analyst Series 2023 - Week 3 Session 5 por DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10369 vistas
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue por ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue96 vistas
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue por ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue46 vistas
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue por ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue46 vistas
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 por IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Igniting Next Level Productivity with AI-Infused Data Integration Workflows por Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software344 vistas

"Stateful app as an efficient way to build dispatching for riders and drivers", Oleksandr Chumak

  • 2. Uklon in numbers 12 130+ Engineers Product Teams 16 M Android/iOS downloads 1.5M+ Riders DAU 30+ microservices 200k+ Drivers DAU 3 Countries 30 Cities
  • 5. How to reduce CPU consumption by 10 times due to stateful-processing and ensure high reliability What is the report about?
  • 6. 3 What are the solutions employed by our competitors?
  • 7. 1 Scaling of stateful services Reliability of stateful services Workloads that make the stateless approach inefficient Basic concepts Agenda
  • 8. Workloads that make the stateless approach inefficient
  • 9. 1. massive frequent write operations are needed to track the objects' current locations. As drivers can move as fast as 20 meters per second, it is therefore important to update drivers' locations at a second. Several challenges within the ride-hailing are… 2. a K-nearest neighbour (kNN) query poses tremendous challenges, compared to a simple Get query, in a key-value data store such as Redis.
  • 10. Feature #1 Orders Dispatching Find the best driver for the order
  • 11. Feature #2 Orders Broadcasting Streaming your order to many drivers DriverApp
  • 12. Feature #3 Batch dispatching Greedy algorithm Batching algorithm The Process of Order Dispatching with Batch Windows 2 min 9 min 4 min 4 min Total wait time = 11 min Total wait time = 8 min
  • 13. image Feature #4 Driver ETA Tracker Requirements: 1. Active Orders = tens of thousands 2. Drivers send their location every 2-5 seconds
  • 14. 1. Order offers. Find the best driver near you. 2. Order broadcasts. Fan-out orders to multiple drivers. 3. Order chaining. Find the next order for the driver, while completing the current one. 4. Order batching (optimization). Reduce the total waiting time for all passengers. 5. Sector queue (airports, train stations). 6. Driver ETA tracking for accepted order. 7. Matching driver’s GPS location to map graph node. Other Workloads
  • 15. Simplified Overview of the Architecture Stateful
  • 16. ● Load balancing algorithms ● Scalability ○ Partitioning ○ Replication ● Fault tolerance and Cold start 4 Stateful architectures Open Problems
  • 17. 1 Key concept 1. Local state is stored in memory KV structures 2. The local state restored from the durable log. In same cases, local state change may have been checkpointed to remote KV store (or into a separate kafka topic) 3. Local state updates occur within a single-threaded. No concurrency, Monotonic Writes
  • 18. NFR (Kyiv only) Writes 1.1) 5000-10000 rps 1.2) 100-500 rps Reads 2.1) 500 rps (handle 100-500 drivers per request) 2.2) fetch 50000-200000 rows/sec (100-400MB/sec) driver entity: 2 KB (50 perc)/ 13 KB (99 perc) total size for 100K = 200 MB
  • 19. Key differences Stateless (remote KV) ● Provide GET/PUT/DELETE API ● A high CPU cost due to marshalling and serialization ● Additional network latency ● Frequently necessitates additional local caching Stateful (in-memory/local KV) ● Domain specific API. Ex: ○ Find nearest drivers ○ Calculate ETA ● Data locality ● Shared-nothing
  • 20. 1 Access patterns for In-memory KV 1. Key lookup 2. Index seek (Offers, Broadcast) 3. All scans / Range scans
  • 22. Two topics are described as co-partitioned if: 1. Their keys have the same schemas 2. They are materialized by topics with the same number of partitions 3. Their producers have similar 'partitioner' Concept #1: Co-partitioning
  • 23. Concept #2: Re-keying partitions ● Related events are not co-partitioned ● Well-balanced partitions ● These can be unbalanced partitions and, as a result, consumers ● Achieving data locality for the consumer
  • 24. Concept #3: Filtering + Enriching DriverLocation { "driver": 12345 "latitude": 50.30846, "longitude": 30.53419 } DriverETA { "driver": 12345 "latitude": 50.30846, "longitude": 30.53419 “order”: 98765, “eta”: “2 min” }
  • 25. How to scale? Driver Dispatching Driver Dispatching Driver Dispatching Driver Dispatching
  • 27. 1 1. geospatial indexing (geohash, S2, H3) 2. city_id (region) Some sharding strategy Consider the following points when you design a data partitioning scheme: 1. Minimize cross-partition data access operations 2. Minimize cross-partition joins
  • 28. 1 Partitioning by Region Possible challenges: ● down-time during rebalance: scale-out, rolling update ● unbalanced load: The load from Kyiv is equivalent to the load from all cities of Ukraine combined)
  • 29. 1 Try to fix: Partitioning by Region + Replication Replication: ● Standalone consumers ● No partitions rebalance ● No down-time ● Replication overhead is less than 0.1CPU per pod ● Reduced requirements for cold recovery
  • 30. 1 1. Scalability - adding Kafka partitions and deploying separate Shard-Instances for cities/countries 2. Elasticity - scale-out of consumers within a Shard Scalability
  • 32. 1 Replica synchronization ● State-based CRDT ● Last write wins (LWW) ● Optimistic replication (can become temporarily inconsistent) ● Strong Eventual Consistency (SEC)
  • 33. ● Reading Your Own Writes ● Monotonic Reads ● Consistent Prefix Reads Depends on your Domain ● Reading Your Own Writes ● Monotonic Reads ● Consistent Prefix Reads 1 Problems with Replication Lag?
  • 34. 1 1. Single infrastructure dependency - Kafka (battle tested streaming platform with high throughput, fault-tolerance, and scalability). 2. When a task instance restarts, local state is repopulated by reading its own Kafka log 3. Yes, reading and repopulating will take some time Fault tolerance with local state
  • 35. 1 1. Key-Based Retention a. Aggressive topic compaction b. Tombstones 2. Time-Based Retention Controlling State Size. How long time to rebuild the state?
  • 36. 1 1. Driver state retention: 1hour 2. Repopulate local state: a. Read driver-state from the beginning of the topic: 400k msg (8 partitions) b. Read driver-locations from the 'now - 5sec' 3. You need to implement own event for ”live processing started” How long time to rebuild the state? "Live processing started "dispatching.driver-summary-events [0]" after 00:00:01.7875633 sec (50142 msgs)" SLA level of 99.998% uptime/availability results in the following periods of allowed downtime/unavailability: ■ Daily: 1.7s
  • 37. Traffic Jams requirements 1. Reduce the cost of Google Maps API 2. High rate of Writes (20k online drivers) 3. Update traffic information every 5min
  • 38. Stateful processing ● Grouping messages by partition key ● Aggregating messages in hopping window ● MapReduce
  • 40. 4 Similar workload using Redis https://aws.amazon.com/blogs/database/optimize-redis-client-performance-for-amazon-elasticache/?utm_source=pocket_saves ○ Client: c5.4xlarge (16 vCPU 32GiB) ○ Redis: 3 nodes r6g.2xlarge (8 vCPUs 64Gib)
  • 42. Although the current design is simple, it allows flexibility to change key aspects: ○ Replication + Sharding 4 Future works
  • 43. 46 1. Stateful is not always difficult 2. Simple and Reliable solution 3. Easy to maintain 4. Much more efficient in terms of resources (2 vCPUs for all dispatching) instead of a Redis cluster with 16-24 vCPUs 5. What about MS Orleans? Lessons learned