SlideShare una empresa de Scribd logo
1 de 25
Descargar para leer sin conexión
1
1
1
Streaming millions of
Contact Center
interactions in (near)
real-time with Pulsar
Frank Kelly
Principal Engineer, Cogito Corp
Slack: https://apache-pulsar.slack.com/
A panoply of parameters
2
● Cogito & What we do
● Architecture & Use-Cases
● Challenges
● Initial lessons learned
● Kubernetes lessons learned
● Performance & Scaling settings
● Results
● Q&A
Intended Audience
Those who understand the main APIs and components but who may not be familiar with all the configuration
settings or how to optimize the system for high write throughput and/or millions of topics.
Overview
3
Formed in 2007 out of MIT - based out of Boston - now with a Global Engineering Footprint
Vision: Elevating the human connection in real time . . . .
Product: Call center AI solution that analyzes the human voice and provides real-time guidance to
enhance emotional intelligence and customer service.
Cogito: Who we are and what we do
4
Architecture
5
● Streaming: Real-time audio and analytic
results from our AI/ML models
● We break each customer call into
separate logical units called “intervals”
● Each interval is backed by two Pulsar
topics
○ Real-time Audio Topic
○ Real-time Analytics Topic
● Splicing up binary formats into discrete
messages → Deduplication is VERY
important!
● With 15,000 concurrent users - we
estimate 1.5m to 2m topics per day
● Each topic has moderate throughput ~ 32
Kb/s
● Also Messaging: Work-Queue events
Use-Cases for Pulsar
6
● Streaming Use-Case
○ Lots of throughput ~ 10 Gbps
○ Message-ordering & deduplication are critical
○ Near real-time requirements (< 250ms)
■ Think about timeouts/retries/failover
● Challenges
○ Zookeeper stores all the topics for a namespace
under one ZNode
○ Brokers require more memory
● Alternatives considered
○ Using key_shared would require us to disable
batching in the producer (not a huge deal)
○ Risk: Message dispatch will stop if there is a
subscription / consumer that has built up a backlog
of messages in their hash-range
○ Filtering on the client-side
The Challenges
7
● Processing real-time binary streams
○ Consumer: SubscriptionInitialPosition.Earliest
○ Broker Configuration: brokerDeduplicationEnabled: "true"
● Client Performance
○ Producer: sendAsync() ~10x improvement
○ Producer: blockIfQueueFull(true)
○ Batching: Enabled but the throughput per Producer is so low it rarely becomes helpful
● Default Timeouts
○ For our real-time system the default connection / operation timeout of 30s is too high
● Persistent vs. Non-Persistent
○ We support both use-cases (some customers wish for zero persistence)
Initial Lessons on the basics
8
● 15k Users ⇒ ingress of 5 Gbps Audio Data ⇒ 20 TB in a 12 hour window
● Open Subscriptions keep the topic data from being deleted
○ Code: pulsarAdmin.namespaces().setSubscriptionExpirationTime());
○ Broker Deduplication has its own subscription
■ brokerDeduplicationEntriesInterval: "50" (default: 1000)
■ brokerDeduplicationProducerInactivityTimeoutMinutes: "15" (default: 360)
● Bookie Compaction Thresholds (Delete more and do it more frequently)
○ majorCompactionInterval / majorCompactionThreshold
○ minorCompactionInterval / minorCompactionThreshold
○ compactionRate
● Tiered Storage
○ Although we use some Tiered storage there will be too many topics in ZK over time
○ Created our own Stream Offload that stores S3 location in RDS DB
Disk Space Challenges
9
● Which Helm chart?
○ Apache Pulsar (“Official”) vs Streamnative (Also “Official”) vs Kafkaesque
● GC Settings
○ Java Ergonomics: -XX:+PrintFlagsFinal
○ GC Settings tied to Pod Memory: -Xms2g -Xmx2g -XX:MaxDirectMemorySize=6g
○ resources.requests.memory = Heap + Direct Memory + Some Buffer
○ Looking forward to seeing modern JVM settings e.g. -XX:MaxRAMPercentage=75%
● Most helm charts set requests but not limits. We set requests == limits
○ JVM Memory is not elastic
○ CPU is however we experienced a lot of throttling from K8S Scheduler
● Istio Service Mesh
○ Integration with Istio for mTLS and service-level authorization took a chunk of time
Kubernetes Lessons
10
● Config: exposeTopicLevelMetricsInPrometheus: "false"
Passive Monitoring with Prometheus / Grafana
11
Active Monitoring with Prometheus Alerts
Integration with Prometheus Alerting to Slack / PagerDuty
12
● Namespace Bundles
○ For 15 Brokers: defaultNumberOfNamespaceBundles: "128" (Default: 4)
● Pulsar Load Balancer
○ # Disable Bundle split due to https://github.com/apache/pulsar/issues/5510
○ loadBalancerAutoBundleSplitEnabled: "false"
● Balancing throughput, durability and reliability across Bookies
○ managedLedgerDefaultEnsembleSize: "N"
○ managedLedgerDefaultWriteQuorum: "2"
○ managedLedgerDefaultAckQuorum: "1"
○ Striping is great for write-throughput but adds cost for read throughput
Real-Time / Scaling Journey Lessons
13
● Error
○ PerChannelBookieClient - Add for failed on bookie bookkeeper-2:3181 code EIO
Bookie EIO Error
Root Cause: At peak load Write Cache not big enough to hold
accumulated data while waiting on second cache flush
14
● Key Prometheus Metrics
○ Bookie
■ bookie_throttled_write_requests
■ bookie_rejected_write_request
○ Broker
■ pulsar_ml_cache_hits_rate
■ pulsar_ml_cache_misses_rate
Bookie EIO Error
BAD
GOOD
Key Lesson
The more we read from the Broker cache, the less we use
the Bookie ledger disk (enabling faster flush of write cache
→ ledger)
15
● EBS drives for Journal & Ledger
○ GP3 with max settings 16000 IOPS, 1000 MB/s
● Broker Cache
○ managedLedgerCacheEvictionTimeThresholdMillis: "5000" (Default: 1000)
○ managedLedgerCacheSizeMB: "512" (Default: 20% of total direct Memory)
● Bookie
○ dbStorage_writeCacheMaxSizeMb: "3072" (Default: 25% of total direct memory)
○ dbStorage_rocksDB_blockCacheSize: "1073741824" (Default: 10% of total direct memory)
○ journalMaxGroupWaitMSec: "10" (Default: 1ms)
● Scaling approach
○ Scale-out Bookies
○ Scale-up and Scale-out Brokers
Key Scaling Settings . . .
16
We’re not at millions yet but we’re seeing a trend . . . .
1) Simulated 300 users for about 18 hours with artificially short 1 minute calls
2) 500k topics created (250k Audio / 250k Signal Analytics)
Latest Results
17
Observations: ZooKeeper
ZK JVM Heap demands increasing . . .
18
Observations: ZooKeeper
ZK Disk Usage Increasing . . .
Suppressed: java.io.IOException: No space left on device
at
org.apache.zookeeper.server.SyncRequestProcessor$1.run(SyncRequestProcessor.java
:135) [org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1]
at
org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:31
2) [org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1]
at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java:
406) ~[org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1]
.
.
[Snapshot Thread] ERROR org.apache.zookeeper.server.ZooKeeperServer - Severe
unrecoverable error, exiting
19
Observations: ZooKeeper
ZK 99%ile response times increasing. . .
20
Observations: Broker
Broker Heap Increasing . . .
Topic Metadata here as well as in ZK
21
Implications
1) ZooKeeper
a) More Heap
b) More CPU for GC (and to avoid throttling during GC)
c) Watch ZooKeeper disk space /pulsar/data
2) Broker
a) More Heap
b) Maybe more CPU for GC (and to avoid throttling during GC)
c) Watch for Broker → ZK latency issues
i) zooKeeperSessionTimeoutMillis: "60000" (default: 30000)
ii) zooKeeperOperationTimeoutSeconds: "60" (default: 30)
22
Recap: Key Metrics for our Streaming Use-Case
23
Thanks
Cogito
Bruce, Hamid, Andy, Jimmy, George, Gibby, Kyle, Matt, Amanda, John,
Ian, Mihai, Luis, Anthony, Karl and many more
Pulsar Community
Addison, Sijie, Matteo, Joshua etc.
24
Thank you!
25
● Benchmarking Pulsar and Kafka - A More Accurate Perspective on Pulsar’s Performance
○ https://streamnative.io/en/blog/tech/2020-11-09-benchmark-pulsar-kafka-performance#maximum-t
hroughput-test
● Taking a Deep-Dive into Apache Pulsar Architecture for Performance Tuning
○ https://streamnative.io/en/blog/tech/2021-01-14-pulsar-architecture-performance-tuning
● Understanding How Apache Pulsar Works
○ https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-works
References

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter...
Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter...Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter...
Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
cLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHousecLoki: Like Loki but for ClickHouse
cLoki: Like Loki but for ClickHouse
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
OSMC 2022 | VictoriaMetrics: scaling to 100 million metrics per second by Ali...
 
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
Apache BookKeeper State Store: A Durable Key-Value Store - Pulsar Summit NA 2021
 
Improved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and AlertmanagerImproved alerting with Prometheus and Alertmanager
Improved alerting with Prometheus and Alertmanager
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 

Similar a Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pulsar - Pulsar Summit NA 2021

Similar a Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pulsar - Pulsar Summit NA 2021 (20)

Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
Managing 600 instances
Managing 600 instancesManaging 600 instances
Managing 600 instances
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 
Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Netty training
Netty trainingNetty training
Netty training
 
Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
 
Microservices with Micronaut
Microservices with MicronautMicroservices with Micronaut
Microservices with Micronaut
 
Netty training
Netty trainingNetty training
Netty training
 
Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016Netflix Container Scheduling and Execution - QCon New York 2016
Netflix Container Scheduling and Execution - QCon New York 2016
 
Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016Scheduling a fuller house - Talk at QCon NY 2016
Scheduling a fuller house - Talk at QCon NY 2016
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on Kubernetes
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Nexmark with beam
Nexmark with beamNexmark with beam
Nexmark with beam
 
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
Vinetalk: The missing piece for cluster managers to enable accelerator sharingVinetalk: The missing piece for cluster managers to enable accelerator sharing
Vinetalk: The missing piece for cluster managers to enable accelerator sharing
 

Más de StreamNative

Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
StreamNative
 

Más de StreamNative (20)

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...
 
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
 
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
 
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
 
Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 

Último

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Streaming Millions of Contact Center Interactions in (Near) Real-Time with Pulsar - Pulsar Summit NA 2021

  • 1. 1 1 1 Streaming millions of Contact Center interactions in (near) real-time with Pulsar Frank Kelly Principal Engineer, Cogito Corp Slack: https://apache-pulsar.slack.com/ A panoply of parameters
  • 2. 2 ● Cogito & What we do ● Architecture & Use-Cases ● Challenges ● Initial lessons learned ● Kubernetes lessons learned ● Performance & Scaling settings ● Results ● Q&A Intended Audience Those who understand the main APIs and components but who may not be familiar with all the configuration settings or how to optimize the system for high write throughput and/or millions of topics. Overview
  • 3. 3 Formed in 2007 out of MIT - based out of Boston - now with a Global Engineering Footprint Vision: Elevating the human connection in real time . . . . Product: Call center AI solution that analyzes the human voice and provides real-time guidance to enhance emotional intelligence and customer service. Cogito: Who we are and what we do
  • 5. 5 ● Streaming: Real-time audio and analytic results from our AI/ML models ● We break each customer call into separate logical units called “intervals” ● Each interval is backed by two Pulsar topics ○ Real-time Audio Topic ○ Real-time Analytics Topic ● Splicing up binary formats into discrete messages → Deduplication is VERY important! ● With 15,000 concurrent users - we estimate 1.5m to 2m topics per day ● Each topic has moderate throughput ~ 32 Kb/s ● Also Messaging: Work-Queue events Use-Cases for Pulsar
  • 6. 6 ● Streaming Use-Case ○ Lots of throughput ~ 10 Gbps ○ Message-ordering & deduplication are critical ○ Near real-time requirements (< 250ms) ■ Think about timeouts/retries/failover ● Challenges ○ Zookeeper stores all the topics for a namespace under one ZNode ○ Brokers require more memory ● Alternatives considered ○ Using key_shared would require us to disable batching in the producer (not a huge deal) ○ Risk: Message dispatch will stop if there is a subscription / consumer that has built up a backlog of messages in their hash-range ○ Filtering on the client-side The Challenges
  • 7. 7 ● Processing real-time binary streams ○ Consumer: SubscriptionInitialPosition.Earliest ○ Broker Configuration: brokerDeduplicationEnabled: "true" ● Client Performance ○ Producer: sendAsync() ~10x improvement ○ Producer: blockIfQueueFull(true) ○ Batching: Enabled but the throughput per Producer is so low it rarely becomes helpful ● Default Timeouts ○ For our real-time system the default connection / operation timeout of 30s is too high ● Persistent vs. Non-Persistent ○ We support both use-cases (some customers wish for zero persistence) Initial Lessons on the basics
  • 8. 8 ● 15k Users ⇒ ingress of 5 Gbps Audio Data ⇒ 20 TB in a 12 hour window ● Open Subscriptions keep the topic data from being deleted ○ Code: pulsarAdmin.namespaces().setSubscriptionExpirationTime()); ○ Broker Deduplication has its own subscription ■ brokerDeduplicationEntriesInterval: "50" (default: 1000) ■ brokerDeduplicationProducerInactivityTimeoutMinutes: "15" (default: 360) ● Bookie Compaction Thresholds (Delete more and do it more frequently) ○ majorCompactionInterval / majorCompactionThreshold ○ minorCompactionInterval / minorCompactionThreshold ○ compactionRate ● Tiered Storage ○ Although we use some Tiered storage there will be too many topics in ZK over time ○ Created our own Stream Offload that stores S3 location in RDS DB Disk Space Challenges
  • 9. 9 ● Which Helm chart? ○ Apache Pulsar (“Official”) vs Streamnative (Also “Official”) vs Kafkaesque ● GC Settings ○ Java Ergonomics: -XX:+PrintFlagsFinal ○ GC Settings tied to Pod Memory: -Xms2g -Xmx2g -XX:MaxDirectMemorySize=6g ○ resources.requests.memory = Heap + Direct Memory + Some Buffer ○ Looking forward to seeing modern JVM settings e.g. -XX:MaxRAMPercentage=75% ● Most helm charts set requests but not limits. We set requests == limits ○ JVM Memory is not elastic ○ CPU is however we experienced a lot of throttling from K8S Scheduler ● Istio Service Mesh ○ Integration with Istio for mTLS and service-level authorization took a chunk of time Kubernetes Lessons
  • 10. 10 ● Config: exposeTopicLevelMetricsInPrometheus: "false" Passive Monitoring with Prometheus / Grafana
  • 11. 11 Active Monitoring with Prometheus Alerts Integration with Prometheus Alerting to Slack / PagerDuty
  • 12. 12 ● Namespace Bundles ○ For 15 Brokers: defaultNumberOfNamespaceBundles: "128" (Default: 4) ● Pulsar Load Balancer ○ # Disable Bundle split due to https://github.com/apache/pulsar/issues/5510 ○ loadBalancerAutoBundleSplitEnabled: "false" ● Balancing throughput, durability and reliability across Bookies ○ managedLedgerDefaultEnsembleSize: "N" ○ managedLedgerDefaultWriteQuorum: "2" ○ managedLedgerDefaultAckQuorum: "1" ○ Striping is great for write-throughput but adds cost for read throughput Real-Time / Scaling Journey Lessons
  • 13. 13 ● Error ○ PerChannelBookieClient - Add for failed on bookie bookkeeper-2:3181 code EIO Bookie EIO Error Root Cause: At peak load Write Cache not big enough to hold accumulated data while waiting on second cache flush
  • 14. 14 ● Key Prometheus Metrics ○ Bookie ■ bookie_throttled_write_requests ■ bookie_rejected_write_request ○ Broker ■ pulsar_ml_cache_hits_rate ■ pulsar_ml_cache_misses_rate Bookie EIO Error BAD GOOD Key Lesson The more we read from the Broker cache, the less we use the Bookie ledger disk (enabling faster flush of write cache → ledger)
  • 15. 15 ● EBS drives for Journal & Ledger ○ GP3 with max settings 16000 IOPS, 1000 MB/s ● Broker Cache ○ managedLedgerCacheEvictionTimeThresholdMillis: "5000" (Default: 1000) ○ managedLedgerCacheSizeMB: "512" (Default: 20% of total direct Memory) ● Bookie ○ dbStorage_writeCacheMaxSizeMb: "3072" (Default: 25% of total direct memory) ○ dbStorage_rocksDB_blockCacheSize: "1073741824" (Default: 10% of total direct memory) ○ journalMaxGroupWaitMSec: "10" (Default: 1ms) ● Scaling approach ○ Scale-out Bookies ○ Scale-up and Scale-out Brokers Key Scaling Settings . . .
  • 16. 16 We’re not at millions yet but we’re seeing a trend . . . . 1) Simulated 300 users for about 18 hours with artificially short 1 minute calls 2) 500k topics created (250k Audio / 250k Signal Analytics) Latest Results
  • 17. 17 Observations: ZooKeeper ZK JVM Heap demands increasing . . .
  • 18. 18 Observations: ZooKeeper ZK Disk Usage Increasing . . . Suppressed: java.io.IOException: No space left on device at org.apache.zookeeper.server.SyncRequestProcessor$1.run(SyncRequestProcessor.java :135) [org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1] at org.apache.zookeeper.server.ZooKeeperServer.takeSnapshot(ZooKeeperServer.java:31 2) [org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1] at org.apache.zookeeper.server.persistence.FileTxnSnapLog.save(FileTxnSnapLog.java: 406) ~[org.apache.pulsar-pulsar-zookeeper-2.6.1.jar:2.6.1] . . [Snapshot Thread] ERROR org.apache.zookeeper.server.ZooKeeperServer - Severe unrecoverable error, exiting
  • 19. 19 Observations: ZooKeeper ZK 99%ile response times increasing. . .
  • 20. 20 Observations: Broker Broker Heap Increasing . . . Topic Metadata here as well as in ZK
  • 21. 21 Implications 1) ZooKeeper a) More Heap b) More CPU for GC (and to avoid throttling during GC) c) Watch ZooKeeper disk space /pulsar/data 2) Broker a) More Heap b) Maybe more CPU for GC (and to avoid throttling during GC) c) Watch for Broker → ZK latency issues i) zooKeeperSessionTimeoutMillis: "60000" (default: 30000) ii) zooKeeperOperationTimeoutSeconds: "60" (default: 30)
  • 22. 22 Recap: Key Metrics for our Streaming Use-Case
  • 23. 23 Thanks Cogito Bruce, Hamid, Andy, Jimmy, George, Gibby, Kyle, Matt, Amanda, John, Ian, Mihai, Luis, Anthony, Karl and many more Pulsar Community Addison, Sijie, Matteo, Joshua etc.
  • 25. 25 ● Benchmarking Pulsar and Kafka - A More Accurate Perspective on Pulsar’s Performance ○ https://streamnative.io/en/blog/tech/2020-11-09-benchmark-pulsar-kafka-performance#maximum-t hroughput-test ● Taking a Deep-Dive into Apache Pulsar Architecture for Performance Tuning ○ https://streamnative.io/en/blog/tech/2021-01-14-pulsar-architecture-performance-tuning ● Understanding How Apache Pulsar Works ○ https://jack-vanlightly.com/blog/2018/10/2/understanding-how-apache-pulsar-works References