SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Cassandra Meetup
Monitoring C* Health at Scale
Jason Cacciatore @jasoncac
How do we assess health ?
● Node Level
- dmesg errors
- gossip status
- thresholds (heap, disk usage, …)
● Cluster Level – ring aggregate
Scope
● Production + Test
● Hundreds of clusters
● Over 9,000 nodes
Current Architecture
So… what’s the problem ?
● Cron system for monitoring is
problematic
○ No state, just snapshot
● Stream processor is a better fit
Mantis
● A reactive stream processing system that runs on Apache Mesos
○ Cloud native
○ Provides a flexible functional programming model
○ Supports job autoscaling
○ Deep integration into Netflix ecosystem
● Current Scale
○ ~350 jobs
○ 8 Million Messages/sec processed
○ 250 Gb/sec data processed
● QConNY 2015 talk on Netflix Mantis
Modeled as Mantis Jobs
Resources
Real-time Dashboard
How can I try this ?
● Priam - JMXNodeTool
● Stream Processor - Spark
THANK YOU
C* Gossip: the good, the bad
and the ugly
Minh Do @timiblossom
What is Gossip Protocol or Gossip?
● A peer-to-peer communication protocol in a
distributed system
● Inspired by the form of gossip seen in
human social networks
● Nodes spread out information to
whichever peers they can contact
● Used in C* primarily as a membership
protocol and information sharing
Gossip flow in C*
● At start, Gossiper loads seed
addresses from configuration
file into the gossip list
● doShadowRound on seeds
● Every 1s, gossip up to 3 nodes
from the peers: a random peer,
a seed peer, and unreachable
peer
C* Gossip round in 3 stages
How Gossip Helps C*?
● Discover cluster topology (DC, Rack)
● Discover token owners
● Figure out peer statuses:
○ moving
○ leaving/left
○ normal
○ down
○ bootstrapping
● Exchange Schema version
● Share Load (used disk space)/Severity (CPU)
● Share Release version/Net version
What Gossip does not do for C*?
● Detect crashes in Thrift or Native servers
● Manage cluster (need Priam or OpsCenter)
● Collect performance metrics: latencies, RPS,
JVM stats, network stats, etc.
● Give C* admins a good sleep.
Gossip race condition
Most of Gossip issues and bugs are caused by incorrect code
logics in handling race conditions
● The larger the cluster the higher chance of having race
condition
● There are several C* components running in different threads
that can affect gossip status:
○ Gossiper
○ FailureDetector
○ Snitches
○ StorageService
○ InboundTcpConnection
○ OutboundTcpConnection
Pain - Gossip Can Inflict on C*
● CASSANDRA-6125 An example to show race condition
● CASSANDRA-10298 Gossiper does not clean out
metadata on a dead peer properly to cause a dead peer
to stay in ring forever
● CASSANDRA-10371 Dead nodes remain in gossip to
prevent a replacement due to FailureDetector unable to
evict a down node
● CASSANDRA-8072 Unable to gossip to
any seeds
● CASSANDRA-8336 Shut down issue
that peers resurrect a down node
Pain - Gossip Can Inflict on C*, cont.
● CASSANDRA-8072 and CASSANDRA-7292
Problems on reusing IP of a dead node on a new node
● CASSANDRA-10969 Long running cluster (over 1yr) has
restarting issue
● CASSANDRA-8768 Upgrading issue to newer version
● CASSANDRA-10321 Gossip to dead nodes caused CPU
usage to be 100%
● A lemon node or AWS network issue to cause one node
not to see the other to display a confusing gossip view
What can we do?
● Rolling restart C* cluster once in awhile
● On AWS, when there is an gossip issue,
try reboot. If still have bad gossip view,,
replace with a new instance
● Node assassination (unsafe and
need a repair/clean-up)
● Monitor network activities to take
pre-emptive actions
● Search community for the issues
reported in the system logs
● Fix it yourself
● Pray
THANK YOU
References
● https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf
● https://wiki.apache.org/cassandra/ArchitectureGossip
Cassandra Tickler
@chriskalan
When does repair fall down?
● Running LCS on an old version of C*
● Space issues
● Repair gets stuck
Solution - Cassandra Tickler
Solution - Cassandra Tickler
Solution - Cassandra Tickler
https://github.com/ckalantzis/cassTickler
THANK YOU

Más contenido relacionado

La actualidad más candente

Cassandra summit 2013 how not to use cassandra
Cassandra summit 2013  how not to use cassandraCassandra summit 2013  how not to use cassandra
Cassandra summit 2013 how not to use cassandra
Axel Liljencrantz
 

La actualidad más candente (19)

Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Beginning Operations: 7 Deadly Sins for Apache Cassandra OpsBeginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
 
Scylla Summit 2018: Consensus in Eventually Consistent Databases
Scylla Summit 2018: Consensus in Eventually Consistent DatabasesScylla Summit 2018: Consensus in Eventually Consistent Databases
Scylla Summit 2018: Consensus in Eventually Consistent Databases
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Mantis qcon nyc_2015
Mantis qcon nyc_2015Mantis qcon nyc_2015
Mantis qcon nyc_2015
 
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
 
Cassandra Summit 2014: Diagnosing Problems in Production
Cassandra Summit 2014: Diagnosing Problems in ProductionCassandra Summit 2014: Diagnosing Problems in Production
Cassandra Summit 2014: Diagnosing Problems in Production
 
Cassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day TorontoCassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts - Cassandra Day Toronto
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra summit 2013 how not to use cassandra
Cassandra summit 2013  how not to use cassandraCassandra summit 2013  how not to use cassandra
Cassandra summit 2013 how not to use cassandra
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
 
Mesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run CassandraMesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run Cassandra
 
How to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instancesHow to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instances
 
Webinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache CassandraWebinar: Getting Started with Apache Cassandra
Webinar: Getting Started with Apache Cassandra
 
DynamoDB at HasOffers
DynamoDB at HasOffers DynamoDB at HasOffers
DynamoDB at HasOffers
 
Seattle Cassandra Meetup - HasOffers
Seattle Cassandra Meetup - HasOffersSeattle Cassandra Meetup - HasOffers
Seattle Cassandra Meetup - HasOffers
 
Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real World
 
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair UpdatesScylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
 
Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!
Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!
Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!
 

Destacado

Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
DataStax Academy
 

Destacado (20)

Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Netflix Operational Simplicity with Apache Cassandra
Netflix Operational Simplicity with Apache CassandraNetflix Operational Simplicity with Apache Cassandra
Netflix Operational Simplicity with Apache Cassandra
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 

Similar a Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
DataStax Academy
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
Krivoy Rog IT Community
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
Edge AI and Vision Alliance
 

Similar a Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python (20)

Scaling Cassandra for Big Data
Scaling Cassandra for Big DataScaling Cassandra for Big Data
Scaling Cassandra for Big Data
 
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache KafkaStrata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
Strata+Hadoop 2017 San Jose: Lessons from a year of supporting Apache Kafka
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflix
 
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
 
Server fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil AhujaServer fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil Ahuja
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Netty training
Netty trainingNetty training
Netty training
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
Kraken mesoscon 2018
Kraken mesoscon 2018Kraken mesoscon 2018
Kraken mesoscon 2018
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan Kumar
 
Taskerman - a distributed cluster task manager
Taskerman - a distributed cluster task managerTaskerman - a distributed cluster task manager
Taskerman - a distributed cluster task manager
 
Netty training
Netty trainingNetty training
Netty training
 

Más de DataStax Academy

Más de DataStax Academy (9)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Cassandra Data Maintenance with Spark
Cassandra Data Maintenance with SparkCassandra Data Maintenance with Spark
Cassandra Data Maintenance with Spark
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
 
Traveler's Guide to Cassandra
Traveler's Guide to CassandraTraveler's Guide to Cassandra
Traveler's Guide to Cassandra
 
Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
 
New features in 3.0
New features in 3.0New features in 3.0
New features in 3.0
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python

  • 2. Monitoring C* Health at Scale Jason Cacciatore @jasoncac
  • 3. How do we assess health ? ● Node Level - dmesg errors - gossip status - thresholds (heap, disk usage, …) ● Cluster Level – ring aggregate
  • 4. Scope ● Production + Test ● Hundreds of clusters ● Over 9,000 nodes
  • 6. So… what’s the problem ? ● Cron system for monitoring is problematic ○ No state, just snapshot ● Stream processor is a better fit
  • 7. Mantis ● A reactive stream processing system that runs on Apache Mesos ○ Cloud native ○ Provides a flexible functional programming model ○ Supports job autoscaling ○ Deep integration into Netflix ecosystem ● Current Scale ○ ~350 jobs ○ 8 Million Messages/sec processed ○ 250 Gb/sec data processed ● QConNY 2015 talk on Netflix Mantis
  • 11.
  • 12. How can I try this ? ● Priam - JMXNodeTool ● Stream Processor - Spark
  • 14. C* Gossip: the good, the bad and the ugly Minh Do @timiblossom
  • 15. What is Gossip Protocol or Gossip? ● A peer-to-peer communication protocol in a distributed system ● Inspired by the form of gossip seen in human social networks ● Nodes spread out information to whichever peers they can contact ● Used in C* primarily as a membership protocol and information sharing
  • 16. Gossip flow in C* ● At start, Gossiper loads seed addresses from configuration file into the gossip list ● doShadowRound on seeds ● Every 1s, gossip up to 3 nodes from the peers: a random peer, a seed peer, and unreachable peer
  • 17. C* Gossip round in 3 stages
  • 18. How Gossip Helps C*? ● Discover cluster topology (DC, Rack) ● Discover token owners ● Figure out peer statuses: ○ moving ○ leaving/left ○ normal ○ down ○ bootstrapping ● Exchange Schema version ● Share Load (used disk space)/Severity (CPU) ● Share Release version/Net version
  • 19. What Gossip does not do for C*? ● Detect crashes in Thrift or Native servers ● Manage cluster (need Priam or OpsCenter) ● Collect performance metrics: latencies, RPS, JVM stats, network stats, etc. ● Give C* admins a good sleep.
  • 20. Gossip race condition Most of Gossip issues and bugs are caused by incorrect code logics in handling race conditions ● The larger the cluster the higher chance of having race condition ● There are several C* components running in different threads that can affect gossip status: ○ Gossiper ○ FailureDetector ○ Snitches ○ StorageService ○ InboundTcpConnection ○ OutboundTcpConnection
  • 21. Pain - Gossip Can Inflict on C* ● CASSANDRA-6125 An example to show race condition ● CASSANDRA-10298 Gossiper does not clean out metadata on a dead peer properly to cause a dead peer to stay in ring forever ● CASSANDRA-10371 Dead nodes remain in gossip to prevent a replacement due to FailureDetector unable to evict a down node ● CASSANDRA-8072 Unable to gossip to any seeds ● CASSANDRA-8336 Shut down issue that peers resurrect a down node
  • 22. Pain - Gossip Can Inflict on C*, cont. ● CASSANDRA-8072 and CASSANDRA-7292 Problems on reusing IP of a dead node on a new node ● CASSANDRA-10969 Long running cluster (over 1yr) has restarting issue ● CASSANDRA-8768 Upgrading issue to newer version ● CASSANDRA-10321 Gossip to dead nodes caused CPU usage to be 100% ● A lemon node or AWS network issue to cause one node not to see the other to display a confusing gossip view
  • 23. What can we do? ● Rolling restart C* cluster once in awhile ● On AWS, when there is an gossip issue, try reboot. If still have bad gossip view,, replace with a new instance ● Node assassination (unsafe and need a repair/clean-up) ● Monitor network activities to take pre-emptive actions ● Search community for the issues reported in the system logs ● Fix it yourself ● Pray
  • 27. When does repair fall down? ● Running LCS on an old version of C* ● Space issues ● Repair gets stuck
  • 30. Solution - Cassandra Tickler https://github.com/ckalantzis/cassTickler