SlideShare una empresa de Scribd logo
1 de 34
…And Metrics For All
Paul O’Connor
github.com/pauloconnor
2015-05-19
About Yelp
Founded: 2004
Monthly Active Users: ~142 Million
Non-US Monthly Users: ~31 Million
Review: ~77 Million
Local Businesses: 2.1 Million
Territories: Available in 31 countries
What are metrics?
Name Value
What are metrics?
Name Value Timestamp
What are metrics?
Name Value Timestamp
server1.load.1m 28.826667 1431950640
What are metrics?
Name Value Timestamp
server1.load.1m 28.826667 1431950640
server1.load.1m 29.188333 1431950700
server1.load.1m 29.231667 1431950760
server1.load.1m 29.083333 1431950820
server1.load.1m 29.710000 1431950880
What are metrics?
Name Value Timestamp
server1.load.1m 28.826667 1431950640
server1.load.1m 29.188333 1431950700
server1.load.1m 29.231667 1431950760
server1.load.1m 29.083333 1431950820
server1.load.1m 29.710000 1431950880
Graphite Components
• Carbon:
• relay
• cache
• aggregator
• Whisper
• Web app
Carbon Relay
• Deals with 2 things
• Replication
• Sharding
Relay Methods
• Rules
• [replicate]
• pattern = ^services.ads..+
• servers = 10.1.2.3, 10.2.2.3
• continue = true
• Consistent Hashing
• Defines a sharding strategy across multiple backends
10
Carbon Cache
• Receives metrics and persists them to disk
• Writes based on storage schemas
11
Storage Schemas
• Details retention rates for storing metrics
[databases_10sec_1year]
pattern = ^servers.db.*$
retentions = 10s:7d,1m:30d,5m:90d,30m:365d
12
Storage Aggregation
• Rules for aggregating data to lower-precision retentions
[all_min]
pattern = .min$
xFilesFactor = 0.1
aggregationMethod = min
13
Carbon Aggregator
• Buffers metrics before forwarding to carbon cache
• Roll up metrics based on rules
14
Aggregation Rules
• Not to be confused with storage aggregation
• Tells the carbon aggregator what to aggregate and how
output_template (frequency) = method input_pattern
<env>.applications.<app>.all.requests (60) = sum
<env>.applications.<app>.*.requests
prod.applications.apache.www01.requests
prod.applications.apache.www02.requests
prod.applications.apache.www03.requests
prod.applications.apache.www04.requests
prod.applications.apache.www05.requests
prod.applications.apache.all.requests
15
Whisper
• Fixed size database
• Allows for roll ups
• Allows for backfilling data
16
Web App
• Django based app for rendering graphs
17
Putting it all together
• Carbon cache listening on port 2003
• Write to disk
• Listen with web
18
Getting more complicated
• Carbon relay using consistent hashing to multiple caches
• Individual caches responsible for specific metrics
19
More Relays
• Use HAProxy to load balance between relays
• Use more relays to use CPU
20
Even more relays
• Useful for sending metrics to other locations
21
Replicate the metrics
• Duplicate your metrics for backup, and redundancy
22
More caches instead
• Consistent hash across multiple nodes
23
Where does the aggregator fit?
• Aggregator uses a lot of CPU. Put it on it’s own node
24
Scaling further
• Use nodes for particular functions:
• Use forwarding relay nodes solely to forward
• Have consistent hashing nodes
• Have aggregation nodes
25
26
Getting your data back out
• Graphite Dashboard
• Third Party Dashboard
• We use Grafana http://grafana.org/
• Graphite-api https://github.com/brutasse/graphite-api
29
Tips
• Aggregate before ingestion
• Control the metrics that can be sent
• Metrics are a gas - they expand to fill all available room
• Use C implementation of carbon
• Use the latest webapp.
Optimize your dashboard queries
• services.biz_app.*.*.timers.pyramid_uwsgi_metrics_tweens_*.p99
• 2154 results
• 35 seconds to just find these files on disk
• Running functions against these results
• Timeout after a minute
• Dashboard automatically refreshing every 10 seconds
What’s the Future?
• InfluxDB
• Cassandra
• Third party
33
We’re hiring!
http://www.yelp.com/careers
Hiring SREs in Dublin, London, New York, San Francisco

Más contenido relacionado

La actualidad más candente

DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATAInfluxData
 
WHODIS_kearns_presentation.v0a
WHODIS_kearns_presentation.v0aWHODIS_kearns_presentation.v0a
WHODIS_kearns_presentation.v0aEdward Kearns
 
Dato vs GraphX
Dato vs GraphXDato vs GraphX
Dato vs GraphXKeira Zhou
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit
 
tado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDBtado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDBInfluxData
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Setting up InfluxData for IoT
Setting up InfluxData for IoTSetting up InfluxData for IoT
Setting up InfluxData for IoTInfluxData
 
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDBHow to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDBInfluxData
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introductionRick Chang
 
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15Zhenxiao Luo
 
uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator
uReplicator: Uber Engineering’s Scalable,  Robust Kafka ReplicatoruReplicator: Uber Engineering’s Scalable,  Robust Kafka Replicator
uReplicator: Uber Engineering’s Scalable, Robust Kafka ReplicatorMichael Hongliang Xu
 
From Ceilometer to Telemetry: not so alarming!
From Ceilometer to Telemetry: not so alarming!From Ceilometer to Telemetry: not so alarming!
From Ceilometer to Telemetry: not so alarming!Nicolas (Nick) Barcet
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | GrafanaStreaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | GrafanaInfluxData
 
A True Story About Database Orchestration
A True Story About Database OrchestrationA True Story About Database Orchestration
A True Story About Database OrchestrationInfluxData
 
Artmosphere Demo
Artmosphere DemoArtmosphere Demo
Artmosphere DemoKeira Zhou
 
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia GuptaIntro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia GuptaInfluxData
 
Ceilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdfCeilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdfOpenStack Foundation
 

La actualidad más candente (20)

DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATA
 
WHODIS_kearns_presentation.v0a
WHODIS_kearns_presentation.v0aWHODIS_kearns_presentation.v0a
WHODIS_kearns_presentation.v0a
 
Dato vs GraphX
Dato vs GraphXDato vs GraphX
Dato vs GraphX
 
Data Integration
Data IntegrationData Integration
Data Integration
 
Spark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug GrallSpark Summit EU talk by Tug Grall
Spark Summit EU talk by Tug Grall
 
tado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDBtado° Makes Your Home Environment Smart with InfluxDB
tado° Makes Your Home Environment Smart with InfluxDB
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Setting up InfluxData for IoT
Setting up InfluxData for IoTSetting up InfluxData for IoT
Setting up InfluxData for IoT
 
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDBHow to Enable Industrial Decarbonization with Node-RED and InfluxDB
How to Enable Industrial Decarbonization with Node-RED and InfluxDB
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
 
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...
 
presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15presto-at-netflix-hadoop-summit-15
presto-at-netflix-hadoop-summit-15
 
uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator
uReplicator: Uber Engineering’s Scalable,  Robust Kafka ReplicatoruReplicator: Uber Engineering’s Scalable,  Robust Kafka Replicator
uReplicator: Uber Engineering’s Scalable, Robust Kafka Replicator
 
From Ceilometer to Telemetry: not so alarming!
From Ceilometer to Telemetry: not so alarming!From Ceilometer to Telemetry: not so alarming!
From Ceilometer to Telemetry: not so alarming!
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | GrafanaStreaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
Streaming Sensor Data with Grafana and InfluxDB | Ryan Mckinley | Grafana
 
A True Story About Database Orchestration
A True Story About Database OrchestrationA True Story About Database Orchestration
A True Story About Database Orchestration
 
Artmosphere Demo
Artmosphere DemoArtmosphere Demo
Artmosphere Demo
 
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia GuptaIntro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
Intro to InfluxDB 2.0 and Your First Flux Query by Sonia Gupta
 
Ceilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdfCeilometer presentation ODS Grizzly.pdf
Ceilometer presentation ODS Grizzly.pdf
 

Destacado

Cohesive SDN Summit Presentation: OpenFlow is SDN, SDN is not OpenFlow
Cohesive SDN Summit Presentation: OpenFlow is SDN, SDN is not OpenFlowCohesive SDN Summit Presentation: OpenFlow is SDN, SDN is not OpenFlow
Cohesive SDN Summit Presentation: OpenFlow is SDN, SDN is not OpenFlowCohesive Networks
 
Hadoop / Spark on Malware Expression
Hadoop / Spark on Malware ExpressionHadoop / Spark on Malware Expression
Hadoop / Spark on Malware ExpressionMapR Technologies
 
Free - Chris Anderson
Free - Chris AndersonFree - Chris Anderson
Free - Chris Andersonschooldialoog
 
concepto de colección local
concepto de colección localconcepto de colección local
concepto de colección localguestf488db7
 
Architecting your Splunk deployment
Architecting your Splunk deploymentArchitecting your Splunk deployment
Architecting your Splunk deploymentSplunk
 
AppSensor Near Real-Time Event Detection and Response - DevNexus 2016
AppSensor Near Real-Time Event Detection and Response - DevNexus 2016AppSensor Near Real-Time Event Detection and Response - DevNexus 2016
AppSensor Near Real-Time Event Detection and Response - DevNexus 2016jtmelton
 
George Park Workshop 1 - Cosumnes CSD
George Park Workshop 1 - Cosumnes CSDGeorge Park Workshop 1 - Cosumnes CSD
George Park Workshop 1 - Cosumnes CSDCosumnes CSD
 
Can you handle The TRUTH ,..? Missing page history of JESUS and Hidden TRUTH
Can you handle The TRUTH ,..?  Missing page history of JESUS and Hidden TRUTHCan you handle The TRUTH ,..?  Missing page history of JESUS and Hidden TRUTH
Can you handle The TRUTH ,..? Missing page history of JESUS and Hidden TRUTHHeri kusrianto
 
vanEngelen 360 Inspiratieborrel - Trends Update 2014
vanEngelen 360 Inspiratieborrel - Trends Update 2014vanEngelen 360 Inspiratieborrel - Trends Update 2014
vanEngelen 360 Inspiratieborrel - Trends Update 2014Van Engelen
 
Game Over - HTML5 Games
Game Over - HTML5 GamesGame Over - HTML5 Games
Game Over - HTML5 GamesGuido Garcia
 
Respond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an saRespond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an saTom Cudd
 
Interact Differently: Get More From Your Tools Through Exposed APIs
Interact Differently: Get More From Your Tools Through Exposed APIsInteract Differently: Get More From Your Tools Through Exposed APIs
Interact Differently: Get More From Your Tools Through Exposed APIsKevin Fealey
 
Modern Infrastructure from Scratch with Puppet
Modern Infrastructure from Scratch with PuppetModern Infrastructure from Scratch with Puppet
Modern Infrastructure from Scratch with PuppetPuppet
 
FDA's Brian Bradley Case Study and Process Review of the Veterans Review and ...
FDA's Brian Bradley Case Study and Process Review of the Veterans Review and ...FDA's Brian Bradley Case Study and Process Review of the Veterans Review and ...
FDA's Brian Bradley Case Study and Process Review of the Veterans Review and ...Foundation for Democratic Advancement
 
IT Infrastructure Monitoring Strategies in Healthcare
IT Infrastructure Monitoring Strategies in HealthcareIT Infrastructure Monitoring Strategies in Healthcare
IT Infrastructure Monitoring Strategies in HealthcareCA Technologies
 
Lost in Translation - Blackhat Brazil 2014
Lost in Translation - Blackhat Brazil 2014Lost in Translation - Blackhat Brazil 2014
Lost in Translation - Blackhat Brazil 2014Rodrigo Montoro
 

Destacado (20)

Cohesive SDN Summit Presentation: OpenFlow is SDN, SDN is not OpenFlow
Cohesive SDN Summit Presentation: OpenFlow is SDN, SDN is not OpenFlowCohesive SDN Summit Presentation: OpenFlow is SDN, SDN is not OpenFlow
Cohesive SDN Summit Presentation: OpenFlow is SDN, SDN is not OpenFlow
 
Open Development
Open DevelopmentOpen Development
Open Development
 
Crow
CrowCrow
Crow
 
Hadoop / Spark on Malware Expression
Hadoop / Spark on Malware ExpressionHadoop / Spark on Malware Expression
Hadoop / Spark on Malware Expression
 
Free - Chris Anderson
Free - Chris AndersonFree - Chris Anderson
Free - Chris Anderson
 
concepto de colección local
concepto de colección localconcepto de colección local
concepto de colección local
 
Architecting your Splunk deployment
Architecting your Splunk deploymentArchitecting your Splunk deployment
Architecting your Splunk deployment
 
AppSensor Near Real-Time Event Detection and Response - DevNexus 2016
AppSensor Near Real-Time Event Detection and Response - DevNexus 2016AppSensor Near Real-Time Event Detection and Response - DevNexus 2016
AppSensor Near Real-Time Event Detection and Response - DevNexus 2016
 
George Park Workshop 1 - Cosumnes CSD
George Park Workshop 1 - Cosumnes CSDGeorge Park Workshop 1 - Cosumnes CSD
George Park Workshop 1 - Cosumnes CSD
 
Can you handle The TRUTH ,..? Missing page history of JESUS and Hidden TRUTH
Can you handle The TRUTH ,..?  Missing page history of JESUS and Hidden TRUTHCan you handle The TRUTH ,..?  Missing page history of JESUS and Hidden TRUTH
Can you handle The TRUTH ,..? Missing page history of JESUS and Hidden TRUTH
 
vanEngelen 360 Inspiratieborrel - Trends Update 2014
vanEngelen 360 Inspiratieborrel - Trends Update 2014vanEngelen 360 Inspiratieborrel - Trends Update 2014
vanEngelen 360 Inspiratieborrel - Trends Update 2014
 
Game Over - HTML5 Games
Game Over - HTML5 GamesGame Over - HTML5 Games
Game Over - HTML5 Games
 
Respond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an saRespond to and troubleshoot production incidents like an sa
Respond to and troubleshoot production incidents like an sa
 
De tabernakel
De tabernakelDe tabernakel
De tabernakel
 
Interact Differently: Get More From Your Tools Through Exposed APIs
Interact Differently: Get More From Your Tools Through Exposed APIsInteract Differently: Get More From Your Tools Through Exposed APIs
Interact Differently: Get More From Your Tools Through Exposed APIs
 
Modern Infrastructure from Scratch with Puppet
Modern Infrastructure from Scratch with PuppetModern Infrastructure from Scratch with Puppet
Modern Infrastructure from Scratch with Puppet
 
FDA's Brian Bradley Case Study and Process Review of the Veterans Review and ...
FDA's Brian Bradley Case Study and Process Review of the Veterans Review and ...FDA's Brian Bradley Case Study and Process Review of the Veterans Review and ...
FDA's Brian Bradley Case Study and Process Review of the Veterans Review and ...
 
Build Stuff 2015 program
Build Stuff 2015 programBuild Stuff 2015 program
Build Stuff 2015 program
 
IT Infrastructure Monitoring Strategies in Healthcare
IT Infrastructure Monitoring Strategies in HealthcareIT Infrastructure Monitoring Strategies in Healthcare
IT Infrastructure Monitoring Strategies in Healthcare
 
Lost in Translation - Blackhat Brazil 2014
Lost in Translation - Blackhat Brazil 2014Lost in Translation - Blackhat Brazil 2014
Lost in Translation - Blackhat Brazil 2014
 

Similar a Scaling Graphite At Yelp

Universal metrics with Apache Beam
Universal metrics with Apache BeamUniversal metrics with Apache Beam
Universal metrics with Apache BeamEtienne Chauchot
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
Canary Analyze All The Things: How We Learned to Keep Calm and Release Often
Canary Analyze All The Things: How We Learned to Keep Calm and Release OftenCanary Analyze All The Things: How We Learned to Keep Calm and Release Often
Canary Analyze All The Things: How We Learned to Keep Calm and Release OftenC4Media
 
Dynamic Reactor Pattern for Distributed Systems in Control and Monitoring
Dynamic Reactor Pattern for Distributed Systems in Control and MonitoringDynamic Reactor Pattern for Distributed Systems in Control and Monitoring
Dynamic Reactor Pattern for Distributed Systems in Control and MonitoringJordan McBain
 
How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner
 
Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...Andrés Colón Pérez
 
Strategies in continuous delivery
Strategies in continuous deliveryStrategies in continuous delivery
Strategies in continuous deliveryAviran Mordo
 
Tools for Measurements and Analysis
Tools for Measurements and AnalysisTools for Measurements and Analysis
Tools for Measurements and AnalysisRIPE NCC
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingJulien Pivotto
 
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesPyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesViach Kakovskyi
 
LISA2017 Kubernetes: Hit the Ground Running
LISA2017 Kubernetes: Hit the Ground RunningLISA2017 Kubernetes: Hit the Ground Running
LISA2017 Kubernetes: Hit the Ground RunningChris McEniry
 
ICANN DNS Symposium 2021: Measuring Recursive Resolver Centrality
ICANN DNS Symposium 2021: Measuring Recursive Resolver CentralityICANN DNS Symposium 2021: Measuring Recursive Resolver Centrality
ICANN DNS Symposium 2021: Measuring Recursive Resolver CentralityAPNIC
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseVictoriaMetrics
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Altinity Ltd
 
IPv6 and the DNS, RIPE 73
IPv6 and the DNS, RIPE 73IPv6 and the DNS, RIPE 73
IPv6 and the DNS, RIPE 73APNIC
 
Understanding Distributed Source Control
Understanding Distributed Source ControlUnderstanding Distributed Source Control
Understanding Distributed Source ControlLorna Mitchell
 
Broadcast Music Inc - Scaling Up, Doing More, Having More Fun!
Broadcast Music Inc - Scaling Up, Doing More, Having More Fun!Broadcast Music Inc - Scaling Up, Doing More, Having More Fun!
Broadcast Music Inc - Scaling Up, Doing More, Having More Fun!ghodgkinson
 
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaSHAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaSpierrecdn -
 
Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation SystemsJared Winick
 

Similar a Scaling Graphite At Yelp (20)

Universal metrics with Apache Beam
Universal metrics with Apache BeamUniversal metrics with Apache Beam
Universal metrics with Apache Beam
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Canary Analyze All The Things: How We Learned to Keep Calm and Release Often
Canary Analyze All The Things: How We Learned to Keep Calm and Release OftenCanary Analyze All The Things: How We Learned to Keep Calm and Release Often
Canary Analyze All The Things: How We Learned to Keep Calm and Release Often
 
Dynamic Reactor Pattern for Distributed Systems in Control and Monitoring
Dynamic Reactor Pattern for Distributed Systems in Control and MonitoringDynamic Reactor Pattern for Distributed Systems in Control and Monitoring
Dynamic Reactor Pattern for Distributed Systems in Control and Monitoring
 
How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...How to Improve the Observability of Apache Cassandra and Kafka applications...
How to Improve the Observability of Apache Cassandra and Kafka applications...
 
Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...Building Modern Digital Services on Scalable Private Government Infrastructur...
Building Modern Digital Services on Scalable Private Government Infrastructur...
 
Strategies in continuous delivery
Strategies in continuous deliveryStrategies in continuous delivery
Strategies in continuous delivery
 
Tools for Measurements and Analysis
Tools for Measurements and AnalysisTools for Measurements and Analysis
Tools for Measurements and Analysis
 
Prometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is comingPrometheus: What is is, what is new, what is coming
Prometheus: What is is, what is new, what is coming
 
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakesPyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
 
LISA2017 Kubernetes: Hit the Ground Running
LISA2017 Kubernetes: Hit the Ground RunningLISA2017 Kubernetes: Hit the Ground Running
LISA2017 Kubernetes: Hit the Ground Running
 
ICANN DNS Symposium 2021: Measuring Recursive Resolver Centrality
ICANN DNS Symposium 2021: Measuring Recursive Resolver CentralityICANN DNS Symposium 2021: Measuring Recursive Resolver Centrality
ICANN DNS Symposium 2021: Measuring Recursive Resolver Centrality
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseApplication Monitoring using Open Source: VictoriaMetrics - ClickHouse
Application Monitoring using Open Source: VictoriaMetrics - ClickHouse
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
 
IPv6 and the DNS, RIPE 73
IPv6 and the DNS, RIPE 73IPv6 and the DNS, RIPE 73
IPv6 and the DNS, RIPE 73
 
Understanding Distributed Source Control
Understanding Distributed Source ControlUnderstanding Distributed Source Control
Understanding Distributed Source Control
 
Broadcast Music Inc - Scaling Up, Doing More, Having More Fun!
Broadcast Music Inc - Scaling Up, Doing More, Having More Fun!Broadcast Music Inc - Scaling Up, Doing More, Having More Fun!
Broadcast Music Inc - Scaling Up, Doing More, Having More Fun!
 
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaSHAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS
HAProxyconf 2019 - Criteo - Transitioning from Ticketing to LBaaS
 
Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation Systems
 

Último

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 

Último (20)

Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 

Scaling Graphite At Yelp

  • 1. …And Metrics For All Paul O’Connor github.com/pauloconnor 2015-05-19
  • 2. About Yelp Founded: 2004 Monthly Active Users: ~142 Million Non-US Monthly Users: ~31 Million Review: ~77 Million Local Businesses: 2.1 Million Territories: Available in 31 countries
  • 4. What are metrics? Name Value Timestamp
  • 5. What are metrics? Name Value Timestamp server1.load.1m 28.826667 1431950640
  • 6. What are metrics? Name Value Timestamp server1.load.1m 28.826667 1431950640 server1.load.1m 29.188333 1431950700 server1.load.1m 29.231667 1431950760 server1.load.1m 29.083333 1431950820 server1.load.1m 29.710000 1431950880
  • 7. What are metrics? Name Value Timestamp server1.load.1m 28.826667 1431950640 server1.load.1m 29.188333 1431950700 server1.load.1m 29.231667 1431950760 server1.load.1m 29.083333 1431950820 server1.load.1m 29.710000 1431950880
  • 8. Graphite Components • Carbon: • relay • cache • aggregator • Whisper • Web app
  • 9. Carbon Relay • Deals with 2 things • Replication • Sharding
  • 10. Relay Methods • Rules • [replicate] • pattern = ^services.ads..+ • servers = 10.1.2.3, 10.2.2.3 • continue = true • Consistent Hashing • Defines a sharding strategy across multiple backends 10
  • 11. Carbon Cache • Receives metrics and persists them to disk • Writes based on storage schemas 11
  • 12. Storage Schemas • Details retention rates for storing metrics [databases_10sec_1year] pattern = ^servers.db.*$ retentions = 10s:7d,1m:30d,5m:90d,30m:365d 12
  • 13. Storage Aggregation • Rules for aggregating data to lower-precision retentions [all_min] pattern = .min$ xFilesFactor = 0.1 aggregationMethod = min 13
  • 14. Carbon Aggregator • Buffers metrics before forwarding to carbon cache • Roll up metrics based on rules 14
  • 15. Aggregation Rules • Not to be confused with storage aggregation • Tells the carbon aggregator what to aggregate and how output_template (frequency) = method input_pattern <env>.applications.<app>.all.requests (60) = sum <env>.applications.<app>.*.requests prod.applications.apache.www01.requests prod.applications.apache.www02.requests prod.applications.apache.www03.requests prod.applications.apache.www04.requests prod.applications.apache.www05.requests prod.applications.apache.all.requests 15
  • 16. Whisper • Fixed size database • Allows for roll ups • Allows for backfilling data 16
  • 17. Web App • Django based app for rendering graphs 17
  • 18. Putting it all together • Carbon cache listening on port 2003 • Write to disk • Listen with web 18
  • 19. Getting more complicated • Carbon relay using consistent hashing to multiple caches • Individual caches responsible for specific metrics 19
  • 20. More Relays • Use HAProxy to load balance between relays • Use more relays to use CPU 20
  • 21. Even more relays • Useful for sending metrics to other locations 21
  • 22. Replicate the metrics • Duplicate your metrics for backup, and redundancy 22
  • 23. More caches instead • Consistent hash across multiple nodes 23
  • 24. Where does the aggregator fit? • Aggregator uses a lot of CPU. Put it on it’s own node 24
  • 25. Scaling further • Use nodes for particular functions: • Use forwarding relay nodes solely to forward • Have consistent hashing nodes • Have aggregation nodes 25
  • 26. 26
  • 27.
  • 28. Getting your data back out • Graphite Dashboard • Third Party Dashboard • We use Grafana http://grafana.org/ • Graphite-api https://github.com/brutasse/graphite-api
  • 29. 29
  • 30. Tips • Aggregate before ingestion • Control the metrics that can be sent • Metrics are a gas - they expand to fill all available room • Use C implementation of carbon • Use the latest webapp.
  • 31. Optimize your dashboard queries • services.biz_app.*.*.timers.pyramid_uwsgi_metrics_tweens_*.p99 • 2154 results • 35 seconds to just find these files on disk • Running functions against these results • Timeout after a minute • Dashboard automatically refreshing every 10 seconds
  • 32.
  • 33. What’s the Future? • InfluxDB • Cassandra • Third party 33
  • 34. We’re hiring! http://www.yelp.com/careers Hiring SREs in Dublin, London, New York, San Francisco

Notas del editor

  1. Hi, I’m Paul. I’m an SRE in Yelp’s Dublin office, where I’ve been for about a year. Today, I’m going to talk a bit about metrics in Yelp - in particular how we’ve scaled Graphite to handle over 12,000,000 metrics a minute.
  2. For those of you who don’t know, Yelp is a company that produces huge amounts of logs, and huge amounts of metrics, and also has a side business for finding and reviewing local businesses. Founded in 2004, about 142 million MAU of which 31 million are outside the US.
  3. So, let’s get started with the basics. What is a metric? Simply, it’s a name and a value. The problem with that is that it is only correct for the moment that the value is recorded but we don’t know when that was. Simple answer…
  4. Let’s add a timestamp to the value. Now we know what the metric’s value was, and when. This is getting useful now. Let’s look at an example
  5. I’ve colour coded these just make it easier to follow along. We can see that we’re looking at a metric called server1.load.1m, which has a value of 28.8ish (Yes, I know this is high. This is the actual load average from one of the graphite nodes we use. More on this later). Finally, we have an Epoch timestamp. Now, a single data point on it’s on isn’t terrible useful, especially if you want to look for trends.
  6. Now we have five data points, spanning 5 minutes. We have some accurate historical data we can now see how this server’s load was. Unfortunately, numbers aren’t terribly wonderful at showing changes in data quickly.
  7. Let’s through it into a graph, and immediately we can see what’s happening with our data. Now that we know what we’re storing, and how we want to present the data, let’s have a look at our solution - Graphite
  8. Graphite is made up of three main components - The Carbon daemon, Whisper and the web app. The carbon daemon has three components in it, which we’ll go into seperately.
  9. So, the relay is pretty simple. It does two simple things. It will forward received metrics to somewhere else, based on a set of rules, or it will forward them based on sharding, using a consistent hashing algorithm. This simply means that when any relay receives a metric, it will always be forwarded to the same destination.
  10. Relay rules are fairly simple. A rule consists of 4 parts - a distinct name for the rule, a regex pattern for matching the metric, a comma separated list of destinations, and an optional rule, telling carbon to continue whether or not to continue processing rules once it matches on a metric. This is useful for splitting metrics between multiple nodes. A rule tells the relay daemon “If you see a metric that matches this regex, forward it to these destinations”. This is very useful for replicating data, or splitting data between multiple storage backends. With consistent hashing, carbon relay will shard the metrics across a list of backends. This is a nice way of scaling out the storage layer. We’ll cover this in detail shortly.
  11. The carbon cache is the responsible daemon for writing to disk. The cache will hold metrics in memory until it can write to disk in an efficient a manner as possible. When it writes the metrics to disk, it follows a storage schema which is configurable per metric name or type.
  12. It would be lovely to store all data points for all metrics for all time, but there’s a problem. Each data point takes 12 bytes, so if we received a metric every 10 seconds, that would be about 37MB per metric. Given a system with a million unique metrics, that would be 37TB of storage that’s fast enough to handle that many metrics. That’s expensive, and quite wasteful. Instead, Whisper and carbon cache can use different retention policies. A retention policy has three parts - a name for the policy, a regex pattern to match on, and the retention policy itself. This retention policy says that for any database server, we will store the metrics at a resolution of 10 seconds for 7 days, 1 minute for 30 days, 5 minutes for 90 days, and 30 minutes for 365 days. What does this actually mean though? When the carbon cache receives a metric within a 10 second window, it will store that metric as is. For seven days, there will be over 60,000 datapoints. As metrics slide outside the 7 day window, they will be taken in groups of 6 (6 groups of 10 seconds in a minute), and then processed so that the 6 datapoints become 1. Let’s talk about how these metrics are processed.
  13. As with everything else we’ve seen so far, Carbon let’s you do whatever you want with your metrics. In this case, we can decide how we want to aggregate our metrics as we step from our 10 second resolution to 1 minute. Again, these rules have 4 parts - a name, a regex for matching the metrics, an xFiles factor and an aggregation method. The xfiles factor is an important option here. It defines what fraction of the points we are aggregating should be non-null in order to create a non null metric. The aggregationMethod defines how the points should be aggregated. Options for this are sum, min, max, last, and average, with the default being average.
  14. And so, the last of our three carbon daemons is the aggregator. The aggregator runs along side the relay, will accept metrics, and as the name suggests, will aggregate them based on a set of rules which we’ll talk about in a moment. This is really handy if you want to creates totals across number of nodes - for example, you could create a list of metrics for a cluster so you can easily see performance, egress and combined disk space before the metrics are written to disk. I’ll cover why this is a really useful feature shortly.
  15. Aggregation rules are quite simple, but they are very powerful. Don’t confuse them with storage aggregation rules though, which only deal with on disk aggregation. A rule is basically asking what it should write, how often, how to aggregate and from what. In this example above, we’re using two variables in the names - env and app. These variables will map to the input metric name based on the location within the name, so in position 0 we have env becoming prod, and in position 2, the app is apache. The new metric that we generate will therefore be called prod.applications.apache.all.requests. Because we have an asterisk in position 3 on the input pattern, this will aggregate all nodes that match. The final metric will be a sum of all matching metrics, forward to the carbon cache every 60 seconds As I said, this is very powerful, but requires a lot of CPU to run.
  16. The storage mechanism for Graphite is the file format Whisper. It’s pretty close to a rewrite of RRD. Some downsides to Whisper - each datapoint is stored with it’s timestamp, rather than assuming position in file is the time it was created, and the file is fixed size so a metric that sends 1 datapoint once will take up the same disk space as a full metric
  17. So, the last piece of the Graphite stack is the web app. It’s a Django web app, that reads from both the carbon caches, and the whisper files on disk.
  18. The very simplest graphite setup you can have is simply having carbon cache listening on TCP port 2003 (which is the standard graphite port), and writing all metrics directly to disk. This works fine for low amounts of metrics for testing. In this situation, you will be bound by disk io, unless you’re backed by SSD.
  19. Now, we’re bringing in the carbon relay to use consistent hashing between caches. Why would we do this? Queues and back pressure. When a carbon cache is waiting for the optimal time to write to disk, it may start dropping metrics. This is a decent way of doing things if you have plenty of CPU, RAM and disk IO to use. In this situation, you can scale out the carbon caches until you run out of CPU cores. Don’t forget, carbon is single threaded, so you’ll lose a cpu core to each process.
  20. OK, so we’re getting a bit more complex here now. Each of our carbon daemons has a queue, and these queues can fill quite quickly. Since CPU and RAM is cheaper than super speedy storage, we can off load a lot of work to those. The consistent hashing algorithm is quite computationally intensive, so splitting load across multiple relay nodes helps ensure performance stays good.
  21. So, you can see here now that we have another HAProxy layer, and another Carbon Relay layer. Let’s walk through the layers again, top to bottom: The top haproxy layer receives metrics on tcp port 2003 (and 2004 for pickle) and forwards in a round robin to the first layer of carbon relay daemons. This layer will forward metrics to destinations, based on rules. One of the destinations will be the next HAProxy layer, which will then round robin to the next carbon relay layer, which is responsible for consistent hashing which will forward to the appropriate carbon cache which writes to whisper on disk which is read by the webapp. With me so far? Excellent!
  22. So, since we have one server working, let’s spin up a second identical one, and start duplicating data. As you can see (hopefully), the first carbon relay layer on each box is writing to the second carbon relay on the second box. This means that no matter what server the metric comes into, it will be persisted onto both boxes. Think of this as Raid 1 - mirrored copies of the data. Obviously, this may not be the ideal solution for everyone. If you don’t particularly care about duplication, you can just use the second carbon relay layer, and consistent hash across both servers.
  23. If you don’t particularly care about duplication, you can just use the second carbon relay layer, and consistent hash across both servers. Think of this as Raid 0. You will get more storage, and more performance from your nodes, but if one of your nodes goes down, you lose N data.
  24. Because the aggregator is so CPU intensive, I find it’s easier to move it onto it’s own node. This might sound expensive, but it will give you more metrics which will be useful. The flow in the above diagram is the first layer of relays in server1 forwards to the HAProxy layer on aggregation1. From there, the metric is round rosined to a relay which using consistent hashing to write to a particular carbon aggregator. From there, the carbon aggregator will flush to the second HAProxy layer on server1 which will forward to the second layer of HAproxy, which in turns forwards onto the second layer of carbon relay which consistent hashes onto the cache. I like to have a carbon aggregator attached to every storage node I have.
  25. Scaling graphite is hard, and it’s expensive. Most people start with a single node, with a single cache and a single webapp. There are better ways of scaling. I had an issue where the forwarding relays were so overloaded, that they started dropping metrics, so people started seeing gaps in their metrics. Spinning up new nodes that existed just to forward metrics to destinations reduced the load on the storage nodes, and allowed for more carbon cache daemons so we could use the full performance of our storage card. Consistent hashing is an expensive operation but it’s stateless, so you can shard this function across many load balanced nodes. The storage node ideally will have 3 things running on it - carbon cache for writing to disk, the webapp to get the metrics back out, and memcached for storing generated metrics.
  26. This is a reasonably up to date diagram of the graphite infrastructure in Yelp. We have more relay nodes, and we’re don’t have the aggregators shown, but this is the bulk of the system. The two lower nodes are the power houses of the system. Both have dual 10 core 2.8ghz cpus, 256GB ram, and 3.2TB Fusion IO cards which are basically SSDs which sit on the pci-e bus. This allows us to record about 12,000,000 updates a second, across about 1,000,000 metrics.
  27. This was sent to me the day we started getting traffic in. Figured I needed a meme somewhere!
  28. So, we have all of our metrics stored safely on disk, we’re not dropping anything on the floor, and we’re not over loading the nodes. Excellent. How do we get the data back out? The default webapp dashboard is fine. It does a lot of work, it’s embeddable, and is powerful. Unfortunately, it’s not the prettiest thing in the world. We’ve settled on using Grafana. It’s an open source project, originally based on the Kibana code base for those of you who use the ELK stack. It’s a node based application which stores data in elastic search. It’s very simple to get running, especially in Docker, and does a lot of very cool stuff. The last option is graphite-api for those of you who want quick lightweight access to the data. There are number of drawbacks to graphite-api, which are listed on the github page, but it can be very useful for servers where you don’t want to run apache.
  29. So, we have a scaled system which works well for now. We’re growing, rather quickly. We’re getting more and more metrics daily, and we’ll need to revisit our metrics system. There are some very interesting tools coming down the line that moves away from the python carbon daemons, and the whisper files. InfluxDB is a time series database designed for the sole purpose of storing metrics. There is already an ecosystem of tools built around it, including Grafana, and it is designed to be run on multiple nodes which helps with horizontal scaling. Cassandra is a well known database that can shard and scale easily. It’s reasonably mature, and is used by many metrics companies including Librato, and SignalFX. Again, there is a large ecosystem of tooling built around it, and can plug into Graphite and the carbon daemons easily. The last option? Just pay someone else to do it. Sometimes, it’s just easier to offload the work onto a company who has a dedicated team, and knowledge than spend money on nodes, and an engineer to maintain it internally. Of course, this may not be an issue for everyone, but sometimes, out sourcing can be very beneficial.
  30. And of course, we’re hiring. We’re looking for people to join our site reliability team - we’ve got openings in Dublin, London, New York City and San Francisco