SlideShare una empresa de Scribd logo
1 de 66
Descargar para leer sin conexión
Graphite
Graphs for the modern age
Graphite basics
● Graphite generates graphs from timeseries
data
– Think MRTG or Cacti
– More flexible than those
Graphite basics
● Graphite generates graphs from timeseries
data
– Think MRTG or Cacti
– More flexible than those
● Written in Python
– This does impact performance
Graphite basics
● Graphite generates graphs from timeseries
data
– Think MRTG or Cacti
– More flexible than those
● Written in Python
– This does impact performance
● Web based and easy to use
– For once, not a marketing buzzword
The church of Graphs
● Pattern Recognition
The church of Graphs
● Pattern Recognition
● Correlation
The church of Graphs
● Pattern Recognition
● Correlation
● Analytics
The church of Graphs
● Pattern Recognition
● Correlation
● Analytics
● Anomaly detection
Helpful Graphite features
● Out of order data insertion
Helpful Graphite features
● Out of order data insertion
● Ability to compare corresponding time periods
(time travel)
Helpful Graphite features
● Out of order data insertion
● Ability to compare corresponding time periods
(time travel)
● Custom retention periods
Moving parts
● Relays
– Send data to correct backend store
Moving parts
● Relays
– Send data to correct backend store
● Pattern matching on metric names
● Consistent hashing
Moving parts
● Relays
– Send data to correct backend store
● Pattern matching on metric names
● Consistent hashing
● Storage
– Flat, fixed size files
● These are created when the metric is first recorded
● Changing later is hard
Moving parts
● Relays
– Send data to correct backend store
● Pattern matching on metric names
● Consistent hashing
● Storage
– Flat, fixed size files
● These are created when the metric is first recorded
● Changing later is hard
● Webapp
– Django based application offering a web api and Javascript
based frontend application
Data output
● Web API
Data output
● Web API
– Everything is a HTTP GET
– A number of functions for data manipulation
Data output
● Web API
– Everything is a HTTP GET
– A number of functions for data manipulation
● Graphite offers outputs in multiple formats
Data output
● Web API
– Everything is a HTTP GET
– A number of functions for data manipulation
● Graphite offers outputs in multiple formats
– Graphical (PNG, SVG)
– Structured(JSON, CSV)
– Raw data
Using Graphite
● Custom pages pulling in PNG images
– Just <img src=”some url here”>
Using Graphite
● Custom pages pulling in PNG images
– Just <img src=”some url here”>
● Using the default frontend
– For single, one off graphs
– Debugging problems
Using Graphite
● Custom pages pulling in PNG images
– Just <img src=”some url here”>
● Using the default frontend
– For single, one off graphs
– Debugging problems
● Using builtin dashboards
– Users create their own dashboards
– Third part dashboard tools
Using Graphite
● Custom pages pulling in PNG images
– Just <img src=”some url here”>
● Using the default frontend
– For single, one off graphs
– Debugging problems
● Using builtin dashboards
– Users create their own dashboards
– Third part dashboard tools
Using Graphite
● Custom pages pulling in PNG images
– Just <img src=”some url here”>
● Using the default frontend
– For single, one off graphs
– Debugging problems
● Using builtin dashboards
– Users create their own dashboards
– Third part dashboard tools
● Using third party libraries
– JSON is nice for this
– Cubism, D3.js, rickshaw, etc
Using Graphite
● API
– Monitoring
– Runtime performance tuning
Using Graphite
● API
– Monitoring
– Runtime performance tuning
● Postmortem analytics
Using Graphite
● API
– Monitoring
– Runtime performance tuning
● Postmortem analytics
● Performance debugging
Making Graphite scale
● Original setup
– Small cluster
● Two frontend boxes, two backend
Making Graphite scale
● Original setup
– Small cluster
● Two frontend boxes, two backend
– RAID 1+0 with 4 spinning disks
● This works well, with about 200 machines
Making Graphite scale
● Original setup
– Small cluster
● Two frontend boxes, two backend
– RAID 1+0 with 4 spinning disks
● This works well, with about 200 machines
– All those individual files force a lot of seeks
Scaling out - try 1
● Add more backend boxes
Scaling out - try 1
● Add more backend boxes
– Manual rules to split traffic
– Pattern matching based on metric names
Scaling out - try 1
● Add more backend boxes
– Manual rules to split traffic
– Pattern matching based on metric names
Scaling out - try 1
● Add more backend boxes
– Manual rules to split traffic
– Pattern matching based on metric names
● Balancing traffic is hard
Scaling up
● Replace spinning disks with SSDs
Scaling up
● Replace spinning disks with SSDs
● Massive performance improvement due to
more IOPS
– Still not as much as we needed
Scaling up
● Replace spinning disks with SSDs
● Massive performance improvement due to
more IOPS
– Still not as much as we needed
● Losing a SSD meant we had a box die
– This has been fixed
Scaling up
● Replace spinning disks with SSDs
● Massive performance improvement due to
more IOPS
– Still not as much as we needed
● Losing a SSD meant we had a box die
– This has been fixed
● SSDs are not as reliable as spinning rust
– SSDs last for between 12 to 14 months
Sharding – take II
● At about 10 storage servers, manually
maintaining regular expressions became
painful
Sharding – take II
● At about 10 storage servers, manually
maintaining regular expressions became
painful
● Keeping disk usage balanced was even
harder
– Anyone is allowed to create graphs
Sharding - take II
● Replace regular expressions with consistent
hashing
● Switch to RAID 0
– We have switched back to RAID 1
● Store data on two nodes in each ring
● Mirror rings in datacenters
● Shuffle metrics to avoid losing data and disk
space.
Disk usage
● Graphite uses a lot of disk io
– Background graph is in thousands on the Y axis.
– Individual files increase seek times
● There are a lot of stat(2) calls
– This hasn't been investigated yet
Naming conventions
● Graphite has no rules for names
Naming conventions
● Graphite has no rules for names
● We adopted:
– sys.* is for system metrics
– user.* is for testing/other stuff
– Anything else which makes sense is acceptable
Collecting metrics
● We have all sorts of homegrown scripts
– Shell
– Perl
– Python
– Powershell
Collecting metrics
● We have all sorts of homegrown scripts
– Shell
– Perl
– Python
– Powershell
● Originally used collectd for system metrics
– The version of collected we were using had memory
usage issues
● These have been fixed later
Collecting metrics
● System metrics are now collected by diamond
Collecting metrics
● System metrics are now collected by diamond
● Diamond is a Python application
– Base framework + metric collection scripts
– Added custom patches for internal metrics
– Added patches to send monitoring data directly to
Nagios for passive checks
Relay issues
● The Python relaying implementation eats CPU
Relay issues
● The Python relaying implementation eats CPU
● Started with relays directly on the cluster
– Still need more CPU
Relay issues
● The Python relaying implementation eats CPU
● Started with relays directly on the cluster
– Still need more CPU
● Added relays in each datacenter
– Still need more CPU
Relay issues
● The Python relaying implementation eats CPU
● Started with relays directly on the cluster
– Still need more CPU
● Added relays in each datacenter
– Still need more CPU
● Ran multiple instances on each relay host
– Still need more CPU
Relay issues
● The Python relaying implementation eats CPU
● Started with relays directly on the cluster
– Still need more CPU
● Added relays in each datacenter
– Still need more CPU
● Ran multiple instances on each relay host
– Still need more CPU
● Finally rewrote in C and added more relay hosts
– This works for us (and we have breathing room)
Data visibility
● We send data to multiple places
– Metrics get dropped
Data visibility
● We send data to multiple places
– Metrics get dropped
● Small application in Go which gets data from
multiple locations and gives us a single
merged resultset
– Prototyped in Python, which was too slow
statsd
● We had statsd running, but unused for a long
time
– statsd use is still relatively small
– Only a few internal applications use it
– We already have an analytics framework for this
statsd
● We had statsd running, but unused for a long
time
– statsd use is still relatively small
– Only a few internal applications use it
– We already have an analytics framework for this
● The PCI vulnerability scanner reliably crashed
it
– This was patched and pushed upstream
Business metrics
● Turns out, developers like Graphite
– They don't reliably understand whisper semantics
● Querying Graphite like SQL doesn't work
– They create a large number of named metrics
● foo.bar.YYYY-MM-DD
● Disk space use is a sudden concern
– Especially when you don't try and restrict this (feature, not bug)
Scaling out clusters
● Different groups have different requirements
– Multiple backend rings, same frontend
● Unix systems
● Windows
● Networking
● Business metrics
● User testing
Current problems
● Hardware
– Need more CPU
● Especially on the frontends where we do a lot of maths
– Better disk reliability on SSDs
● Replacing disks is expensive
– More disk IO
● SSDs are now maxed out under stat(2) calls
● Testing Fusion IO cards
– 10% faster, but we don't know babout reliability yet
Current problems
● People
– If you need a graph, put the data in Graphite
● Even if the data isn't time series data
● Frontend scalability
– The default frontend doesn't work well with a few
thousand hosts
● Software upgrades
– Our last Whisper upgrade caused data recording to
stop
Current problems
● Managability
– Getting rid of older, non-required metrics is a lot of
effort
– Adding hosts into a ring requires manual
rebalancing effort
Future possiilities
● Testing Cassandra as a backend (cyanite)
● Anomaly detection
– Tested Skyline, didn't scale
● More business metrics
● Sparse metrics
– Metrics with a lot of nulls, but potentially a lot of
named metrics involved
Peopleware
● Hiring people to work on interesting
challenges
– Sysadmins, developers
– http://www.booking.com/jobs
● Booking.com will be sponsoring a Graphite
dev summit in June (tentatively just before the
devopsdays Amsterdam event)
Reference URLS
● Graphite
– https://github.com/graphite-project
● Graphite API
– http://graphite.readthedocs.org/en/latest/functions.html
● C Carbon relay
– https://github.com/grobian/carbon-c-relay
● Zipper
– https://github.com/grobian/carbonserver
● Cyanite
– https://github.com/pyr/cyanite
– https://github.com/brutasse/graphite-cyanite
?

Más contenido relacionado

La actualidad más candente

From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...Neville Li
 
How ReversingLabs Serves File Reputation Service for 10B Files
How ReversingLabs Serves File Reputation Service for 10B FilesHow ReversingLabs Serves File Reputation Service for 10B Files
How ReversingLabs Serves File Reputation Service for 10B FilesScyllaDB
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayDatabricks
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB
 
GCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native ArchitecturesGCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native Architecturesnine
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
How Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter FootprintHow Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter FootprintScyllaDB
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Nathan Bijnens
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServicePivotalOpenSourceHub
 
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018PGConf APAC
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningDatabricks
 
Big Data and PostgreSQL
Big Data and PostgreSQLBig Data and PostgreSQL
Big Data and PostgreSQLPGConf APAC
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Nathan Bijnens
 
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan PachenkoPGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan PachenkoEqunix Business Solutions
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...ScyllaDB
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorDatabricks
 

La actualidad más candente (20)

From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...From stream to recommendation using apache beam with cloud pubsub and cloud d...
From stream to recommendation using apache beam with cloud pubsub and cloud d...
 
How ReversingLabs Serves File Reputation Service for 10B Files
How ReversingLabs Serves File Reputation Service for 10B FilesHow ReversingLabs Serves File Reputation Service for 10B Files
How ReversingLabs Serves File Reputation Service for 10B Files
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
 
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightHow Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
 
GCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native ArchitecturesGCP Meetup #3 - Approaches to Cloud Native Architectures
GCP Meetup #3 - Approaches to Cloud Native Architectures
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Will it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsWill it Scale? The Secrets behind Scaling Stream Processing Applications
Will it Scale? The Secrets behind Scaling Stream Processing Applications
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
How Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter FootprintHow Workload Prioritization Reduces Your Datacenter Footprint
How Workload Prioritization Reduces Your Datacenter Footprint
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a Service
 
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
PGConf APAC 2018 - A PostgreSQL DBAs Toolbelt for 2018
 
Optimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File PruningOptimising Geospatial Queries with Dynamic File Pruning
Optimising Geospatial Queries with Dynamic File Pruning
 
Big Data and PostgreSQL
Big Data and PostgreSQLBig Data and PostgreSQL
Big Data and PostgreSQL
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
 
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan PachenkoPGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
PGConf.ASIA 2019 Bali - Keynote Speech 2 - Ivan Pachenko
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
Scylla Summit 2022: Learning Rust the Hard Way for a Production Kafka+ScyllaD...
 
Speed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS AcceleratorSpeed up UDFs with GPUs using the RAPIDS Accelerator
Speed up UDFs with GPUs using the RAPIDS Accelerator
 

Destacado

Graphite, an introduction
Graphite, an introductionGraphite, an introduction
Graphite, an introductionjamesrwu
 
The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...Brian Solis
 
Open Source Creativity
Open Source CreativityOpen Source Creativity
Open Source CreativitySara Cannon
 
Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)maditabalnco
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsBarry Feldman
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome EconomyHelge Tennø
 

Destacado (7)

Graphite
GraphiteGraphite
Graphite
 
Graphite, an introduction
Graphite, an introductionGraphite, an introduction
Graphite, an introduction
 
The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...The impact of innovation on travel and tourism industries (World Travel Marke...
The impact of innovation on travel and tourism industries (World Travel Marke...
 
Open Source Creativity
Open Source CreativityOpen Source Creativity
Open Source Creativity
 
Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)Reuters: Pictures of the Year 2016 (Part 2)
Reuters: Pictures of the Year 2016 (Part 2)
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post Formats
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Similar a OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyftkbajda
 
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...Dave Stokes
 
Ducksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard
 
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsProper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsDave Stokes
 
Rubyslava + PyVo #48
Rubyslava + PyVo #48Rubyslava + PyVo #48
Rubyslava + PyVo #48Jozef Képesi
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsDataWorks Summit
 
Monitoring with Clickhouse
Monitoring with ClickhouseMonitoring with Clickhouse
Monitoring with Clickhouseunicast
 
Scaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rushScaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rushDaniel Ben-Zvi
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsGraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsNicola Molinari
 
Webinar slides: How to Automate & Manage PostgreSQL with ClusterControl
Webinar slides: How to Automate & Manage PostgreSQL with ClusterControlWebinar slides: How to Automate & Manage PostgreSQL with ClusterControl
Webinar slides: How to Automate & Manage PostgreSQL with ClusterControlSeveralnines
 
Optimizing Your Frontend Performance
Optimizing Your Frontend PerformanceOptimizing Your Frontend Performance
Optimizing Your Frontend PerformanceThomas Weinert
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKKriangkrai Chaonithi
 
Random tips that will save your project's life
Random tips that will save your project's lifeRandom tips that will save your project's life
Random tips that will save your project's lifeMariano Iglesias
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 

Similar a OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age (20)

Presto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - LyftPresto Summit 2018 - 07 - Lyft
Presto Summit 2018 - 07 - Lyft
 
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
The Proper Care and Feeding of a MySQL Database for Busy Linux Admins -- SCaL...
 
Ducksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architectureDucksboard - A real-time data oriented webservice architecture
Ducksboard - A real-time data oriented webservice architecture
 
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux AdministratorsProper Care and Feeding of a MySQL Database for Busy Linux Administrators
Proper Care and Feeding of a MySQL Database for Busy Linux Administrators
 
Rubyslava + PyVo #48
Rubyslava + PyVo #48Rubyslava + PyVo #48
Rubyslava + PyVo #48
 
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflowsBeyond unit tests: Deployment and testing for Hadoop/Spark workflows
Beyond unit tests: Deployment and testing for Hadoop/Spark workflows
 
Monitoring with Clickhouse
Monitoring with ClickhouseMonitoring with Clickhouse
Monitoring with Clickhouse
 
Scaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rushScaling graphite to handle a zerg rush
Scaling graphite to handle a zerg rush
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At CommercetoolsGraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
GraphQL Munich Meetup #1 - How We Use GraphQL At Commercetools
 
Webinar slides: How to Automate & Manage PostgreSQL with ClusterControl
Webinar slides: How to Automate & Manage PostgreSQL with ClusterControlWebinar slides: How to Automate & Manage PostgreSQL with ClusterControl
Webinar slides: How to Automate & Manage PostgreSQL with ClusterControl
 
Optimizing Your Frontend Performance
Optimizing Your Frontend PerformanceOptimizing Your Frontend Performance
Optimizing Your Frontend Performance
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Random tips that will save your project's life
Random tips that will save your project's lifeRandom tips that will save your project's life
Random tips that will save your project's life
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 

Último

WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyviewmasabamasaba
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba
 

Último (20)

WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 

OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

  • 2. Graphite basics ● Graphite generates graphs from timeseries data – Think MRTG or Cacti – More flexible than those
  • 3. Graphite basics ● Graphite generates graphs from timeseries data – Think MRTG or Cacti – More flexible than those ● Written in Python – This does impact performance
  • 4. Graphite basics ● Graphite generates graphs from timeseries data – Think MRTG or Cacti – More flexible than those ● Written in Python – This does impact performance ● Web based and easy to use – For once, not a marketing buzzword
  • 5. The church of Graphs ● Pattern Recognition
  • 6. The church of Graphs ● Pattern Recognition ● Correlation
  • 7. The church of Graphs ● Pattern Recognition ● Correlation ● Analytics
  • 8. The church of Graphs ● Pattern Recognition ● Correlation ● Analytics ● Anomaly detection
  • 9. Helpful Graphite features ● Out of order data insertion
  • 10. Helpful Graphite features ● Out of order data insertion ● Ability to compare corresponding time periods (time travel)
  • 11. Helpful Graphite features ● Out of order data insertion ● Ability to compare corresponding time periods (time travel) ● Custom retention periods
  • 12. Moving parts ● Relays – Send data to correct backend store
  • 13. Moving parts ● Relays – Send data to correct backend store ● Pattern matching on metric names ● Consistent hashing
  • 14. Moving parts ● Relays – Send data to correct backend store ● Pattern matching on metric names ● Consistent hashing ● Storage – Flat, fixed size files ● These are created when the metric is first recorded ● Changing later is hard
  • 15. Moving parts ● Relays – Send data to correct backend store ● Pattern matching on metric names ● Consistent hashing ● Storage – Flat, fixed size files ● These are created when the metric is first recorded ● Changing later is hard ● Webapp – Django based application offering a web api and Javascript based frontend application
  • 17. Data output ● Web API – Everything is a HTTP GET – A number of functions for data manipulation
  • 18. Data output ● Web API – Everything is a HTTP GET – A number of functions for data manipulation ● Graphite offers outputs in multiple formats
  • 19. Data output ● Web API – Everything is a HTTP GET – A number of functions for data manipulation ● Graphite offers outputs in multiple formats – Graphical (PNG, SVG) – Structured(JSON, CSV) – Raw data
  • 20. Using Graphite ● Custom pages pulling in PNG images – Just <img src=”some url here”>
  • 21. Using Graphite ● Custom pages pulling in PNG images – Just <img src=”some url here”> ● Using the default frontend – For single, one off graphs – Debugging problems
  • 22. Using Graphite ● Custom pages pulling in PNG images – Just <img src=”some url here”> ● Using the default frontend – For single, one off graphs – Debugging problems ● Using builtin dashboards – Users create their own dashboards – Third part dashboard tools
  • 23. Using Graphite ● Custom pages pulling in PNG images – Just <img src=”some url here”> ● Using the default frontend – For single, one off graphs – Debugging problems ● Using builtin dashboards – Users create their own dashboards – Third part dashboard tools
  • 24. Using Graphite ● Custom pages pulling in PNG images – Just <img src=”some url here”> ● Using the default frontend – For single, one off graphs – Debugging problems ● Using builtin dashboards – Users create their own dashboards – Third part dashboard tools ● Using third party libraries – JSON is nice for this – Cubism, D3.js, rickshaw, etc
  • 25. Using Graphite ● API – Monitoring – Runtime performance tuning
  • 26. Using Graphite ● API – Monitoring – Runtime performance tuning ● Postmortem analytics
  • 27. Using Graphite ● API – Monitoring – Runtime performance tuning ● Postmortem analytics ● Performance debugging
  • 28. Making Graphite scale ● Original setup – Small cluster ● Two frontend boxes, two backend
  • 29. Making Graphite scale ● Original setup – Small cluster ● Two frontend boxes, two backend – RAID 1+0 with 4 spinning disks ● This works well, with about 200 machines
  • 30. Making Graphite scale ● Original setup – Small cluster ● Two frontend boxes, two backend – RAID 1+0 with 4 spinning disks ● This works well, with about 200 machines – All those individual files force a lot of seeks
  • 31. Scaling out - try 1 ● Add more backend boxes
  • 32. Scaling out - try 1 ● Add more backend boxes – Manual rules to split traffic – Pattern matching based on metric names
  • 33. Scaling out - try 1 ● Add more backend boxes – Manual rules to split traffic – Pattern matching based on metric names
  • 34. Scaling out - try 1 ● Add more backend boxes – Manual rules to split traffic – Pattern matching based on metric names ● Balancing traffic is hard
  • 35. Scaling up ● Replace spinning disks with SSDs
  • 36. Scaling up ● Replace spinning disks with SSDs ● Massive performance improvement due to more IOPS – Still not as much as we needed
  • 37. Scaling up ● Replace spinning disks with SSDs ● Massive performance improvement due to more IOPS – Still not as much as we needed ● Losing a SSD meant we had a box die – This has been fixed
  • 38. Scaling up ● Replace spinning disks with SSDs ● Massive performance improvement due to more IOPS – Still not as much as we needed ● Losing a SSD meant we had a box die – This has been fixed ● SSDs are not as reliable as spinning rust – SSDs last for between 12 to 14 months
  • 39. Sharding – take II ● At about 10 storage servers, manually maintaining regular expressions became painful
  • 40. Sharding – take II ● At about 10 storage servers, manually maintaining regular expressions became painful ● Keeping disk usage balanced was even harder – Anyone is allowed to create graphs
  • 41. Sharding - take II ● Replace regular expressions with consistent hashing ● Switch to RAID 0 – We have switched back to RAID 1 ● Store data on two nodes in each ring ● Mirror rings in datacenters ● Shuffle metrics to avoid losing data and disk space.
  • 42. Disk usage ● Graphite uses a lot of disk io – Background graph is in thousands on the Y axis. – Individual files increase seek times ● There are a lot of stat(2) calls – This hasn't been investigated yet
  • 43. Naming conventions ● Graphite has no rules for names
  • 44. Naming conventions ● Graphite has no rules for names ● We adopted: – sys.* is for system metrics – user.* is for testing/other stuff – Anything else which makes sense is acceptable
  • 45. Collecting metrics ● We have all sorts of homegrown scripts – Shell – Perl – Python – Powershell
  • 46. Collecting metrics ● We have all sorts of homegrown scripts – Shell – Perl – Python – Powershell ● Originally used collectd for system metrics – The version of collected we were using had memory usage issues ● These have been fixed later
  • 47. Collecting metrics ● System metrics are now collected by diamond
  • 48. Collecting metrics ● System metrics are now collected by diamond ● Diamond is a Python application – Base framework + metric collection scripts – Added custom patches for internal metrics – Added patches to send monitoring data directly to Nagios for passive checks
  • 49. Relay issues ● The Python relaying implementation eats CPU
  • 50. Relay issues ● The Python relaying implementation eats CPU ● Started with relays directly on the cluster – Still need more CPU
  • 51. Relay issues ● The Python relaying implementation eats CPU ● Started with relays directly on the cluster – Still need more CPU ● Added relays in each datacenter – Still need more CPU
  • 52. Relay issues ● The Python relaying implementation eats CPU ● Started with relays directly on the cluster – Still need more CPU ● Added relays in each datacenter – Still need more CPU ● Ran multiple instances on each relay host – Still need more CPU
  • 53. Relay issues ● The Python relaying implementation eats CPU ● Started with relays directly on the cluster – Still need more CPU ● Added relays in each datacenter – Still need more CPU ● Ran multiple instances on each relay host – Still need more CPU ● Finally rewrote in C and added more relay hosts – This works for us (and we have breathing room)
  • 54. Data visibility ● We send data to multiple places – Metrics get dropped
  • 55. Data visibility ● We send data to multiple places – Metrics get dropped ● Small application in Go which gets data from multiple locations and gives us a single merged resultset – Prototyped in Python, which was too slow
  • 56. statsd ● We had statsd running, but unused for a long time – statsd use is still relatively small – Only a few internal applications use it – We already have an analytics framework for this
  • 57. statsd ● We had statsd running, but unused for a long time – statsd use is still relatively small – Only a few internal applications use it – We already have an analytics framework for this ● The PCI vulnerability scanner reliably crashed it – This was patched and pushed upstream
  • 58. Business metrics ● Turns out, developers like Graphite – They don't reliably understand whisper semantics ● Querying Graphite like SQL doesn't work – They create a large number of named metrics ● foo.bar.YYYY-MM-DD ● Disk space use is a sudden concern – Especially when you don't try and restrict this (feature, not bug)
  • 59. Scaling out clusters ● Different groups have different requirements – Multiple backend rings, same frontend ● Unix systems ● Windows ● Networking ● Business metrics ● User testing
  • 60. Current problems ● Hardware – Need more CPU ● Especially on the frontends where we do a lot of maths – Better disk reliability on SSDs ● Replacing disks is expensive – More disk IO ● SSDs are now maxed out under stat(2) calls ● Testing Fusion IO cards – 10% faster, but we don't know babout reliability yet
  • 61. Current problems ● People – If you need a graph, put the data in Graphite ● Even if the data isn't time series data ● Frontend scalability – The default frontend doesn't work well with a few thousand hosts ● Software upgrades – Our last Whisper upgrade caused data recording to stop
  • 62. Current problems ● Managability – Getting rid of older, non-required metrics is a lot of effort – Adding hosts into a ring requires manual rebalancing effort
  • 63. Future possiilities ● Testing Cassandra as a backend (cyanite) ● Anomaly detection – Tested Skyline, didn't scale ● More business metrics ● Sparse metrics – Metrics with a lot of nulls, but potentially a lot of named metrics involved
  • 64. Peopleware ● Hiring people to work on interesting challenges – Sysadmins, developers – http://www.booking.com/jobs ● Booking.com will be sponsoring a Graphite dev summit in June (tentatively just before the devopsdays Amsterdam event)
  • 65. Reference URLS ● Graphite – https://github.com/graphite-project ● Graphite API – http://graphite.readthedocs.org/en/latest/functions.html ● C Carbon relay – https://github.com/grobian/carbon-c-relay ● Zipper – https://github.com/grobian/carbonserver ● Cyanite – https://github.com/pyr/cyanite – https://github.com/brutasse/graphite-cyanite
  • 66. ?