OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Graphite
Graphs for the modern age

Graphite basics
● Graphite generates graphs from timeseries
data
– Think MRTG or Cacti
– More flexible than those

Graphite basics
data
● Written in Python
– This does impact performance

Graphite basics
data
● Written in Python
– This does impact performance
● Web based and easy to use
– For once, not a marketing buzzword

The church of Graphs
● Pattern Recognition

● Correlation

● Correlation
● Analytics

● Correlation
● Analytics
● Anomaly detection

Helpful Graphite features
● Out of order data insertion

● Ability to compare corresponding time periods
(time travel)

● Ability to compare corresponding time periods
(time travel)
● Custom retention periods

Moving parts
● Relays
– Send data to correct backend store

Moving parts
● Relays
● Pattern matching on metric names
● Consistent hashing

Moving parts
● Relays
● Storage
– Flat, fixed size files
● These are created when the metric is first recorded
● Changing later is hard

Moving parts
● Relays
● Storage
– Flat, fixed size files
● These are created when the metric is first recorded
● Changing later is hard
● Webapp
– Django based application offering a web api and Javascript
based frontend application

Data output
● Web API
– Everything is a HTTP GET
– A number of functions for data manipulation

Data output
● Web API
● Graphite offers outputs in multiple formats

Data output
● Web API
● Graphite offers outputs in multiple formats
– Graphical (PNG, SVG)
– Structured(JSON, CSV)
– Raw data

Using Graphite
● Custom pages pulling in PNG images
– Just <img src=”some url here”>

Using Graphite
● Using the default frontend
– For single, one off graphs
– Debugging problems

Using Graphite
● Using builtin dashboards
– Users create their own dashboards
– Third part dashboard tools

Using Graphite
● Using builtin dashboards
– Users create their own dashboards
– Third part dashboard tools
● Using third party libraries
– JSON is nice for this
– Cubism, D3.js, rickshaw, etc

Using Graphite
● API
– Monitoring
– Runtime performance tuning

Using Graphite
● API
– Monitoring
● Postmortem analytics

Using Graphite
● API
– Monitoring
● Postmortem analytics
● Performance debugging

Making Graphite scale
● Original setup
– Small cluster
● Two frontend boxes, two backend

● Original setup
– Small cluster
– RAID 1+0 with 4 spinning disks
● This works well, with about 200 machines

● Original setup
– Small cluster
– RAID 1+0 with 4 spinning disks
● This works well, with about 200 machines
– All those individual files force a lot of seeks

Scaling out - try 1
● Add more backend boxes

Scaling out - try 1
– Manual rules to split traffic
– Pattern matching based on metric names

Scaling out - try 1
– Manual rules to split traffic
– Pattern matching based on metric names
● Balancing traffic is hard

Scaling up
● Replace spinning disks with SSDs

Scaling up
● Massive performance improvement due to
more IOPS
– Still not as much as we needed

Scaling up
more IOPS
● Losing a SSD meant we had a box die
– This has been fixed

Scaling up
more IOPS
● Losing a SSD meant we had a box die
– This has been fixed
● SSDs are not as reliable as spinning rust
– SSDs last for between 12 to 14 months

Sharding – take II
● At about 10 storage servers, manually
maintaining regular expressions became
painful

Sharding – take II
● At about 10 storage servers, manually
maintaining regular expressions became
painful
● Keeping disk usage balanced was even
harder
– Anyone is allowed to create graphs

Sharding - take II
● Replace regular expressions with consistent
hashing
● Switch to RAID 0
– We have switched back to RAID 1
● Store data on two nodes in each ring
● Mirror rings in datacenters
● Shuffle metrics to avoid losing data and disk
space.

Disk usage
● Graphite uses a lot of disk io
– Background graph is in thousands on the Y axis.
– Individual files increase seek times
● There are a lot of stat(2) calls
– This hasn't been investigated yet

Naming conventions
● Graphite has no rules for names

Naming conventions
● Graphite has no rules for names
● We adopted:
– sys.* is for system metrics
– user.* is for testing/other stuff
– Anything else which makes sense is acceptable

Collecting metrics
● We have all sorts of homegrown scripts
– Shell
– Perl
– Python
– Powershell

Collecting metrics
● We have all sorts of homegrown scripts
– Shell
– Perl
– Python
– Powershell
● Originally used collectd for system metrics
– The version of collected we were using had memory
usage issues
● These have been fixed later

Collecting metrics
● System metrics are now collected by diamond

Collecting metrics
● System metrics are now collected by diamond
● Diamond is a Python application
– Base framework + metric collection scripts
– Added custom patches for internal metrics
– Added patches to send monitoring data directly to
Nagios for passive checks

Relay issues
● The Python relaying implementation eats CPU

Relay issues
● Started with relays directly on the cluster
– Still need more CPU

Relay issues
● Added relays in each datacenter

Relay issues
● Ran multiple instances on each relay host

Relay issues
● Ran multiple instances on each relay host
● Finally rewrote in C and added more relay hosts
– This works for us (and we have breathing room)

Data visibility
● We send data to multiple places
– Metrics get dropped

Data visibility
● We send data to multiple places
– Metrics get dropped
● Small application in Go which gets data from
multiple locations and gives us a single
merged resultset
– Prototyped in Python, which was too slow

statsd
● We had statsd running, but unused for a long
time
– statsd use is still relatively small
– Only a few internal applications use it
– We already have an analytics framework for this

statsd
● We had statsd running, but unused for a long
time
– statsd use is still relatively small
– Only a few internal applications use it
– We already have an analytics framework for this
● The PCI vulnerability scanner reliably crashed
it
– This was patched and pushed upstream

Business metrics
● Turns out, developers like Graphite
– They don't reliably understand whisper semantics
● Querying Graphite like SQL doesn't work
– They create a large number of named metrics
● foo.bar.YYYY-MM-DD
● Disk space use is a sudden concern
– Especially when you don't try and restrict this (feature, not bug)

Scaling out clusters
● Different groups have different requirements
– Multiple backend rings, same frontend
● Unix systems
● Windows
● Networking
● Business metrics
● User testing

Current problems
● Hardware
– Need more CPU
● Especially on the frontends where we do a lot of maths
– Better disk reliability on SSDs
● Replacing disks is expensive
– More disk IO
● SSDs are now maxed out under stat(2) calls
● Testing Fusion IO cards
– 10% faster, but we don't know babout reliability yet

Current problems
● People
– If you need a graph, put the data in Graphite
● Even if the data isn't time series data
● Frontend scalability
– The default frontend doesn't work well with a few
thousand hosts
● Software upgrades
– Our last Whisper upgrade caused data recording to
stop

Current problems
● Managability
– Getting rid of older, non-required metrics is a lot of
effort
– Adding hosts into a ring requires manual
rebalancing effort

Future possiilities
● Testing Cassandra as a backend (cyanite)
● Anomaly detection
– Tested Skyline, didn't scale
● More business metrics
● Sparse metrics
– Metrics with a lot of nulls, but potentially a lot of
named metrics involved

Peopleware
● Hiring people to work on interesting
challenges
– Sysadmins, developers
– http://www.booking.com/jobs
● Booking.com will be sponsoring a Graphite
dev summit in June (tentatively just before the
devopsdays Amsterdam event)

Reference URLS
● Graphite
– https://github.com/graphite-project
● Graphite API
– http://graphite.readthedocs.org/en/latest/functions.html
● C Carbon relay
– https://github.com/grobian/carbon-c-relay
● Zipper
– https://github.com/grobian/carbonserver
● Cyanite
– https://github.com/pyr/cyanite
– https://github.com/brutasse/graphite-cyanite

OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age

Similar a OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age (20)

Último

Último (20)

OSDC 2014: Devdas Bhagat - Graphite: Graphs for the modern age