2. 100+
Data centers globally
2.5B
Monthly unique visitors
>10%
Internet requests
everyday
≦3M
DNS queries/second
websites, apps & APIs
in 150 countries
6M+
5M+
HTTP requests/second
3. Anatomy of a DNS query
$ dig www.cloudflare.com
; <<>> DiG 9.8.3-P1 <<>> www.cloudflare.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36582
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;www.cloudflare.com. IN A
;; ANSWER SECTION:
www.cloudflare.com. 5 IN A 198.41.215.162
www.cloudflare.com. 5 IN A 198.41.214.162
;; Query time: 34 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Sat Sep 2 10:48:30 2017
;; MSG SIZE rcvd: 68
Fields
30+
4. Cloudflare
DNS Server
Log
Forwarder
HTTP & Other
Edge Services
Anycast
DNS
Logs from all edge services and all PoPs are
shipped over TLS to be processed
Logs are received and
de-multiplexed
Logs are written into
various kafka topics
5. Cloudflare
DNS Server
Log
Forwarder
HTTP & Other
Edge Services
Anycast
DNS
Log messages are
serialized with Cap’n’Proto
Logs from all edge services and all PoPs are
shipped over TLS to be processed
Logs are written into
various kafka topics
Logs are received and
de-multiplexed
6. What did we want?
- Multidimensional query analytics
- Complex ad-hoc queries
- Capable of current and expected future scale
- Gracefully handle late arriving log data
- Roll-ups/aggregations for long term storage
- Highly available and replicated architecture
Queries
Per Second
≦3M
Edge Points
of Presence
100+
Query
Dimensions
20+
Years of
stored
aggregation
5+
7. Logs are written into
various kafka topics
Logs are received and
de-multiplexed
Kafka, Apache Spark and Parquet
- Scanning firehose is slow and
adding filters is time consuming
- Offline analysis is difficult with
large amounts of data
- Not a fast or friendly user
experience
- Doesn’t work for customers
Converted into Parquet
and written to HDFS
Download and filter data from
Kafka using Apache Spark
8. Let’s aggregate everything... with streams
Timestamp QName QType RCODE
2017/01/01 01:00:00 www.cloudflare.com A NODATA
2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR
Time Bucket QName QType RCODE Count p50 Response Time
2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms
2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms
9. Let’s aggregate everything... with streams
- Counters
- Total number of queries
- Query types
- Response codes
- Top-n query names
- Top-n query sources
- Response time/size quantiles
10.
11.
12.
13.
14.
15.
16.
17. Logs are written into
various kafka topics
Logs are received and
de-multiplexed
- Spark experience in-house, though
Java/Scala
- Batch-oriented and need a DB to
serve online queries
- Difficult to support ad-hoc analysis
- Low resolution aggregates
- Scanning raw data is slow
- Late arriving data
Aggregating with Spark Streaming
Produce low cardinality
aggregates with Spark Streaming
18. Logs are written into
various kafka topics
Logs are received and
de-multiplexed
- Spark experience in-house, though
Java/Scala
- Batch-oriented and need a DB to
serve online queries
- Difficult to support ad-hoc analysis
- Low resolution aggregates
- Scanning raw data is slow
- Late arriving data
Aggregating with Spark Streaming
Produce low cardinality
aggregates with Spark Streaming
19. Logs are written into
various kafka topics
Logs are received and
de-multiplexed
- Distributed time-series DB
- Existing deployments of CitusDB
- High cardinality aggregations are
tricky due to insert performance
- Late arriving data
- SQL API
Spark Streaming + CitusDB
Produce low cardinality
aggregates with Spark Streaming
Insert aggregate rows into
CitusDB cluster for reads
20. Logs are written into
various kafka topics
Logs are received and
de-multiplexed
Apache Flink + (CitusDB?)
- Dataflow API and support for
stream watermarks
- Checkpoint performance issues
- High cardinality aggregations are
tricky due to insert performance
- SQL API
Produce low cardinality
aggregates with Flink
Insert aggregate rows into
CitusDB cluster for reads
21. Logs are written into
various kafka topics
Logs are received and
de-multiplexed
Druid
- Insertion rate couldn’t keep up in
our initial tests
- Estimated costs of a suitable cluster
were way expensive
- Seemed performant for random
reads but not the best we’d seen
- Operational complexity seemed high
Insert into a cluster of
Druid nodes
22. Let’s aggregate everything... with streams
Timestamp QName QTy
2017/01/01 01:00:00 www.cloudflare.com A
2017/01/01 01:00:01 api.cloudflare.com AAA
Time Bucket QName QTy
2017/01/01 01:00 www.cloudflare.com A
2017/01/01 01:00 api.cloudflare.com AAA
- Raw data isn’t easily queried ad-hoc
- Backfilling new aggregates is impossible or can
be very difficult without custom tools
- A stream can’t serve actual queries
- Can be costly for high cardinality dimensions
*https://clickhouse.yandex/docs/en/introduction/what_is_clickhouse.html
23. ClickHouse
- Tabular, column-oriented data store
- Single binary, clustered architecture
- Familiar SQL query interface
Lots of very useful built-in aggregation functions
- Raw log data stored for 3 months
~7 trillion rows
- Aggregated data for ∞
1m, 1h aggregations across 3 dimensions
24. Cloudflare
DNS Server
Log
Forwarder
HTTP & Other
Edge Services
Anycast
DNS
Log messages are
serialized with Cap’n’Proto
Logs from all edge services and all PoPs are
shipped over TLS to be processed
Logs are written into
various kafka topics
Logs are received and
de-multiplexed
Go Inserters write the
data in parallel
Multi-tenant ClickHouse
cluster stores data
26. ClickHouse Cluster
r{0,2}.dnslogs
- Raw logs are inserted into one replicated, sharded table
- Multiple r{0,2} databases to better pack the cluster with shards and
replicas
First attempt in prod.
ReplicatedMergeTree
27. Speeding up typical queries
- SUM() and COUNT() over a few low-cardinality dimensions
- Global overview (trends and monitoring)
- Storing intermediate state for non-additive functions
28. ClickHouse Cluster
r{0,2}.dnslogs
- Raw logs are inserted into one
replicated, sharded table
- Multiple r{0,2} databases to better pack
the cluster with shards and replicas
- Aggregate tables for long-term storage
Today...
ReplicatedMergeTree
ReplicatedAggregatingMergeTree
dnslogs_rollup_X
29. October 2016
Began evaluating technologies and
architecture, 1 instance in Docker
Finalized schema, deployed a production
ClickHouse cluster of 6 nodes
November 2016
Prototype ClickHouse cluster with 3
nodes, inserting a sample of data
August 2017
Migrated to a new cluster with
multi-tenancy
Growing interest among other
Cloudflare engineering teams,
worked on standard tooling
December 2016
ClickHouse visualisations with
Superset and Grafana
Spring 2017
TopN, IP prefix matching, Go native
driver, Analytics library, pkey in
monotonic functions
30. October 2016
Began evaluating technologies and
architecture, 1 instance in Docker
Finalized schema, deployed a production
ClickHouse cluster of 6 nodes
November 2016
Prototype ClickHouse cluster with 3
nodes, inserting a sample of data
August 2017
Migrated to a new cluster with
multi-tenancy
Growing interest among other
Cloudflare engineering teams,
worked on standard tooling
December 2016
ClickHouse visualisations with
Superset and Grafana
Spring 2017
TopN, IP prefix matching, Go native
driver, Analytics library, pkey in
monotonic functions
Multi-tenant ClickHouse cluster
Row Insertion/s
8M+
Raid-0 Spinning Disks
2PB+
Insertion Throughput/s
4GB+
Nodes
33
31. ClickHouse Today… 12 Trillion Rows
SELECT
table,
sum(rows) AS total
FROM system.cluster_parts
WHERE database = 'r0'
GROUP BY table
ORDER BY total DESC
┌─table──────────────────────────────┬─────────────total─┐
│ ███████████████ │ 9,051,633,001,267 │
│ ████████████████████ │ 2,088,851,716,078 │
│ ███████████████████ │ 847,768,860,981 │
│ ██████████████████████ │ 259,486,159,236 │
│ … │ … │
32. - TopK(n) Aggregates
https://github.com/yandex/ClickHouse/pull/754
- TrieDictionaries (IP Prefix)
https://github.com/yandex/ClickHouse/pull/785
- SpaceSaving: internal storage for StringRef{}
https://github.com/yandex/ClickHouse/pull/925
- Bug fixes to the Go native driver
https://github.com/kshvakov/clickhouse
- sumMap(key, value)
https://github.com/yandex/ClickHouse/pull/1250
Contributions to ClickHouse
33. Other Contributions
- Grafana Plugin
https://github.com/vavrusa/grafana-sqldb-datasource
(see also https://github.com/Vertamedia/clickhouse-grafana)
- SQLAlchemy (Superset)
https://github.com/cloudflare/sqlalchemy-clickhouse
34. Python w/ Jupyter Notebooks
import requests
import pandas as pd
def ch(q, host='127.0.0.1', port=9001):
start = timer()
r = requests.get(
'https://%s:%d/' % (host, port),
params={'user': 'xxx', 'query': q + 'nFORMAT TabSeparatedWithNames'},
stream=True)
end = timer()
if not r.ok:
raise RuntimeError(r.text)
print 'Query finished in %.02fs' % (end - start)
return pd.read_csv(r.raw, sep="t")
35. Python w/ Jupyter Notebooks
import requests
import pandas as pd
def ch(q, host='127.0.0.1', port=9001):
start = timer()
r = requests.get(
'https://%s:%d/' % (host, port),
params={'user': 'xxx', 'query': q + 'nFORMAT TabSeparatedWithNames'},
stream=True)
end = timer()
if not r.ok:
raise RuntimeError(r.text)
print 'Query finished in %.02fs' % (end - start)
return pd.read_csv(r.raw, sep="t")