SlideShare una empresa de Scribd logo
1 de 39
Descargar para leer sin conexión
How Cloudflare analyzes >1m DNS queries
per second
Tom Arnfeld (and Marek Vavrusa )
100+
Data centers globally
2.5B
Monthly unique visitors
>10%
Internet requests
everyday
≦3M
DNS queries/second
websites, apps & APIs
in 150 countries
6M+
5M+
HTTP requests/second
Anatomy of a DNS query
$ dig www.cloudflare.com
; <<>> DiG 9.8.3-P1 <<>> www.cloudflare.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36582
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;www.cloudflare.com. IN A
;; ANSWER SECTION:
www.cloudflare.com. 5 IN A 198.41.215.162
www.cloudflare.com. 5 IN A 198.41.214.162
;; Query time: 34 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Sat Sep 2 10:48:30 2017
;; MSG SIZE rcvd: 68
Fields
30+
Cloudflare
DNS Server
Log
Forwarder
HTTP & Other
Edge Services
Anycast
DNS
Logs from all edge services and all PoPs are
shipped over TLS to be processed
Logs are received and
de-multiplexed
Logs are written into
various kafka topics
Cloudflare
DNS Server
Log
Forwarder
HTTP & Other
Edge Services
Anycast
DNS
Log messages are
serialized with Cap’n’Proto
Logs from all edge services and all PoPs are
shipped over TLS to be processed
Logs are written into
various kafka topics
Logs are received and
de-multiplexed
What did we want?
- Multidimensional query analytics
- Complex ad-hoc queries
- Capable of current and expected future scale
- Gracefully handle late arriving log data
- Roll-ups/aggregations for long term storage
- Highly available and replicated architecture
Queries
Per Second
≦3M
Edge Points
of Presence
100+
Query
Dimensions
20+
Years of
stored
aggregation
5+
Logs are written into
various kafka topics
Logs are received and
de-multiplexed
Kafka, Apache Spark and Parquet
- Scanning firehose is slow and
adding filters is time consuming
- Offline analysis is difficult with
large amounts of data
- Not a fast or friendly user
experience
- Doesn’t work for customers
Converted into Parquet
and written to HDFS
Download and filter data from
Kafka using Apache Spark
Let’s aggregate everything... with streams
Timestamp QName QType RCODE
2017/01/01 01:00:00 www.cloudflare.com A NODATA
2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR
Time Bucket QName QType RCODE Count p50 Response Time
2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms
2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms
Let’s aggregate everything... with streams
- Counters
- Total number of queries
- Query types
- Response codes
- Top-n query names
- Top-n query sources
- Response time/size quantiles
Logs are written into
various kafka topics
Logs are received and
de-multiplexed
- Spark experience in-house, though
Java/Scala
- Batch-oriented and need a DB to
serve online queries
- Difficult to support ad-hoc analysis
- Low resolution aggregates
- Scanning raw data is slow
- Late arriving data
Aggregating with Spark Streaming
Produce low cardinality
aggregates with Spark Streaming
Logs are written into
various kafka topics
Logs are received and
de-multiplexed
- Spark experience in-house, though
Java/Scala
- Batch-oriented and need a DB to
serve online queries
- Difficult to support ad-hoc analysis
- Low resolution aggregates
- Scanning raw data is slow
- Late arriving data
Aggregating with Spark Streaming
Produce low cardinality
aggregates with Spark Streaming
Logs are written into
various kafka topics
Logs are received and
de-multiplexed
- Distributed time-series DB
- Existing deployments of CitusDB
- High cardinality aggregations are
tricky due to insert performance
- Late arriving data
- SQL API
Spark Streaming + CitusDB
Produce low cardinality
aggregates with Spark Streaming
Insert aggregate rows into
CitusDB cluster for reads
Logs are written into
various kafka topics
Logs are received and
de-multiplexed
Apache Flink + (CitusDB?)
- Dataflow API and support for
stream watermarks
- Checkpoint performance issues
- High cardinality aggregations are
tricky due to insert performance
- SQL API
Produce low cardinality
aggregates with Flink
Insert aggregate rows into
CitusDB cluster for reads
Logs are written into
various kafka topics
Logs are received and
de-multiplexed
Druid
- Insertion rate couldn’t keep up in
our initial tests
- Estimated costs of a suitable cluster
were way expensive
- Seemed performant for random
reads but not the best we’d seen
- Operational complexity seemed high
Insert into a cluster of
Druid nodes
Let’s aggregate everything... with streams
Timestamp QName QTy
2017/01/01 01:00:00 www.cloudflare.com A
2017/01/01 01:00:01 api.cloudflare.com AAA
Time Bucket QName QTy
2017/01/01 01:00 www.cloudflare.com A
2017/01/01 01:00 api.cloudflare.com AAA
- Raw data isn’t easily queried ad-hoc
- Backfilling new aggregates is impossible or can
be very difficult without custom tools
- A stream can’t serve actual queries
- Can be costly for high cardinality dimensions
*https://clickhouse.yandex/docs/en/introduction/what_is_clickhouse.html
ClickHouse
- Tabular, column-oriented data store
- Single binary, clustered architecture
- Familiar SQL query interface
Lots of very useful built-in aggregation functions
- Raw log data stored for 3 months
~7 trillion rows
- Aggregated data for ∞
1m, 1h aggregations across 3 dimensions
Cloudflare
DNS Server
Log
Forwarder
HTTP & Other
Edge Services
Anycast
DNS
Log messages are
serialized with Cap’n’Proto
Logs from all edge services and all PoPs are
shipped over TLS to be processed
Logs are written into
various kafka topics
Logs are received and
de-multiplexed
Go Inserters write the
data in parallel
Multi-tenant ClickHouse
cluster stores data
ClickHouse Cluster
TinyLog
dnslogs_2016_01_01_14_30_pN
ReplicatedMergeTree
dnslogs_2016_01_01
ReplicatedMergeTree
dnslogs_2016_01
ReplicatedMergeTree
dnslogs_2016
- Raw logs are inserted into
sharded tables
- Sidecar processes aggregates
data into day/month/year tables
Initial table design
ClickHouse Cluster
r{0,2}.dnslogs
- Raw logs are inserted into one replicated, sharded table
- Multiple r{0,2} databases to better pack the cluster with shards and
replicas
First attempt in prod.
ReplicatedMergeTree
Speeding up typical queries
- SUM() and COUNT() over a few low-cardinality dimensions
- Global overview (trends and monitoring)
- Storing intermediate state for non-additive functions
ClickHouse Cluster
r{0,2}.dnslogs
- Raw logs are inserted into one
replicated, sharded table
- Multiple r{0,2} databases to better pack
the cluster with shards and replicas
- Aggregate tables for long-term storage
Today...
ReplicatedMergeTree
ReplicatedAggregatingMergeTree
dnslogs_rollup_X
October 2016
Began evaluating technologies and
architecture, 1 instance in Docker
Finalized schema, deployed a production
ClickHouse cluster of 6 nodes
November 2016
Prototype ClickHouse cluster with 3
nodes, inserting a sample of data
August 2017
Migrated to a new cluster with
multi-tenancy
Growing interest among other
Cloudflare engineering teams,
worked on standard tooling
December 2016
ClickHouse visualisations with
Superset and Grafana
Spring 2017
TopN, IP prefix matching, Go native
driver, Analytics library, pkey in
monotonic functions
October 2016
Began evaluating technologies and
architecture, 1 instance in Docker
Finalized schema, deployed a production
ClickHouse cluster of 6 nodes
November 2016
Prototype ClickHouse cluster with 3
nodes, inserting a sample of data
August 2017
Migrated to a new cluster with
multi-tenancy
Growing interest among other
Cloudflare engineering teams,
worked on standard tooling
December 2016
ClickHouse visualisations with
Superset and Grafana
Spring 2017
TopN, IP prefix matching, Go native
driver, Analytics library, pkey in
monotonic functions
Multi-tenant ClickHouse cluster
Row Insertion/s
8M+
Raid-0 Spinning Disks
2PB+
Insertion Throughput/s
4GB+
Nodes
33
ClickHouse Today… 12 Trillion Rows
SELECT
table,
sum(rows) AS total
FROM system.cluster_parts
WHERE database = 'r0'
GROUP BY table
ORDER BY total DESC
┌─table──────────────────────────────┬─────────────total─┐
│ ███████████████ │ 9,051,633,001,267 │
│ ████████████████████ │ 2,088,851,716,078 │
│ ███████████████████ │ 847,768,860,981 │
│ ██████████████████████ │ 259,486,159,236 │
│ … │ … │
- TopK(n) Aggregates
https://github.com/yandex/ClickHouse/pull/754
- TrieDictionaries (IP Prefix)
https://github.com/yandex/ClickHouse/pull/785
- SpaceSaving: internal storage for StringRef{}
https://github.com/yandex/ClickHouse/pull/925
- Bug fixes to the Go native driver
https://github.com/kshvakov/clickhouse
- sumMap(key, value)
https://github.com/yandex/ClickHouse/pull/1250
Contributions to ClickHouse
Other Contributions
- Grafana Plugin
https://github.com/vavrusa/grafana-sqldb-datasource
(see also https://github.com/Vertamedia/clickhouse-grafana)
- SQLAlchemy (Superset)
https://github.com/cloudflare/sqlalchemy-clickhouse
Python w/ Jupyter Notebooks
import requests
import pandas as pd
def ch(q, host='127.0.0.1', port=9001):
start = timer()
r = requests.get(
'https://%s:%d/' % (host, port),
params={'user': 'xxx', 'query': q + 'nFORMAT TabSeparatedWithNames'},
stream=True)
end = timer()
if not r.ok:
raise RuntimeError(r.text)
print 'Query finished in %.02fs' % (end - start)
return pd.read_csv(r.raw, sep="t")
Python w/ Jupyter Notebooks
import requests
import pandas as pd
def ch(q, host='127.0.0.1', port=9001):
start = timer()
r = requests.get(
'https://%s:%d/' % (host, port),
params={'user': 'xxx', 'query': q + 'nFORMAT TabSeparatedWithNames'},
stream=True)
end = timer()
if not r.ok:
raise RuntimeError(r.text)
print 'Query finished in %.02fs' % (end - start)
return pd.read_csv(r.raw, sep="t")
Python w/ Jupyter Notebooks
Python w/ Jupyter Notebooks
blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-queries-per-second
Check it
Thanks!
@tarnfeld @vavrusam
https://cloudflare.com/careers/departments/engineering

Más contenido relacionado

La actualidad más candente

Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Databricks
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...Nathan Bijnens
 
QCon London 2016 - Patterns of reliable in-stream processing @ Scale
QCon London 2016 - Patterns of reliable in-stream processing @ ScaleQCon London 2016 - Patterns of reliable in-stream processing @ Scale
QCon London 2016 - Patterns of reliable in-stream processing @ ScaleAlexey Kharlamov
 
HBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnBHBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnBHBaseCon
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
 
Introducing log analysis to your organization
Introducing log analysis to your organization Introducing log analysis to your organization
Introducing log analysis to your organization Sematext Group, Inc.
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kitehuguk
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using El...
DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using El...DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using El...
DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using El...PROIDEA
 
Big data Lambda Architecture - Batch Layer Hands On
Big data Lambda Architecture - Batch Layer Hands OnBig data Lambda Architecture - Batch Layer Hands On
Big data Lambda Architecture - Batch Layer Hands Onhkbhadraa
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaminghuguk
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogVadim Semenov
 
PGConf APAC 2018 - Monitoring PostgreSQL at Scale
PGConf APAC 2018 - Monitoring PostgreSQL at ScalePGConf APAC 2018 - Monitoring PostgreSQL at Scale
PGConf APAC 2018 - Monitoring PostgreSQL at ScalePGConf APAC
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and... DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...PROIDEA
 

La actualidad más candente (20)

Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
 
QCon London 2016 - Patterns of reliable in-stream processing @ Scale
QCon London 2016 - Patterns of reliable in-stream processing @ ScaleQCon London 2016 - Patterns of reliable in-stream processing @ Scale
QCon London 2016 - Patterns of reliable in-stream processing @ Scale
 
HBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnBHBaseCon2017 Data Product at AirBnB
HBaseCon2017 Data Product at AirBnB
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Introducing log analysis to your organization
Introducing log analysis to your organization Introducing log analysis to your organization
Introducing log analysis to your organization
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using El...
DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using El...DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using El...
DOD 2016 - Rafał Kuć - Building a Resilient Log Aggregation Pipeline Using El...
 
Big data Lambda Architecture - Batch Layer Hands On
Big data Lambda Architecture - Batch Layer Hands OnBig data Lambda Architecture - Batch Layer Hands On
Big data Lambda Architecture - Batch Layer Hands On
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Fast real-time approximations using Spark streaming
Fast real-time approximations using Spark streamingFast real-time approximations using Spark streaming
Fast real-time approximations using Spark streaming
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 
PGConf APAC 2018 - Monitoring PostgreSQL at Scale
PGConf APAC 2018 - Monitoring PostgreSQL at ScalePGConf APAC 2018 - Monitoring PostgreSQL at Scale
PGConf APAC 2018 - Monitoring PostgreSQL at Scale
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and... DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
DOD 2016 - Stefan Thies - Monitoring and Log Management for Docker Swarm and...
 

Similar a How Cloudflare analyzes -1m dns queries per second @ Percona E17

Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaAltinity Ltd
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaValery Tkachenko
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT_MTL
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...Amazon Web Services
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneDataWorks Summit
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connectorDenny Lee
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond RelationalLynn Langit
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13DECK36
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyershuguk
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
Infrastructure Monitoring with Postgres
Infrastructure Monitoring with PostgresInfrastructure Monitoring with Postgres
Infrastructure Monitoring with PostgresSteven Simpson
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudKaran Singh
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataVMware Tanzu
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataCarlos Andrés García
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsYousun Jeong
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 

Similar a How Cloudflare analyzes -1m dns queries per second @ Percona E17 (20)

Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek Vavrusa
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek Vavrusa
 
Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21Dsdt meetup 2017 11-21
Dsdt meetup 2017 11-21
 
DSDT Meetup Nov 2017
DSDT Meetup Nov 2017DSDT Meetup Nov 2017
DSDT Meetup Nov 2017
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Infrastructure Monitoring with Postgres
Infrastructure Monitoring with PostgresInfrastructure Monitoring with Postgres
Infrastructure Monitoring with Postgres
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyData
 
High performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyDataHigh performance Spark distribution on PKS by SnappyData
High performance Spark distribution on PKS by SnappyData
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 

Último

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

How Cloudflare analyzes -1m dns queries per second @ Percona E17

  • 1. How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld (and Marek Vavrusa )
  • 2. 100+ Data centers globally 2.5B Monthly unique visitors >10% Internet requests everyday ≦3M DNS queries/second websites, apps & APIs in 150 countries 6M+ 5M+ HTTP requests/second
  • 3. Anatomy of a DNS query $ dig www.cloudflare.com ; <<>> DiG 9.8.3-P1 <<>> www.cloudflare.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36582 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;www.cloudflare.com. IN A ;; ANSWER SECTION: www.cloudflare.com. 5 IN A 198.41.215.162 www.cloudflare.com. 5 IN A 198.41.214.162 ;; Query time: 34 msec ;; SERVER: 192.168.1.1#53(192.168.1.1) ;; WHEN: Sat Sep 2 10:48:30 2017 ;; MSG SIZE rcvd: 68 Fields 30+
  • 4. Cloudflare DNS Server Log Forwarder HTTP & Other Edge Services Anycast DNS Logs from all edge services and all PoPs are shipped over TLS to be processed Logs are received and de-multiplexed Logs are written into various kafka topics
  • 5. Cloudflare DNS Server Log Forwarder HTTP & Other Edge Services Anycast DNS Log messages are serialized with Cap’n’Proto Logs from all edge services and all PoPs are shipped over TLS to be processed Logs are written into various kafka topics Logs are received and de-multiplexed
  • 6. What did we want? - Multidimensional query analytics - Complex ad-hoc queries - Capable of current and expected future scale - Gracefully handle late arriving log data - Roll-ups/aggregations for long term storage - Highly available and replicated architecture Queries Per Second ≦3M Edge Points of Presence 100+ Query Dimensions 20+ Years of stored aggregation 5+
  • 7. Logs are written into various kafka topics Logs are received and de-multiplexed Kafka, Apache Spark and Parquet - Scanning firehose is slow and adding filters is time consuming - Offline analysis is difficult with large amounts of data - Not a fast or friendly user experience - Doesn’t work for customers Converted into Parquet and written to HDFS Download and filter data from Kafka using Apache Spark
  • 8. Let’s aggregate everything... with streams Timestamp QName QType RCODE 2017/01/01 01:00:00 www.cloudflare.com A NODATA 2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR Time Bucket QName QType RCODE Count p50 Response Time 2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms 2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms
  • 9. Let’s aggregate everything... with streams - Counters - Total number of queries - Query types - Response codes - Top-n query names - Top-n query sources - Response time/size quantiles
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17. Logs are written into various kafka topics Logs are received and de-multiplexed - Spark experience in-house, though Java/Scala - Batch-oriented and need a DB to serve online queries - Difficult to support ad-hoc analysis - Low resolution aggregates - Scanning raw data is slow - Late arriving data Aggregating with Spark Streaming Produce low cardinality aggregates with Spark Streaming
  • 18. Logs are written into various kafka topics Logs are received and de-multiplexed - Spark experience in-house, though Java/Scala - Batch-oriented and need a DB to serve online queries - Difficult to support ad-hoc analysis - Low resolution aggregates - Scanning raw data is slow - Late arriving data Aggregating with Spark Streaming Produce low cardinality aggregates with Spark Streaming
  • 19. Logs are written into various kafka topics Logs are received and de-multiplexed - Distributed time-series DB - Existing deployments of CitusDB - High cardinality aggregations are tricky due to insert performance - Late arriving data - SQL API Spark Streaming + CitusDB Produce low cardinality aggregates with Spark Streaming Insert aggregate rows into CitusDB cluster for reads
  • 20. Logs are written into various kafka topics Logs are received and de-multiplexed Apache Flink + (CitusDB?) - Dataflow API and support for stream watermarks - Checkpoint performance issues - High cardinality aggregations are tricky due to insert performance - SQL API Produce low cardinality aggregates with Flink Insert aggregate rows into CitusDB cluster for reads
  • 21. Logs are written into various kafka topics Logs are received and de-multiplexed Druid - Insertion rate couldn’t keep up in our initial tests - Estimated costs of a suitable cluster were way expensive - Seemed performant for random reads but not the best we’d seen - Operational complexity seemed high Insert into a cluster of Druid nodes
  • 22. Let’s aggregate everything... with streams Timestamp QName QTy 2017/01/01 01:00:00 www.cloudflare.com A 2017/01/01 01:00:01 api.cloudflare.com AAA Time Bucket QName QTy 2017/01/01 01:00 www.cloudflare.com A 2017/01/01 01:00 api.cloudflare.com AAA - Raw data isn’t easily queried ad-hoc - Backfilling new aggregates is impossible or can be very difficult without custom tools - A stream can’t serve actual queries - Can be costly for high cardinality dimensions *https://clickhouse.yandex/docs/en/introduction/what_is_clickhouse.html
  • 23. ClickHouse - Tabular, column-oriented data store - Single binary, clustered architecture - Familiar SQL query interface Lots of very useful built-in aggregation functions - Raw log data stored for 3 months ~7 trillion rows - Aggregated data for ∞ 1m, 1h aggregations across 3 dimensions
  • 24. Cloudflare DNS Server Log Forwarder HTTP & Other Edge Services Anycast DNS Log messages are serialized with Cap’n’Proto Logs from all edge services and all PoPs are shipped over TLS to be processed Logs are written into various kafka topics Logs are received and de-multiplexed Go Inserters write the data in parallel Multi-tenant ClickHouse cluster stores data
  • 25. ClickHouse Cluster TinyLog dnslogs_2016_01_01_14_30_pN ReplicatedMergeTree dnslogs_2016_01_01 ReplicatedMergeTree dnslogs_2016_01 ReplicatedMergeTree dnslogs_2016 - Raw logs are inserted into sharded tables - Sidecar processes aggregates data into day/month/year tables Initial table design
  • 26. ClickHouse Cluster r{0,2}.dnslogs - Raw logs are inserted into one replicated, sharded table - Multiple r{0,2} databases to better pack the cluster with shards and replicas First attempt in prod. ReplicatedMergeTree
  • 27. Speeding up typical queries - SUM() and COUNT() over a few low-cardinality dimensions - Global overview (trends and monitoring) - Storing intermediate state for non-additive functions
  • 28. ClickHouse Cluster r{0,2}.dnslogs - Raw logs are inserted into one replicated, sharded table - Multiple r{0,2} databases to better pack the cluster with shards and replicas - Aggregate tables for long-term storage Today... ReplicatedMergeTree ReplicatedAggregatingMergeTree dnslogs_rollup_X
  • 29. October 2016 Began evaluating technologies and architecture, 1 instance in Docker Finalized schema, deployed a production ClickHouse cluster of 6 nodes November 2016 Prototype ClickHouse cluster with 3 nodes, inserting a sample of data August 2017 Migrated to a new cluster with multi-tenancy Growing interest among other Cloudflare engineering teams, worked on standard tooling December 2016 ClickHouse visualisations with Superset and Grafana Spring 2017 TopN, IP prefix matching, Go native driver, Analytics library, pkey in monotonic functions
  • 30. October 2016 Began evaluating technologies and architecture, 1 instance in Docker Finalized schema, deployed a production ClickHouse cluster of 6 nodes November 2016 Prototype ClickHouse cluster with 3 nodes, inserting a sample of data August 2017 Migrated to a new cluster with multi-tenancy Growing interest among other Cloudflare engineering teams, worked on standard tooling December 2016 ClickHouse visualisations with Superset and Grafana Spring 2017 TopN, IP prefix matching, Go native driver, Analytics library, pkey in monotonic functions Multi-tenant ClickHouse cluster Row Insertion/s 8M+ Raid-0 Spinning Disks 2PB+ Insertion Throughput/s 4GB+ Nodes 33
  • 31. ClickHouse Today… 12 Trillion Rows SELECT table, sum(rows) AS total FROM system.cluster_parts WHERE database = 'r0' GROUP BY table ORDER BY total DESC ┌─table──────────────────────────────┬─────────────total─┐ │ ███████████████ │ 9,051,633,001,267 │ │ ████████████████████ │ 2,088,851,716,078 │ │ ███████████████████ │ 847,768,860,981 │ │ ██████████████████████ │ 259,486,159,236 │ │ … │ … │
  • 32. - TopK(n) Aggregates https://github.com/yandex/ClickHouse/pull/754 - TrieDictionaries (IP Prefix) https://github.com/yandex/ClickHouse/pull/785 - SpaceSaving: internal storage for StringRef{} https://github.com/yandex/ClickHouse/pull/925 - Bug fixes to the Go native driver https://github.com/kshvakov/clickhouse - sumMap(key, value) https://github.com/yandex/ClickHouse/pull/1250 Contributions to ClickHouse
  • 33. Other Contributions - Grafana Plugin https://github.com/vavrusa/grafana-sqldb-datasource (see also https://github.com/Vertamedia/clickhouse-grafana) - SQLAlchemy (Superset) https://github.com/cloudflare/sqlalchemy-clickhouse
  • 34. Python w/ Jupyter Notebooks import requests import pandas as pd def ch(q, host='127.0.0.1', port=9001): start = timer() r = requests.get( 'https://%s:%d/' % (host, port), params={'user': 'xxx', 'query': q + 'nFORMAT TabSeparatedWithNames'}, stream=True) end = timer() if not r.ok: raise RuntimeError(r.text) print 'Query finished in %.02fs' % (end - start) return pd.read_csv(r.raw, sep="t")
  • 35. Python w/ Jupyter Notebooks import requests import pandas as pd def ch(q, host='127.0.0.1', port=9001): start = timer() r = requests.get( 'https://%s:%d/' % (host, port), params={'user': 'xxx', 'query': q + 'nFORMAT TabSeparatedWithNames'}, stream=True) end = timer() if not r.ok: raise RuntimeError(r.text) print 'Query finished in %.02fs' % (end - start) return pd.read_csv(r.raw, sep="t")
  • 36. Python w/ Jupyter Notebooks
  • 37. Python w/ Jupyter Notebooks