SlideShare una empresa de Scribd logo
1 de 21
Become dangerous on ClickHouse
in 30 minutes
Robert Hodges
SF ClickHouse Meetup - 19 February 2019
Presenter Background
● CEO of Altinity, a ClickHouse commercial provider
● Experience in databases:
○ M204...
○ Followed by Sybase, Cloudscape, Oracle, MySQL, …, ClickHouse
● Qualifications for working in data management:
○ A sick interest in math
○ Insatiable curiosity
So...What is this ClickHouse?
Understands SQL
Portable from bare metal to cloud
Stores data in columns
Parallelizes query execution
Scales to many petabytes
Is open source (Apache 2.0)
Is WAY fast!
Id a b c d
Id a b c d
Id a b c d
Id a b c d
What does “WAY fast” mean?
SELECT Dest d, count(*) c, avg(DepDelayMinutes) dd
FROM ontime GROUP BY d HAVING c > 100000
ORDER BY c DESC limit 5
┌─d───┬───────c─┬─────────────────dd─┐
│ ATL │ 8281604 │ 10.328921909330608 │
│ ORD │ 8040484 │ 11.893720825761235 │
│ DFW │ 6928489 │ 8.954579418398442 │
│ LAX │ 5164096 │ 10.109661594207388 │
│ DEN │ 4641382 │ 9.664600112638865 │
└─────┴─────────┴────────────────────┘
5 rows in set. Elapsed: 0.938 sec. Processed
153.53 million rows, 2.46 GB (163.67 million
rows/s., 2.62 GB/s.)
TEST BED
40€/month
Intel Core
i7-6700
32GB
Seagate 4TB
7200RPM
128MB cache
Getting started is easy
$ docker run -d --name ch-s yandex/clickhouse-server
$ docker exec -it ch-s clickhouse client
...
11e99303c78e :) select version()
SELECT version()
┌─version()─┐
│ 19.3.3 │
└───────────┘
1 rows in set. Elapsed: 0.001 sec.
Let’s do some technical due diligence
CREATE TABLE sdata (
DevId Int32,
Type String,
MDate Date,
MDatetime DateTime,
Value Float64
) ENGINE = MergeTree() PARTITION BY toYYYYMM(MDate)
ORDER BY (DevId, MDatetime)
INSERT INTO sdata VALUES
(15, 'TEMP', '2018-01-01', '2018-01-01 23:29:55', 18.0),
(15, 'TEMP', '2018-01-01', '2018-01-01 23:30:56', 18.7)
INSERT INTO sdata VALUES
(15, 'TEMP', '2018-01-01', '2018-01-01 23:31:53', 18.1),
(2, 'TEMP', '2018-01-01', '2018-01-01 23:31:55', 7.9)
Select results can be surprising!
SELECT *
FROM sdata ┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:29:55 │ 18 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:30:56 │ 18.7 │
└───────┴──────┴────────────┴─────────────────────┴───────┘
┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐
│ 2 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:55 │ 7.9 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:53 │ 18.1 │
└───────┴──────┴────────────┴─────────────────────┴───────┘
┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐
│ 2 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:55 │ 7.9 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:29:55 │ 18 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:30:56 │ 18.7 │
│ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:53 │ 18.1 │
└───────┴──────┴────────────┴─────────────────────┴───────┘
Result right after INSERT:
Result somewhat later:
Time for some research into table engines
CREATE TABLE sdata (
DevId Int32,
Type String,
MDate Date,
MDatetime DateTime,
Value Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(MDate)
ORDER BY (DevId, MDatetime)
How to manage data
and handle queries
How to break table
into parts
How to index and
sort data in each part
MergeTree writes parts quickly and merges them offline
/var/lib/clickhouse/data/default/sdata
201801_1_1_0/
201801_2_2_0/
Multiple parts after initial
insertion ( => very fast writes)
201801_1_2_1/
Single part after merge
( => very fast reads)
Rows are indexed and sorted inside each part
/var/lib/clickhouse/data/default/sdata
... ...
956 2018-01-01 15:22:37
575 2018-01-01 23:31:53
1300 2018-01-02 05:14:47
... ...
primary.idx
||||
.mrk .bin
||||
.mrk .bin
||||
.mrk .bin
||||
.mrk .bin
201802_1_1_0/
(DevId, MDateTime) DevId Type MDate MDatetime...
primary.idx .mrk .bin .mrk .bin .mrk .bin .mrk .bin
201801_1_2_1/
(DevId, MDateTime) DevId Type MDate MDatetime...
ClickHouse
Now we can follow how query works on a single server
SELECT DevId, Type, avg(Value)
FROM sdata
WHERE MDate = '2018-01-01'
GROUP BY DevId, Type
Identify parts to search
Query in parallel
Aggregate results
Result Set
What if you have a lot of data?
CREATE TABLE
sense.sdata_dist AS sense.sdata
ENGINE = Distributed(scluster, sense, sdata, DevId)
Engine for distributed
tables
Cluster, database,
and table names
Distribute data
on this value
Copy this
schema
What’s a Cluster? (Hint: it’s a file.)
<yandex>
<remote_servers>
<scluster>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>ch-d7dc980i1-0.d7dc980i1s</host>
<port>9000</port>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>ch-d7dc980i2-0.d7dc980i2s</host>
<port>9000</port>
</replica>
</shard>
...
</scluster>
</remote_servers>
</yandex>
Distributed engine spreads queries across shards
SELECT ...
FROM
sdata_dist
ClickHouse
sdata_dist
(Distributed)
sdata
(MergeTable)
ClickHouse
sdata_dist sdata
ClickHouse
sdata_dist sdata
Result Set
What if you have a lot of data and you don’t want to lose it?
CREATE TABLE sense.sdata (
DevId Int32, Type String,
MDate Date,
MDatetime DateTime,
Value Float64
) engine=ReplicatedMergeTree(
'/clickhouse/{installation}/{scluster}/tables/{scluster-
shard}/sense/sdata',
'{replica}', MDate, (DevId, MDatetime), 8192);
Engine for replicated
tables
MergeTree parameters
Cluster identification
Replica #
Now we distribute across shards and replicas
ClickHouse
sdata_dist
sdata
ReplicatedMergeTree
Engine
ClickHouse
sdata_dist
sdata
ClickHouse
sdata_dist
sdata
ClickHouse
sdata_dist
sdata
ClickHouse
sdata_dist
sdata
ClickHouse
sdata_dist
sdata
SELECT ...
FROM
sdata_dist
Result Set
Zookeeper
Zookeeper
Zookeeper
SELECT Dest, count(*) c, avg(DepDelayMinutes)
FROM ontime
GROUP BY Dest HAVING c > 100000
ORDER BY c DESC limit 5
SELECT Dest, count(*) c, avg(DepDelayMinutes)
FROM ontime
WHERE toYear(FlightDate) =
toYear(toDate('2016-01-01'))
GROUP BY Dest HAVING c > 100000
ORDER BY c DESC limit 5
With basic engine knowledge you can now tune queries
Scans 355 table parts
in parallel; does not
use index
Scans 12 parts (3%
of data) because
FlightDate is
partition key
Hint: clickhouse-server.log has the query plan
Faster
SELECT
Dest d, Name n, count(*) c, avg(ArrDelayMinutes)
FROM ontime
JOIN airports ON (airports.IATA = ontime.Dest)
GROUP BY d, n HAVING c > 100000 ORDER BY ad DESC
SELECT dest, Name n, c AS flights, ad FROM (
SELECT Dest dest, count(*) c, avg(ArrDelayMinutes) ad
FROM ontime
GROUP BY dest HAVING c > 100000
ORDER BY ad DESC
) LEFT JOIN airports ON airports.IATA = dest
You can also optimize joins
Subquery
minimizes data
scanned in
parallel; joins on
GROUP BY results
Joins on data
before GROUP BY,
increased amount
to scan
Faster
ClickHouse has a wealth of features to help queries go fast
Dictionaries
Materialized Views
Arrays
Specialized functions and SQL
extensions
Lots more table engines
Parting Question: What’s ClickHouse good for?
● Fast, scalable data warehouse for online services (SaaS
and in-house apps
● Built-in data warehouse for installed analytic applications
● Exploration -- throw in a bunch of data and go crazy!
Thank you!
Comments to:
rhodges at altinity.com
Visit us at:
https://www.altinity.com

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
 
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UIData Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
All about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdfAll about Zookeeper and ClickHouse Keeper.pdf
All about Zookeeper and ClickHouse Keeper.pdf
 
A day in the life of a click house query
A day in the life of a click house queryA day in the life of a click house query
A day in the life of a click house query
 
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
 
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
 
Adventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree EngineAdventures with the ClickHouse ReplacingMergeTree Engine
Adventures with the ClickHouse ReplacingMergeTree Engine
 
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster PerformanceWebinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
 
ClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei MilovidovClickHouse Features for Advanced Users, by Aleksei Milovidov
ClickHouse Features for Advanced Users, by Aleksei Milovidov
 
Fun with click house window functions webinar slides 2021-08-19
Fun with click house window functions webinar slides  2021-08-19Fun with click house window functions webinar slides  2021-08-19
Fun with click house window functions webinar slides 2021-08-19
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
 
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
Building ClickHouse and Making Your First Contribution: A Tutorial_06.10.2021
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
 
10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse
 
Altinity Quickstart for ClickHouse-2202-09-15.pdf
Altinity Quickstart for ClickHouse-2202-09-15.pdfAltinity Quickstart for ClickHouse-2202-09-15.pdf
Altinity Quickstart for ClickHouse-2202-09-15.pdf
 
ClickHouse Keeper
ClickHouse KeeperClickHouse Keeper
ClickHouse Keeper
 
[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse[Meetup] a successful migration from elastic search to clickhouse
[Meetup] a successful migration from elastic search to clickhouse
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouse
 

Similar a Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO

Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013
Sri Ambati
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
Databricks
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
SegFaultConf
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
MongoDB
 
Handling 20 billion requests a month
Handling 20 billion requests a monthHandling 20 billion requests a month
Handling 20 billion requests a month
Dmitriy Dumanskiy
 

Similar a Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO (20)

Performance tuning ColumnStore
Performance tuning ColumnStorePerformance tuning ColumnStore
Performance tuning ColumnStore
 
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
 
Bogdan Kecman INIT Presentation
Bogdan Kecman INIT PresentationBogdan Kecman INIT Presentation
Bogdan Kecman INIT Presentation
 
Nsd, il tuo compagno di viaggio quando Domino va in crash
Nsd, il tuo compagno di viaggio quando Domino va in crashNsd, il tuo compagno di viaggio quando Domino va in crash
Nsd, il tuo compagno di viaggio quando Domino va in crash
 
Osol Pgsql
Osol PgsqlOsol Pgsql
Osol Pgsql
 
Bogdan Kecman Advanced Databasing
Bogdan Kecman Advanced DatabasingBogdan Kecman Advanced Databasing
Bogdan Kecman Advanced Databasing
 
Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013Sv big datascience_cliffclick_5_2_2013
Sv big datascience_cliffclick_5_2_2013
 
implementation of a big data architecture for real-time analytics with data s...
implementation of a big data architecture for real-time analytics with data s...implementation of a big data architecture for real-time analytics with data s...
implementation of a big data architecture for real-time analytics with data s...
 
Data Democratization at Nubank
 Data Democratization at Nubank Data Democratization at Nubank
Data Democratization at Nubank
 
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
 
Tweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийTweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский Дмитрий
 
Beyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the codeBeyond PHP - It's not (just) about the code
Beyond PHP - It's not (just) about the code
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
 
Big Data and Beautiful Video: How ClickHouse enables Mux to Deliver Content a...
Big Data and Beautiful Video: How ClickHouse enables Mux to Deliver Content a...Big Data and Beautiful Video: How ClickHouse enables Mux to Deliver Content a...
Big Data and Beautiful Video: How ClickHouse enables Mux to Deliver Content a...
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
 
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
InfluxDB 101 – Concepts and Architecture by Michael DeSa, Software Engineer |...
 
Handling 20 billion requests a month
Handling 20 billion requests a monthHandling 20 billion requests a month
Handling 20 billion requests a month
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 

Más de Altinity Ltd

Más de Altinity Ltd (20)

Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptxBuilding an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
Building an Analytic Extension to MySQL with ClickHouse and Open Source.pptx
 
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
Cloud Native ClickHouse at Scale--Using the Altinity Kubernetes Operator-2022...
 
Building an Analytic Extension to MySQL with ClickHouse and Open Source
Building an Analytic Extension to MySQL with ClickHouse and Open SourceBuilding an Analytic Extension to MySQL with ClickHouse and Open Source
Building an Analytic Extension to MySQL with ClickHouse and Open Source
 
Fun with ClickHouse Window Functions-2021-08-19.pdf
Fun with ClickHouse Window Functions-2021-08-19.pdfFun with ClickHouse Window Functions-2021-08-19.pdf
Fun with ClickHouse Window Functions-2021-08-19.pdf
 
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdfCloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
Cloud Native Data Warehouses - Intro to ClickHouse on Kubernetes-2021-07.pdf
 
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
Building High Performance Apps with Altinity Stable Builds for ClickHouse | A...
 
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...
 
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdfOwn your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
Own your ClickHouse data with Altinity.Cloud Anywhere-2023-01-17.pdf
 
ClickHouse ReplacingMergeTree in Telecom Apps
ClickHouse ReplacingMergeTree in Telecom AppsClickHouse ReplacingMergeTree in Telecom Apps
ClickHouse ReplacingMergeTree in Telecom Apps
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdfAltinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
Altinity Webinar: Introduction to Altinity.Cloud-Platform for Real-Time Data.pdf
 
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
 
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdfOSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
OSA Con 2022 - Welcome to OSA CON Version 2022 - Robert Hodges - Altinity.pdf
 
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
OSA Con 2022 - Using ClickHouse Database to Power Analytics and Customer Enga...
 
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
OSA Con 2022 - Tips and Tricks to Keep Your Queries under 100ms with ClickHou...
 
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
OSA Con 2022 - The Open Source Analytic Universe, Version 2022 - Robert Hodge...
 
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
OSA Con 2022 - Switching Jaeger Distributed Tracing to ClickHouse to Enable A...
 
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
OSA Con 2022 - Streaming Data Made Easy - Tim Spann & David Kjerrumgaard - St...
 
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdfOSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
OSA Con 2022 - State of Open Source Databases - Peter Zaitsev - Percona.pdf
 
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
OSA Con 2022 - Specifics of data analysis in Time Series Databases - Roman Kh...
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Dangerous on ClickHouse in 30 minutes, by Robert Hodges, Altinity CEO

  • 1. Become dangerous on ClickHouse in 30 minutes Robert Hodges SF ClickHouse Meetup - 19 February 2019
  • 2. Presenter Background ● CEO of Altinity, a ClickHouse commercial provider ● Experience in databases: ○ M204... ○ Followed by Sybase, Cloudscape, Oracle, MySQL, …, ClickHouse ● Qualifications for working in data management: ○ A sick interest in math ○ Insatiable curiosity
  • 3. So...What is this ClickHouse? Understands SQL Portable from bare metal to cloud Stores data in columns Parallelizes query execution Scales to many petabytes Is open source (Apache 2.0) Is WAY fast! Id a b c d Id a b c d Id a b c d Id a b c d
  • 4. What does “WAY fast” mean? SELECT Dest d, count(*) c, avg(DepDelayMinutes) dd FROM ontime GROUP BY d HAVING c > 100000 ORDER BY c DESC limit 5 ┌─d───┬───────c─┬─────────────────dd─┐ │ ATL │ 8281604 │ 10.328921909330608 │ │ ORD │ 8040484 │ 11.893720825761235 │ │ DFW │ 6928489 │ 8.954579418398442 │ │ LAX │ 5164096 │ 10.109661594207388 │ │ DEN │ 4641382 │ 9.664600112638865 │ └─────┴─────────┴────────────────────┘ 5 rows in set. Elapsed: 0.938 sec. Processed 153.53 million rows, 2.46 GB (163.67 million rows/s., 2.62 GB/s.) TEST BED 40€/month Intel Core i7-6700 32GB Seagate 4TB 7200RPM 128MB cache
  • 5. Getting started is easy $ docker run -d --name ch-s yandex/clickhouse-server $ docker exec -it ch-s clickhouse client ... 11e99303c78e :) select version() SELECT version() ┌─version()─┐ │ 19.3.3 │ └───────────┘ 1 rows in set. Elapsed: 0.001 sec.
  • 6. Let’s do some technical due diligence CREATE TABLE sdata ( DevId Int32, Type String, MDate Date, MDatetime DateTime, Value Float64 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(MDate) ORDER BY (DevId, MDatetime) INSERT INTO sdata VALUES (15, 'TEMP', '2018-01-01', '2018-01-01 23:29:55', 18.0), (15, 'TEMP', '2018-01-01', '2018-01-01 23:30:56', 18.7) INSERT INTO sdata VALUES (15, 'TEMP', '2018-01-01', '2018-01-01 23:31:53', 18.1), (2, 'TEMP', '2018-01-01', '2018-01-01 23:31:55', 7.9)
  • 7. Select results can be surprising! SELECT * FROM sdata ┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:29:55 │ 18 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:30:56 │ 18.7 │ └───────┴──────┴────────────┴─────────────────────┴───────┘ ┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐ │ 2 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:55 │ 7.9 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:53 │ 18.1 │ └───────┴──────┴────────────┴─────────────────────┴───────┘ ┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐ │ 2 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:55 │ 7.9 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:29:55 │ 18 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:30:56 │ 18.7 │ │ 15 │ TEMP │ 2018-01-01 │ 2018-01-01 23:31:53 │ 18.1 │ └───────┴──────┴────────────┴─────────────────────┴───────┘ Result right after INSERT: Result somewhat later:
  • 8. Time for some research into table engines CREATE TABLE sdata ( DevId Int32, Type String, MDate Date, MDatetime DateTime, Value Float64 ) ENGINE = MergeTree() PARTITION BY toYYYYMM(MDate) ORDER BY (DevId, MDatetime) How to manage data and handle queries How to break table into parts How to index and sort data in each part
  • 9. MergeTree writes parts quickly and merges them offline /var/lib/clickhouse/data/default/sdata 201801_1_1_0/ 201801_2_2_0/ Multiple parts after initial insertion ( => very fast writes) 201801_1_2_1/ Single part after merge ( => very fast reads)
  • 10. Rows are indexed and sorted inside each part /var/lib/clickhouse/data/default/sdata ... ... 956 2018-01-01 15:22:37 575 2018-01-01 23:31:53 1300 2018-01-02 05:14:47 ... ... primary.idx |||| .mrk .bin |||| .mrk .bin |||| .mrk .bin |||| .mrk .bin 201802_1_1_0/ (DevId, MDateTime) DevId Type MDate MDatetime... primary.idx .mrk .bin .mrk .bin .mrk .bin .mrk .bin 201801_1_2_1/ (DevId, MDateTime) DevId Type MDate MDatetime...
  • 11. ClickHouse Now we can follow how query works on a single server SELECT DevId, Type, avg(Value) FROM sdata WHERE MDate = '2018-01-01' GROUP BY DevId, Type Identify parts to search Query in parallel Aggregate results Result Set
  • 12. What if you have a lot of data? CREATE TABLE sense.sdata_dist AS sense.sdata ENGINE = Distributed(scluster, sense, sdata, DevId) Engine for distributed tables Cluster, database, and table names Distribute data on this value Copy this schema
  • 13. What’s a Cluster? (Hint: it’s a file.) <yandex> <remote_servers> <scluster> <shard> <internal_replication>true</internal_replication> <replica> <host>ch-d7dc980i1-0.d7dc980i1s</host> <port>9000</port> </replica> </shard> <shard> <internal_replication>true</internal_replication> <replica> <host>ch-d7dc980i2-0.d7dc980i2s</host> <port>9000</port> </replica> </shard> ... </scluster> </remote_servers> </yandex>
  • 14. Distributed engine spreads queries across shards SELECT ... FROM sdata_dist ClickHouse sdata_dist (Distributed) sdata (MergeTable) ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata Result Set
  • 15. What if you have a lot of data and you don’t want to lose it? CREATE TABLE sense.sdata ( DevId Int32, Type String, MDate Date, MDatetime DateTime, Value Float64 ) engine=ReplicatedMergeTree( '/clickhouse/{installation}/{scluster}/tables/{scluster- shard}/sense/sdata', '{replica}', MDate, (DevId, MDatetime), 8192); Engine for replicated tables MergeTree parameters Cluster identification Replica #
  • 16. Now we distribute across shards and replicas ClickHouse sdata_dist sdata ReplicatedMergeTree Engine ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata ClickHouse sdata_dist sdata SELECT ... FROM sdata_dist Result Set Zookeeper Zookeeper Zookeeper
  • 17. SELECT Dest, count(*) c, avg(DepDelayMinutes) FROM ontime GROUP BY Dest HAVING c > 100000 ORDER BY c DESC limit 5 SELECT Dest, count(*) c, avg(DepDelayMinutes) FROM ontime WHERE toYear(FlightDate) = toYear(toDate('2016-01-01')) GROUP BY Dest HAVING c > 100000 ORDER BY c DESC limit 5 With basic engine knowledge you can now tune queries Scans 355 table parts in parallel; does not use index Scans 12 parts (3% of data) because FlightDate is partition key Hint: clickhouse-server.log has the query plan Faster
  • 18. SELECT Dest d, Name n, count(*) c, avg(ArrDelayMinutes) FROM ontime JOIN airports ON (airports.IATA = ontime.Dest) GROUP BY d, n HAVING c > 100000 ORDER BY ad DESC SELECT dest, Name n, c AS flights, ad FROM ( SELECT Dest dest, count(*) c, avg(ArrDelayMinutes) ad FROM ontime GROUP BY dest HAVING c > 100000 ORDER BY ad DESC ) LEFT JOIN airports ON airports.IATA = dest You can also optimize joins Subquery minimizes data scanned in parallel; joins on GROUP BY results Joins on data before GROUP BY, increased amount to scan Faster
  • 19. ClickHouse has a wealth of features to help queries go fast Dictionaries Materialized Views Arrays Specialized functions and SQL extensions Lots more table engines
  • 20. Parting Question: What’s ClickHouse good for? ● Fast, scalable data warehouse for online services (SaaS and in-house apps ● Built-in data warehouse for installed analytic applications ● Exploration -- throw in a bunch of data and go crazy!
  • 21. Thank you! Comments to: rhodges at altinity.com Visit us at: https://www.altinity.com