Your first ClickHouse data warehouse

SF Bay Area ClickHouse Meetup talk introducing ClickHouse to database developers

  1. 1. Your first ClickHouse data warehouse Robert Hodges - 2 December 2020 SF Bay Area ClickHouse Meetup 1
  2. 2. Presenter and Company Bio Enterprise provider for ClickHouse, a popular, open source data warehouse. Community sponsor and major committers to ClickHouse project. Robert Hodges - Altinity CEO 30+ years on DBMS plus virtualization and security. Using Kubernetes since 2018. 2
  3. 3. Introducing ClickHouse
  4. 4. Single binary Understands SQL Runs on bare metal to cloud Stores data in columns Parallel and vectorized execution Scales to many petabytes Is Open source (Apache 2.0) ClickHouse is an open source data warehouse ClickHouse Server a b c d And it’s really fast! ClickHouse Server a b c d ClickHouse Server a b c d ClickHouse Server a b c d
  5. 5. Installing ClickHouse goodness on Linux # UBUNTU/DEBIAN INSTALL sudo apt-get install apt-transport-https ca-certificates dirmngr sudo apt-key adv --keyserver hkp:// --recv E0C56BD4 echo "deb main/" | sudo tee /etc/apt/sources.list.d/clickhouse.list sudo apt-get update sudo apt-get install -y clickhouse-server clickhouse-client sudo systemctl start clickhouse-server Debian Packages TarballsRPMs
  6. 6. ClickHouse goodness delivered by Docker mkdir $HOME/clickhouse-data docker run -d --name clickhouse-server --ulimit nofile=262144:262144 --volume=$HOME/clickhouse-data:/var/lib/clickhouse -p 8123:8123 -p 9000:9000 yandex/clickhouse-server 6 Persist data Make ports visible Make ClickHouse happy
  7. 7. YES! ● Yandex Managed Service for ClickHouse -- Runs in Yandex.Cloud ● Altinity.Cloud -- Runs in Amazon Public Cloud Is there ClickHouse cloud goodness? 7
  8. 8. Where is the documentation? 8
  9. 9. Getting started with app development
  10. 10. 10 First step: The ClickHouse Tutorial 10
  11. 11. Second step: Design table(s) and load data CREATE TABLE meetup.readings ( sensor_id Int32, time DateTime, date Date, temperature Decimal(5,2) ) Engine = MergeTree PARTITION BY toYYYYMM(time) ORDER BY (sensor_id, time); Don’t stress about data types Use MergeTree table types Partition by month or day Sort by “keys” to find dataLZ4 compression by default
  12. 12. Table Part Index Columns Sparse index Columns sorted on ORDER BY columns Rows match PARTITION BY expression Part Index Columns Part Compressed block 12 Your friend: the MergeTree table type 12
  13. 13. CSVWithNames "sensor_id","time","date","temperature" 0,"2019-01-01 00:00:00","2019-01-01",43.31 0,"2019-01-01 00:01:00","2019-01-01",43.35 JSONEachRow {"sensor_id":0,"time":"2019-01-01 00:00:00","date":"2019-01-01",...} {"sensor_id":0,"time":"2019-01-01 00:01:00","date":"2019-01-01",...} {"sensor_id":0,"time":"2019-01-01 00:02:00","date":"2019-01-01",...} Popular formats for loading data
  14. 14. # Load CSV cat readings.csv | clickhouse-client --query "INSERT INTO meetup.readings FORMAT CSVWithNames" # Load JSON cat readings.json | clickhouse-client --query "INSERT INTO meetup.readings FORMAT JSONEachRow" Loading through clickhouse-client
  15. 15. -- Load from a file function. sudo mkdir -p /var/lib/clickhouse/user_files sudo chmod 777 /var/lib/clickhouse/user_files sudo cp readings.json /var/lib/clickhouse/user_files clickhouse-client pika :) INSERT INTO meetup.readings SELECT * FROM file('readings.json', 'JSONEachRow', 'sensor_id Int32, time DateTime, date Date, temperature Decimal(5,2)') Loading through table functions
  16. 16. -- Insert from S3 INSERT INTO meetup.readings SELECT * FROM s3('', 'CSVWithNames', 'sensor_id Int32, time DateTime, date Date, temperature Decimal(5,2)') NEW: loading data from S3 (20.8+)
  17. 17. 17 Third Step: Go crazy with your own queries 17
  18. 18. But what about client libraries?? 1818 Language Popular Drivers C++ Golang Java ODBC Python PHP and Javascript Use a library listed on *or* roll your own using the ClickHouse HTTP interface
  19. 19. ClickHouse Database self-defense
  20. 20. Database Choices Row Store Column Store “Data Warehouse”
  21. 21. a b c d e f g h i j k l m n o... a b c d e f g h i j k l m n o... a b c d e f g h i j k l m n o... a b c d e f g h i j k l m n o... a b c d e f g h i j k l m n o... a b c d e f g h i j k l m n o... MySQL: Row Store Access Read row data serially
  22. 22. a b c d e f g h i j k l m n o p q r s t u v... a b c d e f g h i j k l m n o p q r s t u v... a b c d e f g h i j k l m n o p q r s t u v... a b c d e f g h i j k l m n o p q r s t u v... a b c d e f g h i j k l m n o p q r s t u v... a b c d e f g h i j k l m n o p q r s t u v... Column Store Access Read compressed columns in parallel
  23. 23. There is no penalty for wide tables “Pay” only for the columns you read
  24. 24. Compression makes data even smaller Data Type Codec Compression LowCardinality (String) (none) LZ4 UInt32 DoubleDelta ZSTD(1)
  25. 25. Optimize compression to reduce I/O! CREATE TABLE billy.readings ( sensor_id Int32 Codec(DoubleDelta, ZSTD(1)), time DateTime Codec(DoubleDelta, ZSTD(1)), date ALIAS toDate(time), temperature Decimal(5,2) Codec(T64, ZSTD(1)) ) Engine = MergeTree PARTITION BY toYYYYMM(time) ORDER BY (sensor_id, time); Codec Compression Computed value
  26. 26. Query system.columns to see compression 3.22% 0.13% 3.34% 0.14% 43.8% 29.3%
  27. 27. Materialized views restructure/reduce data readings Table Ingest All sensor readings Daily max/min by sensor readings_daily AggregatingMergeTree (Trigger) readings_daily_mv Materialized View CREATE MATERIALIZED VIEW billy.readings_daily_mv TO billy.readings_daily AS SELECT sensor_id, date, minState(temperature) as temp_min, maxState(temperature) as temp_max FROM billy.readings GROUP BY sensor_id, date; Size: 544GB Rows: 500B Size: 1.7GB Rows: 347M
  28. 28. Materialized views function like indexes! SELECT max(temp_max) FROM billy.readings_daily WHERE sensor_id = 55 ┌─max(temp_max)─┐ │ 75.91 │ └───────────────┘ 1 rows in set. Elapsed: 0.011 sec. Processed 180.22 thousand rows, 1.44 MB (15.86 million rows/s., 126.84 MB/s.)
  29. 29. ClickHouse performance tuning is different... The bad news… ● No query optimizer ● No EXPLAIN PLAN ● May need to move [a lot of] data for performance The good news… ● No query optimizer! ● System log is great ● System tables are too ● Performance drivers are simple: I/O and CPU ● Constantly improving
  30. 30. Your friend: the ClickHouse query log clickhouse-client --send_logs_level=trace sudo less /var/log/clickhouse-server/clickhouse-server.log Return messages to clickhouse-client View all log messages on server
  31. 31. Strengths and weaknesses of ClickHouse (-) Lots of “small” lookups (-) Lots of updates (-) High concurrency (-) Consistency critical (+) Very long tables (+) Very wide tables (+) Open ended questions (+) Lots of aggregates OLTP (“Online Transaction Processing”) OLAP (“Online Analytical Processing”) ClickHouse >> MySQL for analytic queries
