Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

High Performance, High Reliability Data Loading on ClickHouse

Cargando en…3

Eche un vistazo a continuación

1 de 37 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a High Performance, High Reliability Data Loading on ClickHouse (20)


Más de Altinity Ltd (20)

Más reciente (20)


High Performance, High Reliability Data Loading on ClickHouse

  1. 1. High Reliability Data Loading on ClickHouse Altinity Engineering Webinar 1
  2. 2. Presenter Bio and Altinity Introduction The #1 enterprise ClickHouse provider. Now offering Altinity.Cloud Major committer and community sponsor for ClickHouse in US/EU Robert Hodges - Altinity CEO 30+ years on DBMS plus virtualization and security. ClickHouse is DBMS #20 Alexander Zaitsev - Altinity CTO Altinity founder with decades of expertise on petabyte-scale analytic systems
  3. 3. Ingestion Pipeline ClickHouseINSERT Event Stream
  4. 4. Ingestion Pipeline ClickHouse HDD INSERT OS Page Cache Event Stream
  5. 5. Ingestion Pipeline ClickHouse HDD INSERT OS Page Cache Event Stream Table MV1 MV2 MV3
  6. 6. Ingestion Pipeline Shard1INSERT Event Stream Table MV1 MV2 MV3 Shard2 Shard3
  7. 7. Ingestion Pipeline Shard1INSERT Event Stream Table MV1 MV2 MV3 Shard2 Shard3 Replica Replica Replica
  8. 8. Topics to discuss ● Performance ● Reliability ● Deduplication
  9. 9. ClickHouseINSERT Event Stream Under the hood: • Data is parsed by rows and converted to in-memory columns • Columns are split to partitions and parts (could be multiple) • Columns are sorted, PK is calculated • Columns are compressed and written to the disk into the temporary dir(s) • Single column may require 2-4 files in a part • Once the part is ready – it is renamed to the real one
  10. 10. General Insert Performance Considerations Single INSERT has a lot of overhead, so: ● User bigger blocks ● Do not insert too often ● Do not use too aggressive compression ● Pick partitioning wisely ● And: ○ INSERT close to ZooKeeper for replicated tables ○ Asynchronous is always faster (but less reliable) Altinity Ltd.
  11. 11. Extra techniques to reduce overhead ● Buffer tables – collect data in memory, and flush once ready ● Polymorphic MergeTree parts – store small inserts more efficiently
  12. 12. Buffer tables Engine=Buffer Engine=MergeTree • Memory buffer • Flush on size/time treshold • SELECT FROM buffer_table • Lost on hard restart Buffer(database, table, num_layers, min_time, max_time, min_rows, max_rows, min_bytes, max_bytes)
  13. 13. Compact parts for MergeTree (20.3+) Part .idx .bin/.mrk2 Part .idx .bin/.mrk2 Part .idx data.bin/mrk3 Part .idx data.bin/mrk3 “wide” parts (default) “compact” parts (new)
  14. 14. Compact parts compared to wide parts “wide” parts (default) “compact” parts (new) /var/lib/clickhouse/data/datasets/ontime_refc/2020_ 347_347_0/ . .. checksums.txt columns.txt count.txt data.bin data.mrk3 minmax_Year.idx partition.dat primary.idx 8 files for 109 columns! /var/lib/clickhouse/data/datasets/ontime_ref/2020 _547_552_1/ . .. ActualElapsedTime.bin ActualElapsedTime.mrk2 AirlineID.bin AirlineID.mrk2 AirTime.bin AirTime.mrk2 ArrDel15.bin ArrDel15.mrk2 ArrDelay.bin 224 files for 109 columns!
  15. 15. Overview of compact part design ● Single file but columnar inside ● Reduces file system overhead ● Useful for small frequent inserts ● Threshold is controlled by merge_tree_settings: ○ min_bytes_for_wide_part ○ min_rows_for_wide_part ● min_bytes_for_wide_part = 10485760 by default since 20.8
  16. 16. ●In-memory parts with write-ahead-log in 20.6 In-memory parts with write-ahead-log 20.6 .idx .bin/.mrk2 merge merge INSERTS ○ min_bytes_for_wide_part ○ min_rows_for_wide_part ○ min_bytes_for_compact_part ○ min_rows_for_compact_part ○ in_memory_parts_enable_wal memory parts compact parts wide part .idx data.bin/mrk3 .idx data.bin/mrk3 .idx data.bin/mrk3
  17. 17. INSERT atomicity User expectations: ● INSERT inserts all the data completely or aborts ● INSERT inserts into all dependent objects or aborts ● INSERT inserts into all distributed and replicated object or aborts No transactions in ClickHouse
  18. 18. How to Make INSERT atomic ClickHouseINSERT How it works: • Data is parsed and written in blocks (parts) • Blocks are written when ready • Partial insert possible in case of failurePart Part Part Need to ensure there is a single part on insert!
  19. 19. How to Make INSERT atomic Important settings: • max_insert_block_size = 1M rows – split insert into chunks • max_block_size = 65K rows – split SELECT into chunks • min_insert_block_size_rows =1M rows – merge input into bigger chunks • min_insert_block_size_bytes = 256MB – merge input into bigger chunks • input_format_parallel_parsing = 1 – splits text input into chunks • max_insert_threads = 1 – parallel INSERT/SELECT Temp Table INSERT INSERT SELECT Table INSERT max_insert_block_size, default 1M rows
  20. 20. Durability Settings (20.10) When to fsync? — min_rows_to_fsync_after_merge — min_compressed_bytes_to_fsync_after_merge — min_compressed_bytes_to_fsync_after_fetch — fsync_after_insert — fsync_part_directory — write_ahead_log_bytes_to_fsync — write_ahead_log_interval_ms_to_fsync — in_memory_parts_insert_sync ClickHouse HDD OS Page Cache
  21. 21. Further reading Files Are Hard (2015): PostgreSQL "Fsyncgate" (2018):
  22. 22. Materialized Views Table MV1 MV2 MV3 How it works: • MVs are executed sequentially in alphabetical order • If any MV fails, source table and unfinished MVs are aborted INSERT Workarounds (partial): • parallel_view_processing = 1 • Do not use cascades MV transactions are coming in 2021!
  23. 23. Distributed Table INSERT Shard1INSERT Shard2 Shard3 How it works: • Split to block per shard and store locally in 'distribution queue' • Ack once data is in the queue • Asynchronously send to shards • Possible loss on hard reset Workarounds: • Insert locally • insert_distributed_sync • insert_distributed_timeout
  24. 24. Replicated table INSERT Shard1INSERT Replica How it works: • Part is written locally and registered in ZooKeeper • Ack client • Data is fetched asynchronously by replicas • Possible loss on hard reset Workarounds: • insert_quorum
  25. 25. Kafka ingest – even more ways to fail Kafka Engine Merge Tree MVTOPIC MV1 MV2 Replica <yandex> <kafka> <!-- enable EOS semantics --> <isolation_level> read_committed </isolation_level> </kafka> </yandex> To make things more complicated: • multiple topics • multiple partitions per topic, partition re-balance • multiple consumers in ClickHouse (num_consumers)
  26. 26. Summary of Best Practices ● Do not use buffer tables (use compact/memory parts instead) ● Make sure single INSERT generates single part if possible ● Local inserts or insert_distributed_sync ● insert_quorum ● Do not use cascading MVs ● parallel_view_processing ● durability settings (if you understand them)
  27. 27. Deduplication Why duplicates are possible? ● Retry failed INSERTs ● Collisions in message bus (e.g. Kafka re-balances) ● User errors No unique keys and constraints in ClickHouse
  28. 28. Block Level Deduplication Scenario – retry INSERT after failure ● ClickHouse keeps history of block hashes per table (crc64 or similar) ● If hash matches INSERT is ignored Details: ● Only Replicated tables (non-replicated in Q1/2021) ● replicated_deduplication_window (100), replicated_deduplication_window_seconds (604800) ● deduplicate_blocks_in_dependent_materialized_views -- fire MVs if source table is deduped
  29. 29. ReplacingMergeTree Eventually removes duplicates: ● Replaces values with equal PRIMARY KEY value ● Replace during merge ● OPTIMIZE FINAL ● SELECT FINAL ○ Slow for aggregation (performance has been improved in 20.5 and 20.11) ○ Good for key_column in (… ) queries
  30. 30. Logical Deduplication Scenario: ● There is a natural unique id in the table ● There is a unique hash Temporary Table INSERT INSERT INTO Table SELECT * FROM … WHERE id NOT IN (SELECT id FROM WHERE <dedup_window> ) Engine=Null Table INSERT MaterializedView INSERT SELECT
  31. 31. Bullet-proof de-duplication in Kafka
  32. 32. How to Find Duplicates in Big Table SELECT min(ts), max(ts), count(*) FROM ( SELECT ts FROM Table WHERE ts BETWEEN time_start and time_end GROUP BY ts, hash HAVING count(*) > 1 ) AS Z Reliable but may be slow lots of RAM Scenario: ● Table.hash – should be unique for a table
  33. 33. How to Find Duplicates in Big Table SELECT ts, hash, neighbor(hash, -1) AS p_hash FROM Table WHERE BETWEEN time_start and time_end AND hash = p_hash ORDER BY ts ASC, hash ASC Scenario: ● Table.hash – should be unique for a table Fast, but may be inaccurate • neighbor – works inside blocks • max_block_size • group_by_two_level_threshold=0
  34. 34. OPTIMIZE DEDUPLICATE ● Full re-sort, may take a lot of time ● Deduplicates identical rows (all columns considered) ● Deduplicate on a subset of columns – coming in 20.13: ○ OPTIMIZE TABLE table DEDUPLICATE BY col1,col2,col3; ○ OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT (colX, colY) ○ OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by- regex') EXCEPT (colX, colY);
  35. 35. Final words ● ClickHouse is very fast and reliable ● Proper schema design is important for performance and reliability ● Default settings are tuned for performance, but not for reliability ● Atomicity requires careful attention ● Important features in 2021 roadmap: ○ Block de-duplication for non-replicated table ○ 'Transactional' materialized views updates ○ 'Transactional' multi-inserts
  36. 36. ● ○ Everything Clickhouse ● ○ Piles of community videos ● ○ Lots of articles about ClickHouse usage ● ○ Webinars on all aspects of ClickHouse ● ○ Check out tests for examples of detailed usage More information and references 37
  37. 37. Thank you! Contacts: Visit us at: