This document summarizes a presentation about Druid, an open-source distributed data store designed to handle real-time queries on large datasets. It discusses what Druid is, its architecture, and how it compares to other technologies. Specifically, it covers how Druid's TopN query works and is much faster than a GROUP BY query, though results can be unstable with high replication. It also provides examples of queries and performance comparisons between Druid, Elasticsearch, and Kudu+Impala.
1. Druid at naver.com - Part 1
허정수 (jason.heo.sde@gmail.com)
2018-10-16
Druid 8th meetup
2. About this talk
• Strata Data Conference 2018 London, 2018-05-22
• 발표 자료: https://bit.ly/2Qz6mVJ
• 발표 영상: https://bit.ly/2OE0r4B
3. • Part 1 – 8차 밋업
• What is & Why Druid
• The Architecture of our service
• topN Query
• Part 2 – 9차 밋업
• Spark Druid Connector
• How to fix TopN’s unstable results
• 입수 성능 개선
• etc
Agenda
4. Platforms we've tested so far
Parquet
ORC
Carbon Data
Elasticsearch
ClickHouse Kudu
Druid
SparkSQL
Hive
Impala
Drill
Presto
Kylin
Phoenix
Query
Engine
Storage Format
5. • 네이버 콘텐츠통계 서비스 소개 및 구현 경험 공유 (deview, 2016)
• https://www.slideshare.net/deview/215-67608986
• Elasticsearch at naver.com
• Part 1: https://bit.ly/2OBcj7t
• Part 2: https://bit.ly/2PjWhfk
• Kudu를 이용한 빅데이터 다차원 분석 시스템 개발
• https://d2.naver.com/helloworld/9099561
• 빅데이터 다차원 분석 시스템, kylin
• https://d2.naver.com/helloworld/9099561
• Druid로 쉽고 빠르게 빅데이터 분석하기 (deview, 2018)
• https://www.slideshare.net/deview/215-druid-119186559
6. • What is Druid?
• Our Requirements
• Why Druid?
• Experimental Results
What is & Why Druid
7. • Column-oriented distributed datastore
• Real-time streaming ingestion
• Scalable to petabytes of data
• Approximate algorithms (hyperLogLog, theta sketch)
https://www.slideshare.net/HadoopSummit/scalable-
realtime-analytics-using-druid
From HORTONWORKS
What is Druid?
8. From my point of view
• Druid is a cumbersome version of Elasticsearch (w/o search feature)
• Similar points
• Secondary Index
• DSLs for query
• Flow of Query Processing
• Terms Aggregation ↔ TopN Query, Coordinator ↔ Broker, Data Node ↔ Historical
• Different points
• more complicated to operate
• better with much more data
• better for Ultra High Cardinality
• less GC overhead
• better for Spark Connectivity (for Full Scan)
What is Druid?
10. Real-time
Node
Historical
Broker
{
"queryType": "groupBy",
"dataSource": "sample_data",
"dimension": ["country", "device"],
"filter": {},
"aggregation": [...],
"limitSpec": [...]
}
{
"queryType": "topN",
"dataSource": "sample_data",
"dimension": "sample_dim",
"filter": {...}
"aggregation": [...],
"threshold": 5
}
SELECT ... FROM dataSource
What is Druid? - Queries
• SQLs can be converted to Druid DSL
• No JOIN
11. SELECT COUNT(*)
FROM logs
WHERE url = ?;
1. Random Access
(OLTP)
SELECT url,
COUNT(*)
FROM logs
GROUP BY url
ORDER BY COUNT(*)
DESC
LIMIT 10;
2. Most Viewed
SELECT visitor,
COUNT(*)
FROM logs
GROUP BY visitor;
3. Full Aggregation
SELECT ...
FROM logs INNER
JOIN users
GROUP BY ...
HAVING ...
4. JOIN
Why Druid? – Our requirements
12. • Supports Bitmap Index
• Fast Random Access
Perfect solution for OLTP and OLAP
For OLTP
• SupportsTopN Query
• 100x times faster than GroupBy query
• Supports Complex Queries
• JOIN, HAVING, etc
• with our Spark Druid Connector
For OLAP
Why Druid?
★★★★☆1. Random Access
★★★★☆3. Full Aggregation
★★★★★2. MostViewed
★★★★☆4. JOIN
13. • Fast Random Access
• Terms Aggregation
• TopN Query
• Easy to manage
Pros
Cons
• Slow full scan with es-hadoop
• Low Performance for multi-field terms aggregation (esp.
High Cardinality)
• GC Overhead
Comparison – ElasticSearch
1. Random Access ★★★★★
3. Full Aggregation ☆☆☆☆☆
2. Most Viewed ★★★☆☆
4. JOIN ☆☆☆☆☆
14. • Fast Random Access via Primary Key
• Fast OLAP with Impala
Pros
• No Secondary Index
• No TopN Query
Cons
Comparison – Kudu + Impala
★★★★★ (PK)
★☆☆☆☆ (non-PK)
1. Random Access
★★★★★3. Full Aggregation
☆☆☆☆☆2. Most Viewed
★★★★★4. JOIN
16. Experimental Results – Notes
• ES: Lucene Index
• Kudu+Impala: Primary Key
• Druid: Bitmap Index
Random Access
• ES: Terms Aggregation
• Kudu+Implala: Group By
• Druid: TopN
• Split-Apply-Combine for Multi Fields
Most Viewed
• 210 mil. rows
• same parallelism
• same number of shards/partitions/segments
Data Sets
17. Logs
The Architecture of our service
Zeppelin
Plywood
Druid DSL
Coordinator
Overlord
Middle
Manager
Peon
Spark Thrift
Server
Batch
Ingestion
Parquet
Kafka
Run daily batch job
API Server
Historical
Spark
Executor
Segments File Broker
Druid
SparkSQL
Kafka
Indexing
Service
Kafka
transform logs
Parquet
remove
duplicated logs
Real-time
Ingestion
19. TopN Query flow
Broker
Historical
Segment Cache
User
TopN Query – We heavily use TopN query
Historical
Segment Cache
Historical
Segment Cache
Client gets top N results
Broker merge each results and
make final records
Each historical node return local
top N results
20. country SUM(duration)
Korea 114
UK 47
USA 21
country SUM(duration)
UK 67
Korea 24
USA 3
country SUM(duration)
Korea 87
UK 57
China 33
country SUM(duration)
Korea 225
UK 171
China 33
USA 24
country SUM(duration)
Korea 225
UK 171
China 33
TopN Query - Example
Top 3 country ORDER BY SUM(duration)
Broker
Top 3 Result
Top 3 of Historical a
Top 3 of Historical b
Top 3 of Historical c
21. country SUM(duration)
Korea 114
UK 47
USA 21
China 17
country SUM(duration)
UK 67
Korea 24
USA 3
China 1
country SUM(duration)
Korea 87
UK 57
China 33
USA 22
country SUM(duration)
Korea 225
UK 171
China 33
Missing!
TopN – is an approximate approach