Druid at naver.com - part 1

Druid at naver.com - Part 1
허정수 (jason.heo.sde@gmail.com)
2018-10-16
Druid 8th meetup

About this talk
• Strata Data Conference 2018 London, 2018-05-22
• 발표 자료: https://bit.ly/2Qz6mVJ
• 발표 영상: https://bit.ly/2OE0r4B

• Part 1 – 8차 밋업
• What is & Why Druid
• The Architecture of our service
• topN Query
• Part 2 – 9차 밋업
• Spark Druid Connector
• How to fix TopN’s unstable results
• 입수 성능 개선
• etc
Agenda

Platforms we've tested so far
Parquet
ORC
Carbon Data
Elasticsearch
ClickHouse Kudu
Druid
SparkSQL
Hive
Impala
Drill
Presto
Kylin
Phoenix
Query
Engine
Storage Format

• 네이버 콘텐츠통계 서비스 소개 및 구현 경험 공유 (deview, 2016)
• https://www.slideshare.net/deview/215-67608986
• Elasticsearch at naver.com
• Part 1: https://bit.ly/2OBcj7t
• Part 2: https://bit.ly/2PjWhfk
• Kudu를 이용한 빅데이터 다차원 분석 시스템 개발
• https://d2.naver.com/helloworld/9099561
• 빅데이터 다차원 분석 시스템, kylin
• https://d2.naver.com/helloworld/9099561
• Druid로 쉽고 빠르게 빅데이터 분석하기 (deview, 2018)
• https://www.slideshare.net/deview/215-druid-119186559

• What is Druid?
• Our Requirements
• Why Druid?
• Experimental Results
What is & Why Druid

• Column-oriented distributed datastore
• Real-time streaming ingestion
• Scalable to petabytes of data
• Approximate algorithms (hyperLogLog, theta sketch)
https://www.slideshare.net/HadoopSummit/scalable-
realtime-analytics-using-druid
From HORTONWORKS
What is Druid?

From my point of view
• Druid is a cumbersome version of Elasticsearch (w/o search feature)
• Similar points
• Secondary Index
• DSLs for query
• Flow of Query Processing
• Terms Aggregation ↔ TopN Query, Coordinator ↔ Broker, Data Node ↔ Historical
• Different points
• more complicated to operate
• better with much more data
• better for Ultra High Cardinality
• less GC overhead
• better for Spark Connectivity (for Full Scan)
What is Druid?

Real-time
Node
Historical
BrokerOverlord
Middle
Manager
Coordinator
Kafka
Index Service
Segment management
What is Druid? - Architecture
MySQL
metadata
Zookeeper
cluster mgmt.
Deep Storage
(HDFS, S3)
stores Druid segments
for durability
Query
Service
Clients
Druid DSL
Segments
download
Segments for
query

Real-time
Node
Historical
Broker
{
"queryType": "groupBy",
"dataSource": "sample_data",
"dimension": ["country", "device"],
"filter": {},
"aggregation": [...],
"limitSpec": [...]
}
{
"queryType": "topN",
"dataSource": "sample_data",
"dimension": "sample_dim",
"filter": {...}
"aggregation": [...],
"threshold": 5
}
SELECT ... FROM dataSource
What is Druid? - Queries
• SQLs can be converted to Druid DSL
• No JOIN

SELECT COUNT(*)
FROM logs
WHERE url = ?;
1. Random Access
(OLTP)
SELECT url,
COUNT(*)
FROM logs
GROUP BY url
ORDER BY COUNT(*)
DESC
LIMIT 10;
2. Most Viewed
SELECT visitor,
COUNT(*)
FROM logs
GROUP BY visitor;
3. Full Aggregation
SELECT ...
FROM logs INNER
JOIN users
GROUP BY ...
HAVING ...
4. JOIN
Why Druid? – Our requirements

• Supports Bitmap Index
• Fast Random Access
Perfect solution for OLTP and OLAP
For OLTP
• SupportsTopN Query
• 100x times faster than GroupBy query
• Supports Complex Queries
• JOIN, HAVING, etc
• with our Spark Druid Connector
For OLAP
Why Druid?
★★★★☆1. Random Access
★★★★☆3. Full Aggregation
★★★★★2. MostViewed
★★★★☆4. JOIN

• Fast Random Access
• Terms Aggregation
• TopN Query
• Easy to manage
Pros
Cons
• Slow full scan with es-hadoop
• Low Performance for multi-field terms aggregation (esp.
High Cardinality)
• GC Overhead
Comparison – ElasticSearch
1. Random Access ★★★★★
3. Full Aggregation ☆☆☆☆☆
2. Most Viewed ★★★☆☆
4. JOIN ☆☆☆☆☆

• Fast Random Access via Primary Key
• Fast OLAP with Impala
Pros
• No Secondary Index
• No TopN Query
Cons
Comparison – Kudu + Impala
★★★★★ (PK)
★☆☆☆☆ (non-PK)
1. Random Access
★★★★★3. Full Aggregation
☆☆☆☆☆2. Most Viewed
★★★★★4. JOIN

Random Access Most Viewed
0.25 0.35 0.08
2.7
2.9
0.78
0
0.5
1
1.5
2
2.5
3
3.5
Elasticesarch Kudu+Impala Druid
1 Field 2 Fields
0.003
0.14
0.03
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Elastisearch Kudu+Impala Druid
Experimental Results – Response Time
sec sec

Experimental Results – Notes
• ES: Lucene Index
• Kudu+Impala: Primary Key
• Druid: Bitmap Index
Random Access
• ES: Terms Aggregation
• Kudu+Implala: Group By
• Druid: TopN
• Split-Apply-Combine for Multi Fields
Most Viewed
• 210 mil. rows
• same parallelism
• same number of shards/partitions/segments
Data Sets

Logs
The Architecture of our service
Zeppelin
Plywood
Druid DSL
Coordinator
Overlord
Middle
Manager
Peon
Spark Thrift
Server
Batch
Ingestion
Parquet
Kafka
Run daily batch job
API Server
Historical
Spark
Executor
Segments File Broker
Druid
SparkSQL
Kafka
Indexing
Service
Kafka
transform logs
Parquet
remove
duplicated logs
Real-time
Ingestion

TopN Query
1. How TopN Query works
2. Performance
3. Limitation

TopN Query flow
Broker
Historical
Segment Cache
User
TopN Query – We heavily use TopN query
Historical
Segment Cache
Historical
Segment Cache
Client gets top N results
Broker merge each results and
make final records
Each historical node return local
top N results

country SUM(duration)
Korea 114
UK 47
USA 21
UK 67
Korea 24
USA 3
Korea 87
UK 57
China 33
Korea 225
UK 171
China 33
USA 24
Korea 225
UK 171
China 33
TopN Query - Example
Top 3 country ORDER BY SUM(duration)
Broker
Top 3 Result
Top 3 of Historical a
Top 3 of Historical b
Top 3 of Historical c

Korea 114
UK 47
USA 21
China 17
UK 67
Korea 24
USA 3
China 1
Korea 87
UK 57
China 33
USA 22
Korea 225
UK 171
China 33
Missing!
TopN – is an approximate approach

GroupBy
(3 minutes)
TopN
(1536 ms)
rank metric rank metric
1 1,948,297 1 1,948,297
2 1,404,167 2 1,404,167
3 1,383,538 3 1,383,538
4 1,141,977 4 1,141,977
5 1,099,028 5 1,090,277
6 1,090,277 6 1,079,242
7 1,051,448 7 1,051,448
8 996,961 8 996,961
9 941,284 9 941,284
10 937,078 10 937,078
100x Faster!
TopN – 100x faster than GroupBy
1. rank changed
rank 5 → rank 6
2. value changed
1,099,028 → 1,079,242

TopN – Limitations
1. TopN only has one dimension
2. Unstable result when replication factor is larger than 2
cf) Elasticsearch의 Terms Aggregation

Druid at naver.com - part 1

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Druid at naver.com - part 1

Similar a Druid at naver.com - part 1 (20)

Último

Último (20)

Druid at naver.com - part 1