Interactive Analytics at Scale with Druid

DRUID
INTERACTIVE EXPLORATORY ANALYTICS AT SCALE
GIAN MERLINO · DRUID COMMITTER · COFOUNDER @ IMPLY

OVERVIEW
MOTIVATION WHY DRUID?
DEMO AN EXAMPLE APPLICATION
ARCHITECTURE HIGH LEVEL OVERVIEW
COMMUNITY CONTRIBUTE TO DRUID

2013
HISTORY & MOTIVATION
‣ Druid was started in 2011
‣ Power interactive data applications
‣ Multi-tenancy: lots of concurrent users
‣ Scalability: trillions events/day, sub-second queries
‣ Real-time analysis

HISTORY & MOTIVATION
‣ Questions lead to more questions
‣ Dig into the dataset using ﬁlters, aggregates, and comparisons
‣ All interesting queries cannot be determined upfront

DEMO
IN CASE THE INTERNET DIDN’T WORK
PRETEND YOU SAW SOMETHING COOL

2015
A GENERAL SOLUTION?
‣ Load all your data into Hadoop. Query it. Done!
‣ Good job guys, let’s go home

2015
FINDING A SOLUTION
Hadoop
EventStreams
Insight

2015
FINDING A SOLUTION
Hadoop (pre-processing and storage) Query Layer
Hadoop
EventStreams
Insight

2015
MAKE QUERIES FASTER
‣ Optimizing business intelligence (OLAP) queries
• Aggregate measures over time, broken down by dimensions
• Revenue over time broken down by product type
• Top selling products by volume in San Francisco
• Number of unique visitors broken down by age
• Not dumping the entire dataset
• Not examining individual events

2015
FINDING A SOLUTION
Hadoop (pre-processing and storage)
Sharded
RDBMS?
Hadoop
EventStreams
Insight

2015
‣ The idea
• Row store
• Star schema
• Aggregate tables
• Query cache
‣ But!
• Scanning raw data is slow and expensive
GENERAL PURPOSE RDBMS

2015
FINDING A SOLUTION
NoSQL K/V
Stores?
Hadoop
EventStreams
Insight

2015
‣ Pre-computation
• Pre-compute every possible query
• Pre-compute a subset of queries
• Exponential scaling costs
‣ Range scans
• Primary key: dimensions/attributes
• Value: measures/metrics (things to aggregate)
• Still too slow!
KEY/VALUE STORES

2015
FINDING A SOLUTION
Column
Stores
Hadoop
EventStreams
Insight

2015
‣ Load/scan exactly what you need for a query
‣ Different compression algorithms for different columns
‣ Encoding for string columns
‣ Compression for measure columns
‣ Different indexes for different columns
COLUMN STORES

2013
KEY FEATURES
LOW LATENCY INGESTION
FAST AGGREGATIONS
ARBITRARY SLICE-N-DICE CAPABILITIES
HIGHLY AVAILABLE
APPROXIMATE & EXACT CALCULATIONS
DRUID

2015
DATA!
timestamp page language city country ... added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T01:01:11Z Ke$ha en Calgary CA 17 87
...

2015
PRE-AGGREGATION/ROLL-UP
...
...

2015
PARTITION DATA
‣ Shard data by time
‣ Immutable blocks of data called “segments”
Segment 2011-01-01T02/2011-01-01T03
Segment 2011-01-01T01/2011-01-01T02
Segment 2011-01-01T00/2011-01-01T01

2015
IMMUTABLE SEGMENTS
‣ Fundamental storage unit in Druid
‣ No contention between reads and writes
‣ One thread scans one segment
‣ Multiple threads can access same underlying data

2015
COLUMNAR STORAGE
‣ Scan/load only what you need
‣ Compression!
‣ Indexes!
...

2013
COLUMN COMPRESSION · DICTIONARIES
‣ Create ids
• Justin Bieber -> 0, Ke$ha -> 1
‣ Store
• page -> [0 0 0 1 1 1]
• language -> [0 0 0 0 0 0]
...

2013
BITMAP INDICES
‣ Justin Bieber -> [0, 1, 2] -> [111000]
‣ Ke$ha -> [3, 4, 5] -> [000111]
...

2013
FAST AND FLEXIBLE QUERIES
JUSTIN BIEBER
[1, 1, 0, 0]
KE$HA
[0, 0, 1, 1]
JUSTIN BIEBER
OR
KE$HA
[1, 1, 1, 1]
row page
0 Justin(Bieber
1 Justin(Bieber
2 Ke$ha
3 Ke$ha

2015
ARCHITECTURE (BATCH ONLY)
Historical
Node
Historical
Node
Historical
Node
HadoopData
Segments

2015
‣ Main workhorses of a Druid cluster
‣ Respond to queries on segments
‣ Shared-nothing architecture
HISTORICAL NODES

2015
ARCHITECTURE (BATCH ONLY)
Broker
Node
Historical
Node
Historical
Node
Historical
Node
Broker
Node
QueriesHadoopData
Segments

2015
‣ Knows which nodes hold what data
‣ Query scatter/gather (send requests to nodes and merge results)
‣ Caching
BROKER NODES

2015
EVOLVING A SOLUTION
Hadoop (pre-processing and storage) Druid
Hadoop
EventStreams
Insight

2015
MORE PROBLEMS
‣ We’ve solved the query problem
• Druid gave us arbitrary data exploration & fast queries
‣ But what about data freshness?
• Batch loading is slow!
• We want “real-time”
• Alerts, operational monitoring, etc.

2015
FAST LOADING WITH DRUID
‣ We have an indexing system
‣ We have a serving system that runs queries on data
‣ We can serve queries while building indexes!
‣ Real-time indexing workers do this

2015
‣ Write-optimized data structure:
hash map in heap
‣ Convert write optimized -> read
optimized
‣ Read-optimized data structure:
Druid segments
‣ Query data immediately
REAL-TIME NODES
Memory
Segment
Events
Queries
Convert

2015
ARCHITECTURE (STREAMING-ONLY)
Broker
Node
Historical
Node
Historical
Node
Historical
Node
Broker
Node
QueriesReal-time
Nodes
Streaming
Data
Segments

2015
ARCHITECTURE (LAMBDA)
Broker
Node
Historical
Node
Historical
Node
Historical
Node
Broker
Node
Queries
Hadoop
Batch
Data
Segments
Real-time
Nodes
Streaming
Data
Segments

2015
APPROXIMATE ANSWERS
‣ Drastically reduce storage space and compute time
• Cardinality estimation
• Histograms
• Quantiles
• Add your own proprietary modules

2015
QUERY INTERFACE
‣ Query libraries:
• JSON over HTTP
• SQL
• R
• Python
• Ruby
• Perl
‣ UIs
• Pivot
• Grafana
• Panoramix

2015
THE COMMUNITY
‣ Growing Community
• 130+ contributors from many different companies
• In production at many different companies, we’re hoping for more!
• Ad-tech, network trafﬁc, operations, activity streams, etc.
• We love contributions!

2015
PRODUCTION READY
‣ High availability through replication
‣ Rolling restarts
‣ 4 years of no down time for software updates and restarts
‣ Battle tested
‣ Used by hundreds of companies in production

2014
REALTIME INGESTION
>3M EVENTS / SECOND SUSTAINED (200B+ EVENTS/DAY)
10 – 100K EVENTS / SECOND / CORE
DRUID IN PRODUCTION

2014
CLUSTER SIZE 
>500TB OF SEGMENTS (>50 TRILLION RAW EVENTS) 
>5000 CORES (>400 NODES, >100TB RAM)
IT’S CHEAP 
MOST COST EFFECTIVE AT THIS SCALE
DRUID IN PRODUCTION

2014
0.0
0.5
1.0
1.5
0
1
2
3
4
0
5
10
15
20
90%ile95%ile99%ile
Feb 03 Feb 10 Feb 17 Feb 24
time
querytime(seconds)
datasource
a
b
c
d
e
f
g
h
Query latency percentiles
QUERY LATENCY (500MS AVERAGE)
90% < 1S 95% < 2S 99% < 10S
DRUID IN PRODUCTION

2014
QUERY VOLUME
SEVERAL HUNDRED QUERIES / SECOND
VARIETY OF GROUP BY & TOP-K QUERIES
DRUID IN PRODUCTION

2015
TAKE-AWAYS
‣ When Druid?
• You want to power user-facing data applications
• You want to do your analysis on data as it’s happening (realtime)
• Arbitrary data exploration with sub-second ad-hoc queries
• OLAP, BI, Pivot (anything involved aggregates)
• You need availability, extensibility and ﬂexibility

DRUID IS OPEN SOURCE
WWW.DRUID.IO
twitter @druidio
irc.freenode.net #druid-dev

MY INFORMATION
GIAN@IMPLY.IO
twitter @gianmerlino
LinkedIn gianmerlino

Interactive Analytics at Scale with Druid

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Interactive Analytics at Scale with Druid

Similar a Interactive Analytics at Scale with Druid (20)

Último

Último (20)

Interactive Analytics at Scale with Druid