SlideShare una empresa de Scribd logo
1 de 54
Descargar para leer sin conexión
©	Hortonworks	Inc.	2011	–2017
Please wait!
We’ll get started
about 3 minutes
past the hour!
Hortonworks
Premier Inside-out
Apache Druid
Slim	Bouguerra,	Staff	Software	Engineer
Will	Xu,	Senior	Product	Manager
May, 2018
Disclaimer
This	document	may	contain	product	features	and	technology	directions	that	are	under	development,	may	be	
under	development	in	the	future	or	may	ultimately	not	be	developed.
Project	capabilities	are	based	on	information	that	is	publicly	available	within	the	Apache	Software	Foundation	
project	websites	("Apache").		Progress	of	the	project	capabilities	can	be	tracked	from	inception	to	release	
through	Apache,	however,	technical	feasibility,	market	demand,	user	feedback	and	the	overarching	Apache	
Software	Foundation	community	development	process	can	all	affect	timing	and	final	delivery.
This	document’s	description	of	these	features	and	technology	directions	does	not	represent	a	contractual	
commitment,	promise	or	obligation	from	Hortonworks	to	deliver	these	features	in	any	generally	available	
product.
Product	features	and	technology	directions	are	subject	to	change,	and	must	not	be	included	in	contracts,	
purchase	orders,	or	sales	agreements	of	any	kind.
Since	this	document	contains	an	outline	of	general	product	development	plans,	customers	should	not	rely	upon	
it	when	making	purchasing	decisions.
© Hortonworks Inc. 2011- 2017. All rights reserved | 3
History
⬢ Development	started	at	Metamarkets																							
in	2011
⬢ Initial	use	case	
– power	ad-tech	analytics	product
⬢ Open	sourced	in	late	2012
– GPL	licensed	initially	
– Switched	to	Apache	V2	in	early	2015
⬢ 150+	committers	today
© Hortonworks Inc. 2011- 2017. All rights reserved | 5
Why	use	Druid?
⬢ Sub-second OLAP Queries
⬢ Real-time Streaming Ingestion
⬢ Multi-tenant with support for 1000+ users
⬢ Cost Effective
⬢ Highly Available
⬢ Scalable
How’s	Druid	used	in	prod	today?
Ad-hoc	analytics.
High	concurrency	user-facing	real-time	slice-and-dice.
Real-time	loads	of	10s	of	billions	of	events	per	day.
Powers	infrastructure	anomaly	detection	dashboards.
Ingest	rates	>	2TB	per	hour.
Exploratory	analytics	on	clickstream	sessions.
Real-time	user	behavior	analytics.
Druid	is	design	for	Time-Series	use	cases
⬢ Time	series	data	workload	!=	normal	database	workload
– Queries	always	contain	date	columns	as	group	by	keys
– Queries	Filters	by	time
– Queries	touch	few	columns	(<	10)
– Queries	have	very	selective	Filters	(<	thousands	of	rows	out	of	billions)
SELECT `user`, sum(`c_added`) AS s, EXTRACT(year FROM `__time`)
FROM druid_table
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`, EXTRACT(year FROM `__time`)
ORDER BY s DESC
LIMIT 10;
Where	do	time	series	data	coming	from?
⬢ Mostly	an	insert/append	workload,	very	
few	updates
– Event	stream	data
– Application/server	logs
– Sensor	logs
⬢ Schema	less
– New	event	types
– New	server/application/OS	type	or	upgrade
– New	sensors
© Hortonworks Inc. 2011- 2017. All rights reserved | 9
Druid @ HDP
Hive	+	Druid	=	Insight	When	You	Need	It
OLAP	Cubes SQL	Tables
Streaming	Data Historical	Data
Unified	SQL	Layer
Pre-
Aggregate ACID	MERGE
Easily	ingest	event
data	into	OLAP	cubes
Keep	data	up-to-date
with	Hive	MERGE
Build	OLAP	Cubes	from	Hive
Archive	data	to	Hive	for	
history
Run	OLAP	queries	in	real-time
or	Deep	Analytics	over	all	history
Deep	
Analytics
Real-Time	
Query
Why	Use	Druid	From	Hortonworks?
With	HDP Druid	Alone
Interactive	Analytics ✓ ✓
Analyze	Data	Streams ✓ ✓
Spatial	Analytics ✓ ✓
Horizontally	Scalable ✓ ✓
SQL:2011	Interface ✓ ✖
Join	Historical	and	Real-time	Data ✓ ✖
Management	and	Monitoring	with	Ambari ✓ ✖
Managed	Rolling	Upgrades ✓ ✖
Visualization	with	Superset ✓ ✖
Easy	App	Development	with	Hortonworks	
SAM ✓ ✖
© Hortonworks Inc. 2011- 2017. All rights reserved | 12
Druid-Hive	integration	makes	things	easy
Superset	UI	for	Fast,	Interactive	Dashboards	and	Exploration
© Hortonworks Inc. 2011- 2017. All rights reserved | 14
Agenda
● Druid overview
● Time series data
● Druid architecture
● Demo
● Questions
© Hortonworks Inc. 2011- 2017. All rights reserved | 15
Druid	Internals:	Segments	(indexes)
Druid	is	design	for	Time-Series	use	cases
⬢ Time	series	data	workload	!=	normal	database	workload
⬢ Queries	always	contain	date	columns	as	group	by	keys.
⬢ Queries	Filters	by	time
⬢ Queries	touch	few	columns	(<	10)
⬢ Queries	have	very	selective	Filters	(<	thousands	of	rows	out	of	billions)
SELECT `user`, sum(`c_added`) AS s, EXTRACT(year FROM `__time`)
FROM druid_table
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`, EXTRACT(year FROM `__time`)
ORDER BY s DESC
LIMIT 10;
© Hortonworks Inc. 2011- 2017. All rights reserved | 17
Druid:	Segment	Data	Structures
⬢ Within	a	Segment:
– Timestamp	Column	Group.
– Dimensions	Column	Group.
– Metrics	Column	Group.
– Indexes	that	facilitate	fast	lookup	and	aggregation.
© Hortonworks Inc. 2011- 2017. All rights reserved | 18
Data is partitioned by time !
timestamp publisher advertiser gender country ... click price
2011-01-01T00:01:35Z bieberfever.com google.com Male USA 0 0.65
2011-01-01T00:03:63Z bieberfever.com google.com Male USA 0 0.62
2011-01-01T00:04:51Z bieberfever.com google.com Male USA 1 0.45
2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87
2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99
2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53
...
⬢ Time	partitioned	immutable	partitions.
⬢ Typically	5	Million	row	per	segment.
⬢ Druid	segments	are	stored	in	a	column	orientation	(only	what	is	needed	is	actually	
loaded	and	scanned).
© Hortonworks Inc. 2011- 2017. All rights reserved | 19
COLUMN COMPRESSION - DICTIONARIES
• Create	ids
• bieberfever.com	->	0,	ultratrimfast.com	->	1
• Store
• publisher	->	[0,	0,	0,	1,	1,	1]
• advertiser	->	[0,	0,	0,	0,	0,	0]	
timestamp publisher advertiser gender country ... click price
2011-01-01T00:01:35Z bieberfever.com google.com Male USA 0 0.65
2011-01-01T00:03:63Z bieberfever.com google.com Male USA 0 0.62
2011-01-01T00:04:51Z bieberfever.com google.com Male USA 1 0.45
2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87
2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99
2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53
...
© Hortonworks Inc. 2011- 2017. All rights reserved | 20
BITMAP INDEXES
timestamp publisher advertiser gender country ... click price
2011-01-01T00:01:35Z bieberfever.com google.com Male USA 0 0.65
2011-01-01T00:03:63Z bieberfever.com google.com Male USA 0 0.62
2011-01-01T00:04:51Z bieberfever.com google.com Male USA 1 0.45
2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87
2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99
2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53
...
• bieberfever.com	->	[0,	1,	2]	->	[111000]
• ultratrimfast.com	->	[3,	4,	5]	->	[000111]
• Compress	using	Concice	or	Roaring	(take	advantage	of	dimension	sparsity)
© Hortonworks Inc. 2011- 2017. All rights reserved | 21
FAST AND FLEXIBLE QUERIES
JUSTIN	BIEBER
[1,	1,	0,	0]
KE$HA
[0,	0,	1,	1]
JUSTIN	BIEBER
OR
KE$HA
[1,	1,	1,	1]
Queries	that	solely	aggregate	metrics	based	on	filters	do	not	need	to	touch	the	list	of	dimension	values	!
© Hortonworks Inc. 2011- 2017. All rights reserved | 22
Druid Architecture
© Hortonworks Inc. 2011- 2017. All rights reserved | 23
Broker	Nodes
⬢ Keeps	track	of	segment	announcements	in	cluster	
– (This	information	is	kept	in	Zookeeper,	much	like	Storm	or	HBase	do.)
⬢ Scatters	query	across	historical	and	realtime	nodes	
– (Clients	issue	queries	to	this	node,	but	queries	are	processed	elsewhere.)
⬢ Merge	results	from	different	query	nodes	
⬢ (Distributed)	caching	layer
© Hortonworks Inc. 2011- 2017. All rights reserved | 24
Historical	Nodes
⬢ Shared	nothing	architecture	
⬢ Main	workhorses	of	druid	cluster	
⬢ Load	immutable	read	optimized	segments	
⬢ Respond	to	queries
⬢ Use	memory	mapped	files
to	load	segments
© Hortonworks Inc. 2011- 2017. All rights reserved | 25
Early	Druid	Architecture
Hadoop
Historical	
Node
Historical	
Node
Historical	
Node
Batch	Data Broker	
Node
Queries
© Hortonworks Inc. 2011- 2017. All rights reserved | 26
Druid:	Batch	Indexing
⬢ Indexing	is	performed	by	related	components:
– Overlord	(N	=	1)
– Middle	Managers	(N	>=	1)
⬢ Indexing	is	done	via	MapReduce	jobs	that	build	
Segment	files.
⬢ Batch	indexing	is	done	on	data	that	already	exists	in	
Deep	Storage	(e.g.	HDFS).
⬢ Index	definition	is	specified	via	a	JSON	file	and	
submitted	to	the	Overlord.
© Hortonworks Inc. 2011- 2017. All rights reserved | 27
Current	Druid	Architecture
Hadoop
Historical	
Node
Historical	
Node
Historical	
Node
Batch	Data
Broker	
Node
Queries
ETL
(Samza,	
Kafka,	Storm,	
Spark	etc)
Streaming	
Data Realtime	
Node
Realtime	
Node
Handoff
© Hortonworks Inc. 2011- 2017. All rights reserved | 28
Druid:	Realtime	Indexing	Push	Mode
Deep	Storage
Tranquility
Coordinator
Broker
Indexing	Service
Overlord
MiddleManager
Peon Peon Peon
ZooKeeper
Kafka
task
Push
Segments
segment
Segment-
cache
Historical
Segment-
cache
Historical
Spark
Flink
Storm
Python
© Hortonworks Inc. 2011- 2017. All rights reserved | 29
Druid:	Realtime	Indexing	Pull	Mode
Deep	Storage
Coordinator
Broker
Indexing	Service
Overlord
MiddleManager
Peon Peon Peon
ZooKeeper
Kafka
task
Segments
segment
Segment-
cache
Historical
Segment-
cache
Historical
Spark
Flink
Storm
Python
Pull
© Hortonworks Inc. 2011- 2017. All rights reserved | 30
Realtime	Nodes
⬢ Ability	to	ingest	streams	of	data	
⬢ Both	push	and	pull	based	ingestion	
⬢ Stores	data	in	write-optimized	structure	
⬢ Periodically	converts	write-optimized	structure
to	read-optimized	segments	
⬢ Event	query-able	as	soon	as	it	is	ingested
© Hortonworks Inc. 2011- 2017. All rights reserved | 31
Coordinator	Nodes
⬢ Assigns	segments	to	historical	nodes	
⬢ Interval	based	cost	function	to	distribute	segments	
⬢ Makes	sure	query	load	is	uniform	across	historical	nodes	
⬢ Handles	replication	of	data	
⬢ Configurable	rules	to	load/drop	data
A	Typical	Druid	Deployment
Many	nodes,	based	on
data	size	and	#queries	
HDFS	or	S3
Superset	or	Dashboards,
BI	Tools
© Hortonworks Inc. 2011- 2017. All rights reserved | 33
Registering and creating
Druid data sources
© Hortonworks Inc. 2011- 2017. All rights reserved | 34
Druid	data	sources	in	Hive
⬢ User	needs	to	provide	Druid	data	sources	information	to	Hive
⬢ Two	different	options	depending	on	requirements
– Register Druid	data	sources	in	Hive
• Data	is	already	stored	in	Druid
– Create Druid	data	sources	from	Hive
• Data	is	stored	in	Hive
• User	may	want	to	pre-process	the	data	before	storing	it	in	Druid
Druid	data	sources	in	Hive
⬢ Simple	CREATE	EXTERNAL	TABLE	statement
CREATE EXTERNAL TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
Hive	table	name
Hive	storage	handler	classname
Druid	data	source	name
⇢ Broker	node	endpoint	specified	as	a	Hive	configuration	parameter
⇢ Automatic	Druid	data	schema	discovery:	segment	metadata	query
Registering	Druid	data	sources
Druid	data	sources	in	Hive
⬢ Use	Create	Table	As	Select	(CTAS) statement
CREATE TABLE druid_table_1
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY")
AS
SELECT __time, page, user, c_added, c_removed
FROM src;
Hive	table	name
Hive	storage	handler	classname
Druid	data	source	name
Druid	segment	granularity
Creating	Druid	data	sources
© Hortonworks Inc. 2011- 2017. All rights reserved | 37
Querying Druid data sources
© Hortonworks Inc. 2011- 2017. All rights reserved | 38
Querying	Druid	data	sources
⬢ Automatic	rewriting	when	query	is	expressed	over	Druid	table
– Powered	by	Apache	Calcite
– Main	challenge:	identify	patterns	in	logical	plan	corresponding	to	different	kinds	of	Druid	queries	
(Timeseries,	TopN,	GroupBy,	Select)
⬢ Translate	(sub)plan	of	operators	into	valid	Druid	JSON	query
– Druid	query	is	encapsulated	within	Hive	TableScan	operator
⬢ Hive	TableScan	uses	Druid	input	format
– Submits	query	to	Druid	and	generates	records	out	of	the	query	results
⬢ It	might	not	be	possible	to	push	all computation	to	Druid
– Our	contract	is	that	the	query	should	always be	executed
Druid	query	recognition	(powered	by	Apache	Calcite)
SELECT `user`, sum(`c_added`) AS s
FROM druid_table_1
WHERE EXTRACT(year FROM `__time`)
BETWEEN 2010 AND 2011
GROUP BY `user`
ORDER BY s DESC
LIMIT 10;
⬢ Top	10	users	that	have	added	more	characters
from	beginning	of	2010	until	the	end	of	2011
Apache	Hive	- SQL	query
Query	logical	plan
Druid	Scan
Project
Aggregate
Sort	Limit
Sink
Filter
{
"queryType": "groupBy",
"dataSource": "users_index",
"granularity": "all",
"dimension": "user",
"aggregations": [ { "type": "longSum",	"name": "s",	"fieldName": "c_added"	}	],
"limitSpec": {
"limit": 10,
"columns": [	{"dimension": "s",	"direction": "descending" }	]
},
"intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000"	 ]
}
Physical	plan	transformation
Apache	Hive
Druid	query
groupBy
Query	logical	plan
Druid	Scan
Project
Aggregate
Sort	Limit
Sink
Filter
Select
File	SinkFile	Sink
Table	Scan
Query	physical	plan
Druid	JSON	query
Table	Scan	uses	
Druid	Input	Format
© Hortonworks Inc. 2011- 2017. All rights reserved | 41
Road	ahead
⬢ Tighten	integration	between	Druid and	Apache	Hive/Apache	Calcite
– Recognize	more	functions	→ Push	more	computation	to	Druid
– Support	complex	column	types
– Close	the	gap	between	semantics	of	different	systems
• Time	zone	handling,	null values
⬢ Broader	perspective
– Materialized	views support	in	Apache	Hive
• Data	stored	in	Apache	Hive
• Create	materialized	view	in	Druid
– Denormalized	star	schema	for	a	certain	time	period
• Automatic	input	query	rewriting	over	the	materialized	view	(Apache	Calcite)
© Hortonworks Inc. 2011- 2017. All rights reserved | 42
SSB Benchmarks
© Hortonworks Inc. 2011- 2017. All rights reserved | 43
Schema	and	Data	Model:	Base	Tables
Star	schema	benchmark
⬢ Line	order	6	billion	rows
⬢ Customer	30,000,000	rows	
⬢ Supplier	2,000,000
⬢ Parts	1,400,000
Picture	Credit	
http://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-create-test-data.html
© Hortonworks Inc. 2011- 2017. All rights reserved | 44
Benchmark	Setup
⬢ 11	Nodes	2x	Intel(R)	Xeon(R)	CPU	E5-
2640	v2	@	2.00GHz	with	16	CPU	threads	
each
⬢ 256G	of	Ram	per	node
⬢ 6x	WDC	WD4000FYYZ-0	1K02	4TB	SCSI	
disks	per	node
⬢ Not	100%	dedicated	(part	of	Yarn	
managed	cluster)
⬢ 10	/	5	/3		Historical	nodes	(Workhorse)
– -Xmx12g -Xms12g -XX:NewSize=6g -
XX:MaxNewSize=6g -
XX:MaxDirectMemorySize=128g
⬢ 1	node		Broker
– -Xmx25g -Xms25g -XX:NewSize=6g -
XX:MaxNewSize=6g -
XX:MaxDirectMemorySize=64g
⬢ No	segments	replication
– Query	to	server	mapping	is	one	to	one
⬢ No	caching
⬢ No	Tuning	out	of	the	box	configuration
Platform Druid
© Hortonworks Inc. 2011- 2017. All rights reserved | 45
Schema	and	Data	Model:	Denormalized	Table	in	Druid
⬢ 6	Billion	rows	(no	Rollups)
⬢ 174	GB	over	7	years	
⬢ Segmented	by	month	
– One	partition	is	worth	1/9	month	of	data
– 10	Million	rows	per	partition
– Segment	size	~	262	MB
– 9	Million	rows	per	partition
– 707	partitions	in	total
© Hortonworks Inc. 2011- 2017. All rights reserved | 46
SSB	Queries
⬢ Q1	(Quick	metrics)	the	amount	of	revenue	increase	that	would	have	resulted	from	
eliminating	certain	company
– No	Group	by	
– Years	to	weeks	of	data
⬢ Q2		(Product	sales	insights)	compares	revenue	for	some	product	classes,	for	suppliers	in	
a	certain	region,	grouped	by	more	restrictive	product	classes	and	all	years	of	orders.
– Couple	of	Group	by	keys	
– All	data	to	one	month	
– Less	selective	to	very	selective	filter
⬢ Q3	(Customer	insights)	revenue	volume	by	customer	nation	and	supplier	nation	and	
year	within	a	given	region,	in	a	certain	time	period
– Greater	than	3	Group	by	keys
– Years	of	data
⬢ Q4	"What-If"	sequence,	of	the	OLAP	type
– Same	as	Q3
© Hortonworks Inc. 2011- 2017. All rights reserved | 47
Benchmark	execution	setup
⬢ Run	using	Apache	Jmeter
⬢ Every	Query	run	15	times	capture	
Min/Max/AVG/STD
⬢ Query	time	includes	JDBC	overhead
⬢ Launch	a	warmup	query	first	to	
initialize	the	metadata	
⬢ Concurrency	mode	Launches	
multiple	threads	via	different	JDBC	
connections
⬢ Druid	0.9.2	(HDP	2.6)
⬢ Hive	2.1	(HDP	2.6)
0
500
1000
1500
2000
2500
3000
Query time in ms
Average Min Max
Q1.1 ({*}, {yearLevel = 1993, 1 < discountLevel < 3, quantityLevel < 25}, {sum_revenue})
Q1.2 ({*}, {yearmonthnumLevel = 199401, 4 < discountLevel < 6, 26 < quantityLevel < 35}, {sum_revenue})
Q1.3 ({weeknuminyearYearLevel}, {5 < discountLevel < 7, 26 < quantityLevel < 35}, {sum_revenue})
0
500
1000
1500
2000
2500
3000
Query time in ms
Average Min Max
(4) ({yearLevel, brand1Level}, {categoryLevel = MFGR#12, s_regionLevel = AMERICA}, {lo_revenue})
(5)({yearLevel, brand1Level, s_regionLevel}, {brand1Level = MFGR#2221 OR ... OR brand1level = MFGR#2228,
gionLevel = ASIA}, {lo_revenue})
(6)({yearLevel, brand1Level, s_regionLevel}, {brand1Level = MFGR#2239, s_regionLevel = EUROPE}, {lo_revenue})
116.22
100
60
5.51
3.667
2.677
1
10
100
Q2.2 3 nodes Q2.2 5 nodes Q2.2 10 nodes
QUERY2.2 RESPONSE TIME IN SECONDS LOG SCALE
Concices Roaring
Q2.2 (5)({yearLevel, brand1Level, s_regionLevel}, {brand1Level = MFGR#2221 OR ... OR brand1level = MFGR#2228,
s_regionLevel = ASIA}, {lo_revenue})
Union of bitmaps
0
500
1000
1500
2000
2500
3000
Query time in ms
Average Min Max
1 (7)({yearLevel, c_nationLevel,s_nationLevel}, {1992 ≤ yearLevel ≤ 1997, c_nationLevel = ASIA, s_nationLevel = ASIA}, {lo_revenue})
2 (8)(yearLevel, c_cityLevel, s_cityLevel}, {5Y, c_nationLevel = UNITED STATES, s_nationLevel = UNITED STATES}, {lo_revenue})
3 (9)(same as 3.2, {5Y, c_cityLevel = UNITED KI1 OR c_cityLevel = UNITED KI5, s_cityLevel = UNITED KI1 OR s_cityLevel = UNITED KI5},
revenue})
4 (10)({yearmonthLevel, c_cityLevel, s_cityLevel}, {yearmonthLevel = Dec1997, c_cityLevel = UNITED KI1 or c_cityLevel = UNITED KI5,
ityLevel = UNITED KI1 or s_cityLevel = UNITED KI5}, {lo_revenue})
0
500
1000
1500
2000
2500
3000
Query time in ms
Average Min Max
Q4.1 (11)({yearLevel, c_nationLevel, mfgrLevel, s_regionLevel}, {c_regionLevel = AMERICA, mfgrLevel = MFGR#1 OR
mfgrLevel = MFGR#2, s_region = AMERICA}, {sum_profit})
Q4.2 (12)({yearLevel, c_regionLevel, categoryLevel, s_nationLevel}, {yearLevel = 1997 OR yearLevel = 1998,
c_regionLevel = AMERICA, mfgrLevel = MFGR#1 OR mfgrLevel = MFGR#2 , s_regionLevel = AMERICA}, {sum_profit})
Q4.3 (13)({yearLevel, regionLevel, categoryLevel, cityLevel}, {yearLevel = 1997 OR yearLevel = 1998,
c_regionLevel = AMERICA, categoryLevel = MFGR#14, s_region = UNITED STATES}, {sum_profit})
0
500
1000
1500
2000
2500
Average query response time in ms with 10 nodes cluster
Execution compile time
HIVE
DRUID
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
count_star Q1.1 Q1.2 Q1.3 Q2.1 Q2.2 Q2.3 Q3.1 Q3.2 Q3.3 Q3.4 Q4.1 Q4.2 Q4.3
Concurrency Test: Average execution time in ms
10 node cluster
15 users 10 users 5 users 1 user
o Every user is connecting via different JDBC connection.

Más contenido relacionado

La actualidad más candente

ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizonThejas Nair
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Tag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and RangerTag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and RangerVimal Sharma
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
 

La actualidad más candente (20)

ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Hive 3 - a new horizon
Hive 3 - a new horizonHive 3 - a new horizon
Hive 3 - a new horizon
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Tag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and RangerTag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and Ranger
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
 

Similar a Apache Druid Architecture Overview

Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
 
Get Started with Big Data in the Cloud ASAP
Get Started with Big Data in the Cloud ASAPGet Started with Big Data in the Cloud ASAP
Get Started with Big Data in the Cloud ASAPHortonworks
 
Get Started with Big Data in the Cloud ASAP
Get Started with Big Data in the Cloud ASAPGet Started with Big Data in the Cloud ASAP
Get Started with Big Data in the Cloud ASAPKait Disney-Leugers
 
Gems to help you troubleshoot query performance
Gems to help you troubleshoot query performanceGems to help you troubleshoot query performance
Gems to help you troubleshoot query performancePedro Lopes
 
Dreamforce 2013 - Heroku 5 use cases
Dreamforce 2013 - Heroku 5 use casesDreamforce 2013 - Heroku 5 use cases
Dreamforce 2013 - Heroku 5 use casesVincent Spehner
 
Bring your Content into Focus using Data Driven Analytics.pptx
Bring your Content into Focus using Data Driven Analytics.pptxBring your Content into Focus using Data Driven Analytics.pptx
Bring your Content into Focus using Data Driven Analytics.pptxRachelWalters28
 
State of Drupal keynote, DrupalCon New Orleans
State of Drupal keynote, DrupalCon New OrleansState of Drupal keynote, DrupalCon New Orleans
State of Drupal keynote, DrupalCon New OrleansDries Buytaert
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospectc-bslim
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the dataDataWorks Summit
 
The Developer is the New CIO: How Vendors Adapt to the Changing Landscape
The Developer is the New CIO: How Vendors Adapt to the Changing LandscapeThe Developer is the New CIO: How Vendors Adapt to the Changing Landscape
The Developer is the New CIO: How Vendors Adapt to the Changing LandscapeLauren Cooney
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopDataWorks Summit
 
OpenNTF Webinar, March, 2021
OpenNTF Webinar, March, 2021OpenNTF Webinar, March, 2021
OpenNTF Webinar, March, 2021Howard Greenberg
 
Making Data Science accessible to a wider audience
Making Data Science accessible to a wider audienceMaking Data Science accessible to a wider audience
Making Data Science accessible to a wider audienceLou Bajuk
 
Embracing data science for smarter analytics apps
Embracing data science for smarter analytics appsEmbracing data science for smarter analytics apps
Embracing data science for smarter analytics appsLou Bajuk
 
SEO 2017 Strategy & Presentation
SEO 2017 Strategy & PresentationSEO 2017 Strategy & Presentation
SEO 2017 Strategy & PresentationGraham Ware
 
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020Emily Potter
 
Seo company in chennai
Seo company in chennaiSeo company in chennai
Seo company in chennaiNissyMary
 
Tech Ed Africa 10 Successful SharePoint Deployment Oleson
Tech Ed Africa 10 Successful SharePoint Deployment OlesonTech Ed Africa 10 Successful SharePoint Deployment Oleson
Tech Ed Africa 10 Successful SharePoint Deployment OlesonJoel Oleson
 

Similar a Apache Druid Architecture Overview (20)

SEO Is About Users
SEO Is About UsersSEO Is About Users
SEO Is About Users
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Get Started with Big Data in the Cloud ASAP
Get Started with Big Data in the Cloud ASAPGet Started with Big Data in the Cloud ASAP
Get Started with Big Data in the Cloud ASAP
 
Get Started with Big Data in the Cloud ASAP
Get Started with Big Data in the Cloud ASAPGet Started with Big Data in the Cloud ASAP
Get Started with Big Data in the Cloud ASAP
 
Scalable Real-time analytics using Druid
Scalable Real-time analytics using DruidScalable Real-time analytics using Druid
Scalable Real-time analytics using Druid
 
Gems to help you troubleshoot query performance
Gems to help you troubleshoot query performanceGems to help you troubleshoot query performance
Gems to help you troubleshoot query performance
 
Dreamforce 2013 - Heroku 5 use cases
Dreamforce 2013 - Heroku 5 use casesDreamforce 2013 - Heroku 5 use cases
Dreamforce 2013 - Heroku 5 use cases
 
Bring your Content into Focus using Data Driven Analytics.pptx
Bring your Content into Focus using Data Driven Analytics.pptxBring your Content into Focus using Data Driven Analytics.pptx
Bring your Content into Focus using Data Driven Analytics.pptx
 
State of Drupal keynote, DrupalCon New Orleans
State of Drupal keynote, DrupalCon New OrleansState of Drupal keynote, DrupalCon New Orleans
State of Drupal keynote, DrupalCon New Orleans
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospect
 
The Implacable advance of the data
The Implacable advance of the dataThe Implacable advance of the data
The Implacable advance of the data
 
The Developer is the New CIO: How Vendors Adapt to the Changing Landscape
The Developer is the New CIO: How Vendors Adapt to the Changing LandscapeThe Developer is the New CIO: How Vendors Adapt to the Changing Landscape
The Developer is the New CIO: How Vendors Adapt to the Changing Landscape
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using Hadoop
 
OpenNTF Webinar, March, 2021
OpenNTF Webinar, March, 2021OpenNTF Webinar, March, 2021
OpenNTF Webinar, March, 2021
 
Making Data Science accessible to a wider audience
Making Data Science accessible to a wider audienceMaking Data Science accessible to a wider audience
Making Data Science accessible to a wider audience
 
Embracing data science for smarter analytics apps
Embracing data science for smarter analytics appsEmbracing data science for smarter analytics apps
Embracing data science for smarter analytics apps
 
SEO 2017 Strategy & Presentation
SEO 2017 Strategy & PresentationSEO 2017 Strategy & Presentation
SEO 2017 Strategy & Presentation
 
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
 
Seo company in chennai
Seo company in chennaiSeo company in chennai
Seo company in chennai
 
Tech Ed Africa 10 Successful SharePoint Deployment Oleson
Tech Ed Africa 10 Successful SharePoint Deployment OlesonTech Ed Africa 10 Successful SharePoint Deployment Oleson
Tech Ed Africa 10 Successful SharePoint Deployment Oleson
 

Más de Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 
4 Essential Steps for Managing Sensitive Data
4 Essential Steps for Managing Sensitive Data4 Essential Steps for Managing Sensitive Data
4 Essential Steps for Managing Sensitive DataHortonworks
 

Más de Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 
4 Essential Steps for Managing Sensitive Data
4 Essential Steps for Managing Sensitive Data4 Essential Steps for Managing Sensitive Data
4 Essential Steps for Managing Sensitive Data
 

Último

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - AvrilIvanti
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 

Último (20)

Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Français Patch Tuesday - Avril
Français Patch Tuesday - AvrilFrançais Patch Tuesday - Avril
Français Patch Tuesday - Avril
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 

Apache Druid Architecture Overview

  • 1. © Hortonworks Inc. 2011 –2017 Please wait! We’ll get started about 3 minutes past the hour! Hortonworks Premier Inside-out Apache Druid Slim Bouguerra, Staff Software Engineer Will Xu, Senior Product Manager May, 2018
  • 2. Disclaimer This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed. Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all affect timing and final delivery. This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product. Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
  • 3. © Hortonworks Inc. 2011- 2017. All rights reserved | 3 History ⬢ Development started at Metamarkets in 2011 ⬢ Initial use case – power ad-tech analytics product ⬢ Open sourced in late 2012 – GPL licensed initially – Switched to Apache V2 in early 2015 ⬢ 150+ committers today
  • 4.
  • 5. © Hortonworks Inc. 2011- 2017. All rights reserved | 5 Why use Druid? ⬢ Sub-second OLAP Queries ⬢ Real-time Streaming Ingestion ⬢ Multi-tenant with support for 1000+ users ⬢ Cost Effective ⬢ Highly Available ⬢ Scalable
  • 7. Druid is design for Time-Series use cases ⬢ Time series data workload != normal database workload – Queries always contain date columns as group by keys – Queries Filters by time – Queries touch few columns (< 10) – Queries have very selective Filters (< thousands of rows out of billions) SELECT `user`, sum(`c_added`) AS s, EXTRACT(year FROM `__time`) FROM druid_table WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user`, EXTRACT(year FROM `__time`) ORDER BY s DESC LIMIT 10;
  • 8. Where do time series data coming from? ⬢ Mostly an insert/append workload, very few updates – Event stream data – Application/server logs – Sensor logs ⬢ Schema less – New event types – New server/application/OS type or upgrade – New sensors
  • 9. © Hortonworks Inc. 2011- 2017. All rights reserved | 9 Druid @ HDP
  • 10. Hive + Druid = Insight When You Need It OLAP Cubes SQL Tables Streaming Data Historical Data Unified SQL Layer Pre- Aggregate ACID MERGE Easily ingest event data into OLAP cubes Keep data up-to-date with Hive MERGE Build OLAP Cubes from Hive Archive data to Hive for history Run OLAP queries in real-time or Deep Analytics over all history Deep Analytics Real-Time Query
  • 11. Why Use Druid From Hortonworks? With HDP Druid Alone Interactive Analytics ✓ ✓ Analyze Data Streams ✓ ✓ Spatial Analytics ✓ ✓ Horizontally Scalable ✓ ✓ SQL:2011 Interface ✓ ✖ Join Historical and Real-time Data ✓ ✖ Management and Monitoring with Ambari ✓ ✖ Managed Rolling Upgrades ✓ ✖ Visualization with Superset ✓ ✖ Easy App Development with Hortonworks SAM ✓ ✖
  • 12. © Hortonworks Inc. 2011- 2017. All rights reserved | 12 Druid-Hive integration makes things easy
  • 14. © Hortonworks Inc. 2011- 2017. All rights reserved | 14 Agenda ● Druid overview ● Time series data ● Druid architecture ● Demo ● Questions
  • 15. © Hortonworks Inc. 2011- 2017. All rights reserved | 15 Druid Internals: Segments (indexes)
  • 16. Druid is design for Time-Series use cases ⬢ Time series data workload != normal database workload ⬢ Queries always contain date columns as group by keys. ⬢ Queries Filters by time ⬢ Queries touch few columns (< 10) ⬢ Queries have very selective Filters (< thousands of rows out of billions) SELECT `user`, sum(`c_added`) AS s, EXTRACT(year FROM `__time`) FROM druid_table WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user`, EXTRACT(year FROM `__time`) ORDER BY s DESC LIMIT 10;
  • 17. © Hortonworks Inc. 2011- 2017. All rights reserved | 17 Druid: Segment Data Structures ⬢ Within a Segment: – Timestamp Column Group. – Dimensions Column Group. – Metrics Column Group. – Indexes that facilitate fast lookup and aggregation.
  • 18. © Hortonworks Inc. 2011- 2017. All rights reserved | 18 Data is partitioned by time ! timestamp publisher advertiser gender country ... click price 2011-01-01T00:01:35Z bieberfever.com google.com Male USA 0 0.65 2011-01-01T00:03:63Z bieberfever.com google.com Male USA 0 0.62 2011-01-01T00:04:51Z bieberfever.com google.com Male USA 1 0.45 2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53 ... ⬢ Time partitioned immutable partitions. ⬢ Typically 5 Million row per segment. ⬢ Druid segments are stored in a column orientation (only what is needed is actually loaded and scanned).
  • 19. © Hortonworks Inc. 2011- 2017. All rights reserved | 19 COLUMN COMPRESSION - DICTIONARIES • Create ids • bieberfever.com -> 0, ultratrimfast.com -> 1 • Store • publisher -> [0, 0, 0, 1, 1, 1] • advertiser -> [0, 0, 0, 0, 0, 0] timestamp publisher advertiser gender country ... click price 2011-01-01T00:01:35Z bieberfever.com google.com Male USA 0 0.65 2011-01-01T00:03:63Z bieberfever.com google.com Male USA 0 0.62 2011-01-01T00:04:51Z bieberfever.com google.com Male USA 1 0.45 2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53 ...
  • 20. © Hortonworks Inc. 2011- 2017. All rights reserved | 20 BITMAP INDEXES timestamp publisher advertiser gender country ... click price 2011-01-01T00:01:35Z bieberfever.com google.com Male USA 0 0.65 2011-01-01T00:03:63Z bieberfever.com google.com Male USA 0 0.62 2011-01-01T00:04:51Z bieberfever.com google.com Male USA 1 0.45 2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53 ... • bieberfever.com -> [0, 1, 2] -> [111000] • ultratrimfast.com -> [3, 4, 5] -> [000111] • Compress using Concice or Roaring (take advantage of dimension sparsity)
  • 21. © Hortonworks Inc. 2011- 2017. All rights reserved | 21 FAST AND FLEXIBLE QUERIES JUSTIN BIEBER [1, 1, 0, 0] KE$HA [0, 0, 1, 1] JUSTIN BIEBER OR KE$HA [1, 1, 1, 1] Queries that solely aggregate metrics based on filters do not need to touch the list of dimension values !
  • 22. © Hortonworks Inc. 2011- 2017. All rights reserved | 22 Druid Architecture
  • 23. © Hortonworks Inc. 2011- 2017. All rights reserved | 23 Broker Nodes ⬢ Keeps track of segment announcements in cluster – (This information is kept in Zookeeper, much like Storm or HBase do.) ⬢ Scatters query across historical and realtime nodes – (Clients issue queries to this node, but queries are processed elsewhere.) ⬢ Merge results from different query nodes ⬢ (Distributed) caching layer
  • 24. © Hortonworks Inc. 2011- 2017. All rights reserved | 24 Historical Nodes ⬢ Shared nothing architecture ⬢ Main workhorses of druid cluster ⬢ Load immutable read optimized segments ⬢ Respond to queries ⬢ Use memory mapped files to load segments
  • 25. © Hortonworks Inc. 2011- 2017. All rights reserved | 25 Early Druid Architecture Hadoop Historical Node Historical Node Historical Node Batch Data Broker Node Queries
  • 26. © Hortonworks Inc. 2011- 2017. All rights reserved | 26 Druid: Batch Indexing ⬢ Indexing is performed by related components: – Overlord (N = 1) – Middle Managers (N >= 1) ⬢ Indexing is done via MapReduce jobs that build Segment files. ⬢ Batch indexing is done on data that already exists in Deep Storage (e.g. HDFS). ⬢ Index definition is specified via a JSON file and submitted to the Overlord.
  • 27. © Hortonworks Inc. 2011- 2017. All rights reserved | 27 Current Druid Architecture Hadoop Historical Node Historical Node Historical Node Batch Data Broker Node Queries ETL (Samza, Kafka, Storm, Spark etc) Streaming Data Realtime Node Realtime Node Handoff
  • 28. © Hortonworks Inc. 2011- 2017. All rights reserved | 28 Druid: Realtime Indexing Push Mode Deep Storage Tranquility Coordinator Broker Indexing Service Overlord MiddleManager Peon Peon Peon ZooKeeper Kafka task Push Segments segment Segment- cache Historical Segment- cache Historical Spark Flink Storm Python
  • 29. © Hortonworks Inc. 2011- 2017. All rights reserved | 29 Druid: Realtime Indexing Pull Mode Deep Storage Coordinator Broker Indexing Service Overlord MiddleManager Peon Peon Peon ZooKeeper Kafka task Segments segment Segment- cache Historical Segment- cache Historical Spark Flink Storm Python Pull
  • 30. © Hortonworks Inc. 2011- 2017. All rights reserved | 30 Realtime Nodes ⬢ Ability to ingest streams of data ⬢ Both push and pull based ingestion ⬢ Stores data in write-optimized structure ⬢ Periodically converts write-optimized structure to read-optimized segments ⬢ Event query-able as soon as it is ingested
  • 31. © Hortonworks Inc. 2011- 2017. All rights reserved | 31 Coordinator Nodes ⬢ Assigns segments to historical nodes ⬢ Interval based cost function to distribute segments ⬢ Makes sure query load is uniform across historical nodes ⬢ Handles replication of data ⬢ Configurable rules to load/drop data
  • 33. © Hortonworks Inc. 2011- 2017. All rights reserved | 33 Registering and creating Druid data sources
  • 34. © Hortonworks Inc. 2011- 2017. All rights reserved | 34 Druid data sources in Hive ⬢ User needs to provide Druid data sources information to Hive ⬢ Two different options depending on requirements – Register Druid data sources in Hive • Data is already stored in Druid – Create Druid data sources from Hive • Data is stored in Hive • User may want to pre-process the data before storing it in Druid
  • 35. Druid data sources in Hive ⬢ Simple CREATE EXTERNAL TABLE statement CREATE EXTERNAL TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker"); Hive table name Hive storage handler classname Druid data source name ⇢ Broker node endpoint specified as a Hive configuration parameter ⇢ Automatic Druid data schema discovery: segment metadata query Registering Druid data sources
  • 36. Druid data sources in Hive ⬢ Use Create Table As Select (CTAS) statement CREATE TABLE druid_table_1 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker", "druid.segment.granularity" = "DAY") AS SELECT __time, page, user, c_added, c_removed FROM src; Hive table name Hive storage handler classname Druid data source name Druid segment granularity Creating Druid data sources
  • 37. © Hortonworks Inc. 2011- 2017. All rights reserved | 37 Querying Druid data sources
  • 38. © Hortonworks Inc. 2011- 2017. All rights reserved | 38 Querying Druid data sources ⬢ Automatic rewriting when query is expressed over Druid table – Powered by Apache Calcite – Main challenge: identify patterns in logical plan corresponding to different kinds of Druid queries (Timeseries, TopN, GroupBy, Select) ⬢ Translate (sub)plan of operators into valid Druid JSON query – Druid query is encapsulated within Hive TableScan operator ⬢ Hive TableScan uses Druid input format – Submits query to Druid and generates records out of the query results ⬢ It might not be possible to push all computation to Druid – Our contract is that the query should always be executed
  • 39. Druid query recognition (powered by Apache Calcite) SELECT `user`, sum(`c_added`) AS s FROM druid_table_1 WHERE EXTRACT(year FROM `__time`) BETWEEN 2010 AND 2011 GROUP BY `user` ORDER BY s DESC LIMIT 10; ⬢ Top 10 users that have added more characters from beginning of 2010 until the end of 2011 Apache Hive - SQL query Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter
  • 40. { "queryType": "groupBy", "dataSource": "users_index", "granularity": "all", "dimension": "user", "aggregations": [ { "type": "longSum", "name": "s", "fieldName": "c_added" } ], "limitSpec": { "limit": 10, "columns": [ {"dimension": "s", "direction": "descending" } ] }, "intervals": [ "2010-01-01T00:00:00.000/2012-01-01T00:00:00.000" ] } Physical plan transformation Apache Hive Druid query groupBy Query logical plan Druid Scan Project Aggregate Sort Limit Sink Filter Select File SinkFile Sink Table Scan Query physical plan Druid JSON query Table Scan uses Druid Input Format
  • 41. © Hortonworks Inc. 2011- 2017. All rights reserved | 41 Road ahead ⬢ Tighten integration between Druid and Apache Hive/Apache Calcite – Recognize more functions → Push more computation to Druid – Support complex column types – Close the gap between semantics of different systems • Time zone handling, null values ⬢ Broader perspective – Materialized views support in Apache Hive • Data stored in Apache Hive • Create materialized view in Druid – Denormalized star schema for a certain time period • Automatic input query rewriting over the materialized view (Apache Calcite)
  • 42. © Hortonworks Inc. 2011- 2017. All rights reserved | 42 SSB Benchmarks
  • 43. © Hortonworks Inc. 2011- 2017. All rights reserved | 43 Schema and Data Model: Base Tables Star schema benchmark ⬢ Line order 6 billion rows ⬢ Customer 30,000,000 rows ⬢ Supplier 2,000,000 ⬢ Parts 1,400,000 Picture Credit http://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-create-test-data.html
  • 44. © Hortonworks Inc. 2011- 2017. All rights reserved | 44 Benchmark Setup ⬢ 11 Nodes 2x Intel(R) Xeon(R) CPU E5- 2640 v2 @ 2.00GHz with 16 CPU threads each ⬢ 256G of Ram per node ⬢ 6x WDC WD4000FYYZ-0 1K02 4TB SCSI disks per node ⬢ Not 100% dedicated (part of Yarn managed cluster) ⬢ 10 / 5 /3 Historical nodes (Workhorse) – -Xmx12g -Xms12g -XX:NewSize=6g - XX:MaxNewSize=6g - XX:MaxDirectMemorySize=128g ⬢ 1 node Broker – -Xmx25g -Xms25g -XX:NewSize=6g - XX:MaxNewSize=6g - XX:MaxDirectMemorySize=64g ⬢ No segments replication – Query to server mapping is one to one ⬢ No caching ⬢ No Tuning out of the box configuration Platform Druid
  • 45. © Hortonworks Inc. 2011- 2017. All rights reserved | 45 Schema and Data Model: Denormalized Table in Druid ⬢ 6 Billion rows (no Rollups) ⬢ 174 GB over 7 years ⬢ Segmented by month – One partition is worth 1/9 month of data – 10 Million rows per partition – Segment size ~ 262 MB – 9 Million rows per partition – 707 partitions in total
  • 46. © Hortonworks Inc. 2011- 2017. All rights reserved | 46 SSB Queries ⬢ Q1 (Quick metrics) the amount of revenue increase that would have resulted from eliminating certain company – No Group by – Years to weeks of data ⬢ Q2 (Product sales insights) compares revenue for some product classes, for suppliers in a certain region, grouped by more restrictive product classes and all years of orders. – Couple of Group by keys – All data to one month – Less selective to very selective filter ⬢ Q3 (Customer insights) revenue volume by customer nation and supplier nation and year within a given region, in a certain time period – Greater than 3 Group by keys – Years of data ⬢ Q4 "What-If" sequence, of the OLAP type – Same as Q3
  • 47. © Hortonworks Inc. 2011- 2017. All rights reserved | 47 Benchmark execution setup ⬢ Run using Apache Jmeter ⬢ Every Query run 15 times capture Min/Max/AVG/STD ⬢ Query time includes JDBC overhead ⬢ Launch a warmup query first to initialize the metadata ⬢ Concurrency mode Launches multiple threads via different JDBC connections ⬢ Druid 0.9.2 (HDP 2.6) ⬢ Hive 2.1 (HDP 2.6)
  • 48. 0 500 1000 1500 2000 2500 3000 Query time in ms Average Min Max Q1.1 ({*}, {yearLevel = 1993, 1 < discountLevel < 3, quantityLevel < 25}, {sum_revenue}) Q1.2 ({*}, {yearmonthnumLevel = 199401, 4 < discountLevel < 6, 26 < quantityLevel < 35}, {sum_revenue}) Q1.3 ({weeknuminyearYearLevel}, {5 < discountLevel < 7, 26 < quantityLevel < 35}, {sum_revenue})
  • 49. 0 500 1000 1500 2000 2500 3000 Query time in ms Average Min Max (4) ({yearLevel, brand1Level}, {categoryLevel = MFGR#12, s_regionLevel = AMERICA}, {lo_revenue}) (5)({yearLevel, brand1Level, s_regionLevel}, {brand1Level = MFGR#2221 OR ... OR brand1level = MFGR#2228, gionLevel = ASIA}, {lo_revenue}) (6)({yearLevel, brand1Level, s_regionLevel}, {brand1Level = MFGR#2239, s_regionLevel = EUROPE}, {lo_revenue})
  • 50. 116.22 100 60 5.51 3.667 2.677 1 10 100 Q2.2 3 nodes Q2.2 5 nodes Q2.2 10 nodes QUERY2.2 RESPONSE TIME IN SECONDS LOG SCALE Concices Roaring Q2.2 (5)({yearLevel, brand1Level, s_regionLevel}, {brand1Level = MFGR#2221 OR ... OR brand1level = MFGR#2228, s_regionLevel = ASIA}, {lo_revenue}) Union of bitmaps
  • 51. 0 500 1000 1500 2000 2500 3000 Query time in ms Average Min Max 1 (7)({yearLevel, c_nationLevel,s_nationLevel}, {1992 ≤ yearLevel ≤ 1997, c_nationLevel = ASIA, s_nationLevel = ASIA}, {lo_revenue}) 2 (8)(yearLevel, c_cityLevel, s_cityLevel}, {5Y, c_nationLevel = UNITED STATES, s_nationLevel = UNITED STATES}, {lo_revenue}) 3 (9)(same as 3.2, {5Y, c_cityLevel = UNITED KI1 OR c_cityLevel = UNITED KI5, s_cityLevel = UNITED KI1 OR s_cityLevel = UNITED KI5}, revenue}) 4 (10)({yearmonthLevel, c_cityLevel, s_cityLevel}, {yearmonthLevel = Dec1997, c_cityLevel = UNITED KI1 or c_cityLevel = UNITED KI5, ityLevel = UNITED KI1 or s_cityLevel = UNITED KI5}, {lo_revenue})
  • 52. 0 500 1000 1500 2000 2500 3000 Query time in ms Average Min Max Q4.1 (11)({yearLevel, c_nationLevel, mfgrLevel, s_regionLevel}, {c_regionLevel = AMERICA, mfgrLevel = MFGR#1 OR mfgrLevel = MFGR#2, s_region = AMERICA}, {sum_profit}) Q4.2 (12)({yearLevel, c_regionLevel, categoryLevel, s_nationLevel}, {yearLevel = 1997 OR yearLevel = 1998, c_regionLevel = AMERICA, mfgrLevel = MFGR#1 OR mfgrLevel = MFGR#2 , s_regionLevel = AMERICA}, {sum_profit}) Q4.3 (13)({yearLevel, regionLevel, categoryLevel, cityLevel}, {yearLevel = 1997 OR yearLevel = 1998, c_regionLevel = AMERICA, categoryLevel = MFGR#14, s_region = UNITED STATES}, {sum_profit})
  • 53. 0 500 1000 1500 2000 2500 Average query response time in ms with 10 nodes cluster Execution compile time HIVE DRUID
  • 54. 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 count_star Q1.1 Q1.2 Q1.3 Q2.1 Q2.2 Q2.3 Q3.1 Q3.2 Q3.3 Q3.4 Q4.1 Q4.2 Q4.3 Concurrency Test: Average execution time in ms 10 node cluster 15 users 10 users 5 users 1 user o Every user is connecting via different JDBC connection.