SlideShare una empresa de Scribd logo
1 de 58
Descargar para leer sin conexión
1©	Cloudera,	Inc.	All	rights	reserved.
Introduction	to	Kudu:	
Hadoop	Storage	for	Fast	Analytics	
on	Fast	Data
Jianwei Li
Jarred@cloudera.com
2©	Cloudera,	Inc.	All	rights	reserved.
Agenda
• What	is	Kudu?	(Motivations	&	Design	Goals)
• Use	Cases
• Overview	of	Design	&	Internals	
• Simple	Benchmark
• Getting	Started
3©	Cloudera,	Inc.	All	rights	reserved.
Connected	Cars
• Analytical
• R&D	wants	to	know	optimal	
charge	performance.
• How	has	software	update	
affected	vehicle	performance	
over	time?
• Real-time
• Consumer	wants	to	know	if	teen	
is	driving	car	right	now.	How	fast	
are	they	accelerating?	What	is	
their	speed?	Where	are	they?
• Vehicle	Service	– e.g.	grab	an	up-
to-date	“diagnosis	bundle”	
before	or	during	service.
4©	Cloudera,	Inc.	All	rights	reserved.
Connected	Cars
• Analytical
• R&D	wants	to	know	optimal	
charge	performance.
• How	has	software	update	
affected	vehicle	performance	
over	time?
• Real-time
• Consumer	wants	to	know	if	teen	
is	driving	car	right	now.	How	fast	
are	they	accelerating?	What	is	
their	speed?	Where	are	they?
• Vehicle	Service	– e.g.	grab	an	up-
to-date	“diagnosis	bundle”	
before	or	during	service.	
fast,	efficient	scans
=	HDFS
fast	inserts/lookups
=	HBase
5©	Cloudera,	Inc.	All	rights	reserved.
How	would	we	build	the	Real-time	System	Today?
Hadoop
12:01:05	AM:	 Where	
is	my	car?	How	fast?
12:01:05	AM:	
Here	I	am!
12:01:05	AM:	Your	car	
is	located	at….
• Cars	are	constantly	producing	
data,	being	ingested	in	real-time	
or	micro-batches
• A	consumer	wants	to	know	what	
is	the	location	and	speed	of	their	
car	in	real-time.	
• Requires:
– potentially	millions	of	low	latency	
writes	/	second
– low	latency,	random	access
Consumer
6©	Cloudera,	Inc.	All	rights	reserved.
How	would	we	build	the	Real-time	System	Today?
• Hbase	Provides:	
• Fast,	Random	Read	&	Write	Access
• “Mini-scans”
• Scale-out	architecture	capable	of	serving	Big	Data	to	many	users
Connected	
Cars
Kafka	/	
Pub-sub
Events
HBase
Consumer
7©	Cloudera,	Inc.	All	rights	reserved.
How	would	we	build	the	Analytics	System	Today?
Hadoop
For	the	last	3	
days,	give	me		
engine	temp	vs.	
radiator	data.	
12:01:05	PM:	
Here	I	am!
Here	you	go….
• Cars	are	constantly	producing	data,	
being	ingested	in	real-time	or	
micro-batches
• GM	just	made	a	software	update	
to	temp	control	module.	How’s	it	
working?	
• Requires	fast,	efficient	scan	
performance
Analyst
8©	Cloudera,	Inc.	All	rights	reserved.
How	would	we	build	the	Analytics	System	Today?
• HDFS	Excels	at:	
• Full	table	scans
• Ad-hoc	analytics
Connected	
Cars
Kafka	/	
Pub-sub
Events
Today’s	Partition
Yesterday’s	Partition
Historic	Data
Analyst
Hbase/Flu
me
1.	Have	we	
accumulated	
enough	data?
2.	Flush	into	
HDFS
9©	Cloudera,	Inc.	All	rights	reserved.
Hybrid	big	data	analytics	pipeline
Before	Kudu
Connected	
Cars
Kafka	/	
Pub-sub
Events
HBase
Consumer
HDFS	(Storage)
Random	Reads
Analyst
Analytics
Snapshot
&	Convert	to	
Parquet
Compact	late	
arriving	data
10©	Cloudera,	Inc.	All	rights	reserved.
Hybrid	big	data	analytics	pipeline
After	Kudu
Connected	
Cars
Kafka	/	
Pub-sub
Events
Kudu
Consume
r
Random	Reads
Analyst
Analytics
Kudu	supports	simultaneous	combination	of
sequential	and	random	reads	and	writes
11©	Cloudera,	Inc.	All	rights	reserved.
What	is	Kudu?
12©	Cloudera,	Inc.	All	rights	reserved.
Current	Storage	Landscape	in	Hadoop
HDFS	excels	at:
• Efficiently	scanning	large	amounts	
of	data
• Accumulating	data	with	high	
throughput
HBase	excels	at:
• Efficiently	finding	and	writing	
individual	rows
• Making	data	mutable
Gaps	exist	when	these	properties	
are	needed	simultaneously
The	Hadoop	
Storage	“Gap”
13©	Cloudera,	Inc.	All	rights	reserved.
• High	throughput	for	big	scans
• Low-latency	for	random	accesses
• High	CPU	performance	to	better	take	advantage	
of	RAM	and	Flash
• Single-column	scan	rate	10-100x	faster	than	HBase
• High	IO	efficiency
• True	column	store	with	type-specific	encodings
• Efficient	analytics	when	only	certain	columns	are	
accessed
• Expressive and	evolvable data	model
• Architecture	that	supports	multi-data	center	
operation
Kudu	Design	Goals
14©	Cloudera,	Inc.	All	rights	reserved.
Changing	Hardware	Landscape
• Spinning	disk	->	solid	state	storage
• NAND	flash:	Up	to	450k	read	250k	write	iops,	about	2GB/sec	read	and	1.5GB/sec	
write	throughput, at	a	price	of	less	than	$3/GB	and	dropping
• 3D	XPoint memory (1000x	faster	than	NAND,	cheaper	than	RAM)
• RAM	is	cheaper	and	more	abundant
• 64->128->256GB	over	last	few	years
Takeaway	1:	
The next	bottleneck	is	CPU,	and	current	storage	systems	weren’t	designed	with	CPU	
efficiency	in	mind.
.
15©	Cloudera,	Inc.	All	rights	reserved.
The	Kudu	Elevator	Pitch
Storage	for	Fast	Analytics	on	Fast	Data
• New	updating	column	store	for	
Hadoop
• Simplifies	the	architecture	for	building	
analytic	applications	on	changing	data
• Designed	for	fast	analytic	performance
• Natively	integrated	with	Hadoop
• Apache-licensed	open	source	(with	
pending	ASF	Incubator	proposal)
• Beta	now	available
FILESYSTEM
HDFS
NoSQL
HBASE
INGEST	– SQOOP,	FLUME,	KAFKA
DATA	INTEGRATION	&	STORAGE
SECURITY	– SENTRY
RESOURCE	MANAGEMENT	– YARN
UNIFIED	DATA	SERVICES
BATCH STREAM SQL SEARCH MODEL ONLINE
DATA	ENGINEERING DATA	DISCOVERY	&	ANALYTICS DATA	APPS
SPARK,	
HIVE,	PIG
SPARK IMPALA SOLR SPARK HBASE
RELATIONAL
KUDU
16©	Cloudera,	Inc.	All	rights	reserved.
Using	Kudu
• Table	has	a	SQL-like	schema
• Finite	number	of	columns	(unlike	HBase/Cassandra)
• Types:	BOOL,	INT8,	INT16,	INT32,	INT64,	FLOAT,	DOUBLE,	STRING,	BINARY,	
TIMESTAMP
• Some	subset	of	columns	makes	up	a	possibly-composite	primary	key
• Fast	ALTER	TABLE
• Java	and	C++	“NoSQL”	style	APIs
• Insert(),	Update(),	Delete(),	Scan()
• Integrations	with	MapReduce,	Spark,	and	Impala
• more	to	come!
16
17©	Cloudera,	Inc.	All	rights	reserved.
What	Kudu	is	*NOT*
• Not a	SQL	interface	itself	
• It’s	just	the	storage	layer	– “Bring	Your	Own	SQL”	(eg Impala	or	Spark)
• Not an	application	that	runs	on	HDFS
• It’s	an	alternative,	native	Hadoop	storage	engine
• Colocation	with	HDFS	is	expected
• Not a	replacement	for	HDFS	or	HBase
• Select	the	right	storage	for	the	right	use	case
• Cloudera	will	support	and	invest	in	all	three
18©	Cloudera,	Inc.	All	rights	reserved.
Use	Cases	for	Kudu
19©	Cloudera,	Inc.	All	rights	reserved.
Kudu	Use	Cases
Kudu	is	best	for	use	cases	requiring	a	simultaneous	combination	of
sequential	and	random	reads	and	writes,	e.g.:
● Time	Series
○ Examples:	Stream	market	data;	fraud	detection	&	prevention;	risk	monitoring
○ Workload:	Insert,	updates,	scans,	lookups
● Machine	Data	Analytics
○ Examples:	Network	threat	detection
○ Workload:	Inserts,	scans,	lookups
● Online	Reporting
○ Examples:	ODS
○ Workload:	Inserts,	updates,	scans,	lookups
20©	Cloudera,	Inc.	All	rights	reserved.
Real-Time	Analytics	in	Hadoop	Today
Fraud	Detection	in	the	Real	World	=	Storage	Complexity
Considerations:
● How	do	I	handle	failure	
during	this	process?
● How	often	do	I	reorganize	
data	streaming	in	into	a	
format	appropriate	for	
reporting?
● When	reporting,	how	do	I	see	
data	that	has	not	yet	been	
reorganized?
● How	do	I	ensure	that	
important	jobs	aren’t	
interrupted	by	maintenance?
New	Partition
Most	Recent	Partition
Historic	Data
HBase
Parquet	
File
Have	we	
accumulated	
enough	data?
Reorganize	
HBase	file	
into	Parquet
• Wait	for	running	operations	to	complete	
• Define	new	Impala	partition	referencing	
the	newly	written	Parquet	file
Incoming	Data	
(Messaging	
System)
Reporting	
Request
Impala	on	HDFS
21©	Cloudera,	Inc.	All	rights	reserved.
Real-Time	Analytics	in	Hadoop	with	Kudu
Simpler	Architecture,	Superior	Performance	over	Hybrid	Approaches
Impala on	Kudu
Incoming	Data	
(Messaging	
System)
Reporting	
Request
22©	Cloudera,	Inc.	All	rights	reserved.
Design	&	Internals
23©	Cloudera,	Inc.	All	rights	reserved.
Kudu	Basic	Design
• Typed	storage
• Basic	Construct:	Tables	
• Tables	broken	down	into	Tablets (roughly	equivalent	to	partitions)
• Maintains	consistency	through	a	Paxos-like	quorum	model	(Raft)
• Architecture	supports	geographically	disparate,	active/active	
systems
24©	Cloudera,	Inc.	All	rights	reserved.
Tables	and	Tablets
• Table	is	horizontally	partitioned	into	tablets
• Range or	hash partitioning
• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY
HASH(timestamp) INTO 100 BUCKETS
• Each	tablet	has	N	replicas	(3	or	5),	with	Raft consensus
• Allow	read	from	any	replica,	plus	leader-driven	writes	with	low	MTTR
• Tablet	servers	host	tablets
• Store	data	on	local	disks	(no	HDFS)
24
25©	Cloudera,	Inc.	All	rights	reserved.
Client
Meta	Cache
26©	Cloudera,	Inc.	All	rights	reserved.
Client
Hey	Master!	Where	is	the	row	for	
‘todd@cloudera.com’	in	table	“T”?Meta	Cache
27©	Cloudera,	Inc.	All	rights	reserved.
Client
Hey	Master!	Where	is	the	row	for	
‘todd@cloudera.com’	in	table	“T”?
It’s	part	of	tablet	2,	which	is	on	servers	{Z,Y,X}.	
BTW,	here’s	info	on	other	tablets	you	might	
care	about:	T1,	T2,	T3,	…
Meta	Cache
28©	Cloudera,	Inc.	All	rights	reserved.
Client
Hey	Master!	Where	is	the	row	for	
‘todd@cloudera.com’	in	table	“T”?
It’s	part	of	tablet	2,	which	is	on	servers	{Z,Y,X}.	
BTW,	here’s	info	on	other	tablets	you	might	
care	about:	T1,	T2,	T3,	…
Meta	Cache
T1:	…
T2:	…
T3:	…
29©	Cloudera,	Inc.	All	rights	reserved.
Client
Hey	Master!	Where	is	the	row	for	
‘todd@cloudera.com’	in	table	“T”?
It’s	part	of	tablet	2,	which	is	on	servers	{Z,Y,X}.	
BTW,	here’s	info	on	other	tablets	you	might	
care	about:	T1,	T2,	T3,	…
UPDATE	
todd@cloudera.com	
SET	…
Meta	Cache
T1:	…
T2:	…
T3:	…
30©	Cloudera,	Inc.	All	rights	reserved.
LSM	vs Kudu
• LSM	– Log	Structured	Merge	(Cassandra,	HBase,	etc)
• Inserts	and	updates	all	go	to	an	in-memory	map	(MemStore)	and	later	flush	to	
on-disk	files	(HFile/SSTable)
• Reads	perform	an	on-the-fly	merge	of	all	on-disk	HFiles
• Kudu
• Shares	some	traits	(memstores,	compactions)
• More	complex.
• Slower	writes in	exchange	for	faster	reads	(especially	scans)
30
31©	Cloudera,	Inc.	All	rights	reserved.
Kudu	storage	– Inserts	and	Flushes
MemRowSet
INSERT(“todd”,	“$1000”,”engineer”)
name pay role
DiskRowSet 1
flush
32©	Cloudera,	Inc.	All	rights	reserved.
Kudu	storage	– Inserts	and	Flushes
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
INSERT(“doug”,	“$1B”,	“Hadoop	man”)
flush
33©	Cloudera,	Inc.	All	rights	reserved.
Kudu	storage	- Updates
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta	MS
Delta	MS
Each	DiskRowSet has	its	own	
DeltaMemStore to	
accumulate	updates
base	data
base	data
34©	Cloudera,	Inc.	All	rights	reserved.
Kudu	storage	- Updates
MemRowSet
name pay role
DiskRowSet 1
name pay role
DiskRowSet 2
Delta	MS
Delta	MS
UPDATE	set	pay=“$1M”	
WHERE	name=“todd”
Is	the	row	in	DiskRowSet 2?
(check	bloom	filters)
Is	the	row	in	DiskRowSet 1?
(check	bloom	filters)
Bloom	says:	no!
Bloom	says:	maybe!
Search	key	column	to	find	
offset:	rowid =	150
150:	pay=$1M
@	time	T
base	data
35©	Cloudera,	Inc.	All	rights	reserved.
Metadata
• Replicated	master*
• Acts	as	a	tablet	directory	(“META”	table)
• Acts	as	a	catalog	(table	schemas,	etc)
• Acts	as	a	load	balancer	(tracks	TS	liveness,	re-replicates	under-replicated	
tablets)
• Caches	all	metadata	in	RAM	for	high	performance
• 80-node	load	test,	GetTableLocations RPC	perf:
• 99th percentile:	68us,		99.99th percentile:	657us	
• <2%	peak	CPU	usage
• Client	configured	with	master	addresses
• Asks	master	for	tablet	locations	as	needed	and	caches	them
35
36©	Cloudera,	Inc.	All	rights	reserved.
Performance	characteristics
Very	CPU	efficient
• Written	in	modern	C++,	uses	specialized	CPU	instructions,	JIT	compilation	with	
LLVM
Latency	dependent	on	storage	hardware	capabilities
• Expect	sub-millisecond	response	on	SSDs	and	upcoming	technologies
No	garbage	collection	allows	very	large	memory	footprint	with	no	pauses
Bloom	filters	reduce	the	need	for	many	disk	accesses
37©	Cloudera,	Inc.	All	rights	reserved.
Kudu	Trade-Offs
• Random	updates	will	be	slower
• HBase	model	allows	random	updates	without	incurring	a	disk	seek
• Kudu	requires	a	key	lookup	before	update,	Bloom	lookup	before	insert
• Single-row	reads	may	be	slower
• Columnar	design	is	optimized	for	scans
• Future:	may	introduce	“column	groups”	for	applications	where	single-row	
access	is	more	important
38©	Cloudera,	Inc.	All	rights	reserved.
APIs
39©	Cloudera,	Inc.	All	rights	reserved.
Native	clients
• First	class	clients	implemented	in	Java	and	C++
• Experimental	Cython	client	that	wraps	the	C++
• Supported	operations
• DDL	(create,	alter,	delete	tables)
• DML	(insert,	update,	delete)
• Scan	(with	predicate	pushdown,	column	projection)
40©	Cloudera,	Inc.	All	rights	reserved.
Java/C++	native	clients	- Create	table
KuduClient client = new KuduClient.KuduClientBuilder("master1,master2,master3").build();
ColumnSchema idColumn = new ColumnSchema.ColumnSchemaBuilder("col_id", Type.INT64)
.key(true)
.build();
ColumnSchema nameColumn = new ColumnSchema.ColumnSchemaBuilder("col_name", Type.STRING)
.build();
ColumnSchema emailColumn = new ColumnSchema.ColumnSchemaBuilder("col_email", Type.STRING)
.nullable(true)
.build();
Schema schema = new Schema(Lists.newArrayList(idColumn, nameColumn, emailColumn));
client.createTable("clouderans", schema);
41©	Cloudera,	Inc.	All	rights	reserved.
Java/C++	native	clients	- Insert
KuduTable table = client.openTable("clouderans");
KuduSession session = client.newSession();
Insert newRow = table.newInsert();
newRow.addLong(idColumn.getName(), 1435425662);
newRow.addString(nameColumn.getName(), "JD");
newRow.addString(emailColumn.getName(), “jdcryans@cloudera.com");
session.apply(newRow);
session.flush();
42©	Cloudera,	Inc.	All	rights	reserved.
Java/C++	native	clients	- Update
Update newUpdate = table.newUpdate();
newUpdate.addLong(idColumn.getName(), 1435425662);
newUpdate.addString(nameColumn.getName(), “Jean-Daniel");
session.apply(newUpdate);
session.flush();
43©	Cloudera,	Inc.	All	rights	reserved.
Java/C++	native	clients	- Scan
ColumnRangePredicate predicate = new ColumnRangePredicate(idColumn);
predicate.setLowerBound(1435425662);
KuduScanner scanner = client.newScannerBuilder(table)
.setProjectedColumnNames(Lists.newArrayList("col_name", "col_email"))
.addColumnRangePredicate(predicate)
.build();
while (scanner.hasMoreRows()) {
RowResultIterator rowResultIterator = scanner.nextRows();
while (rowResultIterator.hasNext()) {
RowResult rowResult = rowResultIterator.next();
System.out.println("Found a clouderan " + rowResult.getString(0));
}
}
44©	Cloudera,	Inc.	All	rights	reserved.
MapReduce
• Very	similar	to	HBase’s
• Input	format
• Generates	one	Map	per	Tablet
• Configurable	like	a	KuduScanner
• A	RowResult	is	given	per	map()
• Output	format
• Configured	for	high	throughput
• Ingests	Insert,	Update,	and	Delete
45©	Cloudera,	Inc.	All	rights	reserved.
Impala + Kudu
CREATE	TABLE	my_first_table (
id	BIGINT,
name	STRING
)
DISTRIBUTE	BY	HASH	(id)	INTO	16	BUCKETS
TBLPROPERTIES(
'storage_handler'	=	'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name'	=	'my_first_table',
'kudu.master_addresses'	=	'kudu-master.example.com:7051',
'kudu.key_columns'	=	'id'
);
46©	Cloudera,	Inc.	All	rights	reserved.
Impala	+	Kudu	– Pre Splitting
CREATE	TABLE	cust_behavior (
_id	BIGINT,
salary	STRING,
edu_level INT,
usergender STRING,
rating	INT)
DISTRIBUTE	BY	RANGE(_id)
SPLIT	ROWS((1439560049342),	(1439566253755),	(1439572458168),	(1439578662581),	(1439584866994),	
(1439591071407))
TBLPROPERTIES(
'storage_handler'	=	'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name'	=	'my_first_table',
'kudu.master_addresses'	=	'kudu-master.example.com:7051',
'kudu.key_columns'	=	'id'
);
47©	Cloudera,	Inc.	All	rights	reserved.
Impala	+	Kudu	– Update	&	Delete
UPDATE	my_first_table SET	name="bob"	where	id	=	3;
DELETE	FROM	my_first_table WHERE	id	<	3;
48©	Cloudera,	Inc.	All	rights	reserved.
Benchmarks
49©	Cloudera,	Inc.	All	rights	reserved.
TPC-H	(Analytics	benchmark)
• 75TS	+	1	master	cluster
• 12	(spinning)	disk	each,	enough	RAM	to	fit	dataset
• Using	Kudu	0.5.0,	Impala	2.2	with	Kudu	support,	CDH	5.4
• TPC-H	Scale	Factor	100	(100GB)
• Example	query:
• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
49
50©	Cloudera,	Inc.	All	rights	reserved.
- Kudu	outperforms	Parquet	by	31%	(geometric	mean)	for	RAM-resident	data
- Parquet	likely	to	outperform	Kudu	for	HDD-resident	(larger	IO	requests)
51©	Cloudera,	Inc.	All	rights	reserved.
What	about	Apache	Phoenix?
• 10	node	cluster	(9	worker,	1	master)
• HBase	1.0,	Phoenix	4.3
• TPC-H	LINEITEM	table	only	(6B	rows)
51
2152
219
76
131
0.04
1918
13.2
1.7
0.7
0.15
155
9.3
1.4 1.5 1.37
0.01
0.1
1
10
100
1000
10000
Load TPCH	Q1 COUNT(*)
COUNT(*)
WHERE…
single-row
lookup
Time	(sec)
Phoenix
Kudu
Parquet
52©	Cloudera,	Inc.	All	rights	reserved.
Xiaomi Use	Case
• Gather	important	RPC	tracing	events	from	mobile	app	and	backend	service
• Service	monitoring	&	troubleshooting	tool
• High	write	throughput
• >5	Billion	records/day	and	growing
• Query	latest	data	and	quick	response
• Identify	and	resolve	issues	quickly
• Can	search	for	individual	records
• Easy	for	troubleshooting
53©	Cloudera,	Inc.	All	rights	reserved.
Big	Data	Analytics	Pipeline
Before	Kudu
• Long	pipeline
high	latency(1	hour	~	1	day),	data	conversion	pains
• No	ordering
Log	arrival(storage)	order	not	exactly	logical	order
e.g.	read	2-3	days	of	log	for	data	in	1	day
54©	Cloudera,	Inc.	All	rights	reserved.
Big	Data	Analysis	Pipeline
Simplified	With	Kudu
• ETL	Pipeline(0~10s	latency)
Apps	that	need	to	prevent	backpressure	or	require	ETL	
• Direct	Pipeline(no	latency)
Apps	that	don’t	require	ETL	and	no	backpressure	issues
OLAP	scan
Side	table	lookup
Result	store
55©	Cloudera,	Inc.	All	rights	reserved.
Use	Case	1:	Benchmark
Environment
• 71	Node	cluster
• Hardware
• CPU:	E5-2620	2.1GHz	*	24	core		Memory:	64GB	
• Network:	1Gb		Disk:	12	HDD
• Software
• Hadoop2.6/Impala	2.1/Kudu
Data
• 1	day	of	server	side	tracing	data
• ~2.6	Billion	rows
• ~270	bytes/row
• 17	columns,	5	key	columns
56©	Cloudera,	Inc.	All	rights	reserved.
Use	Case	1:	Benchmark	Results
1.4	 2.0	 2.3	
3.1	
1.3	 0.9	1.3	
2.8	
4.0	
5.7	
7.5	
16.7	
Q1 Q2 Q3 Q4 Q5 Q6
kudu
parquet
Total	Time(s) Throughput(Total) Throughput(per node)
Kudu 961.1 2.8M	record/s 39.5k	record/s
Parquet 114.6 23.5M	record/s 331k records/s
Bulk	load	using	impala	(INSERT	INTO):	
Query	latency:
*	HDFS	parquet	file	replication	=	3	,	kudu	table	replication	=	3
*	Each	query	run	5	times	then	take	average
57©	Cloudera,	Inc.	All	rights	reserved.
Getting	Started
Users:
Install	the	Beta	or	try	a	VM:
getkudu.io
Get	help:
kudu-user@googlegroups.com
Read	the	white	paper:
getkudu.io/kudu.pdf
Developers:
Contribute:
github.com/cloudera/kudu	(commits)
gerrit.cloudera.org (reviews)
issues.cloudera.org (JIRAs	going	back	to	2013)
Join	the	Dev list:
kudu-dev@googlegroups.com
Contributions/participation	are	welcome	and	
encouraged!
58©	Cloudera,	Inc.	All	rights	reserved.
Questions?

Más contenido relacionado

La actualidad más candente

Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
Cloudera, Inc.
 

La actualidad más candente (20)

Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
 
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoProExtreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
 
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester Webinar
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
Multi-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BTMulti-Tenant Operations with Cloudera 5.7 & BT
Multi-Tenant Operations with Cloudera 5.7 & BT
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
Self-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft AzureSelf-service Big Data Analytics on Microsoft Azure
Self-service Big Data Analytics on Microsoft Azure
 
Apache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance UpdateApache Impala (incubating) 2.5 Performance Update
Apache Impala (incubating) 2.5 Performance Update
 
End to End Streaming Architectures
End to End Streaming ArchitecturesEnd to End Streaming Architectures
End to End Streaming Architectures
 
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud WorldPart 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
 
The Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnThe Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in Churn
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
Cloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera clusterCloudera training: secure your Cloudera cluster
Cloudera training: secure your Cloudera cluster
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera

 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
 
Customer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWSCustomer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWS
 

Similar a 快速数据快速分析引擎-Kudu

School of Computer & Information SciencesITS-532 Cloud Com
School of Computer & Information SciencesITS-532 Cloud ComSchool of Computer & Information SciencesITS-532 Cloud Com
School of Computer & Information SciencesITS-532 Cloud Com
TaunyaCoffman887
 
Webinar effective mobile performance testing using real devices
Webinar effective mobile performance testing using real devicesWebinar effective mobile performance testing using real devices
Webinar effective mobile performance testing using real devices
Perfecto Mobile
 
SoftWatch Overview_short (1)
SoftWatch Overview_short (1)SoftWatch Overview_short (1)
SoftWatch Overview_short (1)
Dror Leshem
 
SoftWatch Overview_short (1)
SoftWatch Overview_short (1)SoftWatch Overview_short (1)
SoftWatch Overview_short (1)
Moshe Kozlovski
 

Similar a 快速数据快速分析引擎-Kudu (20)

Enabling the Connected Car Revolution

Enabling the Connected Car Revolution
Enabling the Connected Car Revolution

Enabling the Connected Car Revolution

 
The Automotive Journey Into the Cloud
The Automotive Journey Into the CloudThe Automotive Journey Into the Cloud
The Automotive Journey Into the Cloud
 
The Automotive Journey Into the Cloud
The Automotive Journey Into the CloudThe Automotive Journey Into the Cloud
The Automotive Journey Into the Cloud
 
Google App engine
Google App engineGoogle App engine
Google App engine
 
School of Computer & Information SciencesITS-532 Cloud Com
School of Computer & Information SciencesITS-532 Cloud ComSchool of Computer & Information SciencesITS-532 Cloud Com
School of Computer & Information SciencesITS-532 Cloud Com
 
How to Solve the Data Challenge in Connected Transport
How to Solve the Data Challenge in Connected TransportHow to Solve the Data Challenge in Connected Transport
How to Solve the Data Challenge in Connected Transport
 
Best Practices for Managing IaaS, PaaS, and Container-Based Deployments - App...
Best Practices for Managing IaaS, PaaS, and Container-Based Deployments - App...Best Practices for Managing IaaS, PaaS, and Container-Based Deployments - App...
Best Practices for Managing IaaS, PaaS, and Container-Based Deployments - App...
 
Introduction to Google App Engine
Introduction to Google App EngineIntroduction to Google App Engine
Introduction to Google App Engine
 
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
Breaking Up the Monolith While Migrating to AWS (GPSTEC320) - AWS re:Invent 2018
 
Get ready for_an_autonomous_data_driven_future_ext
Get ready for_an_autonomous_data_driven_future_extGet ready for_an_autonomous_data_driven_future_ext
Get ready for_an_autonomous_data_driven_future_ext
 
Fast Data Overview
Fast Data OverviewFast Data Overview
Fast Data Overview
 
SOUG Day - autonomous what is next
SOUG Day - autonomous what is nextSOUG Day - autonomous what is next
SOUG Day - autonomous what is next
 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computing
 
Webinar effective mobile performance testing using real devices
Webinar effective mobile performance testing using real devicesWebinar effective mobile performance testing using real devices
Webinar effective mobile performance testing using real devices
 
Australian Cloud and Data Centre Strategy Summit 2016 gus sabatino
Australian Cloud and Data Centre Strategy Summit 2016   gus sabatinoAustralian Cloud and Data Centre Strategy Summit 2016   gus sabatino
Australian Cloud and Data Centre Strategy Summit 2016 gus sabatino
 
Fast Data Overview for Data Science Maryland Meetup
Fast Data Overview for Data Science Maryland MeetupFast Data Overview for Data Science Maryland Meetup
Fast Data Overview for Data Science Maryland Meetup
 
SoftWatch Overview_short (1)
SoftWatch Overview_short (1)SoftWatch Overview_short (1)
SoftWatch Overview_short (1)
 
SoftWatch Overview_short (1)
SoftWatch Overview_short (1)SoftWatch Overview_short (1)
SoftWatch Overview_short (1)
 
The 3 Pillars of Remote Application Development
The 3 Pillars of Remote Application DevelopmentThe 3 Pillars of Remote Application Development
The 3 Pillars of Remote Application Development
 
Applying systems thinking to AWS enterprise application migration
Applying systems thinking to AWS enterprise application migrationApplying systems thinking to AWS enterprise application migration
Applying systems thinking to AWS enterprise application migration
 

Último

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

快速数据快速分析引擎-Kudu