Más contenido relacionado
La actualidad más candente (20)
Similar a 快速数据快速分析引擎-Kudu (20)
快速数据快速分析引擎-Kudu
- 16. 16© Cloudera, Inc. All rights reserved.
Using Kudu
• Table has a SQL-like schema
• Finite number of columns (unlike HBase/Cassandra)
• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY,
TIMESTAMP
• Some subset of columns makes up a possibly-composite primary key
• Fast ALTER TABLE
• Java and C++ “NoSQL” style APIs
• Insert(), Update(), Delete(), Scan()
• Integrations with MapReduce, Spark, and Impala
• more to come!
16
- 19. 19© Cloudera, Inc. All rights reserved.
Kudu Use Cases
Kudu is best for use cases requiring a simultaneous combination of
sequential and random reads and writes, e.g.:
● Time Series
○ Examples: Stream market data; fraud detection & prevention; risk monitoring
○ Workload: Insert, updates, scans, lookups
● Machine Data Analytics
○ Examples: Network threat detection
○ Workload: Inserts, scans, lookups
● Online Reporting
○ Examples: ODS
○ Workload: Inserts, updates, scans, lookups
- 30. 30© Cloudera, Inc. All rights reserved.
LSM vs Kudu
• LSM – Log Structured Merge (Cassandra, HBase, etc)
• Inserts and updates all go to an in-memory map (MemStore) and later flush to
on-disk files (HFile/SSTable)
• Reads perform an on-the-fly merge of all on-disk HFiles
• Kudu
• Shares some traits (memstores, compactions)
• More complex.
• Slower writes in exchange for faster reads (especially scans)
30
- 40. 40© Cloudera, Inc. All rights reserved.
Java/C++ native clients - Create table
KuduClient client = new KuduClient.KuduClientBuilder("master1,master2,master3").build();
ColumnSchema idColumn = new ColumnSchema.ColumnSchemaBuilder("col_id", Type.INT64)
.key(true)
.build();
ColumnSchema nameColumn = new ColumnSchema.ColumnSchemaBuilder("col_name", Type.STRING)
.build();
ColumnSchema emailColumn = new ColumnSchema.ColumnSchemaBuilder("col_email", Type.STRING)
.nullable(true)
.build();
Schema schema = new Schema(Lists.newArrayList(idColumn, nameColumn, emailColumn));
client.createTable("clouderans", schema);
- 43. 43© Cloudera, Inc. All rights reserved.
Java/C++ native clients - Scan
ColumnRangePredicate predicate = new ColumnRangePredicate(idColumn);
predicate.setLowerBound(1435425662);
KuduScanner scanner = client.newScannerBuilder(table)
.setProjectedColumnNames(Lists.newArrayList("col_name", "col_email"))
.addColumnRangePredicate(predicate)
.build();
while (scanner.hasMoreRows()) {
RowResultIterator rowResultIterator = scanner.nextRows();
while (rowResultIterator.hasNext()) {
RowResult rowResult = rowResultIterator.next();
System.out.println("Found a clouderan " + rowResult.getString(0));
}
}
- 46. 46© Cloudera, Inc. All rights reserved.
Impala + Kudu – Pre Splitting
CREATE TABLE cust_behavior (
_id BIGINT,
salary STRING,
edu_level INT,
usergender STRING,
rating INT)
DISTRIBUTE BY RANGE(_id)
SPLIT ROWS((1439560049342), (1439566253755), (1439572458168), (1439578662581), (1439584866994),
(1439591071407))
TBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'my_first_table',
'kudu.master_addresses' = 'kudu-master.example.com:7051',
'kudu.key_columns' = 'id'
);
- 49. 49© Cloudera, Inc. All rights reserved.
TPC-H (Analytics benchmark)
• 75TS + 1 master cluster
• 12 (spinning) disk each, enough RAM to fit dataset
• Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4
• TPC-H Scale Factor 100 (100GB)
• Example query:
• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer,
orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND
l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA'
AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY
n_name ORDER BY revenue desc;
49
- 56. 56© Cloudera, Inc. All rights reserved.
Use Case 1: Benchmark Results
1.4 2.0 2.3
3.1
1.3 0.9 1.3
2.8
4.0
5.7
7.5
16.7
Q1 Q2 Q3 Q4 Q5 Q6
kudu
parquet
Total Time(s) Throughput(Total) Throughput(per node)
Kudu 961.1 2.8M record/s 39.5k record/s
Parquet 114.6 23.5M record/s 331k records/s
Bulk load using impala (INSERT INTO):
Query latency:
* HDFS parquet file replication = 3 , kudu table replication = 3
* Each query run 5 times then take average