SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Apache CarbonData:
An indexed columnar file format for
interactive query with Spark SQL
Jihong MA, Jacky LI
HUAWEI
data
Report & Dashboard OLAP & Ad-hoc Batch processing Machine learningReal TimeAnalytics
Challenge
• Wide Spectrum of Query Analysis
• OLAP Vs Detailed Query
• Full scan Vs Small scan
• Point queries
Small Scan QueryBig Scan Query
Multi-dimensional OLAP Query
How to choose storage engine to
facilitate query execution?
Available Options
1. NoSQL Database
• Key-Value store: low latency, <5ms
• No Standard SQL support
2. MPP relational Database
•Shared-nothing enables fast query execution
•Poor scalability: < 100 cluster size, no fault-tolerance
3. Search Engine
•Advanced indexing technique for fast search
•3~4X data expansion in size, no SQL support
4. SQL on Hadoop
•Modern distributed architecture and high scalability
•Slow on point queries
Motivation for A New File Format
CarbonData: Unified File Format
Small Scan QueryBig Scan Query
Multi-dimensional OLAP Query
A single copy of data
balanced to fit all data access
Apache
• Apache Incubator Project since June, 2016
• Apache releases
• 4 stable releases
• Latest 1.0.0, Jan 28, 2017
• Contributors:
• In Production:
Compute
Storage
Introducing CarbonData
CarbonData Integration with Spark
CarbonData Table
CarbonData File
What is CarbonData file format?
What forms a CarbonData table on disk?
What it takes to deeply integrate with
distributed processing engine like Spark?
1
2
3
CarbonData:
An Indexed Columnar File Format
CarbonData File Structure
§ Built-in Columnar & Index
§ Multi-dimensional Index (B+ Tree)
§ Min/Max index
§ Inverted index
§ Encoding:
§ RLE, Delta, Global Dictionary
§ Snappy for compression
§ Adaptive Data Type Compression
§ Data Type:
§ Primitive type and nested type
CarbonData File Layout
• Blocklet:A set of rows in columnarformat
•Data are sorted along MDK (multi-dimensional keys)
•Clustered data enabling efficient filtering and scan
•Column chunk: Data for one column in a Blocklet
•Footer: Metadata information
•File level metadata & statistics
•Schema
•Blocklet Index
Carbon Data File
Blocklet 1
Column 1 Chunk
Column 2 Chunk
…
Column n Chunk
Blocklet N
File Footer
…
File Metadata
Schema
Blocklet Index
File Level Blocklet Index
Blocklet 1
1 1 1 1 1 1 12000
1 1 1 2 1 2 5000
1 1 2 1 1 1 12000
1 1 2 2 1 2 5000
1 1 3 1 1 1 12000
1 1 3 2 1 2 5000
Blocklet 2
1 2 1 3 2 3 11000
1 2 2 3 2 3 11000
1 2 3 3 2 3 11000
1 3 1 4 3 4 2000
1 3 1 5 3 4 1000
1 3 2 4 3 4 2000
Blocklet 3
1 3 2 5 3 4 1000
1 3 3 4 3 4 2000
1 3 3 5 3 4 1000
1 4 1 4 1 1 20000
1 4 2 4 1 1 20000
1 4 3 4 1 1 20000
Blocklet 4
2 1 1 1 1 1 12000
2 1 1 2 1 2 5000
2 1 2 1 1 1 12000
2 1 2 2 1 2 5000
2 1 3 1 1 1 12000
2 1 3 2 1 2 5000
Blocklet Index
Blocklet1
Start Key1
End Key1 Start Key1
End Key4
Start Key1
End Key2
Start Key3
End Key4
Start Key1
End Key1
Start Key2
End Key2
Start Key3
End Key3
Start Key4
End Key4
File FooterBlocklet
• Build in-memory file level MDK index tree for filtering
• Major optimization for efficient scan
C1(Min, Max)
….
C7(Min, Max)
Blocklet4
Start Key4
End Key4
C1(Min, Max)
….
C7(Min, Max)
C1(Min,Max)
…
C7(Min,Max)
C1(Min,Max)
…
C7(Min,Max)
C1(Min,Max)
…
C7(Min,Max)
C1(Min,Max)
…
C7(Min,Max)
Rich Multi-Level Index Support
Spark Driver
Executor
Carbon File
Data
Footer
Carbon File
Data
Footer
Carbon File
Data
Footer
File Level Index
& Scanner
Table Level Index
Executor
File Level Index
& Scanner
Catalyst
Inverted
Index
•Using the index info in footer, two
level B+ tree index can be built:
•File level index: local B+ tree, efficient
blocklet level filtering
•Table level index: global B+ tree,
efficient file level filtering
•Column chunk inverted index:
efficient column chunk scan
CarbonData Table:
Data Segment + Metadata
CarbonData Table
Carbon
File
Metadata:
Appendable Dictionary, Schema
Index:
Separated Index per Segment
Data:
Immutable Data File
Data
source
Load in batch
Segment
CarbonData Table
Read
CarbonData Table
Carbon
File
Carbon
File
Dictionary
(append)
Index
(new tree)
Data File
(new folder)
Data
source
Load in batch
Segment 2 Segment 1
CarbonData Table
Read
CarbonData Table Layout
Carbon File
Data
Footer
Carbon File
Data
Footer
Carbon File
Data
Footer
Carbon File
Data
Footer
Dictionary
File
Dictionary
Map
Spark Driver
Table Level Index
Index File
All Footer
Schema
File
Latest
Schema
/table_name/fact/segment_id /table_name/metaHDFS
Spark
Mutation: Delete
Delete
Base
File
Segment
CarbonData Table
Delete
Delta
Bitmap file that mark
deleted rows
No change
in index
No change
in dictionary
Mutation: Update
Update
Base
File
Segment
CarbonData Table
Delete
Delta
Insert
Delta
Bitmap file for delete
delta and regular data
file for inserted delta
New index added for
inserted delta
Dictionary append
Data Compaction
Carbon
File
Carbon
File
Carbon
File
Dictionary
Index
Data File
Segment 2 Segment 1 Segment 1.1
CarbonData Table
Read
Segment Management
Data segment
…
Load
(write)
Leveraging ZooKeeper to manage the Segment State
Segment
Manager
(ZK based)
Carbon
File
Data segment
Carbon
File
Data segment
Carbon
File
Query
(read)
SparkSQL + CarbonData:
Enables fast interactive data analysis
Carbon-Spark Integration
• Built-in Spark integration
• Spark 1.5, 1.6, 2.1
• Interface
• SQL
• DataFrame API
• Operation:
• Load, Query (with optimization)
• Update, Delete, Compaction, etc
Reader/Writer
Data
Management
Query
Optimization
Carbon File Carbon File Carbon File
Integration through File Format
Carbon
File
Carbon
File
Carbon
File
HDFS
Segment 3 Segment 2 Segment 1
InputFormat/OutputFormat
Deep Integration with SparkSQL
Carbon
File
Carbon
File
Carbon
File
Integrate
SparkSQL
Optimizer
HDFS
Segment 3 Segment 2 Segment 1
CarbonData Table Read/Write/Update/Delete
CarbonData as a SparkSQL Data Source
Parser/Analyzer Optimizer Execution
SQL or
DataFrame
ResolveRelation
Carbon-specific optimization rule:
• Lazy Decode leveraging
global dictionary
Rule-based
New SQL syntax
• DML related statement
RDD
CarbonScanRDD:
• Leveraging multi level index for
efficient filtering and scan
DML related RDDs
Parser
SparkSQL Catalyst Framework
Cost-based
Physical Planning
Strategy
Carbon Data Source
Spark Core
Spark Executor
Spark Driver
Blocklet
Efficient Filtering via Index
HDFS
File
Footer
Blocklet
Blocklet
…
…
C1 C2 C3 C4 Cn
1. File pruning
File
Blocklet
Blocklet
Footer
File
Footer
Blocklet
Blocklet
File
Blocklet
Blocklet
Footer
Blocklet Blocklet Blocklet Blocklet
Task Task
2. Blocklet
pruning
3. Read and decompress filter column
4. Binary search using inverted index,
skip to next blocklet if no matching
…
5. Decompress projection column
6. Return decoded data to spark
SELECT c3, c4 FROM t1 WHERE c2=’boston’
Stage 2
Stage 1
Stage 2
Stage 1
Lazy Decoding by Leveraging Global Dictionary
Final Aggregation
Partial Aggregation
Scan with Filter
(decode c1, c2, c3)
Final Aggregation
Partial Aggregation
Scan with Filter
(decode c1, c2 only)
Dictionary Decode
Before applying Lazy Decode After applying Lazy Decode
SELECT c3, sum(c2) FROM t1 WHERE c1>10 GROUP BY c3
Aggregate on
encoded
value of c3
Decode c3 to
string type
Usage: Write
• Using SQL
• Using Dataframe
df.write
.format(“carbondata")
.options("tableName“, “t1"))
.mode(SaveMode.Overwrite)
.save()
CREATE TABLE tablename (name String, PhoneNumber String) STORED BY “carbondata”
LOAD DATA [LOCAL] INPATH 'folder path' [OVERWRITE] INTO TABLE tablename
OPTIONS(property_name=property_value, ...)
INSERT INTO TABLE tablennme select_statement1 FROM table1;
Usage: Read
• Using SQL
• Using Dataframe
SELECT project_list FROM t1
WHERE cond_list
GROUP BY columns
ORDER BY columns
df = sparkSession.read
.format(“carbondata”)
.option(“tableName”, “t1”)
.load(“path_to_carbon_file”)
df.select(…).show
Usage: Update and Delete
UPDATE table1 A
SET (A.PRODUCT, A.REVENUE) =
(
SELECT PRODUCT, REVENUE
FROM table2 B
WHERE B.CITY = A.CITY AND B.BROKER = A.BROKER
)
WHERE A.DATE BETWEEN ‘2017-01-01’ AND ‘2017-01-31’
table1 table2
UPDATE table1 A
SET A.REVENUE = A.REVENUE - 10
WHERE A.PRODUCT = ‘phone’
DELETE FROM table1 A
WHERE A.CUSTOMERID = ‘123’
123,abc
456,jkd
phone, 70 60
car,100
phone, 30 20
Modify one column in table1
Modify two columns in table1 with values from table2
Delete records in table1
Performance Result
Test
1. TPC-H benchmark (500GB)
2. Test on Production Data Set (Billions of rows)
3. Test on Large Data Set for Scalability (1000B rows, 103TB)
• Storage
– Parquet:
• Partitioned by time column (c1)
– CarbonData:
• Multi-dimensional Index by c1~c10
• Compute
– Spark 2.1
TPC-H: Query
For big scan queries, similar performance,±20%
0
100
200
300
400
500
600
700
800
Response Time (sec)
Parquet
CarbonData
TPC-H: Query
138
283
173
1910
73
207
41 82
0
500
1000
1500
2000
2500
Query_1 Query_3 Query_12 Query_16
Response Time (sec)
Parquet
CarbonData
For queries include filter on fact table,
CarbonData get 1.5-20X performance by leveraging index
TPC-H: Loading and Compression
83.27
44.30
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
Parquet CarbonData
loading throughput (MB/sec/node)
2.96
2.74
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Parquet CarbonData
compression ratio
Test on Production Data Set
Filter Response Time (sec) Number of Task
Query c1 c2 c3 c4 c5 c6 … c10 Parquet CarbonData Parquet CarbonData
Q1 6.4 1.3 55 5
Q2
65 1.3 804 5
Q3
71 5.2 804 9
Q5
64 4.7 804 9
Q4
67 2.7 804 161
Q6
62 3.7 804 161
Q7
63 21.85 804 588
Q8
69 11.2 804 645
Filter Query
Observation:
Less scan task (resource) is needed because of more efficient filtering by leveraging multi-level
index
Test on Production Data Set
Aggregation Query: no filter, group by c5 (dictionary encoded column)
0
2
4
6
8
10
12
14
16
Parquet CarbonData
Response Time (sec)
0
2
4
6
8
10
12
14
stage 1 stage 2
Stage Execution Time (sec)
Parquet CarbonData
Observation: both partial aggregation and final aggregation are faster,
because aggregation operates on dictionary encoded value
Test on large data set
Observation
When data increase:
l Index is efficient to reduce response time for IO bound query: Q1, Q2
l Spark can scale linearly for CPU bound query: Q3
Data: 200 to 1000 Billion Rows (half year of
telecom data in one china province)
Cluster: 70 nodes, 1120 cores
Query:
Q1: filter (c1~c4), select *
Q2: filter (c10), select *
Q3: full scan aggregate
0.00%
100.00%
200.00%
300.00%
400.00%
500.00%
200B 400B 600B 800B 1000B
Response Time (% of 200B)
Q1 Q2 Q3
What’s Coming Next
What’s coming next
l Enhancement on data loading & compression
l Streaming Ingest:
• Introduce row-based format for fast ingestion
• Gradually compact row-based to column-based for analytic workload
• Optimization for time series data
l Broader Integration across big data ecosystem: Beam, Flink, Kafka, Kylin
Apache
• Love feedbacks, try out, any kind of contribution!
• Code: https://github.com/apache/incubator-carbondata
• JIRA: https://issues.apache.org/jira/browse/CARBONDATA
• dev mailing list: dev@carbondata.incubator.apache.org
• Website: http://http://carbondata.incubator.apache.org
• Current Contributor: 64
• Monthly resolved JIRA issue: 100+
Thank You.
Jihong.Ma@huawei.com
Jacky.likun@huawei.com

Más contenido relacionado

La actualidad más candente

Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaDatabricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들Woong Seok Kang
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 

La actualidad más candente (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
AWSKRUG DS - 데이터 엔지니어가 실무에서 맞닥뜨리는 문제들
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 

Similar a Apache Carbondata: An Indexed Columnar File Format for Interactive Query with Spark SQL: Spark Summit East talk by Jacky Li and Jihong Ma

Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisliang chen
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBAntonios Giannopoulos
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBJason Terpko
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackElasticsearch
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbZhangZhengming
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTjixuan1989
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengDatabricks
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStoreMariaDB plc
 
SQL Server Deep Dive, Denis Reznik
SQL Server Deep Dive, Denis ReznikSQL Server Deep Dive, Denis Reznik
SQL Server Deep Dive, Denis ReznikSigma Software
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Dataconomy Media
 

Similar a Apache Carbondata: An Indexed Columnar File Format for Interactive Query with Spark SQL: Spark Summit East talk by Jacky Li and Jihong Ma (20)

Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic Stack
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdb
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoT
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStore
 
SQL Server Deep Dive, Denis Reznik
SQL Server Deep Dive, Denis ReznikSQL Server Deep Dive, Denis Reznik
SQL Server Deep Dive, Denis Reznik
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
 

Más de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Más de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Último

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 

Último (20)

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 

Apache Carbondata: An Indexed Columnar File Format for Interactive Query with Spark SQL: Spark Summit East talk by Jacky Li and Jihong Ma

  • 1. Apache CarbonData: An indexed columnar file format for interactive query with Spark SQL Jihong MA, Jacky LI HUAWEI
  • 2. data Report & Dashboard OLAP & Ad-hoc Batch processing Machine learningReal TimeAnalytics
  • 3. Challenge • Wide Spectrum of Query Analysis • OLAP Vs Detailed Query • Full scan Vs Small scan • Point queries Small Scan QueryBig Scan Query Multi-dimensional OLAP Query
  • 4. How to choose storage engine to facilitate query execution?
  • 5. Available Options 1. NoSQL Database • Key-Value store: low latency, <5ms • No Standard SQL support 2. MPP relational Database •Shared-nothing enables fast query execution •Poor scalability: < 100 cluster size, no fault-tolerance 3. Search Engine •Advanced indexing technique for fast search •3~4X data expansion in size, no SQL support 4. SQL on Hadoop •Modern distributed architecture and high scalability •Slow on point queries
  • 6. Motivation for A New File Format CarbonData: Unified File Format Small Scan QueryBig Scan Query Multi-dimensional OLAP Query A single copy of data balanced to fit all data access
  • 7. Apache • Apache Incubator Project since June, 2016 • Apache releases • 4 stable releases • Latest 1.0.0, Jan 28, 2017 • Contributors: • In Production: Compute Storage
  • 8. Introducing CarbonData CarbonData Integration with Spark CarbonData Table CarbonData File What is CarbonData file format? What forms a CarbonData table on disk? What it takes to deeply integrate with distributed processing engine like Spark? 1 2 3
  • 10. CarbonData File Structure § Built-in Columnar & Index § Multi-dimensional Index (B+ Tree) § Min/Max index § Inverted index § Encoding: § RLE, Delta, Global Dictionary § Snappy for compression § Adaptive Data Type Compression § Data Type: § Primitive type and nested type
  • 11. CarbonData File Layout • Blocklet:A set of rows in columnarformat •Data are sorted along MDK (multi-dimensional keys) •Clustered data enabling efficient filtering and scan •Column chunk: Data for one column in a Blocklet •Footer: Metadata information •File level metadata & statistics •Schema •Blocklet Index Carbon Data File Blocklet 1 Column 1 Chunk Column 2 Chunk … Column n Chunk Blocklet N File Footer … File Metadata Schema Blocklet Index
  • 12. File Level Blocklet Index Blocklet 1 1 1 1 1 1 1 12000 1 1 1 2 1 2 5000 1 1 2 1 1 1 12000 1 1 2 2 1 2 5000 1 1 3 1 1 1 12000 1 1 3 2 1 2 5000 Blocklet 2 1 2 1 3 2 3 11000 1 2 2 3 2 3 11000 1 2 3 3 2 3 11000 1 3 1 4 3 4 2000 1 3 1 5 3 4 1000 1 3 2 4 3 4 2000 Blocklet 3 1 3 2 5 3 4 1000 1 3 3 4 3 4 2000 1 3 3 5 3 4 1000 1 4 1 4 1 1 20000 1 4 2 4 1 1 20000 1 4 3 4 1 1 20000 Blocklet 4 2 1 1 1 1 1 12000 2 1 1 2 1 2 5000 2 1 2 1 1 1 12000 2 1 2 2 1 2 5000 2 1 3 1 1 1 12000 2 1 3 2 1 2 5000 Blocklet Index Blocklet1 Start Key1 End Key1 Start Key1 End Key4 Start Key1 End Key2 Start Key3 End Key4 Start Key1 End Key1 Start Key2 End Key2 Start Key3 End Key3 Start Key4 End Key4 File FooterBlocklet • Build in-memory file level MDK index tree for filtering • Major optimization for efficient scan C1(Min, Max) …. C7(Min, Max) Blocklet4 Start Key4 End Key4 C1(Min, Max) …. C7(Min, Max) C1(Min,Max) … C7(Min,Max) C1(Min,Max) … C7(Min,Max) C1(Min,Max) … C7(Min,Max) C1(Min,Max) … C7(Min,Max)
  • 13. Rich Multi-Level Index Support Spark Driver Executor Carbon File Data Footer Carbon File Data Footer Carbon File Data Footer File Level Index & Scanner Table Level Index Executor File Level Index & Scanner Catalyst Inverted Index •Using the index info in footer, two level B+ tree index can be built: •File level index: local B+ tree, efficient blocklet level filtering •Table level index: global B+ tree, efficient file level filtering •Column chunk inverted index: efficient column chunk scan
  • 15. CarbonData Table Carbon File Metadata: Appendable Dictionary, Schema Index: Separated Index per Segment Data: Immutable Data File Data source Load in batch Segment CarbonData Table Read
  • 16. CarbonData Table Carbon File Carbon File Dictionary (append) Index (new tree) Data File (new folder) Data source Load in batch Segment 2 Segment 1 CarbonData Table Read
  • 17. CarbonData Table Layout Carbon File Data Footer Carbon File Data Footer Carbon File Data Footer Carbon File Data Footer Dictionary File Dictionary Map Spark Driver Table Level Index Index File All Footer Schema File Latest Schema /table_name/fact/segment_id /table_name/metaHDFS Spark
  • 18. Mutation: Delete Delete Base File Segment CarbonData Table Delete Delta Bitmap file that mark deleted rows No change in index No change in dictionary
  • 19. Mutation: Update Update Base File Segment CarbonData Table Delete Delta Insert Delta Bitmap file for delete delta and regular data file for inserted delta New index added for inserted delta Dictionary append
  • 21. Segment Management Data segment … Load (write) Leveraging ZooKeeper to manage the Segment State Segment Manager (ZK based) Carbon File Data segment Carbon File Data segment Carbon File Query (read)
  • 22. SparkSQL + CarbonData: Enables fast interactive data analysis
  • 23. Carbon-Spark Integration • Built-in Spark integration • Spark 1.5, 1.6, 2.1 • Interface • SQL • DataFrame API • Operation: • Load, Query (with optimization) • Update, Delete, Compaction, etc Reader/Writer Data Management Query Optimization Carbon File Carbon File Carbon File
  • 24. Integration through File Format Carbon File Carbon File Carbon File HDFS Segment 3 Segment 2 Segment 1 InputFormat/OutputFormat
  • 25. Deep Integration with SparkSQL Carbon File Carbon File Carbon File Integrate SparkSQL Optimizer HDFS Segment 3 Segment 2 Segment 1 CarbonData Table Read/Write/Update/Delete
  • 26. CarbonData as a SparkSQL Data Source Parser/Analyzer Optimizer Execution SQL or DataFrame ResolveRelation Carbon-specific optimization rule: • Lazy Decode leveraging global dictionary Rule-based New SQL syntax • DML related statement RDD CarbonScanRDD: • Leveraging multi level index for efficient filtering and scan DML related RDDs Parser SparkSQL Catalyst Framework Cost-based Physical Planning Strategy Carbon Data Source Spark Core
  • 27. Spark Executor Spark Driver Blocklet Efficient Filtering via Index HDFS File Footer Blocklet Blocklet … … C1 C2 C3 C4 Cn 1. File pruning File Blocklet Blocklet Footer File Footer Blocklet Blocklet File Blocklet Blocklet Footer Blocklet Blocklet Blocklet Blocklet Task Task 2. Blocklet pruning 3. Read and decompress filter column 4. Binary search using inverted index, skip to next blocklet if no matching … 5. Decompress projection column 6. Return decoded data to spark SELECT c3, c4 FROM t1 WHERE c2=’boston’
  • 28. Stage 2 Stage 1 Stage 2 Stage 1 Lazy Decoding by Leveraging Global Dictionary Final Aggregation Partial Aggregation Scan with Filter (decode c1, c2, c3) Final Aggregation Partial Aggregation Scan with Filter (decode c1, c2 only) Dictionary Decode Before applying Lazy Decode After applying Lazy Decode SELECT c3, sum(c2) FROM t1 WHERE c1>10 GROUP BY c3 Aggregate on encoded value of c3 Decode c3 to string type
  • 29. Usage: Write • Using SQL • Using Dataframe df.write .format(“carbondata") .options("tableName“, “t1")) .mode(SaveMode.Overwrite) .save() CREATE TABLE tablename (name String, PhoneNumber String) STORED BY “carbondata” LOAD DATA [LOCAL] INPATH 'folder path' [OVERWRITE] INTO TABLE tablename OPTIONS(property_name=property_value, ...) INSERT INTO TABLE tablennme select_statement1 FROM table1;
  • 30. Usage: Read • Using SQL • Using Dataframe SELECT project_list FROM t1 WHERE cond_list GROUP BY columns ORDER BY columns df = sparkSession.read .format(“carbondata”) .option(“tableName”, “t1”) .load(“path_to_carbon_file”) df.select(…).show
  • 31. Usage: Update and Delete UPDATE table1 A SET (A.PRODUCT, A.REVENUE) = ( SELECT PRODUCT, REVENUE FROM table2 B WHERE B.CITY = A.CITY AND B.BROKER = A.BROKER ) WHERE A.DATE BETWEEN ‘2017-01-01’ AND ‘2017-01-31’ table1 table2 UPDATE table1 A SET A.REVENUE = A.REVENUE - 10 WHERE A.PRODUCT = ‘phone’ DELETE FROM table1 A WHERE A.CUSTOMERID = ‘123’ 123,abc 456,jkd phone, 70 60 car,100 phone, 30 20 Modify one column in table1 Modify two columns in table1 with values from table2 Delete records in table1
  • 33. Test 1. TPC-H benchmark (500GB) 2. Test on Production Data Set (Billions of rows) 3. Test on Large Data Set for Scalability (1000B rows, 103TB) • Storage – Parquet: • Partitioned by time column (c1) – CarbonData: • Multi-dimensional Index by c1~c10 • Compute – Spark 2.1
  • 34. TPC-H: Query For big scan queries, similar performance,±20% 0 100 200 300 400 500 600 700 800 Response Time (sec) Parquet CarbonData
  • 35. TPC-H: Query 138 283 173 1910 73 207 41 82 0 500 1000 1500 2000 2500 Query_1 Query_3 Query_12 Query_16 Response Time (sec) Parquet CarbonData For queries include filter on fact table, CarbonData get 1.5-20X performance by leveraging index
  • 36. TPC-H: Loading and Compression 83.27 44.30 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 Parquet CarbonData loading throughput (MB/sec/node) 2.96 2.74 0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 Parquet CarbonData compression ratio
  • 37. Test on Production Data Set Filter Response Time (sec) Number of Task Query c1 c2 c3 c4 c5 c6 … c10 Parquet CarbonData Parquet CarbonData Q1 6.4 1.3 55 5 Q2 65 1.3 804 5 Q3 71 5.2 804 9 Q5 64 4.7 804 9 Q4 67 2.7 804 161 Q6 62 3.7 804 161 Q7 63 21.85 804 588 Q8 69 11.2 804 645 Filter Query Observation: Less scan task (resource) is needed because of more efficient filtering by leveraging multi-level index
  • 38. Test on Production Data Set Aggregation Query: no filter, group by c5 (dictionary encoded column) 0 2 4 6 8 10 12 14 16 Parquet CarbonData Response Time (sec) 0 2 4 6 8 10 12 14 stage 1 stage 2 Stage Execution Time (sec) Parquet CarbonData Observation: both partial aggregation and final aggregation are faster, because aggregation operates on dictionary encoded value
  • 39. Test on large data set Observation When data increase: l Index is efficient to reduce response time for IO bound query: Q1, Q2 l Spark can scale linearly for CPU bound query: Q3 Data: 200 to 1000 Billion Rows (half year of telecom data in one china province) Cluster: 70 nodes, 1120 cores Query: Q1: filter (c1~c4), select * Q2: filter (c10), select * Q3: full scan aggregate 0.00% 100.00% 200.00% 300.00% 400.00% 500.00% 200B 400B 600B 800B 1000B Response Time (% of 200B) Q1 Q2 Q3
  • 41. What’s coming next l Enhancement on data loading & compression l Streaming Ingest: • Introduce row-based format for fast ingestion • Gradually compact row-based to column-based for analytic workload • Optimization for time series data l Broader Integration across big data ecosystem: Beam, Flink, Kafka, Kylin
  • 42. Apache • Love feedbacks, try out, any kind of contribution! • Code: https://github.com/apache/incubator-carbondata • JIRA: https://issues.apache.org/jira/browse/CARBONDATA • dev mailing list: dev@carbondata.incubator.apache.org • Website: http://http://carbondata.incubator.apache.org • Current Contributor: 64 • Monthly resolved JIRA issue: 100+