ORC File and Vectorization - Hadoop Summit 2013

•Descargar como PPTX, PDF•

34 recomendaciones•18,467 vistas

Eric Hanson and I gave this presentation at Hadoop Summit 2013: Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. Hive 0.11 added a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.

Tecnología

Copyright 2013 by Hortonworks and Microsoft
ORC File & Vectorization
Improving Hive Data Storage and Query Performance
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
Jitendra Pandey
jitendra@hortonworks.com
Eric Hanson
ehans@microsoft.com
owen@hortonworks.c
om

File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4

Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double

Comparison
Page 23
RC File Trevni Parquet ORC
Hive Integration Y N N Y
Active Development N N Y Y
Hive Type Model N N N Y
Shred complex columns N Y Y Y
Splits found quickly N Y Y Y
Files per a bucket 1 many 1 or many 1
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N Y Y
Store min, max, sum, count N N N Y
Store internal indexes N N N Y
No overhead for non-null N N N Y ≥ 0.12
Predicate Pushdown N N N Y ≥ 0.12

Why row-at-a-time execution is slow
Page 26
• Hive uses Object Inspectors to work on a row
• Enables level of abstraction
• Costs major performance
• Exacerbated by using lazy serdes
• Inner loop has many method, new(), and if-
then-else calls
• Lots of CPU instructions
• Pipeline stalls Poor instructions/cycle
• Poor cache locality

How the code works (simplified)
Page 27
class LongColumnAddLongScalarExpression {
int inputColumn;
int outputColumn;
long scalar;
void evaluate(VectorizedRowBatch batch) {
long [] inVector =
((LongColumnVector) batch.columns[inputColumn]).vector;
long [] outVector =
((LongColumnVector) batch.columns[outputColumn]).vector;
if (batch.selectedInUse) {
for (int j = 0; j < batch.size; j++) {
int i = batch.selected[j];
outVector[i] = inVector[i] + scalar;
}
} else {
for (int i = 0; i < batch.size; i++) {
outVector[i] = inVector[i] + scalar;
}
}
}
}
}
No method calls
Low instruction count
Cache locality to 1024 values
No pipeline stalls
SIMD in Java 8

Preliminary performance results
• NOT a benchmark
• 218 million row fact table of real data, 25 columns
• 18GB raw data
• 6 core, 12 thread workstation, 1 disk, 16GB RAM
• select a, b, count(*) from t
where c >= const group by a, b -- 53 row result
Page 29
warm start times RC non-
vectorized
(default, not
compressed)
ORC non-
vectorized
(default,
compressed)
ORC vectorized
(default,
compressed)
Runtime (sec) 261 58 43
Total CPU (sec) 381 159 42

Thanks to contributors!
Page 30
• Microsoft Big Data:
• Eric Hanson, Remus Rusanu, Sarvesh
Sakalanaga, Tony Murphy, Ashit Gosalia
• Hortonworks:
• Jitendra Pandey, Owen O’Malley, Gopal V
• Others:
• Teddy Choi, Tim Chen
Jitendra/Eric are joint leads

Más contenido relacionado

La actualidad más candente

InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOxInfluxData

ORC File & Vectorization - Improving Hive Data Storage and Query PerformanceDataWorks Summit

File Format Benchmark - Avro, JSON, ORC and ParquetDataWorks Summit/Hadoop Summit

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

What is in a Lucene index?lucenerevolution

Top 5 Mistakes When Writing Spark ApplicationsSpark Summit

Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko

ORC Deep Dive 2020Owen O'Malley

The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem

A Deep Dive into Query Execution Engine of Spark SQLDatabricks

Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation

InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData

Optimizing Apache Spark SQL JoinsDatabricks

Hortonworks Technical Workshop: Interactive Query with Apache Hive Hortonworks

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks

Parquet performance tuning: the missing guideRyan Blue

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...GetInData

Hive+Tez: A performance deep divet3rmin4t0r

Spark shuffle introductioncolorant

La actualidad más candente (20)

InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx

ORC File & Vectorization - Improving Hive Data Storage and Query Performance

File Format Benchmark - Avro, JSON, ORC and Parquet

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

What is in a Lucene index?

Top 5 Mistakes When Writing Spark Applications

Thrift vs Protocol Buffers vs Avro - Biased Comparison

ORC Deep Dive 2020

The columnar roadmap: Apache Parquet and Apache Arrow

A Deep Dive into Query Execution Engine of Spark SQL

Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache

InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...

Optimizing Apache Spark SQL Joins

Hortonworks Technical Workshop: Interactive Query with Apache Hive

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service

Parquet performance tuning: the missing guide

How to understand and analyze Apache Hive query execution plan for performanc...

Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...

Hive+Tez: A performance deep dive

Spark shuffle introduction

Destacado

Ingesting Data at Blazing Speed Using Apache OrcDataWorks Summit

ORC FilesOwen O'Malley

Big data - Apache Hadoop for Beginner'ssenthil0809

Get started with R langsenthil0809

Ibm spectrum scale fundamentals workshop for americas part 1 components archi...xKinAnx

Storage Cloud and Spectrum deck 2017 June updateJoe Krotz

Alphorm.com Formation Docker (2/2) - Administration Avancée Alphorm

Destacado (7)

Ingesting Data at Blazing Speed Using Apache Orc

ORC Files

Big data - Apache Hadoop for Beginner's

Get started with R lang

Ibm spectrum scale fundamentals workshop for americas part 1 components archi...

Storage Cloud and Spectrum deck 2017 June update

Alphorm.com Formation Docker (2/2) - Administration Avancée

Similar a ORC File and Vectorization - Hadoop Summit 2013

Overview of the Hive Stinger InitiativeModern Data Stack France

Master tuningThomas Kejser

Web analytics at scale with Druid at naver.comJungsu Heo

CBStreams - Java Streams for ColdFusion (CFML)Ortus Solutions, Corp

ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...Ortus Solutions, Corp

User Group3009sqlserver.co.il

OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuNETWAYS

OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuNETWAYS

Fighting Against Chaotically Separated Values with EmbulkSadayuki Furuhashi

WebObjects OptimizationWO Community

Nodejs - Should Ruby Developers Care?Felix Geisendörfer

NOSQL and Cassandrarantav

Google cloud Dataflow & Apache FlinkIván Fernández Perea

Using Apache Hive with High PerformanceInderaj (Raj) Bains

Orms vs Micro-ORMsDavid Paquette

Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit

VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld

Performance optimization - JavaScriptFilip Mares

Node.js: The What, The How and The WhenFITC

Building microservices with KotlinHaim Yadid

Similar a ORC File and Vectorization - Hadoop Summit 2013 (20)

Overview of the Hive Stinger Initiative

Master tuning

Web analytics at scale with Druid at naver.com

CBStreams - Java Streams for ColdFusion (CFML)

ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...

User Group3009

OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu

OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu

Fighting Against Chaotically Separated Values with Embulk

WebObjects Optimization

Nodejs - Should Ruby Developers Care?

NOSQL and Cassandra

Google cloud Dataflow & Apache Flink

Using Apache Hive with High Performance

Orms vs Micro-ORMs

Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine

VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log Insight

Performance optimization - JavaScript

Node.js: The What, The How and The When

Building microservices with Kotlin

Más de Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley

Big Data's Journey to ACIDOwen O'Malley

Protect your private data with ORC column encryptionOwen O'Malley

Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley

Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley

Strata NYC 2018 IcebergOwen O'Malley

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley

ORC Column EncryptionOwen O'Malley

Protecting Enterprise Data in Apache HadoopOwen O'Malley

Data protection2015Owen O'Malley

Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley

Hadoop Security ArchitectureOwen O'Malley

Adding ACID Updates to HiveOwen O'Malley

ORC File IntroductionOwen O'Malley

Optimizing Hive QueriesOwen O'Malley

Next Generation Hadoop OperationsOwen O'Malley

Next Generation MapReduceOwen O'Malley

Bay Area HUG Feb 2011 IntroOwen O'Malley

Plugging the Holes: Security and Compatability in HadoopOwen O'Malley

Más de Owen O'Malley (19)

Running An Apache Project: 10 Traps and How to Avoid Them

Big Data's Journey to ACID

Protect your private data with ORC column encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Strata NYC 2018 Iceberg

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

ORC Column Encryption

Protecting Enterprise Data in Apache Hadoop

Data protection2015

Structor - Automated Building of Virtual Hadoop Clusters

Hadoop Security Architecture

Adding ACID Updates to Hive

ORC File Introduction

Optimizing Hive Queries

Next Generation Hadoop Operations

Next Generation MapReduce

Bay Area HUG Feb 2011 Intro

Plugging the Holes: Security and Compatability in Hadoop

Último

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5

Scale your database traffic with Read & Write split using MySQL RouterMydbops

Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen

Rise of the Machines: Known As Drones...Rick Flair

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein

Data governance with Unity Catalog PresentationKnoldus Inc.

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda

2024 April Patch TuesdayIvanti

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

From Family Reminiscence to Scholarly Archive .Alan Dix

Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

UiPath Community: Communication Mining from Zero to HeroUiPathCommunity

ORC File and Vectorization - Hadoop Summit 2013

1. Copyright 2013 by Hortonworks and Microsoft ORC File & Vectorization Improving Hive Data Storage and Query Performance June 2013 Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley Jitendra Pandey jitendra@hortonworks.com Eric Hanson ehans@microsoft.com owen@hortonworks.c om

2. ORC – Optimized RC File Page 2

3. History Page 3

4. Remaining Challenges Page 4

5. Requirements Page 5

6. File Structure Page 6

7. Stripe Structure Page 7

8. File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4

9. Compression Page 9

10. Integer Column Serialization Page 10

11. String Column Serialization Page 11

12. Hive Compound Types Page 12 0 Struct 4 Struct 3 String 1 Int 2 Map 7 Time 5 String 6 Double

13. Compound Type Serialization Page 13

14. Generic Compression Page 14

15. Column Projection Page 15

16. How Do You Use ORC Page 16

17. Managing Memory Page 17

18. TPC-DS File Sizes Page 18

19. ORC Predicate Pushdown Page 19

20. Additional Details Page 20

21. Current work for Hive 0.12 Page 21

22. Future Work Page 22

23. Comparison Page 23 RC File Trevni Parquet ORC Hive Integration Y N N Y Active Development N N Y Y Hive Type Model N N N Y Shred complex columns N Y Y Y Splits found quickly N Y Y Y Files per a bucket 1 many 1 or many 1 Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N Y Y Store min, max, sum, count N N N Y Store internal indexes N N N Y No overhead for non-null N N N Y ≥ 0.12 Predicate Pushdown N N N Y ≥ 0.12

24. Vectorization Page 24

25. Vectorization Page 25

26. Why row-at-a-time execution is slow Page 26 • Hive uses Object Inspectors to work on a row • Enables level of abstraction • Costs major performance • Exacerbated by using lazy serdes • Inner loop has many method, new(), and if- then-else calls • Lots of CPU instructions • Pipeline stalls Poor instructions/cycle • Poor cache locality

27. How the code works (simplified) Page 27 class LongColumnAddLongScalarExpression { int inputColumn; int outputColumn; long scalar; void evaluate(VectorizedRowBatch batch) { long [] inVector = ((LongColumnVector) batch.columns[inputColumn]).vector; long [] outVector = ((LongColumnVector) batch.columns[outputColumn]).vector; if (batch.selectedInUse) { for (int j = 0; j < batch.size; j++) { int i = batch.selected[j]; outVector[i] = inVector[i] + scalar; } } else { for (int i = 0; i < batch.size; i++) { outVector[i] = inVector[i] + scalar; } } } } } No method calls Low instruction count Cache locality to 1024 values No pipeline stalls SIMD in Java 8

28. Vectorization project Page 28

29. Preliminary performance results • NOT a benchmark • 218 million row fact table of real data, 25 columns • 18GB raw data • 6 core, 12 thread workstation, 1 disk, 16GB RAM • select a, b, count(*) from t where c >= const group by a, b -- 53 row result Page 29 warm start times RC non- vectorized (default, not compressed) ORC non- vectorized (default, compressed) ORC vectorized (default, compressed) Runtime (sec) 261 58 43 Total CPU (sec) 381 159 42

30. Thanks to contributors! Page 30 • Microsoft Big Data: • Eric Hanson, Remus Rusanu, Sarvesh Sakalanaga, Tony Murphy, Ashit Gosalia • Hortonworks: • Jitendra Pandey, Owen O’Malley, Gopal V • Others: • Teddy Choi, Tim Chen Jitendra/Eric are joint leads

ORC File and Vectorization - Hadoop Summit 2013

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a ORC File and Vectorization - Hadoop Summit 2013

Similar a ORC File and Vectorization - Hadoop Summit 2013 (20)

Más de Owen O'Malley

Más de Owen O'Malley (19)

Último

Último (20)

ORC File and Vectorization - Hadoop Summit 2013