More Related Content Similar to Format Wars: from VHS and Beta to Avro and Parquet (20) More from DataWorks Summit (20) Format Wars: from VHS and Beta to Avro and Parquet2. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.@SVDataScience
3. 3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
• Prioritize for highest
business value when using
emerging technology
• Design with outcomes in
mind
• Be agile: deliver initial
results quickly, then adapt
and iterate
• Collaborate constantly with
our customers and partners
OUR PHILOSOPHY
4. 4 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.@SVDataScience
AGENDA
Introduction
Data formats
How to choose
Schema evolution
Summary
Questions
7. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.@SVDataScience
• Storage formats
• What they do
DATA FORMATS
8. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Data format
• Storage Format
• Text
• Sequence File
• Avro
• Parquet
• Optimized Row Columnar (ORC)
9. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Text
• More specifically text = csv, tsv, json records…
• Convenient format to use to exchange with other
applications or scripts that produce or read
delimited files
• Human readable and parsable
• Data stores is bulky and not as efficient to query
• Do not support block compression
10. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Sequence File
• Provides a persistent data structure for binary key-
value pairs
• Row based
• Commonly used to transfer data between Map
Reduce jobs
• Can be used as an archive to pack small files in
Hadoop
• Support splitting even when the data is
compressed
11. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Avro
• Widely used as a serialization platform
• Row-based, offers a compact and fast binary
format
• Schema is encoded on the file so the data can be
untagged
• Files support block compression and are splittable
• Supports schema evolution
12. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Parquet
• Column-oriented binary file format
• Uses the record shredding and assembly algorithm
described in the Dremel paper
• Each data file contains the values for a set of rows
• Efficient in terms of disk I/O when specific columns
need to be queried
13. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Optimized Row Columnar
• Considered the evolution of the RCFile
• Stores collections of rows and within the collection
the row data is stored in columnar format
• Introduces a lightweight indexing that enables
skipping of irrelevant blocks of rows
• Splittable: allows parallel processing of row
collections
• It comes with basic statistics on columns (min ,max,
sum, and count)
15. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.@SVDataScience
• ..for write
• ..for read
HOW TO CHOOSE
16. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
• Functional Requirements:
• What type of data do you have?
• Is the data format compatible with your
processing and querying tools?
• What are your file sizes?
• Do you have schemas that evolve over time?
17. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
• Speed Concerns
• Parquet and ORC usually needs some
additional parsing to format the data which
increases the overall read time
• Avro as a data serialization format: works well from
system to system, handles schema evolution (more
on that later)
• Text is bulky and inefficient but easily readable and
parsable
18. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
0
20
40
60
80
100
120
140
160TimeinSeconds
Narrow – Hortonworks (Hive 0.14 )
0
500
1000
1500
2000
2500
TimeinSeconds
Wide – Hortonworks (Hive 0.14)
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
19. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
0
10
20
30
40
50
60
70
TimeinSeconds
Narrow - hive-1.1.0+cdh5.4.2
0
100
200
300
400
500
600
700
TimeinSeconds
Wide - hive-1.1.0+cdh5.4.2
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
20. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
Narrow: 10 million rows, 10 columns
Wide: 4 million rows, 1000 columns
0
10
20
30
40
50
60
70
Text Avro Parquet
TimeinSeconds
Narrow - Spark 1.3
0
200
400
600
800
1000
1200
Text Avro Parquet
TimeinSeconds
Wide - Spark 1.3
21. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
0
200
400
600
800
1000
1200
1400
Megabytes
File sizes for narrow dataset
0
2000
4000
6000
8000
10000
12000
Megabytes
File sizes for wide dataset
22. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For write
• Use case
• Avro – Event data that can change over time
• Sequence File – Datasets shared between MR
jobs
• Text – Adding large amounts of data to HDFS
quickly
23. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
• Types of queries:
• Column specific queries, or few groups of
columns -> Use columnar format like Parquet or
ORC
• Compression of the file regardless the format
increases query speed times
• Text is really slow to read
• Parquet and ORC optimize read performance at
the expense of write performance
24. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
• Set up:
• Narrow dataset:
• 10 million rows, 10 columns
• Wide dataset:
• 4 million rows, 1000 columns
• Compression:
• Snappy, except for Avro which is deflate
25. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
10
20
30
40
50
60
70
Query 2 (5 conditions) Query 3 (10 conditions)
TimeinSeconds
Narrow Dataset - Hortonworks Hive 0.14.0.2.2.4.2
Text
Avro
Parquet
Sequence
ORC
26. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
100
200
300
400
500
600
700
800
Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions)
TimeinSeconds
Wide Dataset - Hortonworks Hive 0.14.0.2.2.4.2
Text
Avro
Parquet
Sequence
ORC
27. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
10
20
30
40
50
60
70
Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions)
TimeinSeconds
Narrow Dataset - CDH hive-1.1.0+cdh5.4.2
Text
Avro
Parquet
Sequence
ORC
28. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
50
100
150
200
250
Query 1 (no
conditions)
Query 2 (5
conditions)
Query 3 (10
conditions)
Query 4 (20
conditions)
TimeinSeconds
Wide Dataset - CDH hive-1.1.0+cdh5.4.2
Text
Avro
Parquet
Sequence
ORC
29. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
1
2
3
4
5
6
7
8
Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions)
TimeinSeconds
Narrow Dataset - CDH Impala
Text
Avro
Parquet
Sequence
30. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
0
5
10
15
20
25
30
Query 1 (0 filters) Query 2 (5 filters) Query 3 (10 filters) Query 4 (20 filters)
TimeinSeconds
Wide Dataset - CDH Impala
Text
Avro
Parquet
Sequence
31. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
Ran 4 queries (using Impala)
over 4 Million rows (70GB raw),
and 1000 columns (wide table)
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
Query 1 (0
filters)
Query 2 (5
filters)
Query 3 (10
filters)
Query 4 (20
filters)
Seconds
Query times for different data formats
Avro uncompress
Avro Snappy
Avro Deflate
Parquet
Seq uncompressed
Seq Snappy
Text Snappy
Text uncompressed
32. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
How to choose: For read
• Use case
• Avro – Query datasets that have changed over
time
• Parquet – Query a few columns on a wide
table
34. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.@SVDataScience
• What is schema
evolution?
• Data formats that evolve
• Examples
• Use cases
SCHEMA
EVOLUTION
35. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• What is schema evolution?
• Adding columns
• Renaming columns
• Removing columns
• Why do we need it?
36. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Data formats that can evolve
• Avro
• Parquet
• Can only add columns at the end
37. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• The data – Dr Who episodes
• Original Dr Who & new Dr Who
• http://www.theguardian.com/news/datablog/2010/aug/20/doctor-who-time-travel-
information-is-beautiful
• Avro schema for the original Dr Who
{"namespace": "drwho.avro",
"type": "record",
"name": "drwho",
"fields": [
{"name": "doctor_who_season", "type": "string"}, {"name": "doctor_actor", "type": "string"},
{"name": "episode_no", "type": "string"}, {"name": "episode_title", "type": "string"},
{"name": "date_from", "type": "string"}, {"name": "date_to", "type": "string"},
{"name": "estimated", "type": "string"}, {"name": "planet", "type": "string"},
{"name": "sub_location", "type": "string"}, {"name": "main_location", "type": "string"}
]}
38. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• Original Dr Who data
doctor_who_
season doctor_actor episode_no episode_title date_from date_to estimated planet sub_location main_location
3 Pertwee 51 Spearhead from Space 1970 1990 y Earth England London and other
3 Pertwee 55 Terror of the Autons 1971 1971 y Earth England Luigi Rossini's Circus
3 Pertwee 58 Colony in Space 1971 2472 planet Uxarieus
3 Pertwee 59 The Daemons 1971 1971 y Earth England Devil's End; Wiltshire
3 Pertwee 60 Day of the Daleks 1972 2100 Earth England
Auderly House and
environs
3 Pertwee 63 The Mutants 1972 2900 Solos
3 Pertwee 64 The Time Monster -2000 1972 Earth/ Atlantis
3 Pertwee 64 The Time Monster 1972 -2000 Earth/ Atlantis
3 Pertwee 66 Carnival of Monsters 1972 1928 n
Indian Ocean;
Planet Inter Minor Ocean; alien planet
3 Pertwee 67 Frontier in Space 1972 2540 n
Planet Draconia;
Orgon Planet alien planets
3 Pertwee 68 Planet of the Daleks 1972 2540 y Planet Spiridon Alien Planet
3 Pertwee 69 The Green Death 2540 1973 y Earth UK Llanfairfach; Wales
3 Pertwee 70 The Time Warrior 1973 1200 n Earth UK Wessex Castle
3 Pertwee 71
Invasion of The
Dinosaurs 1200 1974 y Earth UK London
39. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• Lets add, rename, and delete some columns
• Avro schema for the new Dr Who
{"namespace": "drwho.avro",
"type": "record",
"name": "drwho",
"fields": [
{"name": "drwho_season", "type": ["null","string"], "aliases": ["doctor_who_season"]},
{"name": "drwho_actor", "type": ["null","string"], "aliases": ["doctor_actor"]},
{"name": "episode_no", "type": ["null","string"]}, {"name": "episode_title", "type": ["null","string"]},
{"name": "date_from", "type": ["null","string"]}, {"name": "date_to", "type": ["null","string"]},
{"name": "estimated", "type": "string"}, {"name": "planet", "type": ["null","string"]},
{"name": "sub_location", "type": ["null","string"]}, {"name": "main_location", "type": ["null","string"]},
{"name": "hd", "type": "string", "default": "no"}
]}
40. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Avro Example
• Original & New Dr Who data
drwho_season drwho_actor episode_no episode_title date_from date_to planet sub_location main_location hd
10 Tennant 201 New Earth 2006 5000000023 New Earth New … New York yes
10 Tennant 202 Tooth and claw 2006 1879 Earth Scotland
Torchwood house;
Near Balmoral yes
10 Tennant 203 school Reunion 2007 2007 Earth England Deffry Vale yes
10 Tennant 204
the Girl in the
Fireplace 1727 1744 Earth France Paris yes
3 Pertwee 51
Spearhead from
Space 1970 1990 Earth England London and other no
3 Pertwee 55
Terror of the
Autons 1971 1971 Earth England Luigi Rossini's Circus no
3 Pertwee 58 Colony in Space 1971 2472
planet
Uxarieus no
3 Pertwee 59 The Daemons 1971 1971 Earth England Devil's End; Wiltshire no
3 Pertwee 60 Day of the Daleks 1972 2100 Earth England
Auderly House and
environs no
41. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
@SVDataScience
Schema evolution
• Use cases
• New data added to an event stream
• Need to see historic data with new data (and
the schema has changed a lot)
• Business has changed the field/column name
43. © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.@SVDataScience
QUESTIONS?
Editor's Notes Description
You have your Hadoop cluster, and you are ready to fill it up with data, but wait: Which format should you use to store your data? Should you store it in Plain Text, Sequence File, Avro, or Parquet? (And should you compress it?) This talk will take a closer look at some of the trade-offs, and will cover the How, Why, and When of choosing one format over another.
Do not support block compression
Once they are compressed they are not splittable anymore increasing read performance cost
Each data file contains the values for a set of rows
Within a data file, the values from each column are organized so that they are adjacent, enabling good compression values
No results query 1 (which is count no conditions). This is because stinger is has meta data about the amount of data in the table (only when it’s an internal table).