SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
© 2013 Acxiom Corporation. All Rights Reserved. © 2013 Acxiom Corporation. All Rights Reserved.
Large-Scale ETL Processing -
Hadoop File Formats
Jakub Wszolek (jwszol@acxiom.com)
Future3 - 2016
© 2013 Acxiom Corporation. All Rights Reserved.
ETL with Hadoop
• Querying and reporting on data swiftly requires
a sophisticated storage format
• Systems flexible enough to load any delimited
source
• Systems that are able to detect layout changes
• System that can catch potential data issues
• BI platform
• Processes and services automation
2
© 2013 Acxiom Corporation. All Rights Reserved.
Hadoop analytics architecture
3
© 2013 Acxiom Corporation. All Rights Reserved.
Hadoop file formats
• Key factor in BigData processing and query
performance, optimization
• Schema evolution
• Compression and splitability
• Optimized storage space utilization
• Data Processing:
-Write performance
-Partial Read
-Full Read
4
© 2013 Acxiom Corporation. All Rights Reserved.
Available File Formats
• Text/CSV (STORED AS TEXTFILE)
• JSON
• Sequence file (STORED AS SEQUENCEFILE)
-Binary key/value pair format
• RC File
-Record columnar format
• Avro/Trevni
• Parquet
• ORC
-Optimized record columnar format
5
All the formats have general compression:
• ZLIB (GZip) – tight compression (slower)
• Snappy – some compression (faster)
© 2013 Acxiom Corporation. All Rights Reserved.
Row/Column oriented
6
© 2013 Acxiom Corporation. All Rights Reserved.
Sequence file
7
• Flat file consisting of binary key/value pairs
• Extensively used in MapReduce as input/output
formats
• Each record is a <key,value> pair
• Key and Value needs to be a class of
org.apache.hadoop.io.Text
• KEY = record name/filename+uniqe ID
• VALUE = content as UTF-8 encoded String
• Hive has to read a full row and decompress it
even if only one column is being requested.
© 2013 Acxiom Corporation. All Rights Reserved.
RCFile
• RCFile (Record Columnar File) is a data
placement structure designed for MapReduce-
based data warehouse systems
• RCFile stores table data in a flat file consisting
of binary key/value pairs.
• RCFile stores the metadata of a row-split as
the key part of a record, and all the data of a
row split as the value part.
• hive --rcfilecat [--start=start_offset] [--length=len] [--verbose] fileName
8
© 2013 Acxiom Corporation. All Rights Reserved.
RCFile
9
© 2013 Acxiom Corporation. All Rights Reserved.
ORCFile
• Column-oriented
• Lightweight indexes stored within the file
-Ability to skip row groups that don’t pass predicate
filtering
• Includes basic statistics (min, max, sum, count)
on columns
• Larger block size of 250 MB by default optimizes
for large sequential reads on HDFS for more
throughput and fewer files to reduce load on the
namenode.
10
© 2013 Acxiom Corporation. All Rights Reserved.
ORCFile
11
© 2013 Acxiom Corporation. All Rights Reserved.
ORCFile Usage
12
CREATE EXTERNAL TABLE testing_campaign (
advertiser_id STRING,
order_id STRING,
order STRING,
start_date STRING,
end_date STRING,
creative_library_enabled STRING,
billing_invoice_code STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS orc TABLEPROPERTIES (“orc.compress” = “ZLIB”)
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutPutFormat'
LOCATION “/user/test/…”
key default comments
orc.compress ZLIB NONE, ZLIB, Snappy
orc.compress.size 256 KB Number of bytes in each compression chunk
orc.stripe.size 64 MB Each ORC stripe is processed in one Map task
hive.exec.orc.default.block.size 250 MB Define the default file system block size for ORC
files.
© 2013 Acxiom Corporation. All Rights Reserved.
Parquet
• Column-oriented
• Parquet came out of a collaboration between
Twitter and Cloudera in 2013
• Querying “wide” tables with many columns or
performing aggregation operations like AVG()
for the values of a single column
• Schema evolution
-Can add columns at the end
13
© 2013 Acxiom Corporation. All Rights Reserved.
File size compression
14
© 2013 Acxiom Corporation. All Rights Reserved.
What is AVRO?
• Data serialization framework
• Language-neutral data serialization system (SerDe
example)
- Data serialization is a mechanism to translate data in computer
environment (like memory buffer, data structures or object state)
into binary or textual form that can be transported over network or
stored in some persistent storage media.
• Develop by Doug Cutting, the father of Hadoop (2009)
• Avro is a preferred tool to serialize data in Hadoop
• Avro uses JSON format to declare the data structures.
Presently, it supports languages such as Java, C, C++, C#,
Python, and Ruby.
© 2013 Acxiom Corporation. All Rights Reserved.
Features of Avro
• Avro creates binary structured format that is
both compressible and splittable
- used as the input to Hadoop MapReduce jobs.
• Avro provides rich data structures
- record that contains an array, an enumerated type, and a sub record.
• Avro creates a self-describing file named Avro Data File
- stores data along with its schema in the metadata section
© 2013 Acxiom Corporation. All Rights Reserved.
SerDe (JSON/Parquet/AVRO…) –
serialization example
CREATE TABLE testing_campaign
PARTITIONED BY (ingestion_dt STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe’
STORED as
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’
LOCATION '/user/...'
TBLPROPERTIES ('avro.schema.url'='hdfs://cloudera.loc/user/.../schema.avsc');
© 2013 Acxiom Corporation. All Rights Reserved.
Schema in AVRO
{
"type" : "record",
"name" : "TestingData",
"doc" : "Schema generated by jwszol",
"fields" : [ {
"name" : "name",
"type" : [ "null", "string" ],
"doc" : "Type inferred from 'jwszol'"
}, {
"name" : "value",
"type" : ["null", "string" ],
"doc" : "Type inferred from '100'"
}, {
"name" : "id",
"type" : [ "null", "string" ],
"doc" : "Type inferred from '1'"
}, {
"name" : "size",
"type" : [ "null", "string" ],
"doc" : "Type inferred from '123'"
} ]
}
© 2013 Acxiom Corporation. All Rights Reserved.
AVRO file
© 2013 Acxiom Corporation. All Rights Reserved.
Tested use cases
• Simple conversion : Text delimited to AVRO –
PASSED
• Field was missing in the data file – PASSED
• Fields being rearranged in the data file – PASSED
• Field name changes in data file (continue to map it
into the old field name) – PASSED
• Define optional and required fields – PASSED
• Basic transformation on the data (reformat of time
stamp) - PASSED
• Missing whole column in text file - PASSED
© 2013 Acxiom Corporation. All Rights Reserved.
Python code
Simple schema example:
Serialization:
© 2013 Acxiom Corporation. All Rights Reserved.
API
• Java supported:
https://avro.apache.org/docs/1.7.7/gettingstarte
djava.html
• Python supported:
https://avro.apache.org/docs/1.7.6/gettingstarte
dpython.html
• AVRO tools (CDH5 embedded):
https://mvnrepository.com/artifact/org.apache.a
vro/avro-tools/1.7.4
© 2013 Acxiom Corporation. All Rights Reserved.
What’s the best choice?
• For writing
-Is the data format compatible with your processing
and querying tools?
-Does your schema evolve over time?
-Saving data types
-Speed concerns (AVRO/Parquet/ORCFile needs
additional parsing to format the data – increase the
overall time)
23
© 2013 Acxiom Corporation. All Rights Reserved.
Writing results
24
0
10
20
30
40
50
60
70
Text Avro Parquet Sequence ORC
Timeinseconds
1* - hive 1.1 Cloudera (5.4)
0
100
200
300
400
500
600
700
Text Avro Parquet Sequence ORC
Timeinseconds
2* - Hive 1.1 Cloudera (5.4)
1* - 10 milions rows, 10 columns
2* - 4 milions rows, 1000 columns
© 2013 Acxiom Corporation. All Rights Reserved.
What’s the best choice?
• For reading
-Compression – regardless the format increases
query speed time
-Column specific queries: group of columns
(Parquet/ORC)
-Parquet and ORC optimize read performance at the
expense of write performance
25
© 2013 Acxiom Corporation. All Rights Reserved.
Reading results
26
0
10
20
30
40
50
60
70
Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions)
timeinsecons
1* - hive 1.1 Cloudera (5.4)
Text
Avro
Parquet
Sequence
ORC
0
50
100
150
200
250
Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions)
Timeinseconds
2* - hive 1.1 Cloudera (5.4)
Text
Avro
Parquet
Sequence
ORC
1* - 10 milions rows, 10 columns
2* - 4 milions rows, 1000 columns
© 2013 Acxiom Corporation. All Rights Reserved.
Reading results (Impala)
27
0
1
2
3
4
5
6
7
8
Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions)
Timeinseconds
1* - Impala CDH 5.4
Text
Avro
Parquet
Sequence
0
5
10
15
20
25
30
Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions)
Timeinseconds
2* - Impala CDH 5.4
Text
Avro
Parquet
Sequence
1* - 10 milions rows, 10 columns
2* - 4 milions rows, 1000 columns
© 2013 Acxiom Corporation. All Rights Reserved.
ACXIOM - Marketing Analytics
Environment
• Bring together online and offline data to get a
complete, 360-degree view of your customer
• Actionable recommendations on how to adjust your
digital marketing to reach your goals
• Measure and analyze your marketing spend across all
channels
• Data loads (Aug – Dec 2015) – 268 980,402,366 records
28
http://www.acxiom.com/marketing-analytics-environment/
© 2013 Acxiom Corporation. All Rights Reserved.
References
• http://www.acxiom.com/marketing-analytics-environment/
• https://www.datanami.com/2014/09/01/five-steps-to-running-etl-on-hadoop-for-
web-companies/
• http://slideplayer.com/slide/6666437/
• https://cwiki.apache.org/confluence/display/Hive/RCFile
• http://axbigdata.blogspot.com/2014/05/hive.html
• http://www.slideshare.net/Hadoop_Summit/innovations-in-apache-hadoop-
mapreduce-pig-hive-for-improving-query-performance
• http://www.semantikoz.com/blog/wp-content/uploads/2014/02/Orc-File-Layout.png
• http://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-
avro-vs-parquet-and-more-stampedecon-2015
• http://www.slideshare.net/Hadoop_Summit/w-1205p230-aradhakrishnan-v3
29
© 2013 Acxiom Corporation. All Rights Reserved. © 2013 Acxiom Corporation. All Rights Reserved.
Questions?
Thank you!
twitter.com/jwszol

Más contenido relacionado

La actualidad más candente

ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
 

La actualidad más candente (20)

ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Apache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysisApache CarbonData:New high performance data format for faster data analysis
Apache CarbonData:New high performance data format for faster data analysis
 
ORC Files
ORC FilesORC Files
ORC Files
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
ORC 2015
ORC 2015ORC 2015
ORC 2015
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Data organization: hive meetup
Data organization: hive meetupData organization: hive meetup
Data organization: hive meetup
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
ORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, SmallerORC 2015: Faster, Better, Smaller
ORC 2015: Faster, Better, Smaller
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 

Destacado

Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
alogarg
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 

Destacado (20)

Secrets in Kubernetes
Secrets in KubernetesSecrets in Kubernetes
Secrets in Kubernetes
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
 
Hadoop story
Hadoop storyHadoop story
Hadoop story
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Map reduce
Map reduceMap reduce
Map reduce
 
Hadoop map reduce data flow
Hadoop map reduce data flowHadoop map reduce data flow
Hadoop map reduce data flow
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Big data gaurav
Big data gauravBig data gaurav
Big data gaurav
 
Veracity think bugdata #2 6.7.2015
Veracity think bugdata #2   6.7.2015Veracity think bugdata #2   6.7.2015
Veracity think bugdata #2 6.7.2015
 
Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7Advance Hive, NoSQL Database (HBase) - Module 7
Advance Hive, NoSQL Database (HBase) - Module 7
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
 
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case StudyOozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
Oozie in Practice - Big Data Workflow Scheduler - Oozie Case Study
 
Hadoop/HBase POC framework
Hadoop/HBase POC frameworkHadoop/HBase POC framework
Hadoop/HBase POC framework
 
Oozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY WayOozie or Easy: Managing Hadoop Workloads the EASY Way
Oozie or Easy: Managing Hadoop Workloads the EASY Way
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Talend Big Data Capabilities Overview
Talend Big Data Capabilities OverviewTalend Big Data Capabilities Overview
Talend Big Data Capabilities Overview
 
Oozie towards zero downtime
Oozie towards zero downtimeOozie towards zero downtime
Oozie towards zero downtime
 

Similar a HadoopFileFormats_2016

Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19
jasonfrantz
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
DataWorks Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 

Similar a HadoopFileFormats_2016 (20)

Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Efficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema RegistryEfficient Schemas in Motion with Kafka and Schema Registry
Efficient Schemas in Motion with Kafka and Schema Registry
 
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
Performance Enhancement using Appropriate File Formats in Big Data Hadoop Eco...
 
Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19Drill Bay Area HUG 2012-09-19
Drill Bay Area HUG 2012-09-19
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Hadoop
HadoopHadoop
Hadoop
 
End-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and AtlasEnd-to-end Data Governance with Apache Avro and Atlas
End-to-end Data Governance with Apache Avro and Atlas
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overview
 
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...
 
1650607.ppt
1650607.ppt1650607.ppt
1650607.ppt
 
Amazon Redshift
Amazon Redshift Amazon Redshift
Amazon Redshift
 
Ashnik EnterpriseDB PostgreSQL - A real alternative to Oracle
Ashnik EnterpriseDB PostgreSQL - A real alternative to Oracle Ashnik EnterpriseDB PostgreSQL - A real alternative to Oracle
Ashnik EnterpriseDB PostgreSQL - A real alternative to Oracle
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
MySQL Connector/Node.js and the X DevAPI
MySQL Connector/Node.js and the X DevAPIMySQL Connector/Node.js and the X DevAPI
MySQL Connector/Node.js and the X DevAPI
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Storage in hadoop
Storage in hadoopStorage in hadoop
Storage in hadoop
 

HadoopFileFormats_2016

  • 1. © 2013 Acxiom Corporation. All Rights Reserved. © 2013 Acxiom Corporation. All Rights Reserved. Large-Scale ETL Processing - Hadoop File Formats Jakub Wszolek (jwszol@acxiom.com) Future3 - 2016
  • 2. © 2013 Acxiom Corporation. All Rights Reserved. ETL with Hadoop • Querying and reporting on data swiftly requires a sophisticated storage format • Systems flexible enough to load any delimited source • Systems that are able to detect layout changes • System that can catch potential data issues • BI platform • Processes and services automation 2
  • 3. © 2013 Acxiom Corporation. All Rights Reserved. Hadoop analytics architecture 3
  • 4. © 2013 Acxiom Corporation. All Rights Reserved. Hadoop file formats • Key factor in BigData processing and query performance, optimization • Schema evolution • Compression and splitability • Optimized storage space utilization • Data Processing: -Write performance -Partial Read -Full Read 4
  • 5. © 2013 Acxiom Corporation. All Rights Reserved. Available File Formats • Text/CSV (STORED AS TEXTFILE) • JSON • Sequence file (STORED AS SEQUENCEFILE) -Binary key/value pair format • RC File -Record columnar format • Avro/Trevni • Parquet • ORC -Optimized record columnar format 5 All the formats have general compression: • ZLIB (GZip) – tight compression (slower) • Snappy – some compression (faster)
  • 6. © 2013 Acxiom Corporation. All Rights Reserved. Row/Column oriented 6
  • 7. © 2013 Acxiom Corporation. All Rights Reserved. Sequence file 7 • Flat file consisting of binary key/value pairs • Extensively used in MapReduce as input/output formats • Each record is a <key,value> pair • Key and Value needs to be a class of org.apache.hadoop.io.Text • KEY = record name/filename+uniqe ID • VALUE = content as UTF-8 encoded String • Hive has to read a full row and decompress it even if only one column is being requested.
  • 8. © 2013 Acxiom Corporation. All Rights Reserved. RCFile • RCFile (Record Columnar File) is a data placement structure designed for MapReduce- based data warehouse systems • RCFile stores table data in a flat file consisting of binary key/value pairs. • RCFile stores the metadata of a row-split as the key part of a record, and all the data of a row split as the value part. • hive --rcfilecat [--start=start_offset] [--length=len] [--verbose] fileName 8
  • 9. © 2013 Acxiom Corporation. All Rights Reserved. RCFile 9
  • 10. © 2013 Acxiom Corporation. All Rights Reserved. ORCFile • Column-oriented • Lightweight indexes stored within the file -Ability to skip row groups that don’t pass predicate filtering • Includes basic statistics (min, max, sum, count) on columns • Larger block size of 250 MB by default optimizes for large sequential reads on HDFS for more throughput and fewer files to reduce load on the namenode. 10
  • 11. © 2013 Acxiom Corporation. All Rights Reserved. ORCFile 11
  • 12. © 2013 Acxiom Corporation. All Rights Reserved. ORCFile Usage 12 CREATE EXTERNAL TABLE testing_campaign ( advertiser_id STRING, order_id STRING, order STRING, start_date STRING, end_date STRING, creative_library_enabled STRING, billing_invoice_code STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS orc TABLEPROPERTIES (“orc.compress” = “ZLIB”) INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutPutFormat' LOCATION “/user/test/…” key default comments orc.compress ZLIB NONE, ZLIB, Snappy orc.compress.size 256 KB Number of bytes in each compression chunk orc.stripe.size 64 MB Each ORC stripe is processed in one Map task hive.exec.orc.default.block.size 250 MB Define the default file system block size for ORC files.
  • 13. © 2013 Acxiom Corporation. All Rights Reserved. Parquet • Column-oriented • Parquet came out of a collaboration between Twitter and Cloudera in 2013 • Querying “wide” tables with many columns or performing aggregation operations like AVG() for the values of a single column • Schema evolution -Can add columns at the end 13
  • 14. © 2013 Acxiom Corporation. All Rights Reserved. File size compression 14
  • 15. © 2013 Acxiom Corporation. All Rights Reserved. What is AVRO? • Data serialization framework • Language-neutral data serialization system (SerDe example) - Data serialization is a mechanism to translate data in computer environment (like memory buffer, data structures or object state) into binary or textual form that can be transported over network or stored in some persistent storage media. • Develop by Doug Cutting, the father of Hadoop (2009) • Avro is a preferred tool to serialize data in Hadoop • Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby.
  • 16. © 2013 Acxiom Corporation. All Rights Reserved. Features of Avro • Avro creates binary structured format that is both compressible and splittable - used as the input to Hadoop MapReduce jobs. • Avro provides rich data structures - record that contains an array, an enumerated type, and a sub record. • Avro creates a self-describing file named Avro Data File - stores data along with its schema in the metadata section
  • 17. © 2013 Acxiom Corporation. All Rights Reserved. SerDe (JSON/Parquet/AVRO…) – serialization example CREATE TABLE testing_campaign PARTITIONED BY (ingestion_dt STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe’ STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’ LOCATION '/user/...' TBLPROPERTIES ('avro.schema.url'='hdfs://cloudera.loc/user/.../schema.avsc');
  • 18. © 2013 Acxiom Corporation. All Rights Reserved. Schema in AVRO { "type" : "record", "name" : "TestingData", "doc" : "Schema generated by jwszol", "fields" : [ { "name" : "name", "type" : [ "null", "string" ], "doc" : "Type inferred from 'jwszol'" }, { "name" : "value", "type" : ["null", "string" ], "doc" : "Type inferred from '100'" }, { "name" : "id", "type" : [ "null", "string" ], "doc" : "Type inferred from '1'" }, { "name" : "size", "type" : [ "null", "string" ], "doc" : "Type inferred from '123'" } ] }
  • 19. © 2013 Acxiom Corporation. All Rights Reserved. AVRO file
  • 20. © 2013 Acxiom Corporation. All Rights Reserved. Tested use cases • Simple conversion : Text delimited to AVRO – PASSED • Field was missing in the data file – PASSED • Fields being rearranged in the data file – PASSED • Field name changes in data file (continue to map it into the old field name) – PASSED • Define optional and required fields – PASSED • Basic transformation on the data (reformat of time stamp) - PASSED • Missing whole column in text file - PASSED
  • 21. © 2013 Acxiom Corporation. All Rights Reserved. Python code Simple schema example: Serialization:
  • 22. © 2013 Acxiom Corporation. All Rights Reserved. API • Java supported: https://avro.apache.org/docs/1.7.7/gettingstarte djava.html • Python supported: https://avro.apache.org/docs/1.7.6/gettingstarte dpython.html • AVRO tools (CDH5 embedded): https://mvnrepository.com/artifact/org.apache.a vro/avro-tools/1.7.4
  • 23. © 2013 Acxiom Corporation. All Rights Reserved. What’s the best choice? • For writing -Is the data format compatible with your processing and querying tools? -Does your schema evolve over time? -Saving data types -Speed concerns (AVRO/Parquet/ORCFile needs additional parsing to format the data – increase the overall time) 23
  • 24. © 2013 Acxiom Corporation. All Rights Reserved. Writing results 24 0 10 20 30 40 50 60 70 Text Avro Parquet Sequence ORC Timeinseconds 1* - hive 1.1 Cloudera (5.4) 0 100 200 300 400 500 600 700 Text Avro Parquet Sequence ORC Timeinseconds 2* - Hive 1.1 Cloudera (5.4) 1* - 10 milions rows, 10 columns 2* - 4 milions rows, 1000 columns
  • 25. © 2013 Acxiom Corporation. All Rights Reserved. What’s the best choice? • For reading -Compression – regardless the format increases query speed time -Column specific queries: group of columns (Parquet/ORC) -Parquet and ORC optimize read performance at the expense of write performance 25
  • 26. © 2013 Acxiom Corporation. All Rights Reserved. Reading results 26 0 10 20 30 40 50 60 70 Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) timeinsecons 1* - hive 1.1 Cloudera (5.4) Text Avro Parquet Sequence ORC 0 50 100 150 200 250 Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions) Timeinseconds 2* - hive 1.1 Cloudera (5.4) Text Avro Parquet Sequence ORC 1* - 10 milions rows, 10 columns 2* - 4 milions rows, 1000 columns
  • 27. © 2013 Acxiom Corporation. All Rights Reserved. Reading results (Impala) 27 0 1 2 3 4 5 6 7 8 Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) Timeinseconds 1* - Impala CDH 5.4 Text Avro Parquet Sequence 0 5 10 15 20 25 30 Query 1 (0 conditions) Query 2 (5 conditions) Query 3 (10 conditions) Query 4 (20 conditions) Timeinseconds 2* - Impala CDH 5.4 Text Avro Parquet Sequence 1* - 10 milions rows, 10 columns 2* - 4 milions rows, 1000 columns
  • 28. © 2013 Acxiom Corporation. All Rights Reserved. ACXIOM - Marketing Analytics Environment • Bring together online and offline data to get a complete, 360-degree view of your customer • Actionable recommendations on how to adjust your digital marketing to reach your goals • Measure and analyze your marketing spend across all channels • Data loads (Aug – Dec 2015) – 268 980,402,366 records 28 http://www.acxiom.com/marketing-analytics-environment/
  • 29. © 2013 Acxiom Corporation. All Rights Reserved. References • http://www.acxiom.com/marketing-analytics-environment/ • https://www.datanami.com/2014/09/01/five-steps-to-running-etl-on-hadoop-for- web-companies/ • http://slideplayer.com/slide/6666437/ • https://cwiki.apache.org/confluence/display/Hive/RCFile • http://axbigdata.blogspot.com/2014/05/hive.html • http://www.slideshare.net/Hadoop_Summit/innovations-in-apache-hadoop- mapreduce-pig-hive-for-improving-query-performance • http://www.semantikoz.com/blog/wp-content/uploads/2014/02/Orc-File-Layout.png • http://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format- avro-vs-parquet-and-more-stampedecon-2015 • http://www.slideshare.net/Hadoop_Summit/w-1205p230-aradhakrishnan-v3 29
  • 30. © 2013 Acxiom Corporation. All Rights Reserved. © 2013 Acxiom Corporation. All Rights Reserved. Questions? Thank you! twitter.com/jwszol