SlideShare una empresa de Scribd logo
1 de 93
Page1
Transformation Processing Smackdown
Spark vs Hive vs Pig
Lester Martin DevNexus 2017
Page2
Connection before Content
Lester Martin – Hadoop/Spark Trainer & Consultant
lester.martin@gmail.com
http://lester.website (links to blog, twitter,
github, LI, FB, etc)
Page3
Agenda
• Present Frameworks
• File Formats
• Source to Target Mappings
• Data Quality
• Data Profiling
• Core Processing Functionality
• Custom Business Logic
• Mutable Data Concerns
• Performance
Lots of
Code !
Page4
Standard Disclaimers Apply
Wide Topic – Multiple Frameworks – Limited Time, so…
• Simple use cases
• Glad to enhance https://github.com/lestermartin/oss-transform-processing-comparison
• ALWAYS 2+ ways to skin a cat
• Especially with Spark ;-)
• CLI, not GUI, tools
• Others in that space such as Talend, Informatica & Syncsort
• ALL code compiles in PPT ;-)
• Won’t explain all examples!!!
Page5
Apache Pig – http://pig.apache.org
• A high-level data-flow scripting language (Pig Latin)
• Run as standalone scripts or use the interactive shell
• Executes on Hadoop
• Uses lazy execution
Grunt shell
Page6
Simple and Novel Commands
Pig Command Description
LOAD Read data from file system
STORE Write data to file system
FOREACH Apply expression to each record and output 1+ records
FILTER Apply predicate and remove records that do not return true
GROUP/COGROUP Collect records with the same key from one or more inputs
JOIN Joint 2+ inputs based on a key; various join algorithms exist
ORDER Sort records based on a key
DISTINCT Remove duplicate records
UNION Merge two data sets
SPLIT Split data into 2+ more sets based on filter conditions
STREAM Send all records through a user provided executable
SAMPLE Read a random sample of the data
LIMIT Limit the number of records
Page7
Executing Scripts in Ambari Pig View
Page8
Apache Hive – http://hive.apache.org
• Data warehouse system for Hadoop
• Create schema/table definitions that point to data in HDFS
• Treat your data in Hadoop as tables
• SQL 92
• Interactive queries at scale
Page9
Hive’s Alignment with SQL
SQL Datatypes SQL Semantics
INT SELECT, LOAD, INSERT from query
TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING
BOOLEAN GROUP BY, ORDER BY, SORT BY
FLOAT CLUSTER BY, DISTRIBUTE BY
DOUBLE Sub-queries in FROM clause
STRING GROUP BY, ORDER BY
BINARY ROLLUP and CUBE
TIMESTAMP UNION
ARRAY, MAP, STRUCT, UNION LEFT, RIGHT and FULL INNER/OUTER JOIN
DECIMAL CROSS JOIN, LEFT SEMI JOIN
CHAR Windowing functions (OVER, RANK, etc.)
VARCHAR Sub-queries for IN/NOT IN, HAVING
DATE EXISTS / NOT EXISTS
INTERSECT, EXCEPT
Page10
Hive Query Process
User issues SQL query
Hive parses and plans query
Query converted to YARN
job and executed on Hadoop
2
3
Web UI
JDBC / ODBC
CLI
Hive
SQL
1
1
HiveServer2 Hive
MR/Tez/Spark
Compiler
Optimizer
Executor
2
Hive
MetaStore
(MySQL, Postgresql,
Oracle)
MapReduce, Tez or Spark Job
Data DataData
Hadoop 3
Data-local processing
Page11
Submitting Hive Queries – CLI and GUI Tools
Page12
Submitting Hive Queries – Ambari Hive View
Page13
Apache Spark – http://spark.apache.org
 A data access engine for fast, large-scale
data processing
 Designed for iterative in-memory
computations and interactive data mining
 Provides expressive multi-language APIs
for Scala, Java, R and Python
 Data workers can use built-in libraries to
rapidly iterate over data for:
– ETL
– Machine learning
– SQL workloads
– Stream processing
– Graph computations
Page14
Spark Executors & Cluster Deployment Options
 Responsible for all application workload processing
– The "workers" of a Spark application
– Includes the SparkContext serving as the "master”
• Schedules tasks
• Pre-created in shells & notebooks
 Exist for the life of the application
 Standalone mode and cluster options
– YARN
– Mesos
HDP Cluster
Executor
Executor
Executor
Executor Executor
Executor
Page15
Spark SQL Overview
 A module built on top of Spark Core
 Provides a programming abstraction for distributed processing of
large-scale structured data in Spark
 Data is described as a DataFrame with rows, columns and a schema
 Data manipulation and access is available with two mechanisms
– SQL Queries
– DataFrames API
Page16
The DataFrame Visually
Col(1) Col(2) Col(3) … Col(n)
RDD
1.1x
RDD
1.3x
RDD
1.2x
Represented logically as….
Pre-existing Hive
data on HDFS
data file 1
data file 2
data file 3
Spark SQL
Converted to…
Data input by Spark SQL
Page17
Apache Zeppelin – http://zeppelin.apache.org
Page18
Still Based on MapReduce Principles
sc.textFile("/some-hdfs-data") 
mapflatMap reduceByKey collecttextFile
.flatMap(lambda line: line.split(" ")) 
.map(lambda line: (word, 1))) 
.reduceByKey(lambda a,b : a+b, 
numPartition=3) 
.collect()
RDD[String]
RDD[List[String]]
RDD[(String, Int)]
Array[(String, Int)]
RDD[(String, Int)]
Page19
ETL Requirements
• Error handling
• Alerts / notifications
• Logging
• Lineage & job statistics
• Administration
• Reusability
• Performance & scalability
• Source code management
• Read/write multiple file formats
and persistent stores
• Source-to-target mappings
• Data profiling
• Data quality
• Common processing
functionality
• Custom business rules
injection
• Merging changed records
Page20
ETL vs ELT
Source: http://www.softwareadvice.com/resources/etl-vs-elt-for-your-data-warehouse/
Page21
File Formats
The ability to read & write many different file formats is critical
• Delimited values (comma, tab, etc)
• XML
• JSON
• Avro
• Parquet
• ORC
• Esoteric formats such as EBCDIC and compact RYO solutions
Page22
File Formats: Delimited Values
Delimited datasets are very common place in Big Data clusters
Simple example file: catalog.del
Programming Pig|Alan Gates|23.17|2016
Apache Hive Essentials|Dayong Du|39.99|2015
Spark in Action|Petar Zecevic|41.24|2016
Page23
Pig Code for Delimited File
book_catalog = LOAD
'/otpc/ff/del/data/catalog.del’
USING PigStorage('|')
AS (title:chararray, author:chararray,
price:float, year:int);
DESCRIBE book_catalog;
DUMP book_catalog;
Page24
Pig Output for Delimited File
book_catalog: {title: chararray,author:
chararray,price: float,year: int}
(Programming Pig,Alan Gates,23.17,2016)
(Apache Hive Essentials,Dayong
Du,39.99,2015)
(Spark in Action,Petar Zecevic,41.24,2016)
Page25
Hive Code for Delimited File
CREATE EXTERNAL TABLE book_catalog_pipe(
title string, author string,
price float, year int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/otpc/ff/del/data';
Page26
Hive Output for Delimited File Schema
desc book_catalog_pipe;
| col_name | data_type | comment |
+-----------+------------+----------+--+
| title | string | |
| author | string | |
| price | float | |
| year | int | |
Page27
Hive Output for Delimited File Contents
SELECT * FROM book_catalog_pipe;
| title | author | price | year |
+-----------+---------+-------+------+--+
| Progra... | Alan... | 23.17 | 2016 |
| Apache... | Dayo... | 39.99 | 2015 |
| Spark ... | Peta... | 41.24 | 2016 |
Page28
Spark Code for Delimited File
val catalogRDD = sc.textFile(
"hdfs:///otpc/ff/del/data/catalog.del")
case class Book(title: String, author:
String, price: Float, year: Int)
val catalogDF = catalogRDD
.map(b => b.split('|'))
.map(b => Book(b(0), b(1), b(2).toFloat,
b(3).toInt))
.toDF()
Page29
Spark Output for Delimited File Schema
catalogDF.printSchema()
root
|-- title: string (nullable = true)
|-- author: string (nullable = true)
|-- price: float (nullable = false)
|-- year: integer (nullable = false)
Page30
Spark Output for Delimited File Contents
catalogDF.show()
| title| author|price|year|
+--------------+-----------+-----+----+
|Programming...| Alan Gates|23.17|2016|
|Apache Hive...| Dayong Du|39.99|2015|
|Spark in Ac...|Petar Ze...|41.24|2016|
Page31
File Formats: XML
Simple example file: catalog.xml
<CATALOG>
<BOOK>
<TITLE>Programming Pig</TITLE>
<AUTHOR>Alan Gates</AUTHOR>
<PRICE>23.17</PRICE>
<YEAR>2016</YEAR>
</BOOK>
<!-- other 2 BOOKs not shown -->
</CATALOG>
Page32
Pig Code for XML File
raw = LOAD '/otpc/ff/xml/catalog.xml'
USING XMLLoader('BOOK') AS (x:chararray);
formatted = FOREACH raw GENERATE
XPath(x, 'BOOK/TITLE') AS title,
XPath(x, 'BOOK/AUTHOR') AS author,
(float) XPath(x, 'BOOK/PRICE') AS price,
(int) XPath(x, 'BOOK/YEAR') AS year;
Page33
Hive Code for XML File
CREATE EXTERNAL TABLE book_catalog_xml(str string)
LOCATION '/otpc/ff/xml/flat';
CREATE TABLE book_catalog STORED AS ORC AS
SELECT xpath_string(str,'BOOK/TITLE') AS title,
xpath_string(str,'BOOK/AUTHOR') AS author,
xpath_float( str,'BOOK/PRICE') AS price,
xpath_int( str,'BOOK/YEAR') AS year
FROM book_catalog_xml;
Page34
Spark Code for XML File
val df = sqlContext
.read
.format("com.databricks.spark.xml")
.option("rowTag", "BOOK")
.load("/otpc/ff/xml/catalog.xml")
Page35
File Formats: JSON
Simple example file: catalog.json
{"title":"Programming Pig", "author":"Alan Gates",
"price":23.17, "year":2016}
{"title":"Apache Hive Essentials", "author":"Dayong Du",
"price":39.99, "year":2015}
{"title":"Spring in Action", "author":"Petar Zecevic",
"price":41.24, "year":2016}
Page36
Pig Code for JSON File
book_catalog =
LOAD '/otpc/ff/json/data/catalog.json’
USING JsonLoader('title:chararray,
author:chararray,
price:float,
year:int');
Page37
Hive Code for JSON File
CREATE EXTERNAL TABLE book_catalog_json(
title string, author string,
price float, year int)
ROW FORMAT SERDE 'o.a.h.h.d.JsonSerDe’
STORED AS TEXTFILE
LOCATION '/otpc/ff/json/data';
Page38
Spark Code for JSON File
val df = sqlContext
.read
.format("json")
.load("/otpc/ff/json/catalog.json")
Page39
WINNER: File Formats
Page40
Data Set for Examples
Reliance on Hive Metastore for Pig and Spark SQL
Page41
Source to Target Mappings
Classic ETL need to map one dataset to another; includes these scenarios
Column Presence Action
Source and Target Move data from source column to
target column (could be renamed,
cleaned, transformed, etc)
Source, not in Target Ignore this column
Target, not in Source Implies a hard-coded or calculated
value will be inserted or updated
Page42
Source to Target Mappings Use Case
Create new dataset from airport_raw
• Change column names
• airport_code to airport_cd
• airport to name
• Carry over as named
• city, state, country
• Exclude
• latitude and longitude
• Hard-code new field
• gov_agency as ‘FAA’
Page43
Pig Code for Data Mapping
src_airport = LOAD 'airport_raw’
USING o.a.h.h.p.HCatLoader();
tgt_airport = FOREACH src_airport GENERATE
airport_code AS airport_cd,
airport AS name, city, state, country,
'FAA' AS gov_agency:chararray;
DESCRIBE tgt_airport;
DUMP tgt_airport;
Page44
Pig Output for Data Mapping
target_airport: {airport_cd: chararray,
name: chararray, city: chararray, state:
chararray, country: chararray, gov_agency:
chararray}
(00M,Thigpen,BaySprings,MS,USA,FAA)
(00R,LivingstonMunicipal,Livingston,TX,USA,F
AA)
(00V,MeadowLake,ColoradoSprings,CO,USA,FAA)
Page45
Hive Code for Data Mapping
CREATE TABLE tgt_airport STORED AS ORC AS
SELECT airport_code AS airport_cd,
airport AS name,
city, state, country,
'FAA' AS gov_agency
FROM airport_raw;
Page46
Hive Output for Mapped Schema
| col_name | data_type | comment |
+------------+------------+----------+--+
| airport_cd | string | |
| name | string | |
| city | string | |
| state | string | |
| country | string | |
| gov_agency | string | |
Page47
Hive Output for Mapped Contents
SELECT * FROM tgt_airport;
| _cd | name | city | st | cny | g_a |
+-----+---------+---------+----+-----+-----+
| 00M | Thig... | BayS... | MS | USA | FAA |
| 00R | Livi... | Livi... | TX | USA | FAA |
| 00V | Mead... | Colo... | CO | USA | FAA |
Page48
Spark Code for Data Mapping
val airport_target = hiveContext
.table("airport_raw")
.drop("lat").drop("long")
.withColumnRenamed("airport_code",
"airport_cd")
.withColumnRenamed("airport", "name")
.withColumn("gov_agency", lit("FAA"))
Page49
Spark Output for Mapped Schema
root
|-- airport_cd: string (nullable = true)
|-- name: string (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- country: string (nullable = true)
|-- gov_agency: string (nullable = false)
Page50
Spark Output for Mapped Contents
airport_target.show()
|_cd| name| city|st|cny|g_a|
+---+-------------+-------------+--+---+---+
|00M| Thigpen| BaySprings|MS|USA|FAA|
|00R|Livingston...|Livingston...|TX|USA|FAA|
|00V| MeadowLake|ColoradoSp...|CO|USA|FAA|
Page51
WINNER: Source to Target Mapping
Page52
Data Quality
DQ is focused on detecting/correcting/enhancing input data
• Data type conversions / casting
• Numeric ranges
• Currency validation
• String validation
• Leading / trailing spaces
• Length
• Formatting (ex: SSN and phone #s)
• Address validation / standardization
• Enrichment
Page53
Numeric Validation Use Case
Validate latitude and longitude values from airport_raw
• Convert them from string to float
• Verify these values are within normal ranges
Attribute Min Max
latitude -90 +90
longitude -180 +180
Page54
Pig Code for Numeric Validation
src_airport = LOAD 'airport_raw’ USING HCatLoader();
aprt_cnvrtd = FOREACH src_airport GENERATE
airport_cd, (float) latitude, (float) longitude;
ll_not_null = FILTER aprt_cnvrtd BY
( NOT ( (latitude IS NULL) OR
(longitude IS NULL) ) );
valid_airports = FILTER ll_not_null BY
(latitude <= 70) AND (longitude >= -170);
Page55
Hive Code for Numeric Validation
CREATE TABLE airport_stage STORED AS ORC AS
SELECT airport_code,
CAST(latitude AS float),
CAST(longitude AS float)
FROM airport_raw;
CREATE TABLE airport_final STORED AS ORC AS
SELECT * FROM airport_stage
WHERE latitude BETWEEN -80 AND 70
AND longitude BETWEEN -170 AND 180;
Page56
Spark Code for Numeric Validation
val airport_validated = hiveContext
.table("airport_raw")
.selectExpr("airport_code",
"cast(latitude as float) latitude",
"cast(longitude as float) longitude")
.filter("latitude is not null")
.filter("longitude is not null")
.filter("latitude <= 70")
.filter("longitude >= -170")
Page57
String Validation Use Case
Validate city values from airport_raw
• Trim any leading / trailing spaces
• Truncate any characters beyond the first 30
Page58
Pig Code for String Validation
src_airport = LOAD 'airport_raw’
USING HCatLoader();
valid_airports = FOREACH src_airport
GENERATE
airport_code, airport,
SUBSTRING(TRIM(city),0,29) AS city,
state, country;
Page59
Hive Code for String Validation
CREATE TABLE airport_final STORED AS ORC AS
SELECT airport_code, airport,
SUBSTR(TRIM(city),1,30) AS city,
state, country
FROM airport_raw;
Page60
Spark Code for String Validation
val airport_validated = hiveContext
.table("airport_raw")
.withColumnRenamed("city", "city_orig")
.withColumn("city", substring(
trim($"city_orig"),1,30))
.drop("city_orig")
Page61
WINNER: Data Quality
Page62
Data Profiling
Technique used to examine data for different purposes such as
determining accuracy and completeness – drives DQ improvements
• Numbers of records – including null counts
• Avg / max lengths
• Distinct values
• Min / max values
• Mean
• Variance
• Standard deviation
Page63
Data Profiling with Pig
Coupled with Apache DataFu generates statistics such as the following
Column Name: sales_price
Row Count: 163794
Null Count: 0
Distinct Count: 1446
Highest Value: 70589
Lowest Value: 1
Total Value: 21781793
Mean Value: 132.98285040966093
Variance: 183789.18332067598
Standard Deviation: 428.7064069041609
Page64
Data Profiling with Hive
Column-level statistics for all data types
|col_name|min|max|nulls|dist_ct|
+--------+---+---+-----+-------+
|air_time| 0|757| 0| 316|
|col_name|dist_ct|avgColLn|mxColLn|
+--------+-------+--------+-------+
| city | 2535| 8.407| 32|
Page65
Data Profiling with Spark
Inherent statistics for numeric data types
|summary| air_time|
+-------+-----------------+
| count| 2056494|
| mean|103.9721783773743|
| stddev|67.42112792270458|
| min| 0|
| max| 757|
Page66
WINNER: Data Profiling
Page67
Core Processing Functionality
Expected features to enable data transformation, cleansing & enrichment
• Filtering / splitting
• Sorting
• Lookups / joining
• Union / distinct
• Aggregations / pivoting
• SQL support
• Analytical functions
Page68
Filtering Examples; Pig, Hive & Spark
tx_arprt = FILTER arprt BY state == 'TX';
SELECT * FROM arprt WHERE state = 'TX';
val txArprt = hiveContext
.table("arprt")
.filter("state = 'TX'")
Page69
Sorting Examples; Pig, Hive & Spark
srt_flight = ORDER flight BY dep_delay DESC,
unique_carrier, flight_num;
SELECT * FROM flight ORDER BY dep_delay DESC,
unique_carrier, flight_num;
val longestDepartureDelays = hiveContext
.table("flight").sort($"dep_delay".desc,
$"unique_carrier", $"flight_num")
Page70
Joining with Pig
jnRslt = JOIN flights BY tail_num,
planes BY tail_number;
prettier = FOREACH with_year GENERATE
flights::flight_date AS flight_date,
flights::tail_num AS tail_num,
-- plus other 17 "flights" attribs
planes::year AS plane_built;
-- ignore other 8 "planes" attribs
Page71
Joining with Hive
SELECT F.*,
P.year AS plane_built
FROM flight F,
plane P
WHERE F.tail_num = P.tail_number;
Page72
Joining with Spark
val flights = hiveContext.table("flight")
val planes = hiveContext.table("plane")
.select("tail_number", "year")
.withColumnRenamed("year", "plane_built")
val augmented_flights = flights
.join(planes)
.where($"tail_num" === $"tail_number")
.drop("tail_number")
Page73
Pig Code for Distinct
planes = LOAD 'plane' USING HCatLoader();
rotos = FILTER planes BY
aircraft_type == 'Rotorcraft';
makers = FOREACH rotos GENERATE
manufacturer;
distinct_makers = DISTINCT makers;
DUMP distinct_rotor_makers;
Page74
Pig Output for Distinct
(BELL)
(SIKORSKY)
(AGUSTA SPA)
(AEROSPATIALE)
(COBB INTL/DBA ROTORWAY INTL IN)
Page75
Hive Code for Distinct
SELECT DISTINCT(manufacturer)
FROM plane
WHERE aircraft_type = 'Rotorcraft';
Page76
Hive Output for Distinct
| manufacturer |
+---------------------------------+
| AEROSPATIALE |
| AGUSTA SPA |
| BELL |
| COBB INTL/DBA ROTORWAY INTL IN |
| SIKORSKY |
Page77
Spark Code for Distinct
val rotor_makers = hiveContext
.table("plane")
.filter("aircraft_type = 'Rotorcraft'")
.select("manufacturer")
.distinct()
Page78
Spark Output for Distinct
| manufacturer|
+--------------------+
| BELL|
| SIKORSKY|
| AGUSTA SPA|
| AEROSPATIALE|
|COBB INTL/DBA ROT...|
Page79
Pig Code for Aggregation
flights = LOAD 'flight' USING HCatLoader();
reqd_cols = FOREACH flights GENERATE
origin, dep_delay;
by_orig = GROUP reqd_cols BY origin;
avg_delay = FOREACH by_orig GENERATE
group AS origin,
AVG(reqd_cols.dep_delay) AS avg_dep_delay;
srtd_delay = ORDER avg_delay BY avg_dep_delay DESC;
top5_delay = LIMIT srtd_delay 5;
DUMP top5_delay;
Page80
Pig Output for Aggregation
(PIR,49.5)
(ACY,35.916666666666664)
(ACK,25.558333333333334)
(CEC,23.40764331210191)
(LMT,23.40268456375839)
Page81
Hive Code for Aggregation
SELECT origin,
AVG(dep_delay) AS avg_dep_delay
FROM flight
GROUP BY origin
ORDER BY avg_dep_delay DESC
LIMIT 5;
Page82
Hive Output for Aggregation
| origin | avg_dep_delay |
+---------+---------------------+
| PIR | 49.5 |
| ACY | 35.916666666666664 |
| ACK | 25.558333333333334 |
| CEC | 23.40764331210191 |
| LMT | 23.40268456375839 |
Page83
Spark Code for Aggregation
val sorted_orig_timings = hiveContext
.table("flight")
.select("origin", "dep_delay")
.groupBy("origin").avg()
.withColumnRenamed("avg(dep_delay)",
"avg_dep_delay")
.sort($"avg_dep_delay".desc)
sorted_orig_timings.show(5)
Page84
Spark Output for Aggregation
|origin| avg_dep_delay|
+------+------------------+
| PIR| 49.5|
| ACY|35.916666666666664|
| ACK|25.558333333333334|
| CEC| 23.40764331210191|
| LMT| 23.40268456375839|
Page85
WINNER: Core Processing Functionality
Page86
Custom Business Logic
Implemented via User Defined Functions (UDF)
• Pig and Hive
• Write Java and compile to a JAR
• Register JAR
• Hive can administratively pre-register UDFs at the database level
• Spark can wrap functions at runtime
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
get_year = udf(lambda x: int(x[:4]), IntegerType())
df1.select(get_year(df1["date"]).alias("year"),
df1["product"])
.collect()
+----+-------+
|year|product|
+----+-------+
|2015|toaster|
|2015| iron|
|2014| fridge|
|2015| cup|
+----+-------+
+----------+-------+
| date|product|
+----------+-------+
|2015-03-12|toaster|
|2015-04-12| iron|
|2014-12-31| fridge|
|2015-02-03| cup|
+----------+-------+
Page87
WINNER: Custom Business Logic
Page88
Mutable Data – Merge & Replace
See my preso and video on this topic
• http://www.slideshare.net/lestermartin/mutable-data-in-hives-immutable-world
• https://www.youtube.com/watch?v=EUz6Pu1lBHQ
Ingest – bring over the incremental
data
Reconcile – perform the merge
Compact – replace the existing data
with the newly merged content
Purge – cleanup & prepare to repeat
Page89
WINNER: Mutable Data
Page90
Performance
Scalability is based on size of cluster
• Tez and Spark has MR optimizations
• Can link multiple maps and reduces together without having to write intermediate data to HDFS
• Every reducer does not require a map phase
• Hive and Spark SQL have query optimizers
• Spark has the edge
• Caching data to memory can avoid extra reads from disk
• Resources dedicated for entire life of the application
• Scheduling of tasks from 15-20s to 15-20ms
Page91
WINNER: Performance
Page92
Recommendations
Review ALL THREE frameworks back at “your desk”
Decision Criteria…
•Existing investments
•Forward-looking beliefs
•Adaptability & current skills
•It’s a “matter of style”
•Polyglot programming is NOT a bad thing!!
Share your findings via blogs and local user groups
Page93
Questions?
Lester Martin – Hadoop/Spark Trainer & Consultant
lester.martin@gmail.com
http://lester.website (links to blog, twitter, github, LI, FB, etc)
THANKS FOR YOUR TIME!!

Más contenido relacionado

La actualidad más candente

Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaDatabricks
 
Building social network with Neo4j and Python
Building social network with Neo4j and PythonBuilding social network with Neo4j and Python
Building social network with Neo4j and PythonAndrii Soldatenko
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframeJaemun Jung
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
Gain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta SharingGain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta SharingDatabricks
 
Data platform data pipeline(Airflow, Kubernetes)
Data platform data pipeline(Airflow, Kubernetes)Data platform data pipeline(Airflow, Kubernetes)
Data platform data pipeline(Airflow, Kubernetes)창언 정
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)Hadoop / Spark Conference Japan
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 

La actualidad más candente (20)

Spark
SparkSpark
Spark
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
Building social network with Neo4j and Python
Building social network with Neo4j and PythonBuilding social network with Neo4j and Python
Building social network with Neo4j and Python
 
Hadoop vs Apache Spark
Hadoop vs Apache SparkHadoop vs Apache Spark
Hadoop vs Apache Spark
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Gain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta SharingGain 3 Benefits with Delta Sharing
Gain 3 Benefits with Delta Sharing
 
Data platform data pipeline(Airflow, Kubernetes)
Data platform data pipeline(Airflow, Kubernetes)Data platform data pipeline(Airflow, Kubernetes)
Data platform data pipeline(Airflow, Kubernetes)
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
A Deeper Understanding of Spark Internals (Hadoop Conference Japan 2014)
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 

Similar a File Formats Showdown: Pig vs Hive vs Spark

Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Databricks
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosEuangelos Linardos
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferretAndrii Gakhov
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 

Similar a File Formats Showdown: Pig vs Hive vs Spark (20)

20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0Large-Scale Data Science in Apache Spark 2.0
Large-Scale Data Science in Apache Spark 2.0
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 

Último

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 

Último (20)

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 

File Formats Showdown: Pig vs Hive vs Spark

  • 1. Page1 Transformation Processing Smackdown Spark vs Hive vs Pig Lester Martin DevNexus 2017
  • 2. Page2 Connection before Content Lester Martin – Hadoop/Spark Trainer & Consultant lester.martin@gmail.com http://lester.website (links to blog, twitter, github, LI, FB, etc)
  • 3. Page3 Agenda • Present Frameworks • File Formats • Source to Target Mappings • Data Quality • Data Profiling • Core Processing Functionality • Custom Business Logic • Mutable Data Concerns • Performance Lots of Code !
  • 4. Page4 Standard Disclaimers Apply Wide Topic – Multiple Frameworks – Limited Time, so… • Simple use cases • Glad to enhance https://github.com/lestermartin/oss-transform-processing-comparison • ALWAYS 2+ ways to skin a cat • Especially with Spark ;-) • CLI, not GUI, tools • Others in that space such as Talend, Informatica & Syncsort • ALL code compiles in PPT ;-) • Won’t explain all examples!!!
  • 5. Page5 Apache Pig – http://pig.apache.org • A high-level data-flow scripting language (Pig Latin) • Run as standalone scripts or use the interactive shell • Executes on Hadoop • Uses lazy execution Grunt shell
  • 6. Page6 Simple and Novel Commands Pig Command Description LOAD Read data from file system STORE Write data to file system FOREACH Apply expression to each record and output 1+ records FILTER Apply predicate and remove records that do not return true GROUP/COGROUP Collect records with the same key from one or more inputs JOIN Joint 2+ inputs based on a key; various join algorithms exist ORDER Sort records based on a key DISTINCT Remove duplicate records UNION Merge two data sets SPLIT Split data into 2+ more sets based on filter conditions STREAM Send all records through a user provided executable SAMPLE Read a random sample of the data LIMIT Limit the number of records
  • 7. Page7 Executing Scripts in Ambari Pig View
  • 8. Page8 Apache Hive – http://hive.apache.org • Data warehouse system for Hadoop • Create schema/table definitions that point to data in HDFS • Treat your data in Hadoop as tables • SQL 92 • Interactive queries at scale
  • 9. Page9 Hive’s Alignment with SQL SQL Datatypes SQL Semantics INT SELECT, LOAD, INSERT from query TINYINT/SMALLINT/BIGINT Expressions in WHERE and HAVING BOOLEAN GROUP BY, ORDER BY, SORT BY FLOAT CLUSTER BY, DISTRIBUTE BY DOUBLE Sub-queries in FROM clause STRING GROUP BY, ORDER BY BINARY ROLLUP and CUBE TIMESTAMP UNION ARRAY, MAP, STRUCT, UNION LEFT, RIGHT and FULL INNER/OUTER JOIN DECIMAL CROSS JOIN, LEFT SEMI JOIN CHAR Windowing functions (OVER, RANK, etc.) VARCHAR Sub-queries for IN/NOT IN, HAVING DATE EXISTS / NOT EXISTS INTERSECT, EXCEPT
  • 10. Page10 Hive Query Process User issues SQL query Hive parses and plans query Query converted to YARN job and executed on Hadoop 2 3 Web UI JDBC / ODBC CLI Hive SQL 1 1 HiveServer2 Hive MR/Tez/Spark Compiler Optimizer Executor 2 Hive MetaStore (MySQL, Postgresql, Oracle) MapReduce, Tez or Spark Job Data DataData Hadoop 3 Data-local processing
  • 11. Page11 Submitting Hive Queries – CLI and GUI Tools
  • 12. Page12 Submitting Hive Queries – Ambari Hive View
  • 13. Page13 Apache Spark – http://spark.apache.org  A data access engine for fast, large-scale data processing  Designed for iterative in-memory computations and interactive data mining  Provides expressive multi-language APIs for Scala, Java, R and Python  Data workers can use built-in libraries to rapidly iterate over data for: – ETL – Machine learning – SQL workloads – Stream processing – Graph computations
  • 14. Page14 Spark Executors & Cluster Deployment Options  Responsible for all application workload processing – The "workers" of a Spark application – Includes the SparkContext serving as the "master” • Schedules tasks • Pre-created in shells & notebooks  Exist for the life of the application  Standalone mode and cluster options – YARN – Mesos HDP Cluster Executor Executor Executor Executor Executor Executor
  • 15. Page15 Spark SQL Overview  A module built on top of Spark Core  Provides a programming abstraction for distributed processing of large-scale structured data in Spark  Data is described as a DataFrame with rows, columns and a schema  Data manipulation and access is available with two mechanisms – SQL Queries – DataFrames API
  • 16. Page16 The DataFrame Visually Col(1) Col(2) Col(3) … Col(n) RDD 1.1x RDD 1.3x RDD 1.2x Represented logically as…. Pre-existing Hive data on HDFS data file 1 data file 2 data file 3 Spark SQL Converted to… Data input by Spark SQL
  • 17. Page17 Apache Zeppelin – http://zeppelin.apache.org
  • 18. Page18 Still Based on MapReduce Principles sc.textFile("/some-hdfs-data") mapflatMap reduceByKey collecttextFile .flatMap(lambda line: line.split(" ")) .map(lambda line: (word, 1))) .reduceByKey(lambda a,b : a+b, numPartition=3) .collect() RDD[String] RDD[List[String]] RDD[(String, Int)] Array[(String, Int)] RDD[(String, Int)]
  • 19. Page19 ETL Requirements • Error handling • Alerts / notifications • Logging • Lineage & job statistics • Administration • Reusability • Performance & scalability • Source code management • Read/write multiple file formats and persistent stores • Source-to-target mappings • Data profiling • Data quality • Common processing functionality • Custom business rules injection • Merging changed records
  • 20. Page20 ETL vs ELT Source: http://www.softwareadvice.com/resources/etl-vs-elt-for-your-data-warehouse/
  • 21. Page21 File Formats The ability to read & write many different file formats is critical • Delimited values (comma, tab, etc) • XML • JSON • Avro • Parquet • ORC • Esoteric formats such as EBCDIC and compact RYO solutions
  • 22. Page22 File Formats: Delimited Values Delimited datasets are very common place in Big Data clusters Simple example file: catalog.del Programming Pig|Alan Gates|23.17|2016 Apache Hive Essentials|Dayong Du|39.99|2015 Spark in Action|Petar Zecevic|41.24|2016
  • 23. Page23 Pig Code for Delimited File book_catalog = LOAD '/otpc/ff/del/data/catalog.del’ USING PigStorage('|') AS (title:chararray, author:chararray, price:float, year:int); DESCRIBE book_catalog; DUMP book_catalog;
  • 24. Page24 Pig Output for Delimited File book_catalog: {title: chararray,author: chararray,price: float,year: int} (Programming Pig,Alan Gates,23.17,2016) (Apache Hive Essentials,Dayong Du,39.99,2015) (Spark in Action,Petar Zecevic,41.24,2016)
  • 25. Page25 Hive Code for Delimited File CREATE EXTERNAL TABLE book_catalog_pipe( title string, author string, price float, year int) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/otpc/ff/del/data';
  • 26. Page26 Hive Output for Delimited File Schema desc book_catalog_pipe; | col_name | data_type | comment | +-----------+------------+----------+--+ | title | string | | | author | string | | | price | float | | | year | int | |
  • 27. Page27 Hive Output for Delimited File Contents SELECT * FROM book_catalog_pipe; | title | author | price | year | +-----------+---------+-------+------+--+ | Progra... | Alan... | 23.17 | 2016 | | Apache... | Dayo... | 39.99 | 2015 | | Spark ... | Peta... | 41.24 | 2016 |
  • 28. Page28 Spark Code for Delimited File val catalogRDD = sc.textFile( "hdfs:///otpc/ff/del/data/catalog.del") case class Book(title: String, author: String, price: Float, year: Int) val catalogDF = catalogRDD .map(b => b.split('|')) .map(b => Book(b(0), b(1), b(2).toFloat, b(3).toInt)) .toDF()
  • 29. Page29 Spark Output for Delimited File Schema catalogDF.printSchema() root |-- title: string (nullable = true) |-- author: string (nullable = true) |-- price: float (nullable = false) |-- year: integer (nullable = false)
  • 30. Page30 Spark Output for Delimited File Contents catalogDF.show() | title| author|price|year| +--------------+-----------+-----+----+ |Programming...| Alan Gates|23.17|2016| |Apache Hive...| Dayong Du|39.99|2015| |Spark in Ac...|Petar Ze...|41.24|2016|
  • 31. Page31 File Formats: XML Simple example file: catalog.xml <CATALOG> <BOOK> <TITLE>Programming Pig</TITLE> <AUTHOR>Alan Gates</AUTHOR> <PRICE>23.17</PRICE> <YEAR>2016</YEAR> </BOOK> <!-- other 2 BOOKs not shown --> </CATALOG>
  • 32. Page32 Pig Code for XML File raw = LOAD '/otpc/ff/xml/catalog.xml' USING XMLLoader('BOOK') AS (x:chararray); formatted = FOREACH raw GENERATE XPath(x, 'BOOK/TITLE') AS title, XPath(x, 'BOOK/AUTHOR') AS author, (float) XPath(x, 'BOOK/PRICE') AS price, (int) XPath(x, 'BOOK/YEAR') AS year;
  • 33. Page33 Hive Code for XML File CREATE EXTERNAL TABLE book_catalog_xml(str string) LOCATION '/otpc/ff/xml/flat'; CREATE TABLE book_catalog STORED AS ORC AS SELECT xpath_string(str,'BOOK/TITLE') AS title, xpath_string(str,'BOOK/AUTHOR') AS author, xpath_float( str,'BOOK/PRICE') AS price, xpath_int( str,'BOOK/YEAR') AS year FROM book_catalog_xml;
  • 34. Page34 Spark Code for XML File val df = sqlContext .read .format("com.databricks.spark.xml") .option("rowTag", "BOOK") .load("/otpc/ff/xml/catalog.xml")
  • 35. Page35 File Formats: JSON Simple example file: catalog.json {"title":"Programming Pig", "author":"Alan Gates", "price":23.17, "year":2016} {"title":"Apache Hive Essentials", "author":"Dayong Du", "price":39.99, "year":2015} {"title":"Spring in Action", "author":"Petar Zecevic", "price":41.24, "year":2016}
  • 36. Page36 Pig Code for JSON File book_catalog = LOAD '/otpc/ff/json/data/catalog.json’ USING JsonLoader('title:chararray, author:chararray, price:float, year:int');
  • 37. Page37 Hive Code for JSON File CREATE EXTERNAL TABLE book_catalog_json( title string, author string, price float, year int) ROW FORMAT SERDE 'o.a.h.h.d.JsonSerDe’ STORED AS TEXTFILE LOCATION '/otpc/ff/json/data';
  • 38. Page38 Spark Code for JSON File val df = sqlContext .read .format("json") .load("/otpc/ff/json/catalog.json")
  • 40. Page40 Data Set for Examples Reliance on Hive Metastore for Pig and Spark SQL
  • 41. Page41 Source to Target Mappings Classic ETL need to map one dataset to another; includes these scenarios Column Presence Action Source and Target Move data from source column to target column (could be renamed, cleaned, transformed, etc) Source, not in Target Ignore this column Target, not in Source Implies a hard-coded or calculated value will be inserted or updated
  • 42. Page42 Source to Target Mappings Use Case Create new dataset from airport_raw • Change column names • airport_code to airport_cd • airport to name • Carry over as named • city, state, country • Exclude • latitude and longitude • Hard-code new field • gov_agency as ‘FAA’
  • 43. Page43 Pig Code for Data Mapping src_airport = LOAD 'airport_raw’ USING o.a.h.h.p.HCatLoader(); tgt_airport = FOREACH src_airport GENERATE airport_code AS airport_cd, airport AS name, city, state, country, 'FAA' AS gov_agency:chararray; DESCRIBE tgt_airport; DUMP tgt_airport;
  • 44. Page44 Pig Output for Data Mapping target_airport: {airport_cd: chararray, name: chararray, city: chararray, state: chararray, country: chararray, gov_agency: chararray} (00M,Thigpen,BaySprings,MS,USA,FAA) (00R,LivingstonMunicipal,Livingston,TX,USA,F AA) (00V,MeadowLake,ColoradoSprings,CO,USA,FAA)
  • 45. Page45 Hive Code for Data Mapping CREATE TABLE tgt_airport STORED AS ORC AS SELECT airport_code AS airport_cd, airport AS name, city, state, country, 'FAA' AS gov_agency FROM airport_raw;
  • 46. Page46 Hive Output for Mapped Schema | col_name | data_type | comment | +------------+------------+----------+--+ | airport_cd | string | | | name | string | | | city | string | | | state | string | | | country | string | | | gov_agency | string | |
  • 47. Page47 Hive Output for Mapped Contents SELECT * FROM tgt_airport; | _cd | name | city | st | cny | g_a | +-----+---------+---------+----+-----+-----+ | 00M | Thig... | BayS... | MS | USA | FAA | | 00R | Livi... | Livi... | TX | USA | FAA | | 00V | Mead... | Colo... | CO | USA | FAA |
  • 48. Page48 Spark Code for Data Mapping val airport_target = hiveContext .table("airport_raw") .drop("lat").drop("long") .withColumnRenamed("airport_code", "airport_cd") .withColumnRenamed("airport", "name") .withColumn("gov_agency", lit("FAA"))
  • 49. Page49 Spark Output for Mapped Schema root |-- airport_cd: string (nullable = true) |-- name: string (nullable = true) |-- city: string (nullable = true) |-- state: string (nullable = true) |-- country: string (nullable = true) |-- gov_agency: string (nullable = false)
  • 50. Page50 Spark Output for Mapped Contents airport_target.show() |_cd| name| city|st|cny|g_a| +---+-------------+-------------+--+---+---+ |00M| Thigpen| BaySprings|MS|USA|FAA| |00R|Livingston...|Livingston...|TX|USA|FAA| |00V| MeadowLake|ColoradoSp...|CO|USA|FAA|
  • 51. Page51 WINNER: Source to Target Mapping
  • 52. Page52 Data Quality DQ is focused on detecting/correcting/enhancing input data • Data type conversions / casting • Numeric ranges • Currency validation • String validation • Leading / trailing spaces • Length • Formatting (ex: SSN and phone #s) • Address validation / standardization • Enrichment
  • 53. Page53 Numeric Validation Use Case Validate latitude and longitude values from airport_raw • Convert them from string to float • Verify these values are within normal ranges Attribute Min Max latitude -90 +90 longitude -180 +180
  • 54. Page54 Pig Code for Numeric Validation src_airport = LOAD 'airport_raw’ USING HCatLoader(); aprt_cnvrtd = FOREACH src_airport GENERATE airport_cd, (float) latitude, (float) longitude; ll_not_null = FILTER aprt_cnvrtd BY ( NOT ( (latitude IS NULL) OR (longitude IS NULL) ) ); valid_airports = FILTER ll_not_null BY (latitude <= 70) AND (longitude >= -170);
  • 55. Page55 Hive Code for Numeric Validation CREATE TABLE airport_stage STORED AS ORC AS SELECT airport_code, CAST(latitude AS float), CAST(longitude AS float) FROM airport_raw; CREATE TABLE airport_final STORED AS ORC AS SELECT * FROM airport_stage WHERE latitude BETWEEN -80 AND 70 AND longitude BETWEEN -170 AND 180;
  • 56. Page56 Spark Code for Numeric Validation val airport_validated = hiveContext .table("airport_raw") .selectExpr("airport_code", "cast(latitude as float) latitude", "cast(longitude as float) longitude") .filter("latitude is not null") .filter("longitude is not null") .filter("latitude <= 70") .filter("longitude >= -170")
  • 57. Page57 String Validation Use Case Validate city values from airport_raw • Trim any leading / trailing spaces • Truncate any characters beyond the first 30
  • 58. Page58 Pig Code for String Validation src_airport = LOAD 'airport_raw’ USING HCatLoader(); valid_airports = FOREACH src_airport GENERATE airport_code, airport, SUBSTRING(TRIM(city),0,29) AS city, state, country;
  • 59. Page59 Hive Code for String Validation CREATE TABLE airport_final STORED AS ORC AS SELECT airport_code, airport, SUBSTR(TRIM(city),1,30) AS city, state, country FROM airport_raw;
  • 60. Page60 Spark Code for String Validation val airport_validated = hiveContext .table("airport_raw") .withColumnRenamed("city", "city_orig") .withColumn("city", substring( trim($"city_orig"),1,30)) .drop("city_orig")
  • 62. Page62 Data Profiling Technique used to examine data for different purposes such as determining accuracy and completeness – drives DQ improvements • Numbers of records – including null counts • Avg / max lengths • Distinct values • Min / max values • Mean • Variance • Standard deviation
  • 63. Page63 Data Profiling with Pig Coupled with Apache DataFu generates statistics such as the following Column Name: sales_price Row Count: 163794 Null Count: 0 Distinct Count: 1446 Highest Value: 70589 Lowest Value: 1 Total Value: 21781793 Mean Value: 132.98285040966093 Variance: 183789.18332067598 Standard Deviation: 428.7064069041609
  • 64. Page64 Data Profiling with Hive Column-level statistics for all data types |col_name|min|max|nulls|dist_ct| +--------+---+---+-----+-------+ |air_time| 0|757| 0| 316| |col_name|dist_ct|avgColLn|mxColLn| +--------+-------+--------+-------+ | city | 2535| 8.407| 32|
  • 65. Page65 Data Profiling with Spark Inherent statistics for numeric data types |summary| air_time| +-------+-----------------+ | count| 2056494| | mean|103.9721783773743| | stddev|67.42112792270458| | min| 0| | max| 757|
  • 67. Page67 Core Processing Functionality Expected features to enable data transformation, cleansing & enrichment • Filtering / splitting • Sorting • Lookups / joining • Union / distinct • Aggregations / pivoting • SQL support • Analytical functions
  • 68. Page68 Filtering Examples; Pig, Hive & Spark tx_arprt = FILTER arprt BY state == 'TX'; SELECT * FROM arprt WHERE state = 'TX'; val txArprt = hiveContext .table("arprt") .filter("state = 'TX'")
  • 69. Page69 Sorting Examples; Pig, Hive & Spark srt_flight = ORDER flight BY dep_delay DESC, unique_carrier, flight_num; SELECT * FROM flight ORDER BY dep_delay DESC, unique_carrier, flight_num; val longestDepartureDelays = hiveContext .table("flight").sort($"dep_delay".desc, $"unique_carrier", $"flight_num")
  • 70. Page70 Joining with Pig jnRslt = JOIN flights BY tail_num, planes BY tail_number; prettier = FOREACH with_year GENERATE flights::flight_date AS flight_date, flights::tail_num AS tail_num, -- plus other 17 "flights" attribs planes::year AS plane_built; -- ignore other 8 "planes" attribs
  • 71. Page71 Joining with Hive SELECT F.*, P.year AS plane_built FROM flight F, plane P WHERE F.tail_num = P.tail_number;
  • 72. Page72 Joining with Spark val flights = hiveContext.table("flight") val planes = hiveContext.table("plane") .select("tail_number", "year") .withColumnRenamed("year", "plane_built") val augmented_flights = flights .join(planes) .where($"tail_num" === $"tail_number") .drop("tail_number")
  • 73. Page73 Pig Code for Distinct planes = LOAD 'plane' USING HCatLoader(); rotos = FILTER planes BY aircraft_type == 'Rotorcraft'; makers = FOREACH rotos GENERATE manufacturer; distinct_makers = DISTINCT makers; DUMP distinct_rotor_makers;
  • 74. Page74 Pig Output for Distinct (BELL) (SIKORSKY) (AGUSTA SPA) (AEROSPATIALE) (COBB INTL/DBA ROTORWAY INTL IN)
  • 75. Page75 Hive Code for Distinct SELECT DISTINCT(manufacturer) FROM plane WHERE aircraft_type = 'Rotorcraft';
  • 76. Page76 Hive Output for Distinct | manufacturer | +---------------------------------+ | AEROSPATIALE | | AGUSTA SPA | | BELL | | COBB INTL/DBA ROTORWAY INTL IN | | SIKORSKY |
  • 77. Page77 Spark Code for Distinct val rotor_makers = hiveContext .table("plane") .filter("aircraft_type = 'Rotorcraft'") .select("manufacturer") .distinct()
  • 78. Page78 Spark Output for Distinct | manufacturer| +--------------------+ | BELL| | SIKORSKY| | AGUSTA SPA| | AEROSPATIALE| |COBB INTL/DBA ROT...|
  • 79. Page79 Pig Code for Aggregation flights = LOAD 'flight' USING HCatLoader(); reqd_cols = FOREACH flights GENERATE origin, dep_delay; by_orig = GROUP reqd_cols BY origin; avg_delay = FOREACH by_orig GENERATE group AS origin, AVG(reqd_cols.dep_delay) AS avg_dep_delay; srtd_delay = ORDER avg_delay BY avg_dep_delay DESC; top5_delay = LIMIT srtd_delay 5; DUMP top5_delay;
  • 80. Page80 Pig Output for Aggregation (PIR,49.5) (ACY,35.916666666666664) (ACK,25.558333333333334) (CEC,23.40764331210191) (LMT,23.40268456375839)
  • 81. Page81 Hive Code for Aggregation SELECT origin, AVG(dep_delay) AS avg_dep_delay FROM flight GROUP BY origin ORDER BY avg_dep_delay DESC LIMIT 5;
  • 82. Page82 Hive Output for Aggregation | origin | avg_dep_delay | +---------+---------------------+ | PIR | 49.5 | | ACY | 35.916666666666664 | | ACK | 25.558333333333334 | | CEC | 23.40764331210191 | | LMT | 23.40268456375839 |
  • 83. Page83 Spark Code for Aggregation val sorted_orig_timings = hiveContext .table("flight") .select("origin", "dep_delay") .groupBy("origin").avg() .withColumnRenamed("avg(dep_delay)", "avg_dep_delay") .sort($"avg_dep_delay".desc) sorted_orig_timings.show(5)
  • 84. Page84 Spark Output for Aggregation |origin| avg_dep_delay| +------+------------------+ | PIR| 49.5| | ACY|35.916666666666664| | ACK|25.558333333333334| | CEC| 23.40764331210191| | LMT| 23.40268456375839|
  • 86. Page86 Custom Business Logic Implemented via User Defined Functions (UDF) • Pig and Hive • Write Java and compile to a JAR • Register JAR • Hive can administratively pre-register UDFs at the database level • Spark can wrap functions at runtime from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType get_year = udf(lambda x: int(x[:4]), IntegerType()) df1.select(get_year(df1["date"]).alias("year"), df1["product"]) .collect() +----+-------+ |year|product| +----+-------+ |2015|toaster| |2015| iron| |2014| fridge| |2015| cup| +----+-------+ +----------+-------+ | date|product| +----------+-------+ |2015-03-12|toaster| |2015-04-12| iron| |2014-12-31| fridge| |2015-02-03| cup| +----------+-------+
  • 88. Page88 Mutable Data – Merge & Replace See my preso and video on this topic • http://www.slideshare.net/lestermartin/mutable-data-in-hives-immutable-world • https://www.youtube.com/watch?v=EUz6Pu1lBHQ Ingest – bring over the incremental data Reconcile – perform the merge Compact – replace the existing data with the newly merged content Purge – cleanup & prepare to repeat
  • 90. Page90 Performance Scalability is based on size of cluster • Tez and Spark has MR optimizations • Can link multiple maps and reduces together without having to write intermediate data to HDFS • Every reducer does not require a map phase • Hive and Spark SQL have query optimizers • Spark has the edge • Caching data to memory can avoid extra reads from disk • Resources dedicated for entire life of the application • Scheduling of tasks from 15-20s to 15-20ms
  • 92. Page92 Recommendations Review ALL THREE frameworks back at “your desk” Decision Criteria… •Existing investments •Forward-looking beliefs •Adaptability & current skills •It’s a “matter of style” •Polyglot programming is NOT a bad thing!! Share your findings via blogs and local user groups
  • 93. Page93 Questions? Lester Martin – Hadoop/Spark Trainer & Consultant lester.martin@gmail.com http://lester.website (links to blog, twitter, github, LI, FB, etc) THANKS FOR YOUR TIME!!

Notas del editor

  1. Despite the “Data Science and Machine Learning” track, this is NOT a talk on DS or ML. Sorry!! ;-) Target audience is someone new to, at least one of, these processing frameworks and want to have some initial info to compare/contrast with.
  2. Originally created by Yahoo! Pigs eat anything Pig can process any data, structured or unstructured Pigs live anywhere Pig can run on any parallel data processing framework, so Pig scripts do not have to run just on Hadoop Pigs are domestic animals Pig is designed to be easily controlled and modified by its users Pigs fly Pig is designed to process data quickly
  3. Calliouts are that connections are maintained by HS2, but all real processing is happening on the worker nodes in the grid
  4. Use familiar command-line and SQL GUI tools just as with “normal” RDBMS technologies Beeline HiveServer2 (introduced in Hive 0.11) has its own CLI called Beeline. HiveCLI is now deprecated in favor of Beeline, as it lacks the multi-user, security, and other capabilities of HiveServer2. Beeline is started with the JDBC URL of the HiveServer2, which depends on the address and port where HiveServer2 was started. By default, it will be (localhost:10000), so the address will look like jdbc:hive2://localhost:10000. GUI Tools Open source SQL tools used to query Hive include: Ambari Hive View (http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_ambari_views_guide/content/) Zeppelin (https://zeppelin.incubator.apache.org/) DBVisualizer (https://www.dbvis.com/download/) Dbeaver (http://dbeaver.jkiss.org/) SquirrelSQL (http://squirrel-sql.sourceforge.net/) SQLWorkBench (http://www.sql-workbench.net/)
  5. This is Hortonworks preferred tool over Hue Ambari includes a built-in set of Views that are pre-deployed for you to use with your cluster. The Hive View is designed to help you author, execute, understand, and debug Hive queries. You can: Browse databases Write and execute queries Manage query execution jobs and history
  6. Spark allows you to do data processing, ETL, machine learning, stream processing, SQL querying from one framework
  7. The Spark executor is the component that does performs the map and reduce tasks of a Spark application, and is sometimes referred to as a Spark “worker.” Once created, executors exist for the life of the application. Note: In the context of Spark, the SparkContext is the "master" and executors are the "workers." However, in the context of HDP in general, you also have "master" nodes and "worker" nodes. Both uses of the term worker are correct - in terms of HDP, the worker (node) can run one or more Spark workers (executors). When in doubt, make sure to verify whether the worker being described is an HDP node or a Spark executor running on an HDP node. Spark executors function as interchangeable work spaces for Spark application processing. If an executor is lost while an application is running, all tasks assigned to it will be reassigned to another executor. In addition, any data lost will be recomputed on another executor. Executor behavior can be controlled programmatically. Configuring the number of executors and their resources available can greatly increase performance of an application when done correctly.
  8. Spark SQL is a module that is built on top of Spark Core. Spark SQL provides another level of abstraction for declarative programming on top of Spark. In Spark SQL, data is described as a DataFrame with rows, columns and a schema. Data manipulation in Spark SQL is available via SQL queries, and DataFrames API. Spark SQL is being used more and more at the enterprise level. Spark SQL allows the developer to focus even more on the business logic and even less on the plumbing, and even some of the optimizations.
  9. The image above shows what a data frame looks like visually. Much like Hive, a DataFrame is a set of metadata that sits on top of an RDD. The RDD can be created from many file types. A DataFrame is conceptually equivalent to a table in traditional data warehousing.
  10. Zeppelin has four major functions: data ingestion, discovery, analytics, and visualization. It comes with built-in examples that demonstrate these capabilities. These examples can be reused and modified for real-world scenarios.
  11. Pointing out that even the Spark RDD API have ”map” and “reduce” method names. Pig and Hive execute as MapReduce (even if on Tez (or Spark)). Also points out that the examples in this preso use DF instead of RDD APIs as they focus on “what you want to do” instead of on “how exactly things should be done”.
  12. Covering list of left, but mostly NOT covering the one on the right (will discuss perf/scale). Here’s are some thoughts on these additional requirements Error Handling: generally done via creating datasets with issues (ex: reject file) and restartability is handled by frameworks – have to RYO for a something that doesn’t rerun from the beginning. Alerts, Logging, Job Stats: system-oriented solutions, but have to RYO for a custom soln Lineage: rely on other tools such as CDH’s Navigator or HDP’s Atlas Admin: common workflow & scheduling options for all jobs and frameworks
  13. Land the raw data first – Bake it as needed (aka Schema on Read). Tools such as sqoop, flume nifi, storm, spark streaming and custom integrations take care of the Extract and Load – this preso focuses on Transformation processing.
  14. Just showing examples of del, xml and json in the slides
  15. Yep… a bit gnarly…
  16. NOT showing output slides as is (basically) the SAME as the delimited output
  17. Have to FLATTEN the XML first and then do a CTAS against it to get rid of XPATH stuff.
  18. NOT showing output slides as is (basically) the SAME as the delimited output
  19. TIE! Spark shines in the file formats that have included schema (Pig & Hive have to regurgitate the schema def), but it doesn’t work all that well with simple delimited files. All in all, they all can read & write a variety of file formats.
  20. No clear winner: all address this req in a straightforward manner.
  21. Just showing examples of numeric and string validations in the slides
  22. See github project notes – had to fudge the numbers since all where already valid
  23. See github project notes – had to fudge the numbers since all where already valid
  24. See github project notes – had to fudge the numbers since all where already valid
  25. IMHO, Hive really is not the tool for a series of data testing and conforming logic due to its need to continually build tables for the output of each step along the way. Pig and Spark tackle this more appropriately (again, my opinion). Both off crisp and elegant solutions with the difference really being a matter of style. See orc-ddl.hql for example of a full, yet very simple, DQ implementation used for rest of examples.
  26. CALL OUT THE orc-ddl.hql SCRIPT FOR THE CLEANSED DATA MODEL
  27. DataFu came from LinkedIn
  28. With DataFu and a bit of coding, Pig can satisfy baseline statistical functions. Hive and Spark present different data natively and coupling their results can satisfy base statistics.
  29. This gets a bit "chatty" in Pig Latin
  30. Determine the top 5 longest average dep_delay values by aggregating the origin airport for all flight records.
  31. PIR = Pierre,SD ACY = Atlantic City, NJ ACK = Nantucket, MA CEC = Crescent City, CA LMT = Klamath Falls, OR
  32. Determine the top 5 longest average dep_delay values by aggregating the origin airport for all flight records.
  33. PIR = Pierre,SD ACY = Atlantic City, NJ ACK = Nantucket, MA CEC = Crescent City, CA LMT = Klamath Falls, OR
  34. Determine the top 5 longest average dep_delay values by aggregating the origin airport for all flight records.
  35. PIR = Pierre,SD ACY = Atlantic City, NJ ACK = Nantucket, MA CEC = Crescent City, CA LMT = Klamath Falls, OR
  36. Hive is slight winner as all know "language of SQL" and these basic operations are very well known. Pig and Spark are both equally capable in these spaces, but I fear the masses will think they are a bit "chatty" to get the job done. As is often the case, it is a matter of style.
  37. For grins… this code snippet is with Python instead of Scala
  38. Spark is 1st at how easy to surface an UDF. Hive in 2nd due to being able to publish UDF to a database.
  39. “Mutable Data in an Immutable World” is hard for ALL, but Hive edges out with it’s growing ”transactions” features; https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
  40. Spark is 1st at how easy to surface an UDF. Hive in 2nd due to being able to publish UDF to a database.