SlideShare a Scribd company logo
1 of 39
Download to read offline
A Bird’s-Eye View of Pig and Scalding
with hRaven
a tale by @gario and @joep
Hadoop Summit 2013
v1.2
@Twitter#HadoopSummit2013 2
Apache HBase PMC member and
Committer
Software Engineer @ Twitter
Core Storage Team - Hadoop/HBase
•
•
•
About the authors
Software Engineer @ Twitter
Engineering Manager Hadoop/HBase
team @ Twitter
•
•
@Twitter#HadoopSummit2013 3
Chapter 1: The Problem
Chapter 2: Why hRaven?
Chapter 3: How Does it Work?
3a: Loading
3b: Table structure / querying
Chapter 4: Current Uses
Appendix: Future Work
•
•
•
•
•
•
•
Table of Contents
Chapter 1: The Problem
Illustration by Sirxlem (CC BY-NC-ND
3.0)
@Twitter#HadoopSummit2013 5
Most users run Pig and Scalding scripts, not straight map reduce
JobTracker UI shows jobs, not DAGs of jobs generated by Pig and Scalding
•
•
Chapter 1: Mismatched Abstractions
@Twitter#HadoopSummit2013
Chapter 1: A Problem of Scale
6
@Twitter#HadoopSummit2013 7
How many Pig versus Scalding jobs do we run ?
What cluster capacity do jobs in my pool take ?
How many jobs do we run each day ?
What % of jobs have > 30k tasks ?
Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ?
•
•
•
•
•
Chapter 1: Questions
@Twitter#HadoopSummit2013 8
How many Pig versus Scalding jobs do we run ?
What cluster capacity do jobs in my pool take ?
How many jobs do we run each day ?
What % of jobs have > 30k tasks ?
Why do I need to hand-tune these (hundreds) of jobs, can’t the cluster learn ?
•
•
•
•
•
Chapter 1: Questions
#Nevermore
Chapter 2: Why hRaven?
Photo by DAVID ILIFF. License: CC-BY-SA
3.0
@Twitter#HadoopSummit2013 10
Stores stats, configuration and timing for every map reduce job on every
cluster
Structured around the full DAG of jobs from a Pig or Scalding application
Easily queryable for historical trending
Allows for Pig reducer optimization based on historical run stats
Keep data online forever (12.6M jobs, 4.5B tasks + attempts)
•
•
•
•
•
Chapter 2: Why hRaven?
@Twitter#HadoopSummit2013 11
cluster - each cluster has a unique name mapping to the Job Tracker
user - map reduce jobs are run as a given user
application - a Pig or Scalding script (or plain map reduce job)
flow - the combined DAG of jobs executed from a single run of an
application
version - changes impacting the DAG are recorded as a new version of the
same application
•
•
•
•
•
Chapter 2: Key Concepts
@Twitter#HadoopSummit2013 12
Chapter 2: Application Flows
Edgar
@Twitter#HadoopSummit2013 13
Chapter 2: Application Flows
Edgar
@Twitter#HadoopSummit2013 14
All jobs in a flow are ordered together•
Chapter 2: Flow Storage
@Twitter#HadoopSummit2013 15
Most recent flow is ordered first•
Chapter 2: Flow Storage
@Twitter#HadoopSummit2013 16
All jobs in a flow are ordered together
Per-job metrics stored
Total map and reduce tasks
HDFS bytes read / written
File bytes read / written
Total map and reduce slot milliseconds
Easy to aggregate stats for an entire flow
Easy to scan the timeseries of each application’s flows
•
•
•
•
•
•
•
•
Chapter 2: Key Features
Chapter 3: How Does it Work?
@Twitter#HadoopSummit2013 18
Chapter 3: ETL - Step 1: JobFilePreprocessor
@Twitter#HadoopSummit2013 19
Chapter 3: ETL - Step 2: JobFileRawLoader
@Twitter#HadoopSummit2013 20
Chapter 3: ETL - Step 3: JobFileProcessor
@Twitter#HadoopSummit2013 21
Chapter 3: ETL - Step 3: JobFileProcessor
Jobs finish out of order with respect to job_id
@Twitter#HadoopSummit2013 22
job_history_raw
job_history
job_history_task
job_history_app_version
•
•
•
•
Chapter 3: Tables
@Twitter#HadoopSummit2013 23
Row key: cluster!jobID
Columns:
jobconf - stores serialized raw job_*_conf.xml file
jobhistory - stored serialized raw job history log file
job_processed_success - indicates whether job has been processed
•
•
•
Chapter 3: job_history_raw
@Twitter#HadoopSummit2013 24
Row key: cluster!user!application!timestamp!jobID
cluster - unique cluster name (ie. “cluster1@dc1”)
user - user running the application (“edgar”)
application - application ID derived from job configuration:
uses “batch.desc” property if set
otherwise parses a consistent ID from “mapred.job.name”
timestamp - inverted (Long.MAX_VALUE - value) value of submission time
jobID - stored as Job Tracker start time (long), concatenated with job sequence number
job_201306271100_0001 -> [1372352073732L][1L]
•
•
•
•
•
•
•
•
Chapter 3: job_history
@Twitter#HadoopSummit2013 25
Row key: cluster!user!application!timestamp!jobID!taskID
same components as job_history key (same ordering)
taskID - (ie. “m_00001”) uniquely identifies individual task/attempt in job
Two row types:
Task - “meta” row
cluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001
Task Attempt - individual execution on a Task Tracker
cluster1@dc1!edgar!wordcount!9654...!...[00001]!m_00001_1
•
•
•
•
Chapter 3: job_history_task
@Twitter#HadoopSummit2013 26
Row key: cluster!user!application
Example: cluster1@dc1!edgar!wordcount
Columns:
v1=1369585634000
v2=1372263813000
Chapter 3: job_history_app_version
@Twitter#HadoopSummit2013 27
Using Pig’s HBaseStorage (or direct HBase APIs)
Through Client API
Through REST API
•
•
•
Chapter 3: Querying hRaven
Chapter 4: Current Uses
@Twitter#HadoopSummit2013 29
Pig reducer optimizations
Cluster utilization / capacity planning
Application performance trending over time
Identifying common job anti-patterns
Ad-hoc analysis troubleshooting cluster problems
•
•
•
•
•
Chapter 4: Current Uses
@Twitter#HadoopSummit2013 30
Chapter 4: Cluster reads-writes
@Twitter#HadoopSummit2013
Chapter 4: Pool / Application reads/writes
31
Pool view
Spike in File size read
Indicates jobs spilling
•
•
•
Application view
Spike in HDFS size
read
Indicates spiking input
•
•
•
@Twitter#HadoopSummit2013
Chapter 4: Pool usage: Used vs. Allocated
32
@Twitter#HadoopSummit2013 33
Chapter 4: Compute cost
Appendix: Future Work
@Twitter#HadoopSummit2013 35
Real-time data loading from Job Tracker / Application Master
Full flow-centric UI (Job Tracker UI replacement)
Hadoop 2.0 compatibility (in-progress)
Ambrose integration
•
•
•
•
Appendix: Future Work
@Twitter#HadoopSummit2013 36
hRaven on Github
https://github.com/twitter/hraven
hRaven Mailing Lists
hraven-user@googlegroups.com
hraven-dev@googlegroups.com
•
•
•
Additional Resources
@Twitter#HadoopSummit2013
Afterword
37
Now will thou drop your job data on the floor ?
Quoth the hRaven, 'Nevermore.'
#TheEnd
@gario and @joep
Come visit us at booth #26 to continue the story
@Twitter#HadoopSummit2013 39
Desired order
job_201306271100_9999
job_201306271100_10000
...
job_201306271100_99999
job_201306271100_100000
...
job_201306271100_999999
job_201306271100_1000000
•
Sort order Variable length job_id
Lexical order
job_201306271100_10000
job_201306271100_100000
job_201306271100_1000000
job_201306271100_9999
job_201306271100_99999
job_201306271100_999999
•

More Related Content

Similar to A Birds-Eye View of Pig and Scalding Jobs with hRaven

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
Indhujeni
 

Similar to A Birds-Eye View of Pig and Scalding Jobs with hRaven (20)

Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Final Presentation.pptx
Final Presentation.pptxFinal Presentation.pptx
Final Presentation.pptx
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Data herding
Data herdingData herding
Data herding
 
Data herding
Data herdingData herding
Data herding
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Dynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the flyDynamic DDL: Adding structure to streaming IoT data on the fly
Dynamic DDL: Adding structure to streaming IoT data on the fly
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
 
Multi-tenancy with Rails
Multi-tenancy with RailsMulti-tenancy with Rails
Multi-tenancy with Rails
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 

A Birds-Eye View of Pig and Scalding Jobs with hRaven