SlideShare una empresa de Scribd logo
1 de 14
Alan F. Gates Yahoo! Pig, Making Hadoop Easy
Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit:  Steven Guarnaccia, The Three Little Pigs
Motivation By Example You have web server logs of purchases on your site.  You want to find the 10 users who bought the most and the cities they live in.  You also want to know what percentage of purchases they account for in those cities. Load Logs Find top 10 users Sum purchases by city Join by city Store top 10 users Calculate percentage Store results
In Pig Latin raw     = load 'logs' as (name, city, purchase); -- Find top 10 users usrgrp  = group raw by (name, city); byusr   = foreach usrgrp generate group as k1,                SUM(raw.purchase) as utotal; srtusr  = order byusr by usrtotal desc; topusrs = limit srtusr 10; store topusrs into 'top_users'; -- Count purchases per city citygrp = group raw by city; bycity  = foreach citygrp generate group as k2,                SUM(raw.purchase) as ctotal; -- Join top users back to city jnd     = join topusrs by k1.city, bycity by k2; pct     = foreach jnd generate k1.name, k1.city, utotal/ctotal; store pct into 'top_users_pct_of_city';
Translates to Four MapReduce Jobs
Performance
Where Do Pigs Live? Data Factory Pig Pipelines Iterative Processing Research Data Warehouse BI Tools Analysis Data Collection
Pig Highlights Language designed to enable efficient description of data flow Standard relational operators built in User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM) UDFs can be written to take advantage of the combiner Four join implementations built in:  hash, fragment-replicate, merge, skewed Multi-query:  Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned Order by provides total ordering across reducers in a balanced way Writing load and store functions is easy once an InputFormat and OutputFormat exist Piggybank, a collection of user contributed UDFs
Multi-store script A = load ‘users’ as (name, age, gender,       city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; group by age, gender store into ‘bydemo’ apply UDFs load users filter nulls group by state store into ‘bystate’ apply UDFs
Multi-Store Map-Reduce Plan map filter split local rearrange local rearrange reduce multiplex package package foreach foreach
Hash Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Users by name, Pages by user; Map 1 Reducer 1 (1, user) Pages block n (1, fred) (2, fred) (2, fred) Pages Users Map 2 Reducer 2 Users block m (1, jane) (2, jane) (2, jane) (2, name)
Fragment Replicate Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “replicated”; Map 1 Users Pages block 1 Pages Users Map 2 Users Pages block 2
Skew Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “skewed”; Map 1 Reducer 1 SP (1, user) Pages block n (1, fred, p1) (1, fred, p2) (2, fred) Pages Users SP Map 2 Reducer 2 Users block m (1, fred, p3) (1, fred, p4) (2, fred) (2, name)
Merge Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “merge”; Map 1 Users Pages Pages Users aaron… amr aaron … aaron     .     .     .     .     .     .     .     . zach aaron     .     .     .     .     .     .     .     . zach Map 2 Users Pages amy… barb amy …

Más contenido relacionado

La actualidad más candente

Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latinknowbigdata
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And HdfsCloudera, Inc.
 
Hive data migration (export/import)
Hive data migration (export/import)Hive data migration (export/import)
Hive data migration (export/import)Bopyo Hong
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Rohit Agrawal
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamZheng Shao
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Swiss Big Data User Group
 
Nov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonNov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonYahoo Developer Network
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat SheetHortonworks
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010Yahoo Developer Network
 

La actualidad más candente (20)

Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hw09   Hadoop Development At Facebook  Hive And HdfsHw09   Hadoop Development At Facebook  Hive And Hdfs
Hw09 Hadoop Development At Facebook Hive And Hdfs
 
Hive data migration (export/import)
Hive data migration (export/import)Hive data migration (export/import)
Hive data migration (export/import)
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5Pig and Pig Latin - Module 5
Pig and Pig Latin - Module 5
 
Unit 4 lecture-3
Unit 4 lecture-3Unit 4 lecture-3
Unit 4 lecture-3
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive Team
 
Apache pig
Apache pigApache pig
Apache pig
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)Practical Pig and PigUnit (Michael Noll, Verisign)
Practical Pig and PigUnit (Michael Noll, Verisign)
 
Nov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonNov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In Python
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
 

Similar a Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Scaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approachScaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approachJinal Jhaveri
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxLex Avstreikh
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkDatabricks
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...DrPDShebaKeziaMalarc
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Julian Hyde
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...Codemotion
 
Parse cloud code
Parse cloud codeParse cloud code
Parse cloud code維佋 唐
 
Intuitive CLIs for gRPC APIs
Intuitive CLIs for gRPC APIsIntuitive CLIs for gRPC APIs
Intuitive CLIs for gRPC APIsNordic APIs
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
 

Similar a Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate (20)

January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
Pig
PigPig
Pig
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Scaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approachScaling python webapps from 0 to 50 million users - A top-down approach
Scaling python webapps from 0 to 50 million users - A top-down approach
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
Pig - A Data Flow Language and Execution Environment for Exploring Very Large...
 
pig.ppt
pig.pptpig.ppt
pig.ppt
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
ClojureScript - Making Front-End development Fun again - John Stevenson - Cod...
 
Parse cloud code
Parse cloud codeParse cloud code
Parse cloud code
 
Intuitive CLIs for gRPC APIs
Intuitive CLIs for gRPC APIsIntuitive CLIs for gRPC APIs
Intuitive CLIs for gRPC APIs
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 

Más de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Más de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate

  • 1. Alan F. Gates Yahoo! Pig, Making Hadoop Easy
  • 2. Who Am I? Pig committer and PMC Member An architect in Yahoo! grid team Photo credit: Steven Guarnaccia, The Three Little Pigs
  • 3. Motivation By Example You have web server logs of purchases on your site. You want to find the 10 users who bought the most and the cities they live in. You also want to know what percentage of purchases they account for in those cities. Load Logs Find top 10 users Sum purchases by city Join by city Store top 10 users Calculate percentage Store results
  • 4. In Pig Latin raw = load 'logs' as (name, city, purchase); -- Find top 10 users usrgrp = group raw by (name, city); byusr = foreach usrgrp generate group as k1, SUM(raw.purchase) as utotal; srtusr = order byusr by usrtotal desc; topusrs = limit srtusr 10; store topusrs into 'top_users'; -- Count purchases per city citygrp = group raw by city; bycity = foreach citygrp generate group as k2, SUM(raw.purchase) as ctotal; -- Join top users back to city jnd = join topusrs by k1.city, bycity by k2; pct = foreach jnd generate k1.name, k1.city, utotal/ctotal; store pct into 'top_users_pct_of_city';
  • 5. Translates to Four MapReduce Jobs
  • 7. Where Do Pigs Live? Data Factory Pig Pipelines Iterative Processing Research Data Warehouse BI Tools Analysis Data Collection
  • 8. Pig Highlights Language designed to enable efficient description of data flow Standard relational operators built in User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM) UDFs can be written to take advantage of the combiner Four join implementations built in: hash, fragment-replicate, merge, skewed Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned Order by provides total ordering across reducers in a balanced way Writing load and store functions is easy once an InputFormat and OutputFormat exist Piggybank, a collection of user contributed UDFs
  • 9. Multi-store script A = load ‘users’ as (name, age, gender, city, state); B = filter A by name is not null; C1 = group B by age, gender; D1 = foreach C1 generate group, COUNT(B); store D into ‘bydemo’; C2= group B by state; D2 = foreach C2 generate group, COUNT(B); store D2 into ‘bystate’; group by age, gender store into ‘bydemo’ apply UDFs load users filter nulls group by state store into ‘bystate’ apply UDFs
  • 10. Multi-Store Map-Reduce Plan map filter split local rearrange local rearrange reduce multiplex package package foreach foreach
  • 11. Hash Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Users by name, Pages by user; Map 1 Reducer 1 (1, user) Pages block n (1, fred) (2, fred) (2, fred) Pages Users Map 2 Reducer 2 Users block m (1, jane) (2, jane) (2, jane) (2, name)
  • 12. Fragment Replicate Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “replicated”; Map 1 Users Pages block 1 Pages Users Map 2 Users Pages block 2
  • 13. Skew Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “skewed”; Map 1 Reducer 1 SP (1, user) Pages block n (1, fred, p1) (1, fred, p2) (2, fred) Pages Users SP Map 2 Reducer 2 Users block m (1, fred, p3) (1, fred, p4) (2, fred) (2, name)
  • 14. Merge Join Users = load‘users’as (name, age);Pages = load ‘pages’ as (user, url);Jnd = join Pages by user, Users by name using “merge”; Map 1 Users Pages Pages Users aaron… amr aaron … aaron . . . . . . . . zach aaron . . . . . . . . zach Map 2 Users Pages amy… barb amy …
  • 15. Who uses Pig for What? 70% of production grid jobs at Yahoo (10ks per day) Also used by Twitter, LinkedIn, Ebay, AOL, … Used to Process web logs Build user behavior models Process images Build maps of the web Do research on raw data sets
  • 16.
  • 17. Grunt, the pig shell
  • 18. PigServer Java class, a JDBC like interfaceJob executes on cluster Hadoop Cluster Pig resides on user machine User machine No need to install anything extra on your Hadoop cluster.
  • 19.
  • 23. submits jar to Hadoop
  • 24. monitors job progressExecution Plan Map: Filter Count Combine/Reduce: Sum
  • 25. New in 0.8 UDFs can be Jython Improved and expanded statistics Performance Improvements Automatic merging of small files Compression of intermediate results PigUnit for unit testing your Pig Latin scripts Access to static Java functions as UDFs Improved HBase integration Custom PartitionersB = group A by $0 partition by YourPartitioner parallel2; Greatly expanded string and math built in UDFs
  • 26. What’s Next? Preview of Pig 0.9 Integrate Pig with scripting languages for control flow Add macros to Pig Latin Revive ILLUSTRATE Fix runtime type errors Rewrite parser to give more useful error messages Programming Pig from O’Reilly Press
  • 27. Learn More Online documentation: http://pig.apache.org/ Hadoop, The Definitive Guide 2nd edition has an up to date chapter on Pig, search at your favorite bookstore Join the mailing lists: user@pig.apache.org for user questions dev@pig.apache.com for developer issues Follow me on Twitter, @alanfgates
  • 28. UDFs in Scripting Languages Evaluation functions can now be written in scripting languages that compile down to the JVM Reference implementation provided in Jython Jruby, others, could be added with minimal code JavaScript implementation in progress Jython sold separately
  • 29. Example Python UDF test.py: @outputSchema(”sqr:long”) def square(num): return ((num)*(num)) test.pig: register 'test.py' using jython as myfuncs; A = load ‘input’ as (i:int); B = foreach A generate myfuncs.square(i); dump B;
  • 30. Better statistics Statistics printed out at end of job run Pig information stored in Hadoop’s job history files so you can mine the information and analyze your Pig usage Loader for reading job history files included in Piggybank New PigRunner interface that allows users to invoke Pig and get back a statistics object that contains stats information Can also pass listener to track Pig jobs as they run Done for Oozie so it can show users Pig statistics
  • 31. Sample stats info Job Stats (time in seconds): JobId Maps Reduces MxMTMnMT AMT MxRTMnRT ART Alias job_0 2 1 15 3 9 27 27 27 a,b,c,d,e job_1 1 1 3 3 3 12 12 12 g,h job_2 1 1 3 3 3 12 12 12 i job_3 1 1 3 3 3 12 12 12 i Input(s): Successfully read 10000 records from: “studenttab10k" Successfully read 10000 records from: “votertab10k" Output(s): Successfully stored 6 records (150 bytes) in: ”outfile" Counters: Total records written : 6 Total bytes written : 150
  • 32. Invoke Static Java Functions as UDFs Often UDF you need already exists as Java function, e.g. Java’s URLDecoder.decode() for decoding URLs define UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');A = load 'encoded.txt' as (e:chararray);B = foreach A generate UrlDecode(e, 'UTF-8'); Currently only works with simple types and static functions
  • 33. Improved HBase Integration Can now read records as bytes instead of auto converting to strings Filters can be pushed down Can store data in HBase as well as load from it Works with HBase 0.20 but not 0.89 or 0.90. Patch in PIG-1680 addresses this but has not been committed yet.

Notas del editor

  1. A very common use case we see at Yahoo is users want to read one data set and group it several different ways. Since scan time often dominates for these large data sets, sharing one scan across several group instances can result in nearly linear speed up of queries.
  2. In this case multiple pipelines are needed in Map and Reduce phasesDue to our pull based model in execution, we have split and multiplex embed the pipelines within themselvesRecords are tagged with the pipeline number in the map stageGrouping is done by Hadoop using a union of the keysMultiplex operator on the reducer places incoming records in the correct pipeline
  3. As your website grows, the number of unique users grows beyond what you can keep in memory.A given map only gets input from a given input source. It can therefore annotate tuples from that source with information on which source it came from. The join key is then used to partition the data, but the join key plus the input source id is used to sort it. This allows pig to buffer one side of the join keys in memory and then use that as a probe table as keys from the other input stream by.
  4. Running example: You start a webiste. You want to know how users are using your website. So you collect a couple of streams of information from your logs: page views and users.When you start you have a fair number of page views, but not many users.In this algorithm the smaller table is copied to every map in its entirety (doesn’t yet use Distributed Cache, it should). Larger file is partitioned as per normal MR.
  5. As your website grows even more, some pages become significantly more popular than others. This means that some pages are visited by almost every user, while others are visited only by a few users.First, a sampling pass is done to determine which keys are large enough to need special attention. These are keys that have enough values that we estimate we cannot hold the entire value in memory. It’s about holding the values in memory, not the key.Then at partitioning time, those keys are handled specially. All other keys are treated as in the regular join. These selected keys from input1 are split across multiple reducers. For input2, they are replicated to each of these reducers that had the split. In this way we guarantee that every instance of key k from input1 comes into contact with every instance of k from input2.
  6. Now lets say that for some reason you start keeping both your page view data and user data sorted by user.Note that one way to do this is make sure that pages and users are partitioned the same way. But this leads to a big problem. In order to make sure you can join all your data sets you end up using the same hash function to join them all. But rarely does one bucketing scheme make sense for all your data. Whatever is big enough for one data set will be too small for others, and vice versa. So Pig’s implementation doesn’t depend on how the data is split.Pig does this by sampling one of the inputs and then building an index from that sample that indicates the key for the first record in every split. The other input is used as the standard input file for Hadoop and is split to the maps as per normal. When the map begins processing this file, when it encounters the first key in that file it uses the index to determine where it should open the second, sampled file. It then opens the file at the appropriate point, seeks forward until it finds the key it is looking for, and then begins doing a join on the two data sources.
  7. Can’t yet inline the Python functions in Pig Latin script. In 0.9 we’ll add the ability to put them in the same file.