Pig, Making Hadoop Easy

Alan F. Gates Yahoo! Pig, Making Hadoop Easy

Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo!grid team Or, as one coworker put it, “the lipstick on the Pig”

Motivation By Example Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25. Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

In Pig Latin Users = load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = joinFltrdby name, Pages by user;Grpd = groupJndbyurl;Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks;Srtd = orderSmmdby clicks desc;Top5 = limitSrtd 5;store Top5 into‘top5sites’;

Performance 0.1 0.4, 0.5 0.2 0.3 0.6, 0.7

Why not SQL? Data Factory Pig Pipelines Iterative Processing Research Data Warehouse Hive BI Tools Analysis Data Collection

Pig Highlights User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM) UDFs can be written to take advantage of the combiner Four join implementations built in: hash, fragment-replicate, merge, skewed Multi-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned Order by provides total ordering across reducers in a balanced way Writing load and store functions is easy once an InputFormat and OutputFormat exist Piggybank, a collection of user contributed UDFs

Who uses Pig for What? 70% of production jobs at Yahoo (10ks per day) Also used by Twitter, LinkedIn, Ebay, AOL, … Used to Process web logs Build user behavior models Process images Build maps of the web Do research on raw data sets

Accessing Pig Submit a script directly Grunt, the pig shell PigServer Java class, a JDBC like interface

Components Job executes on cluster Hadoop Cluster Pig resides on user machine User machine No need to install anything extra on your Hadoop cluster.

How It Works Pig Latin A = LOAD ‘myfile’ AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO ‘output’; pig.jar: ,[object Object]

Pig, Making Hadoop Easy

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (10)

Similar a Pig, Making Hadoop Easy

Similar a Pig, Making Hadoop Easy (20)

Más de Nick Dimiduk

Más de Nick Dimiduk (13)

Pig, Making Hadoop Easy

Notas del editor