SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
Working with Structured Data in
Hadoop


Jeff Hammerbacher
Manager, Data
May 28 - 29, 2008
Structured Data Management in Hadoop
State of the World
    HBase is a Hadoop subproject
▪

        Powerset and Rapleaf are the main contributors
    ▪

    Hypertable is Bigtable in C++
▪

        Zvents are the main contributors
    ▪

    Pig is an Apache Incubator project
▪

        Yahoo! is the main contributor
    ▪

    JAQL has been released as open source
▪

        IBM is the main contributor
    ▪

    Hive not available publicly, hopefully under contrib/ soon
▪

        Facebook is the main contributor
    ▪
Pig
Philosophy
    Pigs Eat Anything
▪

        Operate on data with or without metadata
    ▪

        Operate on relational, nested, or unstructured data
    ▪

    Pigs Live Anywhere
▪

        The language is independent of execution environment
    ▪

    Pigs are Domestic Animals
▪

        Integrate user code wherever possible
    ▪

        Allow control over code reorganization when optimizing
    ▪

    Pigs Fly
▪
Pig
Components
    Pig Latin
▪

        Dataflow programming language; procedural, not declarative
    ▪

        Algebraic: each step specifies only a single data transformation
    ▪

        Parse, verify, and build a logical plan
    ▪

    Evaluation Mechanisms
▪

        Local evaluation in single JVM
    ▪

        Compilation to Hadoop MapReduce
    ▪

    Grunt: interactive shell
▪

    Pig Pen: debugging environment
▪
Pig
Data Model
    Pig has four types of data items:
▪

        Atom: string or number
    ▪

        Tuple: “data record” consisting of an ordered sequence of “fields”
    ▪

            Denoted with < > bracketing
        ▪


        Bag: an unordered collection of tuples with possible duplicates and
    ▪
        possibly inconsistent schemas
            Denoted with { } bracketing
        ▪


        Map: an unordered collection of data items where each data item has
    ▪
        an associated key; the key must be a string
            Denoted with [ ] bracketing
        ▪
Pig
Data Model, continued
    Fields in a tuple may be named for easier access
▪

    A “relation” is a Bag that has been assigned a name (“alias”)
▪




    Example:
▪

        Let t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] >
    ▪

        Give the fields of t the names “f1”, “f2”, and “f3”
    ▪

        Give the fields of the tuples of the bag the names “g1”, “g2”, and “g3”
    ▪

        We’ll look at Pig’s data access syntax on the next page
    ▪
Pig
   Data Access
         t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] >
     ▪


   Method of Data Access                 Example                 Value for t      Applies to which Data Item

Constant                   ‘1.0’ or ‘apache.org’    Constant                      Atom

                                                    ‘1’
Positional Reference       $0                                                     Tuple

                                                    ‘1’
Named Reference            f1                                                     Tuple

Projection                 f2.$0                    { <2>, <4>, <5> }             Bag

Multiple Projection        f2.(g1, g3)              { <2, 4>, <4, 8>, <5, 11> }   Bag

Map Lookup                 f3#’apache’              ‘search’                      Map

Multiple Map Lookup (?)    ?                        ?                             Map
Pig
Questions
    How does a tuple with named fields differ from a map?
▪

    How does a tuple of tuples differ from a bag?
▪

    When do you ever use a map?
▪




    For further information, see Pig’s documentation and mailing lists:
▪

        Web site: incubator.apache.org/pig
    ▪

        Wiki: http://wiki.apache.org/pig
    ▪

        Paper: http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf
    ▪

        Language reference: http://wiki.apache.org/pig/PigLatin
    ▪
Pig
Statements
    A Pig Latin statement is a command that produces a relation
▪

    Pig commands can take zero, one, or more relations as input
▪

    Pig commands can span multiple lines and must include “;” at the end
▪

    To play with Pig syntax, you can use the grunt shell or the
▪
    StandAloneParser
Pig
Example Data
    Let ‘a.txt’ be a tab-delimited file with values:
▪

        123
    ▪

        421
    ▪

        834
    ▪

        433
    ▪

        725
    ▪

        843
    ▪
Pig
Example Data
    Let ‘b.txt’ be a tab-delimited file with values:
▪

        24
    ▪

        89
    ▪

        13
    ▪

        27
    ▪

        29
    ▪

        46
    ▪

        49
    ▪
Pig
Statements: LOAD and STORE
    LOAD <filename> [USING <function>] [AS <schema>]
▪

    Example:
▪

        grunt> a = LOAD ‘a.txt’ USING PigStorage(‘t’) AS (f1, f2, f3);
    ▪

        Now a is a relation with six tuples which share a common schema:
    ▪

            a = { <1, 2, 3>, <4, 2, 1>, <8, 3, 4>, <4, 3, 3>, <7, 2, 5>, <8, 4, 3> }
        ▪


            all the tuples have field names “f1”, “f2”, and “f3”
        ▪


        PigStorage() can be any deserialization function
    ▪

    STORE <relation> INTO <filename> [USING <function>] does the reverse
▪

        PigStorage() can’t handle nested relations; use BinStorage() instead
    ▪
Pig
Statements: FILTER
    FILTER <relation> BY <condition>
▪

    Example:
▪

        grunt> x = FILTER a BY f1 == ‘8’ OR f3 > 4;
    ▪

        The relation x has three tuples which again share the schema (f1, f2,
    ▪
        f3):
            x = { <8, 3, 4>, <8, 4, 3>, <7, 2, 5> }
        ▪


        In addition to standard numerical comparisons, you can also do string
    ▪
        comparisons and even do regular expression matching
        You can also use your own comparison function
    ▪
Pig
Statements: GROUP
    GROUP <relation> BY [<fields> | ALL | ANY]
▪

    Only makes sense if tuples in relation have partially shared schemas
▪

    Example:
▪

        grunt> y = GROUP x BY f1;
    ▪

        The relation y has two tuples which share the schema (group, x):
    ▪

            y = { < 7, { < 7, 2, 5 > } >, < 8, { < 8, 3, 4 >, < 8, 4, 3 > } > }
        ▪


        Using ANY will return a single tuple with all tuples into a single bag
    ▪

        Note that GROUP is just syntactic sugar for COGROUP for a single
    ▪
        relation
Pig
Statements: COGROUP
    COGROUP <relation> BY <fields> [INNER][, <relation> BY <fields> [INNER]];
▪

    Example:
▪

        grunt> z = COGROUP x BY f3 INNER, b BY $0 INNER;
    ▪

        The relation z has three tuples with the schema (group, x, b):
    ▪

            z = { 4, { < 8, 3, 4 > }, { < 4, 6 >, < 4, 9 > } }
        ▪


        Note that we could have used multiple fields with BY
    ▪

        The INNER keyword on either relation will toss out the group records
    ▪
        for which there are empty tuples for that relation
Pig
Statements: FOREACH ... GENERATE
    FOREACH <relation> GENERATE <data item>, <data item>, ...;
▪

    Example:
▪

        w = FOREACH x GENERATE f1, f3;
    ▪

        Equivalent to the projection x.(f1, f3)
    ▪

        The relation w has three tuples which share the schema (f1, f3):
    ▪

            w = { <8, 4>, <8, 3>, <7, 5> }
        ▪


        Can also have “nested projections”:
    ▪

            u = FOREACH y GENERATE group, SUM(x.f3) AS thirdcolsum;
        ▪


            u = { <7, 5>, <8, 7> }, where tuples have the schema (group, thirdcolsum)
        ▪
Pig
More Keywords and Statements
    FLATTEN
▪

    JOIN
▪

    ORDER
▪

    DISTINCT
▪

    CROSS
▪

    UNION
▪

    SPLIT
▪

    Write your own functions: http://wiki.apache.org/pig/PigFunctions
▪
Pig
Physical Execution via Hadoop MapReduce
    How is a logical Pig plan executed via Hadoop?
▪

        Details in SIGMOD paper
    ▪

        Essentially each (CO)GROUP results in a new map and reduce function
    ▪

        Similar to Teradata, intermediate data is materialized in the DFS
    ▪

        For Pig commands that take multiple relations as input, an additional
    ▪
        field is inserted into each tuple to indicate which relation it came from
Pig
Grunt Shell
    Allows you to maintain a working session
▪

    You can interact with the DFS as well as your Pig logical objects
▪

    DUMP command will let you see the objects you are working with
▪

    ILLUSTRATE command provides for simple debugging
▪

    For more, check out http://wiki.apache.org/pig/Grunt
▪
Pig
Pig Pen
    Run sequence of Pig commands over a representative sample of data
▪

    Difficult to generate a representative sample when using highly
▪
    selective FILTER or COGROUP statements
    Algorithm runs multiple sampling passes over the data and generates
▪
    representative data if necessary
    Allows for incremental construction of complex Pig commands
▪
Pig Pen
Pig
What’s Missing?
    Metadata repository
▪

        Browse schemas for persistent data
    ▪

        Library of serialization and deserialization functions
    ▪

    Optimized logical and physical organization of data
▪

    SQL interface
▪

    UDF in any language
▪

    Execution dataflows other than MapReduce
▪

        Hash joins, aggregate operators that don’t require a sort, etc.
    ▪

    Query optimization
▪
(c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Más contenido relacionado

La actualidad más candente

Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
pugpe
 
Manipulating strings
Manipulating stringsManipulating strings
Manipulating strings
Nicole Ryan
 
My sql presentation
My sql presentationMy sql presentation
My sql presentation
Nikhil Jain
 

La actualidad más candente (20)

Python fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanPython fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuan
 
Pdxpugday2010 pg90
Pdxpugday2010 pg90Pdxpugday2010 pg90
Pdxpugday2010 pg90
 
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
 
Sql for dbaspresentation
Sql for dbaspresentationSql for dbaspresentation
Sql for dbaspresentation
 
Manipulating strings
Manipulating stringsManipulating strings
Manipulating strings
 
My sql presentation
My sql presentationMy sql presentation
My sql presentation
 
Unit vii wp ppt
Unit vii wp pptUnit vii wp ppt
Unit vii wp ppt
 
Big Data Analytics Lab File
Big Data Analytics Lab FileBig Data Analytics Lab File
Big Data Analytics Lab File
 
Exadata - BULK DATA LOAD Testing on Database Machine
Exadata - BULK DATA LOAD Testing on Database Machine Exadata - BULK DATA LOAD Testing on Database Machine
Exadata - BULK DATA LOAD Testing on Database Machine
 
FITC CoffeeScript 101
FITC CoffeeScript 101FITC CoffeeScript 101
FITC CoffeeScript 101
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Sparklyr
SparklyrSparklyr
Sparklyr
 
The bones of a nice Python script
The bones of a nice Python scriptThe bones of a nice Python script
The bones of a nice Python script
 
Percona toolkit
Percona toolkitPercona toolkit
Percona toolkit
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
TDC2016SP - Código funcional em Java: superando o hype
TDC2016SP - Código funcional em Java: superando o hypeTDC2016SP - Código funcional em Java: superando o hype
TDC2016SP - Código funcional em Java: superando o hype
 
Elixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicitElixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicit
 
Postgresql Database Administration- Day4
Postgresql Database Administration- Day4Postgresql Database Administration- Day4
Postgresql Database Administration- Day4
 
Elixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicitElixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicit
 

Similar a 20080529dublinpt2

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Viswanath Gangavaram
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
bhargavi804095
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In Perl
Kang-min Liu
 

Similar a 20080529dublinpt2 (20)

Apache pig
Apache pigApache pig
Apache pig
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Pig workshop
Pig workshopPig workshop
Pig workshop
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Bioinformatica 10-11-2011-p6-bioperl
Bioinformatica 10-11-2011-p6-bioperlBioinformatica 10-11-2011-p6-bioperl
Bioinformatica 10-11-2011-p6-bioperl
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
Bioinformatics p5-bioperlv2014
Bioinformatics p5-bioperlv2014Bioinformatics p5-bioperlv2014
Bioinformatics p5-bioperlv2014
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In Perl
 
Bioinformatica p4-io
Bioinformatica p4-ioBioinformatica p4-io
Bioinformatica p4-io
 
Python Workshop. LUG Maniapl
Python Workshop. LUG ManiaplPython Workshop. LUG Maniapl
Python Workshop. LUG Maniapl
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
What's new, what's hot in PHP 5.3
What's new, what's hot in PHP 5.3What's new, what's hot in PHP 5.3
What's new, what's hot in PHP 5.3
 
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, ItalyPHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
 

Más de Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081022cca
20081022cca20081022cca
20081022cca
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

20080529dublinpt2

  • 1.
  • 2. Working with Structured Data in Hadoop Jeff Hammerbacher Manager, Data May 28 - 29, 2008
  • 3. Structured Data Management in Hadoop State of the World HBase is a Hadoop subproject ▪ Powerset and Rapleaf are the main contributors ▪ Hypertable is Bigtable in C++ ▪ Zvents are the main contributors ▪ Pig is an Apache Incubator project ▪ Yahoo! is the main contributor ▪ JAQL has been released as open source ▪ IBM is the main contributor ▪ Hive not available publicly, hopefully under contrib/ soon ▪ Facebook is the main contributor ▪
  • 4. Pig Philosophy Pigs Eat Anything ▪ Operate on data with or without metadata ▪ Operate on relational, nested, or unstructured data ▪ Pigs Live Anywhere ▪ The language is independent of execution environment ▪ Pigs are Domestic Animals ▪ Integrate user code wherever possible ▪ Allow control over code reorganization when optimizing ▪ Pigs Fly ▪
  • 5. Pig Components Pig Latin ▪ Dataflow programming language; procedural, not declarative ▪ Algebraic: each step specifies only a single data transformation ▪ Parse, verify, and build a logical plan ▪ Evaluation Mechanisms ▪ Local evaluation in single JVM ▪ Compilation to Hadoop MapReduce ▪ Grunt: interactive shell ▪ Pig Pen: debugging environment ▪
  • 6. Pig Data Model Pig has four types of data items: ▪ Atom: string or number ▪ Tuple: “data record” consisting of an ordered sequence of “fields” ▪ Denoted with < > bracketing ▪ Bag: an unordered collection of tuples with possible duplicates and ▪ possibly inconsistent schemas Denoted with { } bracketing ▪ Map: an unordered collection of data items where each data item has ▪ an associated key; the key must be a string Denoted with [ ] bracketing ▪
  • 7. Pig Data Model, continued Fields in a tuple may be named for easier access ▪ A “relation” is a Bag that has been assigned a name (“alias”) ▪ Example: ▪ Let t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] > ▪ Give the fields of t the names “f1”, “f2”, and “f3” ▪ Give the fields of the tuples of the bag the names “g1”, “g2”, and “g3” ▪ We’ll look at Pig’s data access syntax on the next page ▪
  • 8. Pig Data Access t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] > ▪ Method of Data Access Example Value for t Applies to which Data Item Constant ‘1.0’ or ‘apache.org’ Constant Atom ‘1’ Positional Reference $0 Tuple ‘1’ Named Reference f1 Tuple Projection f2.$0 { <2>, <4>, <5> } Bag Multiple Projection f2.(g1, g3) { <2, 4>, <4, 8>, <5, 11> } Bag Map Lookup f3#’apache’ ‘search’ Map Multiple Map Lookup (?) ? ? Map
  • 9. Pig Questions How does a tuple with named fields differ from a map? ▪ How does a tuple of tuples differ from a bag? ▪ When do you ever use a map? ▪ For further information, see Pig’s documentation and mailing lists: ▪ Web site: incubator.apache.org/pig ▪ Wiki: http://wiki.apache.org/pig ▪ Paper: http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf ▪ Language reference: http://wiki.apache.org/pig/PigLatin ▪
  • 10. Pig Statements A Pig Latin statement is a command that produces a relation ▪ Pig commands can take zero, one, or more relations as input ▪ Pig commands can span multiple lines and must include “;” at the end ▪ To play with Pig syntax, you can use the grunt shell or the ▪ StandAloneParser
  • 11. Pig Example Data Let ‘a.txt’ be a tab-delimited file with values: ▪ 123 ▪ 421 ▪ 834 ▪ 433 ▪ 725 ▪ 843 ▪
  • 12. Pig Example Data Let ‘b.txt’ be a tab-delimited file with values: ▪ 24 ▪ 89 ▪ 13 ▪ 27 ▪ 29 ▪ 46 ▪ 49 ▪
  • 13. Pig Statements: LOAD and STORE LOAD <filename> [USING <function>] [AS <schema>] ▪ Example: ▪ grunt> a = LOAD ‘a.txt’ USING PigStorage(‘t’) AS (f1, f2, f3); ▪ Now a is a relation with six tuples which share a common schema: ▪ a = { <1, 2, 3>, <4, 2, 1>, <8, 3, 4>, <4, 3, 3>, <7, 2, 5>, <8, 4, 3> } ▪ all the tuples have field names “f1”, “f2”, and “f3” ▪ PigStorage() can be any deserialization function ▪ STORE <relation> INTO <filename> [USING <function>] does the reverse ▪ PigStorage() can’t handle nested relations; use BinStorage() instead ▪
  • 14. Pig Statements: FILTER FILTER <relation> BY <condition> ▪ Example: ▪ grunt> x = FILTER a BY f1 == ‘8’ OR f3 > 4; ▪ The relation x has three tuples which again share the schema (f1, f2, ▪ f3): x = { <8, 3, 4>, <8, 4, 3>, <7, 2, 5> } ▪ In addition to standard numerical comparisons, you can also do string ▪ comparisons and even do regular expression matching You can also use your own comparison function ▪
  • 15. Pig Statements: GROUP GROUP <relation> BY [<fields> | ALL | ANY] ▪ Only makes sense if tuples in relation have partially shared schemas ▪ Example: ▪ grunt> y = GROUP x BY f1; ▪ The relation y has two tuples which share the schema (group, x): ▪ y = { < 7, { < 7, 2, 5 > } >, < 8, { < 8, 3, 4 >, < 8, 4, 3 > } > } ▪ Using ANY will return a single tuple with all tuples into a single bag ▪ Note that GROUP is just syntactic sugar for COGROUP for a single ▪ relation
  • 16. Pig Statements: COGROUP COGROUP <relation> BY <fields> [INNER][, <relation> BY <fields> [INNER]]; ▪ Example: ▪ grunt> z = COGROUP x BY f3 INNER, b BY $0 INNER; ▪ The relation z has three tuples with the schema (group, x, b): ▪ z = { 4, { < 8, 3, 4 > }, { < 4, 6 >, < 4, 9 > } } ▪ Note that we could have used multiple fields with BY ▪ The INNER keyword on either relation will toss out the group records ▪ for which there are empty tuples for that relation
  • 17. Pig Statements: FOREACH ... GENERATE FOREACH <relation> GENERATE <data item>, <data item>, ...; ▪ Example: ▪ w = FOREACH x GENERATE f1, f3; ▪ Equivalent to the projection x.(f1, f3) ▪ The relation w has three tuples which share the schema (f1, f3): ▪ w = { <8, 4>, <8, 3>, <7, 5> } ▪ Can also have “nested projections”: ▪ u = FOREACH y GENERATE group, SUM(x.f3) AS thirdcolsum; ▪ u = { <7, 5>, <8, 7> }, where tuples have the schema (group, thirdcolsum) ▪
  • 18. Pig More Keywords and Statements FLATTEN ▪ JOIN ▪ ORDER ▪ DISTINCT ▪ CROSS ▪ UNION ▪ SPLIT ▪ Write your own functions: http://wiki.apache.org/pig/PigFunctions ▪
  • 19. Pig Physical Execution via Hadoop MapReduce How is a logical Pig plan executed via Hadoop? ▪ Details in SIGMOD paper ▪ Essentially each (CO)GROUP results in a new map and reduce function ▪ Similar to Teradata, intermediate data is materialized in the DFS ▪ For Pig commands that take multiple relations as input, an additional ▪ field is inserted into each tuple to indicate which relation it came from
  • 20. Pig Grunt Shell Allows you to maintain a working session ▪ You can interact with the DFS as well as your Pig logical objects ▪ DUMP command will let you see the objects you are working with ▪ ILLUSTRATE command provides for simple debugging ▪ For more, check out http://wiki.apache.org/pig/Grunt ▪
  • 21. Pig Pig Pen Run sequence of Pig commands over a representative sample of data ▪ Difficult to generate a representative sample when using highly ▪ selective FILTER or COGROUP statements Algorithm runs multiple sampling passes over the data and generates ▪ representative data if necessary Allows for incremental construction of complex Pig commands ▪
  • 23. Pig What’s Missing? Metadata repository ▪ Browse schemas for persistent data ▪ Library of serialization and deserialization functions ▪ Optimized logical and physical organization of data ▪ SQL interface ▪ UDF in any language ▪ Execution dataflows other than MapReduce ▪ Hash joins, aggregate operators that don’t require a sort, etc. ▪ Query optimization ▪
  • 24. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0