SlideShare una empresa de Scribd logo
1 de 24
Descargar para leer sin conexión
Working with Structured Data in
Hadoop


Jeff Hammerbacher
Manager, Data
May 28 - 29, 2008
Structured Data Management in Hadoop
State of the World
    HBase is a Hadoop subproject
▪

        Powerset and Rapleaf are the main contributors
    ▪

    Hypertable is Bigtable in C++
▪

        Zvents are the main contributors
    ▪

    Pig is an Apache Incubator project
▪

        Yahoo! is the main contributor
    ▪

    JAQL has been released as open source
▪

        IBM is the main contributor
    ▪

    Hive not available publicly, hopefully under contrib/ soon
▪

        Facebook is the main contributor
    ▪
Pig
Philosophy
    Pigs Eat Anything
▪

        Operate on data with or without metadata
    ▪

        Operate on relational, nested, or unstructured data
    ▪

    Pigs Live Anywhere
▪

        The language is independent of execution environment
    ▪

    Pigs are Domestic Animals
▪

        Integrate user code wherever possible
    ▪

        Allow control over code reorganization when optimizing
    ▪

    Pigs Fly
▪
Pig
Components
    Pig Latin
▪

        Dataflow programming language; procedural, not declarative
    ▪

        Algebraic: each step specifies only a single data transformation
    ▪

        Parse, verify, and build a logical plan
    ▪

    Evaluation Mechanisms
▪

        Local evaluation in single JVM
    ▪

        Compilation to Hadoop MapReduce
    ▪

    Grunt: interactive shell
▪

    Pig Pen: debugging environment
▪
Pig
Data Model
    Pig has four types of data items:
▪

        Atom: string or number
    ▪

        Tuple: “data record” consisting of an ordered sequence of “fields”
    ▪

            Denoted with < > bracketing
        ▪


        Bag: an unordered collection of tuples with possible duplicates and
    ▪
        possibly inconsistent schemas
            Denoted with { } bracketing
        ▪


        Map: an unordered collection of data items where each data item has
    ▪
        an associated key; the key must be a string
            Denoted with [ ] bracketing
        ▪
Pig
Data Model, continued
    Fields in a tuple may be named for easier access
▪

    A “relation” is a Bag that has been assigned a name (“alias”)
▪




    Example:
▪

        Let t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] >
    ▪

        Give the fields of t the names “f1”, “f2”, and “f3”
    ▪

        Give the fields of the tuples of the bag the names “g1”, “g2”, and “g3”
    ▪

        We’ll look at Pig’s data access syntax on the next page
    ▪
Pig
   Data Access
         t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] >
     ▪


   Method of Data Access                 Example                 Value for t      Applies to which Data Item

Constant                   ‘1.0’ or ‘apache.org’    Constant                      Atom

                                                    ‘1’
Positional Reference       $0                                                     Tuple

                                                    ‘1’
Named Reference            f1                                                     Tuple

Projection                 f2.$0                    { <2>, <4>, <5> }             Bag

Multiple Projection        f2.(g1, g3)              { <2, 4>, <4, 8>, <5, 11> }   Bag

Map Lookup                 f3#’apache’              ‘search’                      Map

Multiple Map Lookup (?)    ?                        ?                             Map
Pig
Questions
    How does a tuple with named fields differ from a map?
▪

    How does a tuple of tuples differ from a bag?
▪

    When do you ever use a map?
▪




    For further information, see Pig’s documentation and mailing lists:
▪

        Web site: incubator.apache.org/pig
    ▪

        Wiki: http://wiki.apache.org/pig
    ▪

        Paper: http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf
    ▪

        Language reference: http://wiki.apache.org/pig/PigLatin
    ▪
Pig
Statements
    A Pig Latin statement is a command that produces a relation
▪

    Pig commands can take zero, one, or more relations as input
▪

    Pig commands can span multiple lines and must include “;” at the end
▪

    To play with Pig syntax, you can use the grunt shell or the
▪
    StandAloneParser
Pig
Example Data
    Let ‘a.txt’ be a tab-delimited file with values:
▪

        123
    ▪

        421
    ▪

        834
    ▪

        433
    ▪

        725
    ▪

        843
    ▪
Pig
Example Data
    Let ‘b.txt’ be a tab-delimited file with values:
▪

        24
    ▪

        89
    ▪

        13
    ▪

        27
    ▪

        29
    ▪

        46
    ▪

        49
    ▪
Pig
Statements: LOAD and STORE
    LOAD <filename> [USING <function>] [AS <schema>]
▪

    Example:
▪

        grunt> a = LOAD ‘a.txt’ USING PigStorage(‘t’) AS (f1, f2, f3);
    ▪

        Now a is a relation with six tuples which share a common schema:
    ▪

            a = { <1, 2, 3>, <4, 2, 1>, <8, 3, 4>, <4, 3, 3>, <7, 2, 5>, <8, 4, 3> }
        ▪


            all the tuples have field names “f1”, “f2”, and “f3”
        ▪


        PigStorage() can be any deserialization function
    ▪

    STORE <relation> INTO <filename> [USING <function>] does the reverse
▪

        PigStorage() can’t handle nested relations; use BinStorage() instead
    ▪
Pig
Statements: FILTER
    FILTER <relation> BY <condition>
▪

    Example:
▪

        grunt> x = FILTER a BY f1 == ‘8’ OR f3 > 4;
    ▪

        The relation x has three tuples which again share the schema (f1, f2,
    ▪
        f3):
            x = { <8, 3, 4>, <8, 4, 3>, <7, 2, 5> }
        ▪


        In addition to standard numerical comparisons, you can also do string
    ▪
        comparisons and even do regular expression matching
        You can also use your own comparison function
    ▪
Pig
Statements: GROUP
    GROUP <relation> BY [<fields> | ALL | ANY]
▪

    Only makes sense if tuples in relation have partially shared schemas
▪

    Example:
▪

        grunt> y = GROUP x BY f1;
    ▪

        The relation y has two tuples which share the schema (group, x):
    ▪

            y = { < 7, { < 7, 2, 5 > } >, < 8, { < 8, 3, 4 >, < 8, 4, 3 > } > }
        ▪


        Using ANY will return a single tuple with all tuples into a single bag
    ▪

        Note that GROUP is just syntactic sugar for COGROUP for a single
    ▪
        relation
Pig
Statements: COGROUP
    COGROUP <relation> BY <fields> [INNER][, <relation> BY <fields> [INNER]];
▪

    Example:
▪

        grunt> z = COGROUP x BY f3 INNER, b BY $0 INNER;
    ▪

        The relation z has three tuples with the schema (group, x, b):
    ▪

            z = { 4, { < 8, 3, 4 > }, { < 4, 6 >, < 4, 9 > } }
        ▪


        Note that we could have used multiple fields with BY
    ▪

        The INNER keyword on either relation will toss out the group records
    ▪
        for which there are empty tuples for that relation
Pig
Statements: FOREACH ... GENERATE
    FOREACH <relation> GENERATE <data item>, <data item>, ...;
▪

    Example:
▪

        w = FOREACH x GENERATE f1, f3;
    ▪

        Equivalent to the projection x.(f1, f3)
    ▪

        The relation w has three tuples which share the schema (f1, f3):
    ▪

            w = { <8, 4>, <8, 3>, <7, 5> }
        ▪


        Can also have “nested projections”:
    ▪

            u = FOREACH y GENERATE group, SUM(x.f3) AS thirdcolsum;
        ▪


            u = { <7, 5>, <8, 7> }, where tuples have the schema (group, thirdcolsum)
        ▪
Pig
More Keywords and Statements
    FLATTEN
▪

    JOIN
▪

    ORDER
▪

    DISTINCT
▪

    CROSS
▪

    UNION
▪

    SPLIT
▪

    Write your own functions: http://wiki.apache.org/pig/PigFunctions
▪
Pig
Physical Execution via Hadoop MapReduce
    How is a logical Pig plan executed via Hadoop?
▪

        Details in SIGMOD paper
    ▪

        Essentially each (CO)GROUP results in a new map and reduce function
    ▪

        Similar to Teradata, intermediate data is materialized in the DFS
    ▪

        For Pig commands that take multiple relations as input, an additional
    ▪
        field is inserted into each tuple to indicate which relation it came from
Pig
Grunt Shell
    Allows you to maintain a working session
▪

    You can interact with the DFS as well as your Pig logical objects
▪

    DUMP command will let you see the objects you are working with
▪

    ILLUSTRATE command provides for simple debugging
▪

    For more, check out http://wiki.apache.org/pig/Grunt
▪
Pig
Pig Pen
    Run sequence of Pig commands over a representative sample of data
▪

    Difficult to generate a representative sample when using highly
▪
    selective FILTER or COGROUP statements
    Algorithm runs multiple sampling passes over the data and generates
▪
    representative data if necessary
    Allows for incremental construction of complex Pig commands
▪
Pig Pen
Pig
What’s Missing?
    Metadata repository
▪

        Browse schemas for persistent data
    ▪

        Library of serialization and deserialization functions
    ▪

    Optimized logical and physical organization of data
▪

    SQL interface
▪

    UDF in any language
▪

    Execution dataflows other than MapReduce
▪

        Hash joins, aggregate operators that don’t require a sort, etc.
    ▪

    Query optimization
▪
(c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

Más contenido relacionado

La actualidad más candente

Python fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanPython fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanWei-Yuan Chang
 
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...DevOpsDays Tel Aviv
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Pythonpugpe
 
Manipulating strings
Manipulating stringsManipulating strings
Manipulating stringsNicole Ryan
 
My sql presentation
My sql presentationMy sql presentation
My sql presentationNikhil Jain
 
Exadata - BULK DATA LOAD Testing on Database Machine
Exadata - BULK DATA LOAD Testing on Database Machine Exadata - BULK DATA LOAD Testing on Database Machine
Exadata - BULK DATA LOAD Testing on Database Machine Monowar Mukul
 
FITC CoffeeScript 101
FITC CoffeeScript 101FITC CoffeeScript 101
FITC CoffeeScript 101Faisal Abid
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
The bones of a nice Python script
The bones of a nice Python scriptThe bones of a nice Python script
The bones of a nice Python scriptsaniac
 
Data handling in r
Data handling in rData handling in r
Data handling in rAbhik Seal
 
TDC2016SP - Código funcional em Java: superando o hype
TDC2016SP - Código funcional em Java: superando o hypeTDC2016SP - Código funcional em Java: superando o hype
TDC2016SP - Código funcional em Java: superando o hypetdc-globalcode
 
Elixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicitElixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicitTobias Pfeiffer
 
Postgresql Database Administration- Day4
Postgresql Database Administration- Day4Postgresql Database Administration- Day4
Postgresql Database Administration- Day4PoguttuezhiniVP
 
Elixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicitElixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicitTobias Pfeiffer
 

La actualidad más candente (20)

Python fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanPython fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuan
 
Pdxpugday2010 pg90
Pdxpugday2010 pg90Pdxpugday2010 pg90
Pdxpugday2010 pg90
 
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
Massively Distributed Backups at Facebook Scale - Shlomo Priymak, Facebook - ...
 
Palestra sobre Collections com Python
Palestra sobre Collections com PythonPalestra sobre Collections com Python
Palestra sobre Collections com Python
 
Sql for dbaspresentation
Sql for dbaspresentationSql for dbaspresentation
Sql for dbaspresentation
 
Manipulating strings
Manipulating stringsManipulating strings
Manipulating strings
 
My sql presentation
My sql presentationMy sql presentation
My sql presentation
 
Unit vii wp ppt
Unit vii wp pptUnit vii wp ppt
Unit vii wp ppt
 
Big Data Analytics Lab File
Big Data Analytics Lab FileBig Data Analytics Lab File
Big Data Analytics Lab File
 
Exadata - BULK DATA LOAD Testing on Database Machine
Exadata - BULK DATA LOAD Testing on Database Machine Exadata - BULK DATA LOAD Testing on Database Machine
Exadata - BULK DATA LOAD Testing on Database Machine
 
FITC CoffeeScript 101
FITC CoffeeScript 101FITC CoffeeScript 101
FITC CoffeeScript 101
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Sparklyr
SparklyrSparklyr
Sparklyr
 
The bones of a nice Python script
The bones of a nice Python scriptThe bones of a nice Python script
The bones of a nice Python script
 
Percona toolkit
Percona toolkitPercona toolkit
Percona toolkit
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
TDC2016SP - Código funcional em Java: superando o hype
TDC2016SP - Código funcional em Java: superando o hypeTDC2016SP - Código funcional em Java: superando o hype
TDC2016SP - Código funcional em Java: superando o hype
 
Elixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicitElixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicit
 
Postgresql Database Administration- Day4
Postgresql Database Administration- Day4Postgresql Database Administration- Day4
Postgresql Database Administration- Day4
 
Elixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicitElixir & Phoenix – fast, concurrent and explicit
Elixir & Phoenix – fast, concurrent and explicit
 

Similar a 20080529dublinpt2

AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of HadoopAsif Ali
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxRahul Borate
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In PerlKang-min Liu
 
Python Workshop. LUG Maniapl
Python Workshop. LUG ManiaplPython Workshop. LUG Maniapl
Python Workshop. LUG ManiaplAnkur Shrivastava
 
What's new, what's hot in PHP 5.3
What's new, what's hot in PHP 5.3What's new, what's hot in PHP 5.3
What's new, what's hot in PHP 5.3Jeremy Coates
 
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, ItalyPHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, ItalyPatrick Allaert
 

Similar a 20080529dublinpt2 (20)

Apache pig
Apache pigApache pig
Apache pig
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Pig workshop
Pig workshopPig workshop
Pig workshop
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Bioinformatica 10-11-2011-p6-bioperl
Bioinformatica 10-11-2011-p6-bioperlBioinformatica 10-11-2011-p6-bioperl
Bioinformatica 10-11-2011-p6-bioperl
 
An Overview of Hadoop
An Overview of HadoopAn Overview of Hadoop
An Overview of Hadoop
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
Bioinformatics p5-bioperlv2014
Bioinformatics p5-bioperlv2014Bioinformatics p5-bioperlv2014
Bioinformatics p5-bioperlv2014
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
PigHive.pptx
PigHive.pptxPigHive.pptx
PigHive.pptx
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In Perl
 
Bioinformatica p4-io
Bioinformatica p4-ioBioinformatica p4-io
Bioinformatica p4-io
 
Python Workshop. LUG Maniapl
Python Workshop. LUG ManiaplPython Workshop. LUG Maniapl
Python Workshop. LUG Maniapl
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
What's new, what's hot in PHP 5.3
What's new, what's hot in PHP 5.3What's new, what's hot in PHP 5.3
What's new, what's hot in PHP 5.3
 
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, ItalyPHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
PHP data structures (and the impact of php 7 on them), phpDay Verona 2015, Italy
 

Más de Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100513brown
20100513brown20100513brown
20100513brown
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100301icde
20100301icde20100301icde
20100301icde
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081022cca
20081022cca20081022cca
20081022cca
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 

Último

Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxMasterG
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiRaviKumarDaparthi
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...SOFTTECHHUB
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Paige Cruz
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxjbellis
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 

Último (20)

Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
The Ultimate Prompt Engineering Guide for Generative AI: Get the Most Out of ...
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 

20080529dublinpt2

  • 1.
  • 2. Working with Structured Data in Hadoop Jeff Hammerbacher Manager, Data May 28 - 29, 2008
  • 3. Structured Data Management in Hadoop State of the World HBase is a Hadoop subproject ▪ Powerset and Rapleaf are the main contributors ▪ Hypertable is Bigtable in C++ ▪ Zvents are the main contributors ▪ Pig is an Apache Incubator project ▪ Yahoo! is the main contributor ▪ JAQL has been released as open source ▪ IBM is the main contributor ▪ Hive not available publicly, hopefully under contrib/ soon ▪ Facebook is the main contributor ▪
  • 4. Pig Philosophy Pigs Eat Anything ▪ Operate on data with or without metadata ▪ Operate on relational, nested, or unstructured data ▪ Pigs Live Anywhere ▪ The language is independent of execution environment ▪ Pigs are Domestic Animals ▪ Integrate user code wherever possible ▪ Allow control over code reorganization when optimizing ▪ Pigs Fly ▪
  • 5. Pig Components Pig Latin ▪ Dataflow programming language; procedural, not declarative ▪ Algebraic: each step specifies only a single data transformation ▪ Parse, verify, and build a logical plan ▪ Evaluation Mechanisms ▪ Local evaluation in single JVM ▪ Compilation to Hadoop MapReduce ▪ Grunt: interactive shell ▪ Pig Pen: debugging environment ▪
  • 6. Pig Data Model Pig has four types of data items: ▪ Atom: string or number ▪ Tuple: “data record” consisting of an ordered sequence of “fields” ▪ Denoted with < > bracketing ▪ Bag: an unordered collection of tuples with possible duplicates and ▪ possibly inconsistent schemas Denoted with { } bracketing ▪ Map: an unordered collection of data items where each data item has ▪ an associated key; the key must be a string Denoted with [ ] bracketing ▪
  • 7. Pig Data Model, continued Fields in a tuple may be named for easier access ▪ A “relation” is a Bag that has been assigned a name (“alias”) ▪ Example: ▪ Let t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] > ▪ Give the fields of t the names “f1”, “f2”, and “f3” ▪ Give the fields of the tuples of the bag the names “g1”, “g2”, and “g3” ▪ We’ll look at Pig’s data access syntax on the next page ▪
  • 8. Pig Data Access t = < 1, { <2, 3, 4>, <4, 6, 8>, <5, 7, 11>}, [‘apache’: ‘search’] > ▪ Method of Data Access Example Value for t Applies to which Data Item Constant ‘1.0’ or ‘apache.org’ Constant Atom ‘1’ Positional Reference $0 Tuple ‘1’ Named Reference f1 Tuple Projection f2.$0 { <2>, <4>, <5> } Bag Multiple Projection f2.(g1, g3) { <2, 4>, <4, 8>, <5, 11> } Bag Map Lookup f3#’apache’ ‘search’ Map Multiple Map Lookup (?) ? ? Map
  • 9. Pig Questions How does a tuple with named fields differ from a map? ▪ How does a tuple of tuples differ from a bag? ▪ When do you ever use a map? ▪ For further information, see Pig’s documentation and mailing lists: ▪ Web site: incubator.apache.org/pig ▪ Wiki: http://wiki.apache.org/pig ▪ Paper: http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf ▪ Language reference: http://wiki.apache.org/pig/PigLatin ▪
  • 10. Pig Statements A Pig Latin statement is a command that produces a relation ▪ Pig commands can take zero, one, or more relations as input ▪ Pig commands can span multiple lines and must include “;” at the end ▪ To play with Pig syntax, you can use the grunt shell or the ▪ StandAloneParser
  • 11. Pig Example Data Let ‘a.txt’ be a tab-delimited file with values: ▪ 123 ▪ 421 ▪ 834 ▪ 433 ▪ 725 ▪ 843 ▪
  • 12. Pig Example Data Let ‘b.txt’ be a tab-delimited file with values: ▪ 24 ▪ 89 ▪ 13 ▪ 27 ▪ 29 ▪ 46 ▪ 49 ▪
  • 13. Pig Statements: LOAD and STORE LOAD <filename> [USING <function>] [AS <schema>] ▪ Example: ▪ grunt> a = LOAD ‘a.txt’ USING PigStorage(‘t’) AS (f1, f2, f3); ▪ Now a is a relation with six tuples which share a common schema: ▪ a = { <1, 2, 3>, <4, 2, 1>, <8, 3, 4>, <4, 3, 3>, <7, 2, 5>, <8, 4, 3> } ▪ all the tuples have field names “f1”, “f2”, and “f3” ▪ PigStorage() can be any deserialization function ▪ STORE <relation> INTO <filename> [USING <function>] does the reverse ▪ PigStorage() can’t handle nested relations; use BinStorage() instead ▪
  • 14. Pig Statements: FILTER FILTER <relation> BY <condition> ▪ Example: ▪ grunt> x = FILTER a BY f1 == ‘8’ OR f3 > 4; ▪ The relation x has three tuples which again share the schema (f1, f2, ▪ f3): x = { <8, 3, 4>, <8, 4, 3>, <7, 2, 5> } ▪ In addition to standard numerical comparisons, you can also do string ▪ comparisons and even do regular expression matching You can also use your own comparison function ▪
  • 15. Pig Statements: GROUP GROUP <relation> BY [<fields> | ALL | ANY] ▪ Only makes sense if tuples in relation have partially shared schemas ▪ Example: ▪ grunt> y = GROUP x BY f1; ▪ The relation y has two tuples which share the schema (group, x): ▪ y = { < 7, { < 7, 2, 5 > } >, < 8, { < 8, 3, 4 >, < 8, 4, 3 > } > } ▪ Using ANY will return a single tuple with all tuples into a single bag ▪ Note that GROUP is just syntactic sugar for COGROUP for a single ▪ relation
  • 16. Pig Statements: COGROUP COGROUP <relation> BY <fields> [INNER][, <relation> BY <fields> [INNER]]; ▪ Example: ▪ grunt> z = COGROUP x BY f3 INNER, b BY $0 INNER; ▪ The relation z has three tuples with the schema (group, x, b): ▪ z = { 4, { < 8, 3, 4 > }, { < 4, 6 >, < 4, 9 > } } ▪ Note that we could have used multiple fields with BY ▪ The INNER keyword on either relation will toss out the group records ▪ for which there are empty tuples for that relation
  • 17. Pig Statements: FOREACH ... GENERATE FOREACH <relation> GENERATE <data item>, <data item>, ...; ▪ Example: ▪ w = FOREACH x GENERATE f1, f3; ▪ Equivalent to the projection x.(f1, f3) ▪ The relation w has three tuples which share the schema (f1, f3): ▪ w = { <8, 4>, <8, 3>, <7, 5> } ▪ Can also have “nested projections”: ▪ u = FOREACH y GENERATE group, SUM(x.f3) AS thirdcolsum; ▪ u = { <7, 5>, <8, 7> }, where tuples have the schema (group, thirdcolsum) ▪
  • 18. Pig More Keywords and Statements FLATTEN ▪ JOIN ▪ ORDER ▪ DISTINCT ▪ CROSS ▪ UNION ▪ SPLIT ▪ Write your own functions: http://wiki.apache.org/pig/PigFunctions ▪
  • 19. Pig Physical Execution via Hadoop MapReduce How is a logical Pig plan executed via Hadoop? ▪ Details in SIGMOD paper ▪ Essentially each (CO)GROUP results in a new map and reduce function ▪ Similar to Teradata, intermediate data is materialized in the DFS ▪ For Pig commands that take multiple relations as input, an additional ▪ field is inserted into each tuple to indicate which relation it came from
  • 20. Pig Grunt Shell Allows you to maintain a working session ▪ You can interact with the DFS as well as your Pig logical objects ▪ DUMP command will let you see the objects you are working with ▪ ILLUSTRATE command provides for simple debugging ▪ For more, check out http://wiki.apache.org/pig/Grunt ▪
  • 21. Pig Pig Pen Run sequence of Pig commands over a representative sample of data ▪ Difficult to generate a representative sample when using highly ▪ selective FILTER or COGROUP statements Algorithm runs multiple sampling passes over the data and generates ▪ representative data if necessary Allows for incremental construction of complex Pig commands ▪
  • 23. Pig What’s Missing? Metadata repository ▪ Browse schemas for persistent data ▪ Library of serialization and deserialization functions ▪ Optimized logical and physical organization of data ▪ SQL interface ▪ UDF in any language ▪ Execution dataflows other than MapReduce ▪ Hash joins, aggregate operators that don’t require a sort, etc. ▪ Query optimization ▪
  • 24. (c) 2008 Facebook, Inc. or its licensors.  quot;Facebookquot; is a registered trademark of Facebook, Inc.. All rights reserved. 1.0