SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
Apache Pig UDFs
Extending Pig to solve complex tasks

   UDF = User Defined Functions
Your speaker today:
          Christoph Bauer

          java developer 10+ years

          one of the founders

          Helping our clients to use and
          understand their (big) data

          working in "BigData" since 2010
Why use PIG
● ad-hoc way for creating and executing
  map/reduce jobs
● simple, high-level language
● more natural for analysts than map/reduce
Done.




        http://leesfishandphotos.blogspot.de
Oh, wait...
UDFs to the rescue
Writing user defined functions (UDF)
+ easy to use
+ easy to code
+ keep the power of PIG
+ you can write them in java, python, ...
Do whatever you want
● image feature extraction
● geo computations
● data cleaning
● retrieve web pages
● natural language processing
  ...
● much more...
User Defined Functions
● EvalFunc<T>
  public <T> exec(Tuple input)
● FilterFunc
  public Boolean exec(Tuple input)
● Aggregate Functions
  public interface Algebraic{
      public String getInitial();
      public String getIntermed();
      public String getFinal();
  }
● Load/Store Functions
  public Tuple getNext()
  public void putNext(Tuple input);
What? Why?
companyName          companyAdress                  Net Worth
                       companyAdress                 Net Worth
                         companyAddress                 Net Worth
                                                         Net Worth
                                                            Net Worth
                                                             Net Worth
                                                                Net Worth
                                                                 Net Worth
                                                                    Net Worth




2010 | companyName | current Address | historical Net Worth


2011 | companyName | current Address | historical Net Worth


2012 | companyName | current Address | historical Net Worth
Example
r1, { q1:[(t1, "v1") , (t4, "v2")],
      q2:[(t2, "v3"),(t7, "v4")] }
...apply UDF
r1, t1, q1:"v1", q2:"v4"
r1, t3, q1:"v1", q2:"v4"
r1, t5, q1:"v2", q2:"v4"


SNAPSHOTS(q1, t1 <= t < t6, 2), LATEST (q2)
LATEST
public class LATEST extends EvalFunc<Tuple> {

    public Tuple exec(Tuple input) throws IOException {

    }
}
LATEST (contd.)
public Tuple exec(Tuple input) throws IOException {
    int numTuples = input.size();
    Tuple result = tupleFactory.newTuple(numTuples);
    for (int i = 0; i < numTuples; i++) {
        switch (input.getType(i)) {
        case DataType.BAG:
            DataBag bag = (DataBag) input.get(i);
            Object val = extractLatestValueFromBag(bag);
            if (val != null) {
                result.set(i, val);
            }
            break;
        case DataType.MAP:
            // ... MAPs need different handling
        default:
            // warn ...
        }                   r1, { q1:[(t1, "v1") , (t4, "v2")],
    }                             q2:[(t2, "v3"),(t7, "v4")] }
    return result;
}
SNAPSHOT
public class SNAPSHOTS extends EvalFunc<DataBag> {
    @Override
    public DataBag exec(Tuple input) throws IOException {
        List<Tuple> listOfTuples = new ArrayList<Tuple>();

       DateTime dtCur = new DateTime(start);
       DateTime dtEnd = new DateTime(end).plus(1L);
       while (dtCur.isBefore(dtEnd)) {
           listOfTuples.add(snapshot(input, dtCur));

           dtCur = dtCur.plus(period);
       }
       DataBag bag = factory.newDefaultBag(listOfTuples);
       return bag;
   }
SNAPSHOT (contd.)
protected Tuple snapshot(Tuple input, long ts) throws... {
    int numTuples = input.size();
    Tuple result = tupleFactory.newTuple(numTuples + 1);
    result.set(0, ts);

    for (int i = 0; i < numTuples; i++) {
        switch (input.getType(i)) {
        case DataType.BAG:
            DataBag bag = (DataBag) input.get(i);
            Object val = extractTSValueFromBag(bag, ts);
            result.set(i + 1, val);
            break;
        case DataType.MAP:
            // handle MAPs
        default:
        }                   r1, { q1:[(t1, "v1") , (t4, "v2")],
    }                             q2:[(t2, "v3"),(t7, "v4")] }
    return result;
}
PigLatin
                   r1, { q1:[(t1, "v1") , (t4, "v2")],
                         q2:[(t2, "v3"),(t7, "v4")] }


REGISTER 'my-udf.jar'
DEFINE LATEST myudf.Latest();
DEFINE SNAPSHOT myudf.Snapshot
              ('2000-01-01 2013-01-01 1y');
A = LOAD 'inputTable' AS (id, q1, q2);
B = FOREACH A GENERATE id,
    SNAPSHOT(q1) AS SN, LATEST(q2) as CUR;
C = FOREACH B GENERATE id,
    FLATTEN(SN), FLATTEN(CUR);
STORE C INTO 'output.csv';
Passing parameters to UDFs
DEFINE SNAPSHOT cool.udf.Snapshot
                 ('2000-01-01 2013-01-01 1y');
...
public SNAPSHOTS
(String start, String end, String increment)
{
    super();
    this.start = Long.parseLong(start);
    this.end = Long.parseLong(end);
    this.increment = parseLong(increment);
}
I didn't talk about
● UDFs run as a single instance in every
  mapper, reducer, ... use instance variables
  for locally shared objects
● Watch your heap when using Lucene
  Indexes, or implementing the Algebraic
  interface
● do implement
  public Schema outputSchema(Schema input)
● report progress when doing time consuming
  stuff
● Performance?
SNAPSHOT (contd.)
@Override
public Schema outputSchema(Schema input) {
    List out = new ArrayList<Schema.FieldSchema>();
    out.add(new FieldSchema("snapshot", DataType.LONG));

    for (FieldSchema fieldSchema : input.getFields()) {
        String alias = fieldSchema.alias;
        byte type = fieldSchema.type;
        out.add(new FieldSchema(alias, type));
    }
    Schema bagSchema = new Schema(out);
    try {
        return new Schema(new FieldSchema( getSchemaName(
            "snapshots", input), bagSchema, DataType.
BAG));
    } catch (FrontendException e) {
    }
    return null;
}
Reality check
● These UDFs are in production,
● Producing reports with up to 60GB
● Data is stored in HBase
Wrapping it up
We at Oberbaum Concept developed a bunch
of PIG Functions handling versioned data in
HBase.
● Rewrote HBaseStorage
● UDFs for Snapshots, Latest

Right now we are trying to push our changes
back into PIG.
Questions?
Thank you!
                Christoph Bauer




christoph.bauer@oberbaum-concept.com
https://www.xing.com/profile/Christoph_Bauer62

Más contenido relacionado

La actualidad más candente

If You Think You Can Stay Away from Functional Programming, You Are Wrong
If You Think You Can Stay Away from Functional Programming, You Are WrongIf You Think You Can Stay Away from Functional Programming, You Are Wrong
If You Think You Can Stay Away from Functional Programming, You Are Wrong
Mario Fusco
 
The Evolution of Async-Programming on .NET Platform (TUP, Full)
The Evolution of Async-Programming on .NET Platform (TUP, Full)The Evolution of Async-Programming on .NET Platform (TUP, Full)
The Evolution of Async-Programming on .NET Platform (TUP, Full)
jeffz
 
Java8 stream
Java8 streamJava8 stream
Java8 stream
koji lin
 
Jscex: Write Sexy JavaScript
Jscex: Write Sexy JavaScriptJscex: Write Sexy JavaScript
Jscex: Write Sexy JavaScript
jeffz
 
The Evolution of Async-Programming (SD 2.0, JavaScript)
The Evolution of Async-Programming (SD 2.0, JavaScript)The Evolution of Async-Programming (SD 2.0, JavaScript)
The Evolution of Async-Programming (SD 2.0, JavaScript)
jeffz
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Mario Fusco
 
Jscex: Write Sexy JavaScript (中文)
Jscex: Write Sexy JavaScript (中文)Jscex: Write Sexy JavaScript (中文)
Jscex: Write Sexy JavaScript (中文)
jeffz
 
The Evolution of Async-Programming on .NET Platform (.Net China, C#)
The Evolution of Async-Programming on .NET Platform (.Net China, C#)The Evolution of Async-Programming on .NET Platform (.Net China, C#)
The Evolution of Async-Programming on .NET Platform (.Net China, C#)
jeffz
 
OO JS for AS3 Devs
OO JS for AS3 DevsOO JS for AS3 Devs
OO JS for AS3 Devs
Jason Hanson
 

La actualidad más candente (20)

Java 8 Workshop
Java 8 WorkshopJava 8 Workshop
Java 8 Workshop
 
响应式编程及框架
响应式编程及框架响应式编程及框架
响应式编程及框架
 
If You Think You Can Stay Away from Functional Programming, You Are Wrong
If You Think You Can Stay Away from Functional Programming, You Are WrongIf You Think You Can Stay Away from Functional Programming, You Are Wrong
If You Think You Can Stay Away from Functional Programming, You Are Wrong
 
Native interfaces for R
Native interfaces for RNative interfaces for R
Native interfaces for R
 
The Evolution of Async-Programming on .NET Platform (TUP, Full)
The Evolution of Async-Programming on .NET Platform (TUP, Full)The Evolution of Async-Programming on .NET Platform (TUP, Full)
The Evolution of Async-Programming on .NET Platform (TUP, Full)
 
Fun with Kotlin
Fun with KotlinFun with Kotlin
Fun with Kotlin
 
MTL Versus Free
MTL Versus FreeMTL Versus Free
MTL Versus Free
 
Java8 stream
Java8 streamJava8 stream
Java8 stream
 
자바 8 스트림 API
자바 8 스트림 API자바 8 스트림 API
자바 8 스트림 API
 
Jscex: Write Sexy JavaScript
Jscex: Write Sexy JavaScriptJscex: Write Sexy JavaScript
Jscex: Write Sexy JavaScript
 
Phil Bartie QGIS PLPython
Phil Bartie QGIS PLPythonPhil Bartie QGIS PLPython
Phil Bartie QGIS PLPython
 
The Evolution of Async-Programming (SD 2.0, JavaScript)
The Evolution of Async-Programming (SD 2.0, JavaScript)The Evolution of Async-Programming (SD 2.0, JavaScript)
The Evolution of Async-Programming (SD 2.0, JavaScript)
 
OOP and FP - Become a Better Programmer
OOP and FP - Become a Better ProgrammerOOP and FP - Become a Better Programmer
OOP and FP - Become a Better Programmer
 
Introduction to functional programming using Ocaml
Introduction to functional programming using OcamlIntroduction to functional programming using Ocaml
Introduction to functional programming using Ocaml
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...
 
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
«iPython & Jupyter: 4 fun & profit», Лев Тонких, Rambler&Co
 
Jscex: Write Sexy JavaScript (中文)
Jscex: Write Sexy JavaScript (中文)Jscex: Write Sexy JavaScript (中文)
Jscex: Write Sexy JavaScript (中文)
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
The Evolution of Async-Programming on .NET Platform (.Net China, C#)
The Evolution of Async-Programming on .NET Platform (.Net China, C#)The Evolution of Async-Programming on .NET Platform (.Net China, C#)
The Evolution of Async-Programming on .NET Platform (.Net China, C#)
 
OO JS for AS3 Devs
OO JS for AS3 DevsOO JS for AS3 Devs
OO JS for AS3 Devs
 

Similar a Apache PIG - User Defined Functions

Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
Sages
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2go
Moriyoshi Koizumi
 
Linq Sanjay Vyas
Linq   Sanjay VyasLinq   Sanjay Vyas
Linq Sanjay Vyas
rsnarayanan
 
To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2
Bahul Neel Upadhyaya
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
Edward Capriolo
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
Mariusz Gil
 

Similar a Apache PIG - User Defined Functions (20)

Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
Empathic Programming - How to write comprehensible code
Empathic Programming - How to write comprehensible codeEmpathic Programming - How to write comprehensible code
Empathic Programming - How to write comprehensible code
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2go
 
Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)Kotlin: forse è la volta buona (Trento)
Kotlin: forse è la volta buona (Trento)
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Apache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheelApache Commons - Don\'t re-invent the wheel
Apache Commons - Don\'t re-invent the wheel
 
Writing Good Tests
Writing Good TestsWriting Good Tests
Writing Good Tests
 
Effective C#
Effective C#Effective C#
Effective C#
 
Qt Workshop
Qt WorkshopQt Workshop
Qt Workshop
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 
Are we ready to Go?
Are we ready to Go?Are we ready to Go?
Are we ready to Go?
 
Linq Sanjay Vyas
Linq   Sanjay VyasLinq   Sanjay Vyas
Linq Sanjay Vyas
 
To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2To Infinity & Beyond: Protocols & sequences in Node - Part 2
To Infinity & Beyond: Protocols & sequences in Node - Part 2
 
The STL
The STLThe STL
The STL
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
 
Paradigmas de Linguagens de Programacao - Aula #4
Paradigmas de Linguagens de Programacao - Aula #4Paradigmas de Linguagens de Programacao - Aula #4
Paradigmas de Linguagens de Programacao - Aula #4
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 

Último

Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
ssuserdda66b
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Último (20)

Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

Apache PIG - User Defined Functions

  • 1. Apache Pig UDFs Extending Pig to solve complex tasks UDF = User Defined Functions
  • 2. Your speaker today: Christoph Bauer java developer 10+ years one of the founders Helping our clients to use and understand their (big) data working in "BigData" since 2010
  • 3. Why use PIG ● ad-hoc way for creating and executing map/reduce jobs ● simple, high-level language ● more natural for analysts than map/reduce
  • 4. Done. http://leesfishandphotos.blogspot.de
  • 6. UDFs to the rescue Writing user defined functions (UDF) + easy to use + easy to code + keep the power of PIG + you can write them in java, python, ...
  • 7. Do whatever you want ● image feature extraction ● geo computations ● data cleaning ● retrieve web pages ● natural language processing ... ● much more...
  • 8. User Defined Functions ● EvalFunc<T> public <T> exec(Tuple input) ● FilterFunc public Boolean exec(Tuple input) ● Aggregate Functions public interface Algebraic{ public String getInitial(); public String getIntermed(); public String getFinal(); } ● Load/Store Functions public Tuple getNext() public void putNext(Tuple input);
  • 9. What? Why? companyName companyAdress Net Worth companyAdress Net Worth companyAddress Net Worth Net Worth Net Worth Net Worth Net Worth Net Worth Net Worth 2010 | companyName | current Address | historical Net Worth 2011 | companyName | current Address | historical Net Worth 2012 | companyName | current Address | historical Net Worth
  • 10. Example r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] } ...apply UDF r1, t1, q1:"v1", q2:"v4" r1, t3, q1:"v1", q2:"v4" r1, t5, q1:"v2", q2:"v4" SNAPSHOTS(q1, t1 <= t < t6, 2), LATEST (q2)
  • 11. LATEST public class LATEST extends EvalFunc<Tuple> { public Tuple exec(Tuple input) throws IOException { } }
  • 12. LATEST (contd.) public Tuple exec(Tuple input) throws IOException { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractLatestValueFromBag(bag); if (val != null) { result.set(i, val); } break; case DataType.MAP: // ... MAPs need different handling default: // warn ... } r1, { q1:[(t1, "v1") , (t4, "v2")], } q2:[(t2, "v3"),(t7, "v4")] } return result; }
  • 13. SNAPSHOT public class SNAPSHOTS extends EvalFunc<DataBag> { @Override public DataBag exec(Tuple input) throws IOException { List<Tuple> listOfTuples = new ArrayList<Tuple>(); DateTime dtCur = new DateTime(start); DateTime dtEnd = new DateTime(end).plus(1L); while (dtCur.isBefore(dtEnd)) { listOfTuples.add(snapshot(input, dtCur)); dtCur = dtCur.plus(period); } DataBag bag = factory.newDefaultBag(listOfTuples); return bag; }
  • 14. SNAPSHOT (contd.) protected Tuple snapshot(Tuple input, long ts) throws... { int numTuples = input.size(); Tuple result = tupleFactory.newTuple(numTuples + 1); result.set(0, ts); for (int i = 0; i < numTuples; i++) { switch (input.getType(i)) { case DataType.BAG: DataBag bag = (DataBag) input.get(i); Object val = extractTSValueFromBag(bag, ts); result.set(i + 1, val); break; case DataType.MAP: // handle MAPs default: } r1, { q1:[(t1, "v1") , (t4, "v2")], } q2:[(t2, "v3"),(t7, "v4")] } return result; }
  • 15. PigLatin r1, { q1:[(t1, "v1") , (t4, "v2")], q2:[(t2, "v3"),(t7, "v4")] } REGISTER 'my-udf.jar' DEFINE LATEST myudf.Latest(); DEFINE SNAPSHOT myudf.Snapshot ('2000-01-01 2013-01-01 1y'); A = LOAD 'inputTable' AS (id, q1, q2); B = FOREACH A GENERATE id, SNAPSHOT(q1) AS SN, LATEST(q2) as CUR; C = FOREACH B GENERATE id, FLATTEN(SN), FLATTEN(CUR); STORE C INTO 'output.csv';
  • 16. Passing parameters to UDFs DEFINE SNAPSHOT cool.udf.Snapshot ('2000-01-01 2013-01-01 1y'); ... public SNAPSHOTS (String start, String end, String increment) { super(); this.start = Long.parseLong(start); this.end = Long.parseLong(end); this.increment = parseLong(increment); }
  • 17. I didn't talk about ● UDFs run as a single instance in every mapper, reducer, ... use instance variables for locally shared objects ● Watch your heap when using Lucene Indexes, or implementing the Algebraic interface ● do implement public Schema outputSchema(Schema input) ● report progress when doing time consuming stuff ● Performance?
  • 18. SNAPSHOT (contd.) @Override public Schema outputSchema(Schema input) { List out = new ArrayList<Schema.FieldSchema>(); out.add(new FieldSchema("snapshot", DataType.LONG)); for (FieldSchema fieldSchema : input.getFields()) { String alias = fieldSchema.alias; byte type = fieldSchema.type; out.add(new FieldSchema(alias, type)); } Schema bagSchema = new Schema(out); try { return new Schema(new FieldSchema( getSchemaName( "snapshots", input), bagSchema, DataType. BAG)); } catch (FrontendException e) { } return null; }
  • 19. Reality check ● These UDFs are in production, ● Producing reports with up to 60GB ● Data is stored in HBase
  • 20. Wrapping it up We at Oberbaum Concept developed a bunch of PIG Functions handling versioned data in HBase. ● Rewrote HBaseStorage ● UDFs for Snapshots, Latest Right now we are trying to push our changes back into PIG.
  • 22. Thank you! Christoph Bauer christoph.bauer@oberbaum-concept.com https://www.xing.com/profile/Christoph_Bauer62