2. Agenda :
Issues with MapReduce pipelines
Solving with Apache Crunch
Data Model & Operations
System Workflow
Examples
Question & Answers
2
3. Issues with MapReduce Pipelines
Unit Testing pipeline ??
You must be joking !! Can someone tell me where
is the business logic ??
Chain performance??
Learn Latin(pig)
first!!
3
4. Apache Crunch
Is a Java library
Contains Collections which can excute Parallel operations
Lazy evaluation of Collections at runtime
Operations merged at runtime to have efficient chains.
Available @ http://incubator.apache.org/crunch/
Based on Google FlumeJava paper
4
5. Apache Crunch
Supports Hadoop version 1 and 2-alpha
Supports HBase, jdbc etc
Works with Writables, Avro, Thrift and proto-buffers
Scala varient also exists
Integration with R and Clojure in process
Archetype exists for creating sample maven project
5