André Kelpe's presentation at Hadoop User Group France - 25.11.2014.
Abstract: Cascading is widely deployed, production ready open source data application framework geared towards Java developers. Cascading enables developers to write complex data applications without the need to become a distributed systems expert. Cascading apps are portable between different computation frameworks, so that a given application can be moved from Hadoop onto new processing platforms like Apache Tez or Apache Spark without rewriting any of the application code.
The Cascading (big) data application framework - André Keple, Sr. Engineer, Concurrent
1. The Cascading
(big) data
application framework
André Kelpe | HUG France | Paris | 25. November 2014
2. Who am I?
André Kelpe
Senior Software Engineer at Concurrent
company behind Cascading, Lingual and
Driven
http://concurrentinc.com / @concurrent
andre@concurrentinc.com / @fs111
3. http://cascading.org
Apache licensed Java framework for writing data
oriented applications
production ready, stable and battle proven
(soundcloud, twitter, etsy, climate corp + many
more)
4. Cascading goals
developer productivity
focus on business problems, not distributed
systems knowledge
useful abstractions over underlying „fabrics“
5. Cascading goals
Testability & robustness
production quality applications rather than a
collection of scripts
(hooks into the core for experts)
7. Cascading terminology
Taps are sources and sinks for data
Schemes represent the format of the data
Pipes are connecting Taps
8. Cascading terminology
● Tuples flow through Pipes
● Fields describe the Tuples
● Operations are executed on Tuples in
TupleStreams
● FlowConnector uses QueryPlanner to
translate FlowDef into Flow to run on
computational fabric
9. Compiler
QueryPlanner
FlowDef
FlowDef
FlowDef
Hadoop
FlowDef Tez
Spark
User Code Translation
Optimization
Assembly
CPU Architecture
10. User-APIs
● Fluid - A Fluent API for Cascading
– Targeted at application writers
– https://github.com/Cascading/fluid
● „Raw“ Cascading API
– Targeted for library writers, code generators,
integration layers
– https://github.com/Cascading/cascading
11. Counting words
// configuration
String docPath = args[ 0 ];
String wcPath = args[ 1 ];
Properties properties = new Properties();
AppProps.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new Hadoop2MR1FlowConnector( properties );
// create source and sink taps
Tap docTap = new Hfs( new TextDelimited( true, "t" ), docPath );
Tap wcTap = new Hfs( new TextDelimited( true, "t" ), wcPath );
...
12. Counting words (cont.)
// specify a regex operation to split the "document" text lines into a
token stream
Fields token = new Fields( "token" );
Fields text = new Fields( "text" );
RegexSplitGenerator splitter =
new RegexSplitGenerator( token, "[ [](),.]" );
// only returns "token"
Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );
// determine the word counts
Pipe wcPipe = new Pipe( "wc", docPipe );
wcPipe = new GroupBy( wcPipe, token );
wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );
...
13. Counting words (cont.)
// connect the taps, pipes, etc., into a flow
FlowDef flowDef = FlowDef.flowDef()
.setName( "wc" )
.addSource( docPipe, docTap )
.addTailSink( wcPipe, wcTap );
Flow wcFlow = flowConnector.connect( flowDef )
wcFlow.complete(); // ← runs the code
}