Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Cascading[1]
1. Cascading
www.cascading.org
info@cascading.org
Wednesday, May 14, 2008
2. Design Goals
Make large processing jobs more transparent
Reusable processing components independent of resources
Incremental “data” builds
Simplify testing of processes
Scriptable from higher level languages (Groovy, JRuby, Jython, etc)
Wednesday, May 14, 2008
4. Tuple Streams
Value Stream Group Stream
[K1,K2,...,Kn
[V1,V2,...,Vn
[V1,V2,...,Vn
Tuple [V1,V2,...,Vn
[V1,V2,...,Vn
A set of ordered data [“John”, “Doe”, 39] [V1,V2,...,Vn [V1,V2,...,Vn
[K1,K2,...,Kn
[V1,V2,...,Vn
Value Stream [V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn
Just tuples [V1,V2,...,Vn
[V1,V2,...,Vn
[V1,V2,...,Vn [V1,V2,...,Vn
Group Stream
Tuples groups by a key
Wednesday, May 14, 2008
5. Tuple Streams
[values]
Scalar functions and filters
Source
Apply to value and group streams [values]
func
[values]
Aggregate functions
Apply to group stream
[values] [groups/values]
Group
Functions can be chained [groups] [values]
aggr
func
[values]
Sink
Source func Group aggr Sink
Wednesday, May 14, 2008
6. Stream Processing
Flow
Pipe Assembly
S F F G A A S
Pipe Assemblies
A chain of scalar functions, groupings, aggregate functions
Reusable, independent of data source/sink
Flows Cascade
S F S F S
Assemblies plus sources and sinks
S F
Cascades
S F S
A collection of Flows
S F
Wednesday, May 14, 2008
7. Processing Patterns
Source Group Sink Source Sink
Chain
Group Sink Sink
Source Source
Splits
Group Sink Sink
Source
Joins
Group Sink
Source
Cross Source Group Sink
Wednesday, May 14, 2008
8. MapReduce Planner
Flow Job
Flow Job
Map
Map Reduce
F F F Reduce
S F G A S G A S
F
Map
Job
Map
S F F Job
Reduce Map
Flows are logical ‘units of work’
G A S F S
S F
Map
Flows ‘compiled’ into MR Jobs
Intermediate files are created (and destroyed) to join Jobs
Wednesday, May 14, 2008
9. Topological Scheduler
Flows walk MapReduce Jobs in dependency order
Cascades walk Flows in dependency order
Independent Jobs and Flows are scheduled to run concurrently
Listeners can react to element events (notify completion or failures)
Only stale data-sets are rebuilt (configurable)
Wednesday, May 14, 2008
10. Scripting - Groovy
Flow flow = builder.flow("wordcount")
{
source(input, scheme: text()) // input is filename of raw text document
tokenize(/[.,]*s+/) // output new tuple for each split, result replaces stream by default
group() // group on stream
count() // count values in group, creates 'count' field by default
group(["count"], reverse: true) // group/sort on 'count', reverse the sort order
sink(output)
}
flow.complete() // execute, block till completed
Wednesday, May 14, 2008
11. System Integration
FileSystems (unique to Cascading)
Raw file S3 reading/writing (MD5)
Raw file HTTP reading (MD5)
Zip files
Can bypass native Hadoop ‘collectors’
Event notification via listeners (XMPP/SQS/Zookeeper notifications)
Groovy scripting for easier local shell/file operations (wget, scp, etc)
Wednesday, May 14, 2008
13. Core Concepts
Taps and Schemes
Tuples and Fields
Pipes and PipeAssemblies
Each and Every Operators
Groups
Flows, FlowSteps, and FlowConnectors
Cascades, and CascadeConnectors, optional
Wednesday, May 14, 2008
14. Taps and Schemes
Taps, abstract out where and how a data resources is accessed
hdfs, http, local, S3, etc
Taps, used as Tuple (data) stream sinks, sources, or both
Schemes, define what a resource is made of
text lines, SequenceFile, CSV, etc
Wednesday, May 14, 2008
15. Tuples and Fields
Tuples are the ‘records’, read from Tap sources, written to Tap sinks
Fields are the ‘column names’, sourced from Schemes
Tuple class, an ordered collection of Comparable values
(“a string”, 1.0, new SomeComparableWritable())
Fields class, a list of field names, absolute or relative positions
(“total”, 3, -1) // fields ‘total’, 4th position, last position
Wednesday, May 14, 2008
16. Pipes and PipeAssemblies
Tuple streams pass through Pipes to be processed
Pipes, apply functions, filters, and aggregators to the Tuple stream
Pipe instances are chained together into assemblies
Reusable assemblies are subclasses of class PipeAssembly
A
B'
A
E P
C'
E
B'
G A
B
A
E E
C
C'
E E
Wednesday, May 14, 2008
17. Group Class and Subclasses
Group, subclass of Pipe, groups the Tuple stream on given fields
GroupBy and CoGroup subclass Group
GroupBy groups and sorts
CoGroup performs joins
T E
Fe Fa
G A T
T E G A T
T E
Wednesday, May 14, 2008
18. Each and Every Classes
Each, subclass of Pipe, applies Functions and Filters to each Tuple instance
(a,b,c) -> Each( func() ) -> (a,b,c,d)
Every, subclass of Pipe, applies Aggregators to every Tuple group
(a: b,c) -> Every( agg()) -> (a,d: b,c)
Fe Fa
E A
Wednesday, May 14, 2008
19. Flows and FlowConnectors
Flows encapsulate assemblies and sink and source Taps
FlowConnectors connect assemblies and Taps into Flows
Flow FlowStep Flow
FlowStep
E G A T
T E
T E
G A T
FlowStep
T E
E E G A T
Wednesday, May 14, 2008
20. FlowSteps and FlowConnectors
Internally, FlowConnectors ‘compile’ assemblies into FlowSteps
FlowSteps are MapReduce jobs, which are executed in Topo order
Temporary files are created to link FlowSteps
Flow
FlowStep FlowStep
Map Stack Reduce Stack Map Reduce Stack
Stack
T E G A T G E T
Wednesday, May 14, 2008
21. Cascades and CascadeConnectors
Are optional
Cascades bind Flows together via shared Taps
CascadeConnectors connect Flows
Flows are executed in Topo order
Cascade
T F T F T
T F
T E T F T
Wednesday, May 14, 2008