A quick walk overview of Stratosphere, including our Scala programming interface.
See also bigdataclass.org for two self-paced Stratosphere Big Data exercises.
More information about Stratosphere: stratosphere.eu
2. What is this?
●
●
●
Distributed data
processing system
source
DAG (Directed acyclic
graph) of sources, sinks,
and operators: “data flow”
map: “split words”
Handles distribution, faulttolerance, network
transfer
reduce: “count words”
sink
2
3. Why would I use this?
Automatic parallelization / Because you are told to
source
source
source
map: “split words”
map: “split words”
map: “split words”
reduce: “count words”
reduce: “count words”
reduce: “count words”
sink
sink
sink
3
4. So how do I use this?
(from Java)
●
How is data represented in the system?
●
How to I create data flows?
●
Which types of operators are there?
●
How do I write operators?
●
How do run the whole shebang?
4
5. How do I move my data?
●
●
●
Data is stored in fields in PactRecord
Basic data types: PactString, PactInteger, PactDouble,
PactFloat, PactBoolean, …
New data types must implement Value interface
5
6. PactRecord
PactRecord rec = ...
PactInteger foo =
rec.getField(0, PactInteger.class)
int i = foo.getValue()
PactInteger foo2 = new PactInteger(3)
rec.setField(1, foo2)
6
7. Creating Data Flows
●
Create one or several sources
●
Create operators:
–
–
●
Input is/are preceding operator(s)
Specify a class/object with the operator implementation
Create one or several sinks:
–
Input is some operator
7
8. WordCount Example Data Flow
FileDataSource source = new FileDataSource(TextInputFormat.class, dataInput, "Input Lines");
MapContract mapper = MapContract.builder(TokenizeLine.class)
.input(source)
.name("Tokenize Lines")
.build();
ReduceContract reducer = ReduceContract.builder(CountWords.class, PactString.class, 0)
.input(mapper)
.name("Count Words")
.build();
FileDataSink out = new FileDataSink(RecordOutputFormat.class, output, reducer, "Word Counts");
RecordOutputFormat.configureRecordFormat(out)
.recordDelimiter('n')
.fieldDelimiter(' ')
.field(PactString.class, 0)
.field(PactInteger.class, 1);
Plan plan = new Plan(out, "WordCount Example");
8
9. Operator Types
●
We call them second order functions (SOF)
●
Code inside the operator is the first order function
or user defined function (UDF)
●
●
Currently five SOFs: map, reduce, match, cogroup,
cross
SOF describes how PactRecords are handed to the
UDF
9
10. Map Operator
●
●
User code receives one
record at a time (per
call to user code
function)
Not really a functional
map since all operators
can output an arbitrary
number of records
10
11. Map Operator Example
public static class TokenizeLine extends MapStub {
private final AsciiUtils.WhitespaceTokenizer tokenizer =
new AsciiUtils.WhitespaceTokenizer();
private final PactRecord outputRecord = new PactRecord();
private final PactString word = new PactString();
private final PactInteger one = new PactInteger(1);
@Override
public void map(PactRecord record, Collector<PactRecord> collector) {
PactString line = record.getField(0, PactString.class);
this.tokenizer.setStringToTokenize(line);
while (tokenizer.next(word)) {
outputRecord.setField(0, word);
outputRecord.setField(1, one);
collector.collect(outputRecord);
}
}
}
11
12. Reduce Operator
●
●
User code receives a
group of records with
same key
Must specify which
fields of a record are
the key
12
13. Reduce Operator Example
public static class CountWords extends ReduceStub {
private final PactInteger cnt = new PactInteger();
@Override
public void reduce(Iterator<PactRecord> records, Collector<PactRecord> out)
throws Exception {
PactRecord element = null;
int sum = 0;
while (records.hasNext()) {
element = records.next();
PactInteger i = element.getField(1, PactInteger.class);
sum += i.getValue();
}
cnt.setValue(sum);
element.setField(1, cnt);
out.collect(element);
}
}
13
15. Cross Operator
●
●
●
●
Two input operator
Cartesian product: every
record from left combined
with every record from
right
One record from left, one
record from right per user
code call
Implement CrossStub
15
16. Match Operator
●
●
●
Two input operator
with keys
Join: record from left
combined with every
record from right with
same key
Implement MatchStub
16
17. CoGroup Operator
●
●
●
●
Two input operator with
keys
Records from left
combined with all record
from right with same key
User code gets an iterator
for left and right records
Implement CoGroupStub
17
18. How to execute a data flow plan
●
Either use LocalExecutor:
LocalExecutor.execute(plan)
●
Implement
PlanAssembler.getPlan(String...args)
And run on a local cluster or proper cluster
●
See: http://stratosphere.eu/quickstart/
and http://stratosphere.eu/docs/gettingstarted.html
18
20. And Now for Something Completely
Different
val input = TextFile(textInput)
val words = input
.flatMap { _.split(" ") map { (_, 1) } }
val counts = words
.groupBy { case (word, _) => word }
.reduce { (w1, w2) => (w1._1, w1._2 + w2._2) }
val output = counts
.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))
20
22. Anatomy of a Scala Class
package foo.bar
import something.else
class Job(arg1: Int) {
def map(in: Int): String = {
val i: Int = in + 2
var a = “Hello”
i.toString
}
}
22
24. Collections
val a = Seq(1, 2, 4)
List(“Hallo”, 2)
Array(2,3)
Map(1->”1”, 2->”2”)
val b = a map { x => x + 2}
val c = a map { _ + 2 }
val c = a.map({ _ + 2 })
24
25. Generics and Tuples
val a: Seq[Int] = Seq(1, 2, 4)
val tup = (3, “a”)
val tup2: (Int, String) = (3, “a”)
25
27. Skeleton of a Stratosphere Program
●
Input: a text file/JDBC source/CSV, etc.
–
●
Transformations on the Dataset
–
●
loaded in internal representation: the DataSet
map, reduce, join, etc.
Output: program results in a DataSink
–
Text file, JDBC, CSV, etc.
27
28. The Almighty DataSet
●
●
●
●
Operations are methods on DataSet[A]
Working with DataSet[A] feels like working with
Scala collections
DataSet[A] is not an actual collection but
represents computation on a collection
Stringing together operations creates a data flow
graph that can be execute
28
29. An Important Difference
Immediately Executed
Executed when data flow is executed
val input: List[String] = ...
val input: DataSet[String] = ...
val mapped = input.map { s => (s, 1) }
val mapped = input.map { s => (s, 1) }
val result = mapped.write(“file”, ...)
val plan = new Plan(result)
execute(plan)
29
30. Usable Data Types
●
Primitive types
●
Tuples
●
Case classes
●
Custom data types that implement the Value
interface
30
31. Creating Data Sources
val input = TextFile(“file://”)
val input: DataSet[(Int, String)] =
DataSource(“hdfs://”,
CsvInputFormat[(Int, String)]())
def parseInput(line: String): (Int, Int) = {…}
val input = DataSource(“hdfs://”,
DelimitedInputFormat](parseInput))
31
32. Interlude: Anonymous Functions
var fun: ((Int, String)) => String = ...
fun = { t => t._2 }
fun = { _._2 }
fun = { case (i, w) => w }
32
33. Map
val input: DataSet[(Int, String)] = ...
val mapper = input
.map { case (a, b) => (a + 2, b) }
val mapper2 = input
.flatMap { _._2.split(“ “) }
val filtered = input
.filter { case (a, b) => a > 3 }
33
34. Reduce
val input: DataSet[(String, Int)] = ...
val reducer = input
.groupBy { case (w, _) => w }
.groupReduce { _.minBy {...} }
val reducer2 = input
.groupBy { case (w, _) => w }
.reduce { (w1, w2) => (w1._1, w1._2 + w2._2) }
34
35. Cross
val left: DataSet[(String, Int)] = ...
val right: DataSet[(String, Int)] = ...
val cross = left cross right
.map { (l, r) => ... }
val cross = left cross right
.flatMap { (l, r) => ... }
35
38. Creating Data Sinks
val counts: DataSet[(String, Int)]
val sink = counts.write(“<>”, CsvOutputFormat())
def formatOutput(a: (String, Int)): String = {
“Word “ + a._1 + “ count “ + a._2
}
val sink = counts.write(“<>”,
DelimitedOutputFormat(formatOutput))
38
39. Word Count example
val input = TextFile(textInput)
val words = input
.flatMap { _.split(" ") map { (_, 1) } }
val counts = words
.groupBy { case (word, _) => word }
.reduce { (w1, w2) => (w1._1, w1._2 + w2._2) }
val output = counts
.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))
39
40. Things not mentioned
●
The is support for iterations (both in Java and Scala)
●
Many more data source/sink formats
●
Look at the examples in the stratosphere source
●
Don't be afraid to write on mailing list and on
github:
–
●
http://stratosphere.eu/quickstart/scala.html
Or come directly to us
40