SlideShare una empresa de Scribd logo
1 de 41
Introduction to Stratosphere
Aljoscha Krettek
DIMA / TU Berlin
What is this?
●

●

●

Distributed data
processing system

source

DAG (Directed acyclic
graph) of sources, sinks,
and operators: “data flow”

map: “split words”

Handles distribution, faulttolerance, network
transfer

reduce: “count words”

sink
2
Why would I use this?
Automatic parallelization / Because you are told to
source

source

source

map: “split words”

map: “split words”

map: “split words”

reduce: “count words”

reduce: “count words”

reduce: “count words”

sink

sink

sink

3
So how do I use this?
(from Java)
●

How is data represented in the system?

●

How to I create data flows?

●

Which types of operators are there?

●

How do I write operators?

●

How do run the whole shebang?

4
How do I move my data?
●

●

●

Data is stored in fields in PactRecord
Basic data types: PactString, PactInteger, PactDouble,
PactFloat, PactBoolean, …
New data types must implement Value interface

5
PactRecord
PactRecord rec = ...
PactInteger foo =
rec.getField(0, PactInteger.class)
int i = foo.getValue()
PactInteger foo2 = new PactInteger(3)
rec.setField(1, foo2)

6
Creating Data Flows
●

Create one or several sources

●

Create operators:
–
–

●

Input is/are preceding operator(s)
Specify a class/object with the operator implementation

Create one or several sinks:
–

Input is some operator

7
WordCount Example Data Flow
FileDataSource source = new FileDataSource(TextInputFormat.class, dataInput, "Input Lines");
MapContract mapper = MapContract.builder(TokenizeLine.class)
.input(source)
.name("Tokenize Lines")
.build();
ReduceContract reducer = ReduceContract.builder(CountWords.class, PactString.class, 0)
.input(mapper)
.name("Count Words")
.build();
FileDataSink out = new FileDataSink(RecordOutputFormat.class, output, reducer, "Word Counts");
RecordOutputFormat.configureRecordFormat(out)
.recordDelimiter('n')
.fieldDelimiter(' ')
.field(PactString.class, 0)
.field(PactInteger.class, 1);
Plan plan = new Plan(out, "WordCount Example");

8
Operator Types
●

We call them second order functions (SOF)

●

Code inside the operator is the first order function
or user defined function (UDF)

●

●

Currently five SOFs: map, reduce, match, cogroup,
cross
SOF describes how PactRecords are handed to the
UDF

9
Map Operator
●

●

User code receives one
record at a time (per
call to user code
function)
Not really a functional
map since all operators
can output an arbitrary
number of records

10
Map Operator Example
public static class TokenizeLine extends MapStub {
private final AsciiUtils.WhitespaceTokenizer tokenizer =
new AsciiUtils.WhitespaceTokenizer();
private final PactRecord outputRecord = new PactRecord();
private final PactString word = new PactString();
private final PactInteger one = new PactInteger(1);
@Override
public void map(PactRecord record, Collector<PactRecord> collector) {
PactString line = record.getField(0, PactString.class);
this.tokenizer.setStringToTokenize(line);
while (tokenizer.next(word)) {
outputRecord.setField(0, word);
outputRecord.setField(1, one);
collector.collect(outputRecord);
}
}
}
11
Reduce Operator
●

●

User code receives a
group of records with
same key
Must specify which
fields of a record are
the key

12
Reduce Operator Example
public static class CountWords extends ReduceStub {
private final PactInteger cnt = new PactInteger();
@Override
public void reduce(Iterator<PactRecord> records, Collector<PactRecord> out)
throws Exception {
PactRecord element = null;
int sum = 0;
while (records.hasNext()) {
element = records.next();
PactInteger i = element.getField(1, PactInteger.class);
sum += i.getValue();
}
cnt.setValue(sum);
element.setField(1, cnt);
out.collect(element);
}
}
13
Specifying the Key Fields
ReduceContract reducer =
ReduceContract.builder(
Foo.class,
PactString.class, 0)
.input(mapper)
.keyField(PactInteger.class, 1)
.name("Count Words")
.build();
14
Cross Operator
●

●

●

●

Two input operator
Cartesian product: every
record from left combined
with every record from
right
One record from left, one
record from right per user
code call
Implement CrossStub

15
Match Operator
●

●

●

Two input operator
with keys
Join: record from left
combined with every
record from right with
same key
Implement MatchStub

16
CoGroup Operator
●

●

●

●

Two input operator with
keys
Records from left
combined with all record
from right with same key
User code gets an iterator
for left and right records
Implement CoGroupStub

17
How to execute a data flow plan
●

Either use LocalExecutor:
LocalExecutor.execute(plan)

●

Implement
PlanAssembler.getPlan(String...args)

And run on a local cluster or proper cluster
●

See: http://stratosphere.eu/quickstart/
and http://stratosphere.eu/docs/gettingstarted.html

18
Getting Started

https://github.com/stratosphere/stratosphere
https://github.com/stratosphere/stratosphere-quickstart

19
And Now for Something Completely
Different
val input = TextFile(textInput)
val words = input
.flatMap { _.split(" ") map { (_, 1) } }
val counts = words
.groupBy { case (word, _) => word }
.reduce { (w1, w2) => (w1._1, w1._2 + w2._2) }
val output = counts
.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))
20
(Very) Short Introduction to Scala

21
Anatomy of a Scala Class
package foo.bar
import something.else
class Job(arg1: Int) {
def map(in: Int): String = {
val i: Int = in + 2
var a = “Hello”
i.toString
}
}

22
Singletons
●

Similar to Java singletons and/or static methods

object Job {
def main(args: String*) {
println(“Hello World”)
}
}

23
Collections
val a = Seq(1, 2, 4)
List(“Hallo”, 2)
Array(2,3)
Map(1->”1”, 2->”2”)
val b = a map { x => x + 2}
val c = a map { _ + 2 }
val c = a.map({ _ + 2 })

24
Generics and Tuples
val a: Seq[Int] = Seq(1, 2, 4)
val tup = (3, “a”)
val tup2: (Int, String) = (3, “a”)

25
Stratosphere Scala Front End

26
Skeleton of a Stratosphere Program
●

Input: a text file/JDBC source/CSV, etc.
–

●

Transformations on the Dataset
–

●

loaded in internal representation: the DataSet
map, reduce, join, etc.

Output: program results in a DataSink
–

Text file, JDBC, CSV, etc.

27
The Almighty DataSet
●

●

●

●

Operations are methods on DataSet[A]
Working with DataSet[A] feels like working with
Scala collections
DataSet[A] is not an actual collection but
represents computation on a collection
Stringing together operations creates a data flow
graph that can be execute

28
An Important Difference
Immediately Executed

Executed when data flow is executed

val input: List[String] = ...

val input: DataSet[String] = ...

val mapped = input.map { s => (s, 1) }

val mapped = input.map { s => (s, 1) }

val result = mapped.write(“file”, ...)

val plan = new Plan(result)

execute(plan)

29
Usable Data Types
●

Primitive types

●

Tuples

●

Case classes

●

Custom data types that implement the Value
interface

30
Creating Data Sources
val input = TextFile(“file://”)
val input: DataSet[(Int, String)] =
DataSource(“hdfs://”,
CsvInputFormat[(Int, String)]())
def parseInput(line: String): (Int, Int) = {…}
val input = DataSource(“hdfs://”,
DelimitedInputFormat](parseInput))

31
Interlude: Anonymous Functions
var fun: ((Int, String)) => String = ...
fun = { t => t._2 }
fun = { _._2 }
fun = { case (i, w) => w }

32
Map
val input: DataSet[(Int, String)] = ...
val mapper = input
.map { case (a, b) => (a + 2, b) }
val mapper2 = input
.flatMap { _._2.split(“ “) }
val filtered = input
.filter { case (a, b) => a > 3 }

33
Reduce
val input: DataSet[(String, Int)] = ...
val reducer = input
.groupBy { case (w, _) => w }
.groupReduce { _.minBy {...} }
val reducer2 = input
.groupBy { case (w, _) => w }
.reduce { (w1, w2) => (w1._1, w1._2 + w2._2) }

34
Cross
val left: DataSet[(String, Int)] = ...
val right: DataSet[(String, Int)] = ...
val cross = left cross right
.map { (l, r) => ... }
val cross = left cross right
.flatMap { (l, r) => ... }

35
Join (Match)
val counts: DataSet[(String, Int)] = ...
val names: DataSet[(Int, String)] = ...
val join = counts
.join(right)
.where {case (_,c) => c}.isEqualsTo {case (n,_) => n}
.map { (l, r) => (l._1, r._2) }
val join = counts
.join(right)
.where {case (_,c) => c}.isEqualsTo {case (n,_) => n}
.flatMap { (l, r) => ... }

36
CoGroup
val counts: DataSet[(String, Int)] = ...
val names: DataSet[(Int, String)] = ...
val cogroup = counts
.cogroup(right)
.where {case (_,c) => c}.isEqualsTo {case (n,_) => n}
.map { (l, r) => (l.minBy {...} , r.minBy {...}) }
val cogroup = counts
.cogroup(right)
.where {case (_,c) => c}.isEqualsTo {case (n,_) => n}
.flatMap { (l, r) => ... }

37
Creating Data Sinks
val counts: DataSet[(String, Int)]
val sink = counts.write(“<>”, CsvOutputFormat())
def formatOutput(a: (String, Int)): String = {
“Word “ + a._1 + “ count “ + a._2
}
val sink = counts.write(“<>”,
DelimitedOutputFormat(formatOutput))

38
Word Count example
val input = TextFile(textInput)
val words = input
.flatMap { _.split(" ") map { (_, 1) } }
val counts = words
.groupBy { case (word, _) => word }
.reduce { (w1, w2) => (w1._1, w1._2 + w2._2) }
val output = counts
.write(wordsOutput, CsvOutputFormat())
val plan = new ScalaPlan(Seq(output))
39
Things not mentioned
●

The is support for iterations (both in Java and Scala)

●

Many more data source/sink formats

●

Look at the examples in the stratosphere source

●

Don't be afraid to write on mailing list and on
github:
–

●

http://stratosphere.eu/quickstart/scala.html

Or come directly to us

40
End.

Más contenido relacionado

La actualidad más candente

A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsFlink Forward
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and DataframeNamgee Lee
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big DataLeonardo Gamas
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flinkFlink Forward
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSigmoid
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionDatabricks
 

La actualidad más candente (20)

Meet scala
Meet scalaMeet scala
Meet scala
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Spark training-in-bangalore
Spark training-in-bangaloreSpark training-in-bangalore
Spark training-in-bangalore
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 

Destacado

Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...stratosphere_eu
 
Is Data Scientist the Sexiest Job of the 21st century?
Is Data Scientist the Sexiest Job of the 21st century?Is Data Scientist the Sexiest Job of the 21st century?
Is Data Scientist the Sexiest Job of the 21st century?Edureka!
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Onlinesfdatascience
 
data scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centurydata scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centuryFrank Kienle
 
Data scientist the sexiest job of the 21st century (article review presentation)
Data scientist the sexiest job of the 21st century (article review presentation)Data scientist the sexiest job of the 21st century (article review presentation)
Data scientist the sexiest job of the 21st century (article review presentation)chaithu reddy
 
Is Data Scientist still the sexiest job of 21st century? Find Out!
Is Data Scientist still the sexiest job of 21st century? Find Out!Is Data Scientist still the sexiest job of 21st century? Find Out!
Is Data Scientist still the sexiest job of 21st century? Find Out!Edureka!
 
Take Aways from "Data Scientist: The Sexiest Job of the 21st Century"
Take Aways from "Data Scientist: The Sexiest Job of the 21st Century"Take Aways from "Data Scientist: The Sexiest Job of the 21st Century"
Take Aways from "Data Scientist: The Sexiest Job of the 21st Century"Greg Farrenkopf
 

Destacado (7)

Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
Dr. Kostas Tzoumas: Big Data Looks Tiny From Stratosphere at Big Data Beers (...
 
Is Data Scientist the Sexiest Job of the 21st century?
Is Data Scientist the Sexiest Job of the 21st century?Is Data Scientist the Sexiest Job of the 21st century?
Is Data Scientist the Sexiest Job of the 21st century?
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Online
 
data scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st centurydata scientist the sexiest job of the 21st century
data scientist the sexiest job of the 21st century
 
Data scientist the sexiest job of the 21st century (article review presentation)
Data scientist the sexiest job of the 21st century (article review presentation)Data scientist the sexiest job of the 21st century (article review presentation)
Data scientist the sexiest job of the 21st century (article review presentation)
 
Is Data Scientist still the sexiest job of 21st century? Find Out!
Is Data Scientist still the sexiest job of 21st century? Find Out!Is Data Scientist still the sexiest job of 21st century? Find Out!
Is Data Scientist still the sexiest job of 21st century? Find Out!
 
Take Aways from "Data Scientist: The Sexiest Job of the 21st Century"
Take Aways from "Data Scientist: The Sexiest Job of the 21st Century"Take Aways from "Data Scientist: The Sexiest Job of the 21st Century"
Take Aways from "Data Scientist: The Sexiest Job of the 21st Century"
 

Similar a Stratosphere Intro (Java and Scala Interface)

User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35Bilal Ahmed
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
Legacy lambda code
Legacy lambda codeLegacy lambda code
Legacy lambda codePeter Lawrey
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemSages
 
Idea for ineractive programming language
Idea for ineractive programming languageIdea for ineractive programming language
Idea for ineractive programming languageLincoln Hannah
 
Time Series Meetup: Virtual Edition | July 2020
Time Series Meetup: Virtual Edition | July 2020Time Series Meetup: Virtual Edition | July 2020
Time Series Meetup: Virtual Edition | July 2020InfluxData
 
Compose Async with RxJS
Compose Async with RxJSCompose Async with RxJS
Compose Async with RxJSKyung Yeol Kim
 
Egor Bogatov - .NET Core intrinsics and other micro-optimizations
Egor Bogatov - .NET Core intrinsics and other micro-optimizationsEgor Bogatov - .NET Core intrinsics and other micro-optimizations
Egor Bogatov - .NET Core intrinsics and other micro-optimizationsEgor Bogatov
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2goMoriyoshi Koizumi
 

Similar a Stratosphere Intro (Java and Scala Interface) (20)

User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35CS101- Introduction to Computing- Lecture 35
CS101- Introduction to Computing- Lecture 35
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Legacy lambda code
Legacy lambda codeLegacy lambda code
Legacy lambda code
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data EcosystemWprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
Wprowadzenie do technologii Big Data / Intro to Big Data Ecosystem
 
Idea for ineractive programming language
Idea for ineractive programming languageIdea for ineractive programming language
Idea for ineractive programming language
 
Time Series Meetup: Virtual Edition | July 2020
Time Series Meetup: Virtual Edition | July 2020Time Series Meetup: Virtual Edition | July 2020
Time Series Meetup: Virtual Edition | July 2020
 
Compose Async with RxJS
Compose Async with RxJSCompose Async with RxJS
Compose Async with RxJS
 
Egor Bogatov - .NET Core intrinsics and other micro-optimizations
Egor Bogatov - .NET Core intrinsics and other micro-optimizationsEgor Bogatov - .NET Core intrinsics and other micro-optimizations
Egor Bogatov - .NET Core intrinsics and other micro-optimizations
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2go
 
R basics
R basicsR basics
R basics
 
Matlab1
Matlab1Matlab1
Matlab1
 

Más de Robert Metzger

How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)Robert Metzger
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupRobert Metzger
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupApache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupRobert Metzger
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)Robert Metzger
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupRobert Metzger
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkRobert Metzger
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkRobert Metzger
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016Robert Metzger
 
Flink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewFlink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewRobert Metzger
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Robert Metzger
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureRobert Metzger
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community UpdateRobert Metzger
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingRobert Metzger
 
August Flink Community Update
August Flink Community UpdateAugust Flink Community Update
August Flink Community UpdateRobert Metzger
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Robert Metzger
 
Apache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateApache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateRobert Metzger
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger
 

Más de Robert Metzger (20)

How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
How to Contribute to Apache Flink (and Flink at the Apache Software Foundation)
 
dA Platform Overview
dA Platform OverviewdA Platform Overview
dA Platform Overview
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
Apache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin MeetupApache Flink Community Updates November 2016 @ Berlin Meetup
Apache Flink Community Updates November 2016 @ Berlin Meetup
 
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
 
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink MeetupCommunity Update May 2016 (January - May) | Berlin Apache Flink Meetup
Community Update May 2016 (January - May) | Berlin Apache Flink Meetup
 
GOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache FlinkGOTO Night Amsterdam - Stream processing with Apache Flink
GOTO Night Amsterdam - Stream processing with Apache Flink
 
QCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache FlinkQCon London - Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
 
January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016January 2016 Flink Community Update & Roadmap 2016
January 2016 Flink Community Update & Roadmap 2016
 
Flink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in ReviewFlink Community Update December 2015: Year in Review
Flink Community Update December 2015: Year in Review
 
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
 
Chicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architectureChicago Flink Meetup: Flink's streaming architecture
Chicago Flink Meetup: Flink's streaming architecture
 
Flink September 2015 Community Update
Flink September 2015 Community UpdateFlink September 2015 Community Update
Flink September 2015 Community Update
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
August Flink Community Update
August Flink Community UpdateAugust Flink Community Update
August Flink Community Update
 
Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)Flink Cummunity Update July (Berlin Meetup)
Flink Cummunity Update July (Berlin Meetup)
 
Apache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community UpdateApache Flink First Half of 2015 Community Update
Apache Flink First Half of 2015 Community Update
 
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CAApache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
 
Apache Flink Hands On
Apache Flink Hands OnApache Flink Hands On
Apache Flink Hands On
 

Último

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Último (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Stratosphere Intro (Java and Scala Interface)

  • 1. Introduction to Stratosphere Aljoscha Krettek DIMA / TU Berlin
  • 2. What is this? ● ● ● Distributed data processing system source DAG (Directed acyclic graph) of sources, sinks, and operators: “data flow” map: “split words” Handles distribution, faulttolerance, network transfer reduce: “count words” sink 2
  • 3. Why would I use this? Automatic parallelization / Because you are told to source source source map: “split words” map: “split words” map: “split words” reduce: “count words” reduce: “count words” reduce: “count words” sink sink sink 3
  • 4. So how do I use this? (from Java) ● How is data represented in the system? ● How to I create data flows? ● Which types of operators are there? ● How do I write operators? ● How do run the whole shebang? 4
  • 5. How do I move my data? ● ● ● Data is stored in fields in PactRecord Basic data types: PactString, PactInteger, PactDouble, PactFloat, PactBoolean, … New data types must implement Value interface 5
  • 6. PactRecord PactRecord rec = ... PactInteger foo = rec.getField(0, PactInteger.class) int i = foo.getValue() PactInteger foo2 = new PactInteger(3) rec.setField(1, foo2) 6
  • 7. Creating Data Flows ● Create one or several sources ● Create operators: – – ● Input is/are preceding operator(s) Specify a class/object with the operator implementation Create one or several sinks: – Input is some operator 7
  • 8. WordCount Example Data Flow FileDataSource source = new FileDataSource(TextInputFormat.class, dataInput, "Input Lines"); MapContract mapper = MapContract.builder(TokenizeLine.class) .input(source) .name("Tokenize Lines") .build(); ReduceContract reducer = ReduceContract.builder(CountWords.class, PactString.class, 0) .input(mapper) .name("Count Words") .build(); FileDataSink out = new FileDataSink(RecordOutputFormat.class, output, reducer, "Word Counts"); RecordOutputFormat.configureRecordFormat(out) .recordDelimiter('n') .fieldDelimiter(' ') .field(PactString.class, 0) .field(PactInteger.class, 1); Plan plan = new Plan(out, "WordCount Example"); 8
  • 9. Operator Types ● We call them second order functions (SOF) ● Code inside the operator is the first order function or user defined function (UDF) ● ● Currently five SOFs: map, reduce, match, cogroup, cross SOF describes how PactRecords are handed to the UDF 9
  • 10. Map Operator ● ● User code receives one record at a time (per call to user code function) Not really a functional map since all operators can output an arbitrary number of records 10
  • 11. Map Operator Example public static class TokenizeLine extends MapStub { private final AsciiUtils.WhitespaceTokenizer tokenizer = new AsciiUtils.WhitespaceTokenizer(); private final PactRecord outputRecord = new PactRecord(); private final PactString word = new PactString(); private final PactInteger one = new PactInteger(1); @Override public void map(PactRecord record, Collector<PactRecord> collector) { PactString line = record.getField(0, PactString.class); this.tokenizer.setStringToTokenize(line); while (tokenizer.next(word)) { outputRecord.setField(0, word); outputRecord.setField(1, one); collector.collect(outputRecord); } } } 11
  • 12. Reduce Operator ● ● User code receives a group of records with same key Must specify which fields of a record are the key 12
  • 13. Reduce Operator Example public static class CountWords extends ReduceStub { private final PactInteger cnt = new PactInteger(); @Override public void reduce(Iterator<PactRecord> records, Collector<PactRecord> out) throws Exception { PactRecord element = null; int sum = 0; while (records.hasNext()) { element = records.next(); PactInteger i = element.getField(1, PactInteger.class); sum += i.getValue(); } cnt.setValue(sum); element.setField(1, cnt); out.collect(element); } } 13
  • 14. Specifying the Key Fields ReduceContract reducer = ReduceContract.builder( Foo.class, PactString.class, 0) .input(mapper) .keyField(PactInteger.class, 1) .name("Count Words") .build(); 14
  • 15. Cross Operator ● ● ● ● Two input operator Cartesian product: every record from left combined with every record from right One record from left, one record from right per user code call Implement CrossStub 15
  • 16. Match Operator ● ● ● Two input operator with keys Join: record from left combined with every record from right with same key Implement MatchStub 16
  • 17. CoGroup Operator ● ● ● ● Two input operator with keys Records from left combined with all record from right with same key User code gets an iterator for left and right records Implement CoGroupStub 17
  • 18. How to execute a data flow plan ● Either use LocalExecutor: LocalExecutor.execute(plan) ● Implement PlanAssembler.getPlan(String...args) And run on a local cluster or proper cluster ● See: http://stratosphere.eu/quickstart/ and http://stratosphere.eu/docs/gettingstarted.html 18
  • 20. And Now for Something Completely Different val input = TextFile(textInput) val words = input .flatMap { _.split(" ") map { (_, 1) } } val counts = words .groupBy { case (word, _) => word } .reduce { (w1, w2) => (w1._1, w1._2 + w2._2) } val output = counts .write(wordsOutput, CsvOutputFormat()) val plan = new ScalaPlan(Seq(output)) 20
  • 22. Anatomy of a Scala Class package foo.bar import something.else class Job(arg1: Int) { def map(in: Int): String = { val i: Int = in + 2 var a = “Hello” i.toString } } 22
  • 23. Singletons ● Similar to Java singletons and/or static methods object Job { def main(args: String*) { println(“Hello World”) } } 23
  • 24. Collections val a = Seq(1, 2, 4) List(“Hallo”, 2) Array(2,3) Map(1->”1”, 2->”2”) val b = a map { x => x + 2} val c = a map { _ + 2 } val c = a.map({ _ + 2 }) 24
  • 25. Generics and Tuples val a: Seq[Int] = Seq(1, 2, 4) val tup = (3, “a”) val tup2: (Int, String) = (3, “a”) 25
  • 27. Skeleton of a Stratosphere Program ● Input: a text file/JDBC source/CSV, etc. – ● Transformations on the Dataset – ● loaded in internal representation: the DataSet map, reduce, join, etc. Output: program results in a DataSink – Text file, JDBC, CSV, etc. 27
  • 28. The Almighty DataSet ● ● ● ● Operations are methods on DataSet[A] Working with DataSet[A] feels like working with Scala collections DataSet[A] is not an actual collection but represents computation on a collection Stringing together operations creates a data flow graph that can be execute 28
  • 29. An Important Difference Immediately Executed Executed when data flow is executed val input: List[String] = ... val input: DataSet[String] = ... val mapped = input.map { s => (s, 1) } val mapped = input.map { s => (s, 1) } val result = mapped.write(“file”, ...) val plan = new Plan(result) execute(plan) 29
  • 30. Usable Data Types ● Primitive types ● Tuples ● Case classes ● Custom data types that implement the Value interface 30
  • 31. Creating Data Sources val input = TextFile(“file://”) val input: DataSet[(Int, String)] = DataSource(“hdfs://”, CsvInputFormat[(Int, String)]()) def parseInput(line: String): (Int, Int) = {…} val input = DataSource(“hdfs://”, DelimitedInputFormat](parseInput)) 31
  • 32. Interlude: Anonymous Functions var fun: ((Int, String)) => String = ... fun = { t => t._2 } fun = { _._2 } fun = { case (i, w) => w } 32
  • 33. Map val input: DataSet[(Int, String)] = ... val mapper = input .map { case (a, b) => (a + 2, b) } val mapper2 = input .flatMap { _._2.split(“ “) } val filtered = input .filter { case (a, b) => a > 3 } 33
  • 34. Reduce val input: DataSet[(String, Int)] = ... val reducer = input .groupBy { case (w, _) => w } .groupReduce { _.minBy {...} } val reducer2 = input .groupBy { case (w, _) => w } .reduce { (w1, w2) => (w1._1, w1._2 + w2._2) } 34
  • 35. Cross val left: DataSet[(String, Int)] = ... val right: DataSet[(String, Int)] = ... val cross = left cross right .map { (l, r) => ... } val cross = left cross right .flatMap { (l, r) => ... } 35
  • 36. Join (Match) val counts: DataSet[(String, Int)] = ... val names: DataSet[(Int, String)] = ... val join = counts .join(right) .where {case (_,c) => c}.isEqualsTo {case (n,_) => n} .map { (l, r) => (l._1, r._2) } val join = counts .join(right) .where {case (_,c) => c}.isEqualsTo {case (n,_) => n} .flatMap { (l, r) => ... } 36
  • 37. CoGroup val counts: DataSet[(String, Int)] = ... val names: DataSet[(Int, String)] = ... val cogroup = counts .cogroup(right) .where {case (_,c) => c}.isEqualsTo {case (n,_) => n} .map { (l, r) => (l.minBy {...} , r.minBy {...}) } val cogroup = counts .cogroup(right) .where {case (_,c) => c}.isEqualsTo {case (n,_) => n} .flatMap { (l, r) => ... } 37
  • 38. Creating Data Sinks val counts: DataSet[(String, Int)] val sink = counts.write(“<>”, CsvOutputFormat()) def formatOutput(a: (String, Int)): String = { “Word “ + a._1 + “ count “ + a._2 } val sink = counts.write(“<>”, DelimitedOutputFormat(formatOutput)) 38
  • 39. Word Count example val input = TextFile(textInput) val words = input .flatMap { _.split(" ") map { (_, 1) } } val counts = words .groupBy { case (word, _) => word } .reduce { (w1, w2) => (w1._1, w1._2 + w2._2) } val output = counts .write(wordsOutput, CsvOutputFormat()) val plan = new ScalaPlan(Seq(output)) 39
  • 40. Things not mentioned ● The is support for iterations (both in Java and Scala) ● Many more data source/sink formats ● Look at the examples in the stratosphere source ● Don't be afraid to write on mailing list and on github: – ● http://stratosphere.eu/quickstart/scala.html Or come directly to us 40
  • 41. End.

Notas del editor

  1. Google: Search results, Spam Filter Amazon: Recommendations Soundcloud: Recommendations Spotify: Recommendations Youtube: Recommendations, Adverts Netflix: Recommendations, compare to Maxdome :D Twitter: Just everything … :D Facebook: Adverts, GraphSearch, Friend suggestion, Filtering (for annoying friends) Instagram: They have lots of data, theres gotta be something … Bioinformatik: DNA, 1TB per genom, 1000 genome
  2. Google: Search results, Spam Filter Amazon: Recommendations Soundcloud: Recommendations Spotify: Recommendations Youtube: Recommendations, Adverts Netflix: Recommendations, compare to Maxdome :D Twitter: Just everything … :D Facebook: Adverts, GraphSearch, Friend suggestion, Filtering (for annoying friends) Instagram: They have lots of data, theres gotta be something … Bioinformatik: DNA, 1TB per genom, 1000 genome