27. One Day…
Boss
I ve heard about Google Cloud Dataflow!
It may unify Batch & Streaming Distributed Processing.
Wow, That sounds awesome.
I d like to integrate it with our service.
Eh!? I have to investigate the details...
I ll leave it to you.
28. Two Missions
• Port SDK to other Language (Ruby
etc..)
• Implement Custom Stream Input (AMQP)
32. Open Source
• Apache License Version 2.0
• You can read it
• You can modify it
• You can run it
• locally (PubsubIO is not supported)
• on the Cloud Dataflow service(beta)
39. Pipeline as a Code
Pipeline p = Pipeline.create(options);
p.apply( TextIO.Read.named(“Read”).from(input) )
.apply( new MyTransform() )
.apply( TextIO.Write.named(“Write").to(output) );
PCollection
PTransform
public <Output extends POutput>
Output
apply(PTransform<? super PCollection<T>, Output> t)
• Pipeline.apply()/PCollection.apply() Signature
40. PCollection
• Container of data in Dataflow Pipeline
• Bounded (fixed size) or
Unbounded (variable size ≒ streaming)
• Handler for the real data (element)
cf. file descriptor, pipe etc..
54. Example of DoFn
static class ExtractWordsFn extends DoFn<String, String> {
public void processElement(ProcessContext c) {
String[] words = c.element().split(“[^a-zA-Z']+");
for (String word : words) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}
static class FormatCountsFn extends DoFn<KV<String, Long>, String> {
public void processElement(ProcessContext c) {
c.output(c.element().getKey() + ": " + c.element().getValue());
}
}
from WordCount.java
55. Staging
• How to load user defined code in
Dataflow managed service?
• DoFn<I,O> implements Serializable
• .jar files in $CLASSPATH are
uploaded to GCS `staging` bucket
56. Two Missions
• Port SDK to other Language (Ruby
etc..)
• Implement Custom Stream Input (AMQP)Dataflow service depend on JVM runtime.
(Python SDK is planned for future release.)
58. PubsubIO impl. in SDK
• PubsubIO.Read.Bound<T> extends
PTransform<PInput, PCollection<T>>
• Bound don’t have any runtime impl.
• runners.worker.ReaderFactory translate
these objects into Source/Sink type and
parameters and transport to Dataflow
service workers
59. Two Missions
• Port SDK to other Language (Ruby
etc..)
• Implement Custom Stream Input (AMQP)
Dataflow custom input development is not supported yet.
(Is there no future plan?)
60. OK.
But stay tuned for the activities in Dataflow.
I ve found that there s no way to accomplish
these missions right now...
Roger.
66. Windowing
k1: 1
k1: 2
k1: 3
k2: 2
Group
by
Key
k1: [1,2,3]
k2: [2]
Combine
k1: 3
k2: 1
k1: [1,2,3]
k2: [2]
• These transforms require all elements of input.
" In streaming mode inputs are unbounded.
67. Windowing
• Fixed Time Windows
• Sliding Time Windows
• Session Windows
• Single Global Window
Group elements into windows by timestamp
68. Trigger
• Streaming data could be arrived with
some delay
• Dataflow should wait for while
after end of window in wall time.
• Time-Based Triggers
• Data-Driven Triggers