Presenter - Siyuan Hua, Apache Apex PMC Member & DataTorrent Engineer
Apache Apex provides a DAG construction API that gives the developers full control over the logical plan. Some use cases don't require all of that flexibility, at least so it may appear initially. Also a large part of the audience may be more familiar with an API that exhibits more functional programming flavor, such as the new Java 8 Stream interfaces and the Apache Flink and Spark-Streaming API. Thus, to make Apex beginners to get simple first app running with familiar API, we are now providing the Stream API on top of the existing DAG API. The Stream API is designed to be easy to use yet flexible to extend and compatible with the native Apex API. This means, developers can construct their application in a way similar to Flink, Spark but also have the power to fine tune the DAG at will. Per our roadmap, the Stream API will closely follow Apache Beam (aka Google Data Flow) model. In the future, you should be able to either easily run Beam applications with the Apex Engine or express an existing application in a more declarative style.
3. Apex Overview
• YARN is
the
resource
manager
• HDFS used
for storing
any
persistent
state
4. Current Development Model
Directed Acyclic Graph (DAG)
Output
Stream
Tupl
e
Tupl
e
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
● Stream is a sequence of data tuples
● Typical Operator takes one or more input streams, performs computations & emits one or more output streams
● Each operator is your custom business logic in java, or built-in operator from our open source library
● Operator has many instances that run in parallel and each instance is single-threaded
● Directed Acyclic Graph (DAG) is made up of operators and streams
5. Current Application Example
@ApplicationAnnotation(name="WordCountDemo")
public class Application implements StreamingApplication
{
@Override
public void populateDAG(DAG dag, Configuration conf)
{
WordCountInputOperator input = dag.addOperator("wordinput", new WordCountInputOperator());
UniqueCounter<String> wordCount = dag.addOperator("count", new UniqueCounter<String>());
ConsoleOutputOperator consoleOperator = dag.addOperator("console", new ConsoleOutputOperator());
dag.addStream("wordinput-count", input.outputPort, wordCount.data);
dag.addStream("count-console",wordCount.count, consoleOperator.input);
}
}
6. o Easier for beginners to start with
o Fluent API
o Smaller learning curve
o Transform methods in one place vs operator library
o Operator API provides flexibility while high-level API provides ease of use
Why we need high-level API
8. Stream API (Application Example)
@ApplicationAnnotation(name = "WordCountStreamingApiDemo")
public class ApplicationWithStreamAPI implements StreamingApplication
{
@Override
public void populateDAG(DAG dag, Configuration configuration)
{
String localFolder = "./src/test/resources/data";
ApexStream<String> stream = StreamFactory
.fromFolder(localFolder)
.flatMap(new Split())
.window(new WindowOption.GlobalWindow(), new
TriggerOption().withEarlyFiringsAtEvery(Duration.millis(1000)).accumulatingFiredPanes())
.countByKey(new ConvertToKeyVal()).print();
stream.populateDag(dag);
}
}
9. How it works
o ApexStream<T> literally means bounded/unbounded data set of type T
o ApexStream<T> also holds a graph data struture of all operator and
connections between operators from input to current point
o Each transform method attach one or more operators to current graph
data structure and return a new Apex Stream object
o The graph data structure won’t be translated to Apex DAG until
populateDag or run method are called
11. ○ Method chain for readability
○ Stateless transform(map, flatmap, filter)
○ Some input and output are available (file, console, Kafka)
○ Some interoperability (addOperator, getDag, set property/attributes etc)
○ Local mode and distributed mode
○ Annonymous function class support
○ Extensible
Current Status
12. ○ WindowedStream is in pull request along with Operators that support it
○ A few window transforms (count, reduce, etc)
○ 3 Window types (fix window, sliding window, session window)
○ 3 Trigger types (early trigger, late trigger, at watermark)
○ 3 Accumulation modes(accumulate, discard, accumulation_retraction)
○ In memory window state (checkpointed)
Current Status (Con’t)
13. Roadmap
○ Persistent window state for windowed operators (large state)
○ Fully follow Beam model (window, trigger, watermark)
○ Rich selection of windowed transform (group, combine, join)
○ Support custom window assignor
○ Support custom trigger
○ More input/output (hbase, cassendra, jdbc, etc)
○ Better schema support
○ More language support (java 8, scala, etc...)
○ What the community asks for