Presenter:
Priyanka Gugale, Committer for Apache Apex and Software Engineer at DataTorrent.
In this session we will cover introduction to Yarn, understanding yarn architecture as well as look into Yarn application lifecycle. We will also learn how Apache Apex is one of the Yarn applications in Hadoop.
TeamStation AI System Report LATAM IT Salaries 2024
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
1. Introduction to YARN and
Apex as YARN Application
Priyanka Gugale (priyag@apache.org)
September 30th 2016
2. Apache Apex - Stream Processing
Easily Operable - Exposes an easy API for developing Operators (part of an
application) and Applications
Highly Scalable - Scales statically as well as dynamically
Highly Performant - Can reach single digit millisecond end-to-end latency
Fault Tolerant - Automatically recovers from failures - without manual
intervention
Stateful - Guarantees that no state will be lost
Apex Malhar library
4. An Apex Application is a DAG
(Directed Acyclic Graph)
A DAG is composed of vertices (Operators) and edges (Streams).
A Stream is a sequence of data tuples which connects operators at end-points called Ports
An Operator takes one or more input streams, performs computations & emits one or more output streams
● Each operator is USER’s business logic, or built-in operator from our open source library
● Operator may have multiple instances that run in parallel
5. DAG Components
• Tuple
● Atomic data that flows over a stream
• Operator
● Basic compute unit per tuple
• Stream
● Connector abstraction between operators
● Tuples flow over this
Operator
1
Operator
2
Stream
tuple
3
tuple
1
tuple
2
13. Application Components of Apex - StrAMClient
• Part of apex client interface
• Invoked by “launch” command of apex
• Tasks:
● Copy required the application package files into HDFS
● Validate Logical Plan
● Serialize Logical plan to HDFS
● Launch Application Master i.e. StrAM
Apache Apex Meetup
14. Application Components of Apex - StrAM
• Streaming Application Master
• Started by StrAMClient on a YarnContainer
• Tasks:
● Convert logical plan to physical plan
● Serialize operators to HDFS
● Request for resources to ResourceManager
● Start StrAMChild in YarnContainer(s)
● Monitor StrAMChild using ContainerManager protocol
● Generate Application statistics
● Host results on WebService (dtManage)
● Checkpointing/Committing Application States
● Fault Tolerance
● Support Security
● Shutdown Application
Apache Apex Meetup
15. Application Components of Apex - StrAMChild
• Deployed on YarnContainer
• Started by NodeManager as instructed by StrAM
• Instance of StreamingContainer
• Contains Operators (compute-related)
• Contains BufferServer (stream-related)
• Tasks:
● Regularly send heartbeat to StrAM
● Execute commands from StrAM
● Shutdown or Kill self if instructed
● Manage lifecycle of an Operator
● Network communication using BufferServer
Apache Apex Meetup
17. Summary – Apex platform
• Enables YARN to be used for Streaming Applications
• Takes care of YARN specific work
• User can focus on business logic defined in Operators
Apache Apex Meetup