SlideShare una empresa de Scribd logo
1 de 34
Application Timeline Server
- Past, Present & Future
NAGANARASIMHA G R & VARUN SAXENA
Agenda
 Who we are ?
 Why we need History Server?
 Application History Server
 Timeline Server V1
 Timeline Server V2
Who we are ?
Naganarasimha G R
 Senior Technical Lead @ Huawei
 Active Apache Hadoop Contributor.
 Currently working in Hadoop Platform Dev team
 Earlier worked in Reporting Domain
Varun Saxena
 Technical Lead @ Huawei
 Active Apache Hadoop Contributor.
 Currently working in Hadoop Platform Dev team
 Earlier worked in Telecom Data Network Domain
Both of us are currently participating in ATS V2 development
Agenda
 Who we are ?
 Why we need History Server?
 Application History Server
 Timeline Server V1
 Timeline Server V2
Need for new History Server
 Job History server is only for MR app, YARN supports
many Applications.
 YARN level Events and Metrics are not captured.
 Storage is HDFS only, Not good for adhoc analysis.
 JHS is only for historical or completed jobs.
 On failures of Application Master, Data for current
running application is lost.
 Storage is very MR specific
- Counters
- Mappers and Reducers
Agenda
 Who we are ?
 Why we need History Server?
 Application History Server
 Timeline Server V1 & V1.5
 Timeline Server V2
Application History Server
 Separate Process
 Resource Manager directly writes to Storage(HDFS)
 Aggregated Logs
 Separate UI, CLI and Rest End Point
 Data stored :
- Application level data (queue, user etc…)
- List of ApplicationAttempts
- Information about each ApplicationAttempt
- List of containers for ApplicationAttempt
- Generic information about each container.
 CLI and REST Query interfaces were supported
Drawbacks :
 Storing Application specific custom data not
supported
 RM crashes, HDFS files are not readable
 Hard limit no number of Files
 Upgrades / Update
 Supports only completed jobs.
Agenda
 Who we are ?
 Why we need History Server?
 Application History Server
 Timeline Server V1
 Timeline Server V2
Application Timeline Service
Motivation :
 YARN takes care of it
- Relieving the application from monitoring service
 Application diversity
- Framework specific metadata/metrics
ATS V1 : Data Model
 Timeline Domain
- Namespace for Timeline server which supports
isolations users and applications
- Timeline server Security is defined at this level
 Timeline Entity
- An abstract concept of anything
- Defines the relationship between entities
- Can be an application, an application attempt, a
container or any user-defined object
- contains Primary filters which will be used to index the
entities in the Timeline Store.
- uniquely identified by an EntityId and EntityType.
 Timeline Event
- Event that is related to a specific Timeline Entity of an application
- Users are free to define what an event means, such as starting an application, getting allocated a container,
ATS V1 : Architecture
 Separate Process
 Pluggable store – defaults to LevelDB
 REST Interfaces
ATS V1 : Level DB
 Key- value store
 Lightweight
 Open source Compatible license
 Used to store
- TimelineStore : Domain, Entity, Events and metrics
- TimelineStateStore : Security Tokens
 Supports Data Retention
ATS V1 : Client & API
 Timeline client
- Wrap over REST POST method
- POJO objects
 TimelineEntity
 TimelineEvent
- In Client/AM/Container
 Rest APIs, JSON as the media
- Get timeline entities
http://localhost:8188/ws/v1/timeline/{entityType}
- Get timeline entity
http://localhost:8188/ws/v1/timeline/{entityType}/{entityId}
- Get timeline events
http://localhost:8188/ws/v1/timeline/{entityType}/events
ATS V1 : Security
HTTP SPNEGO
Kerberos Authentication
Delegation Token
- Performance
- AM/Container no Kerberos
Access Control
- Admin/owner
- Timeline entity-level
ATS V1 : Use cases
Agenda
 Who we are ?
 Why we need History Server?
 Application History Server
 Timeline Server V1
 Timeline Server V2
Why ATSv2 ?
 Scalability
• Single global instance of writer/reader
• ATSv1 uses local disk based LevelDB storage
 Usability
• Handle flows as first-class concepts and model aggregation.
• Elevate configuration and metrics to first-class members.
• Better support for queries.
 Reliability
• Data is stored only in a local disk .
• Single daemon so single point of failure.
 Existing external tooling: hRaven, Finch, Dr. Elephant, etc. As new Hadoop versions are rolled out,
maintenance of these tools becomes an issue.
Key Design Points
 Distributed writers (per app and per node)
• Per App Writer/Collector launched as part of RM.
• Per Node Collector/Writer launched as an auxiliary service in NM.
• In future, will support standalone writers.
 Scalable and reliable backend storage (HBase)
 A new object model API with flows built into it.
 Separate reader instance(s). Currently have a single reader instance.
 Aggregation i.e. rolling up the metric values to the parent.
• Online aggregation for apps and flow runs.
• Offline aggregation for users, flows and queues.
Timeline
Reader
Timeline
Reader
ATSv2 Components
Application
Master
Node
Manager
Timeline
Writer
App Events
/ Metrics
Container Events
/ Metrics
Storage
Resource Manager
Timeline Writer
Timeline
Reader
User Queries
Timeline
Reader Pool
App / Container
Events
Resource
Manager
RMApp
Distributed Writers / Collectors
Node Manager 1
{
app_1_collector_info
….
}
List of app collectors
App Master
3. Launch App Master
App Collector
App Collector
Aux Service
4. Notify Aux Service
to bind new collector 5. Bind new collector
NODE 1
HBase
NM
Collector
Service
6. Register new
collector
RM
Events
Heartbeat with collector info
App Collector
App Collector
Node
Manager 2
Node
Manager X
1. User submits app
Heartbeat with collector info
2. RMApp launches
companion app collector on
new app submission
7. Report new collector
info. (IP + Port)
Container
Events
AM reports events to app
collector notified in HB by
RM.
NM reports events to app
collector notified in HB by
RM.
{
app_1_collector_info
app_2_collector_info
….
}
App 1
App 2
App 3 App 4
Run at 9:00 pm
Flow
Script / Program
(eg. HIVE Query /
Pig Script)
App 1
App 2
App 3 App 4
Run at 7:30 pm
Joe
Data Model
Entity
ID + Type
Configurations
Metadata(Info)
Parent-Child
Relationships
Metrics
Events
Cluster
Type
Cluster Attributes
Flow
Type
User
Flow Runs
Flow Attributes
Flow Run
Type
User
Running
apps
Flow Run
Attributes
Application
Type
User
Flow + Run
Queue
Attempts
Attempt
Type
Application
Queue
Containers
Container
Type
Attempt
Attributes
Entities of first
class citizens
User
Username(ID)
Aggregated metrics
Queue
Queue(ID)
Sub queues
Aggregated metrics
Aggregation
Event
ID
Metadata
Timestamp
Metric
ID
Metadata
Single Value or
Time Series(with
timestamps)
HBase vs Phoenix evaluation
Based on the evaluation of both Hbase and Phoenix, it was decided that HBase will be used on write path. With
Hbase, much higher throughput, a lower IO wait and far lower CPU load was witnessed.
Test descript
ion
Map
tasks
Entities
per
mapper
Total
entities
written
Phoenix
Transaction
Rate (per
mapper)
ops/sec
HBase
Transactio
n Rate
(per
mapper)
ops/sec
Phoenix Write
Time (job
counter
TIMELINE_
SERVICE_
WRITE_TIME)
Hbase Write Time
(job counter
TIMELINE _SERVICE
_WRITE_TIME)
Synthetic Data 170 1k 170k 112.83 2285.13 1506704 74394
Synthetic Data 170 10k 1.7M 53.029 636.41 32057957 2671241
Synthetic Data 1 50k 50k 196.67 19770.66 254225 2529
9 History Files 33 - 85k 319.19
(write errors)
962.32 265460 88049
555 History
Files
33 - 810k 206.25
(write errors)
927.62 4102364 874151
Aggregation
 Aggregation basically means rolling up metrics from child entities to parent entities. We can perform different operations such as
SUM, AVG ,etc. while rolling them up and store them in the parent.
 App level aggregation will be done by app collector as and when it receives different metrics.
 Online or real time aggregation for apps would be a simple SUM of metrics of child entities. Additional metrics will also be stored
which indicate AVG, MAX, AREA(time integral) etc. More on this in next slide.
 App to flow run aggregation will be done via a HBase coprocessor on the read path. Cell tags used to achieve this.
 For user/flow, aggregation happens periodically(not real time i.e. offline). For this, Phoenix tables will be used. To achieve offline
aggregation, a MR job is run which reads application table and writes to user and flow aggregation tables
Container A1
(CPUCoresMillis = 400)
Container A2
(CPUCoresMillis = 300)
Container B1
(CPUCoresMillis = 200)
App A
(CPUCoresMillis = 700)
App B
(CPUCoresMillis = 200)
Flow
(CPUCoresMillis = 900)
Accumulation
 While aggregating, we also accumulate metric values along the time dimension. This is especially useful for gauges. Consider the
table below which displays the CPU utilization for containers belonging to an app(in terms of cores). Here t1…t16 represents time
10ms. apart. This table shows how values are aggregated for an app and how they are accumulated and averages calculated.
Trapezoidal integration rule is used to calculate area under the curve i.e. Area under the curve = ((valuet1 + valuet2)/2) * Dt
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16
Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5
Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5
Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5
Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5
Container 5 0.5 0.5 1 0
Application
Area
(CoreMillis)
Average
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16
Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5
Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5
Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5
Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5
Container 5 0.5 0.5 1 0
Application 1
Area
(CoreMillis)
Average
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16
Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5
Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5
Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5
Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5
Container 5 0.5 0.5 1 0
Application 1 2.5 4 4 4 3.5 3 3 3 3 2 1.5 1 1 1 0
Area
(CoreMillis)
-
Average -
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16
Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5
Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5
Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5
Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5
Container 5 0.5 0.5 1 1
Application 1 2.5 4 4 4 3.5 3 3 3 3 2 1.5 1 1 1 1
Area
(CoreMillis)
- 15
Average - 1.5
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16
Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0
Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0
Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5 0
Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0
Container 5 0.5 0.5 1 0
Application 1 2.5 4 4 4 3.5 3 3 3 3 2 1.5 1 1 1 0
Area
(CoreMillis)
- 15 42
Average - 1.5 2.1
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16
Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0
Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0
Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5 0
Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0
Container 5 0.5 0.5 1 0
Application 1 2.5 4 4 4 3.5 3 3 3 3 2 1.5 1 1 1 0
Area
(CoreMillis)
- 15 42 82 122 160 192 222 252 282 307 325 335 345 355 360
Average - 1.5 2.1 2.7 3.1 3.2 3.2 3.1 3.1 3.1 3.1 3 2.8 2.6 2.5 2.4
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
CPU Cores for App Avg
HBase Table Schema
 Entity Table – Used for storing Timeline Entity object. Contains configs, metrics and other info (events,
parent child relationships, etc.).
Row Key : clusterId!user!flowId!flowRunId!appId!entityType!entityId
 Application Table – Used for storing YARN Application entity. Contains configs, metrics and other info.
Same as entity table but added for better performance.
Row Key : clusterId!user!flowId!flowRunId!appId
 App To Flow Table – Used for getting flowId and flowRunId information based on cluster and app. This is
helpful in querying entity table on the basis of just the cluster and app information.
Row Key : clusterId! appId
 Flow Run Table – Stores flow run information aggregated across apps.
Row Key : clusterId!user!flowId!flowRunId
HBase Table Schema (Contd.)
 Flow Activity Table – Used for storing daily activity records for a flow. For quick lookup of flow level info.
Row Key : clusterId!inverted top of the day timestamp!user!flowId
Phoenix Tables for Offline Aggregation :
 Flow Aggregation Table – Stores aggregated metrics at flow level. Metrics are aggregated from
application table.
Primary Key : user, cluster, flowId
 User Aggregation Table – Stores aggregated metrics at user level. Metrics are aggregated from
application table.
Primary Key : user, cluster
Querying ATSv2
 ATSv2 offers major enhancement over ATSv1 in terms of queries supported. Efficient queries around
flows, flow runs, apps, etc. are possible. Moreover, ATSv2 can support complex queries to filter out
results.
 ATSv1 offered only primary filters and secondary filters for filtering out entities. ATSv2 offers ability to
filter out entities based on config values, metric values, entity parent child relationships and events. It
also supports returning only certain configurations and metrics in the result.
 ATSv1 queries supported only “equal to” match for primary and secondary filters. But for metrics this
does not quite make sense. A user would while filtering on the basis of metric values would more likely be
using relational operators such as >=, <=, != etc. All these relational operators are supported in ATSv2 for
metrics. In addition to this different predicates in filters can be combined using “AND” and “OR”
operators.
All in all this gives ATSv2 a very powerful query interface.
Querying ATSv2 (Contd.)
 ATSv2, like ATSv1 supports a REST API interface with JSON as the media. Some examples are given below.
 Get Entities – Returns a set of TimelineEntity objects based on cluster, app and entity type. The query also
supports multiple optional query parameters such as limit on number of entities to be returned,
configurations and metrics to be returned, filter on the basis of created and modified time window, config
filters, metric filters and event filters.
http://localhost:8188/entities/{clusterId}/{appId}/{entityType}
Example : -
http://localhost:8188/entities/cluster1/application_1334432321_0002/YARN_CONTAINER?limit=5&metrics=
memory,cpu
 Get Entity – Returns a Timeline Entity object based on cluster, app, entity type and entityId.
http://localhost:8188/entity/{clusterId}/{appId}/{entityType}/{entityId}
Possible use cases
 Cluster utilization and inputs for capacity planning. Cluster can learn from flow’s/application’s historical
data.
 Mappers / reducers optimizations.
 Application performance over time.
 Identifying job bottlenecks.
 Ad-hoc troubleshooting and identification of problems in cluster.
 Complex queries possible at flow, user and queue level. For instance, queries like % of applications which
ran more than 10000 containers.
 Full DAG from flow to flow run to application to container level can be seen.
Team Members
 Sangjin Lee, Vrushali C and Joep Rottinghuis (Twitter)
 Junping Du, Li Lu and Vinod Kumar Vavillapalli (Hortonworks)
 Zhijie Shen (formerly Hortonworks)
 Varun Saxena and Naganarasimha G R (Huawei)
 Robert Kanter and Karthik Kambatla (Cloudera)
 Inputs from LinkedIn, Yahoo! and Altiscale.
Feature Status
 Distributed per-app and per-node writers (as Aux Service)
 RM Companion writer
 NM, RM and AM writing events and metrics to ATS
 File based readers and writers for test
 HBase and Phoenix writer implementations
 Performance evaluation of these writers
 HBase based reader implementation
 Support for flows
 App and flow run level online Aggregation
 Offline Aggregation
 Query Interface
Feature Status (Contd.)
 Standalone timeline writer
 Distributed timeline readers and a reader pool
 ATSv2 UI
 Security
 Support for migration
Thank You !

Más contenido relacionado

La actualidad más candente

Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnMike Frampton
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopHortonworks
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnApache Apex
 
Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Tsuyoshi OZAWA
 
Writing app framworks for hadoop on yarn
Writing app framworks for hadoop on yarnWriting app framworks for hadoop on yarn
Writing app framworks for hadoop on yarnDataWorks Summit
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Simplilearn
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Jianfeng Zhang
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureVARUN SAXENA
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
 
YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerVertiCloud Inc
 
Apache REEF - stdlib for big data
Apache REEF - stdlib for big dataApache REEF - stdlib for big data
Apache REEF - stdlib for big dataSergiy Matusevych
 
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...Zhijie Shen
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Hortonworks
 
Introduction to YARN Apps
Introduction to YARN AppsIntroduction to YARN Apps
Introduction to YARN AppsCloudera, Inc.
 
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Rohit Agrawal
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 

La actualidad más candente (20)

Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
An Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop YarnAn Introduction to Apache Hadoop Yarn
An Introduction to Apache Hadoop Yarn
 
YARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo HadoopYARN - Next Generation Compute Platform fo Hadoop
YARN - Next Generation Compute Platform fo Hadoop
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014Taming YARN @ Hadoop Conference Japan 2014
Taming YARN @ Hadoop Conference Japan 2014
 
Writing app framworks for hadoop on yarn
Writing app framworks for hadoop on yarnWriting app framworks for hadoop on yarn
Writing app framworks for hadoop on yarn
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 
YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource Manager
 
Apache REEF - stdlib for big data
Apache REEF - stdlib for big dataApache REEF - stdlib for big data
Apache REEF - stdlib for big data
 
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
Introduction to YARN Apps
Introduction to YARN AppsIntroduction to YARN Apps
Introduction to YARN Apps
 
Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9Hadoop 2.0, MRv2 and YARN - Module 9
Hadoop 2.0, MRv2 and YARN - Module 9
 
Resource scheduling
Resource schedulingResource scheduling
Resource scheduling
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 

Similar a Application Timeline Server Past, Present and Future

Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureVARUN SAXENA
 
Venturing into Hadoop Large Clusters
Venturing into Hadoop Large ClustersVenturing into Hadoop Large Clusters
Venturing into Hadoop Large ClustersVARUN SAXENA
 
Venturing into Large Hadoop Clusters
Venturing into Large Hadoop ClustersVenturing into Large Hadoop Clusters
Venturing into Large Hadoop ClustersNaganarasimha Garla
 
Venturing into Large Hadoop Clusters
Venturing into Large Hadoop ClustersVenturing into Large Hadoop Clusters
Venturing into Large Hadoop ClustersVARUN SAXENA
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Sangjin Lee
 
Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016Vrushali Channapattan
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBCarol McDonald
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317Nan Zhu
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.Renzo Tomà
 
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaElasticsearch
 
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaElasticsearch
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerFederico Palladoro
 
VMworld 2013: Performance Management of Business Critical Applications using ...
VMworld 2013: Performance Management of Business Critical Applications using ...VMworld 2013: Performance Management of Business Critical Applications using ...
VMworld 2013: Performance Management of Business Critical Applications using ...VMworld
 
Keynote 1 the rise of stream processing for data management &amp; micro serv...
Keynote 1  the rise of stream processing for data management &amp; micro serv...Keynote 1  the rise of stream processing for data management &amp; micro serv...
Keynote 1 the rise of stream processing for data management &amp; micro serv...Sabri Skhiri
 
Restate: Event-driven Asynchronous Services, Easy as Synchronous RPC
Restate: Event-driven Asynchronous Services, Easy as Synchronous RPCRestate: Event-driven Asynchronous Services, Easy as Synchronous RPC
Restate: Event-driven Asynchronous Services, Easy as Synchronous RPCHostedbyConfluent
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software ValidationBioDec
 

Similar a Application Timeline Server Past, Present and Future (20)

Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Venturing into Hadoop Large Clusters
Venturing into Hadoop Large ClustersVenturing into Hadoop Large Clusters
Venturing into Hadoop Large Clusters
 
Venturing into Large Hadoop Clusters
Venturing into Large Hadoop ClustersVenturing into Large Hadoop Clusters
Venturing into Large Hadoop Clusters
 
Venturing into Large Hadoop Clusters
Venturing into Large Hadoop ClustersVenturing into Large Hadoop Clusters
Venturing into Large Hadoop Clusters
 
Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)Timeline Service v.2 (Hadoop Summit 2016)
Timeline Service v.2 (Hadoop Summit 2016)
 
Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016Timeline service V2 at the Hadoop Summit SJ 2016
Timeline service V2 at the Hadoop Summit SJ 2016
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
 
Seattle spark-meetup-032317
Seattle spark-meetup-032317Seattle spark-meetup-032317
Seattle spark-meetup-032317
 
How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.How bol.com makes sense of its logs, using the Elastic technology stack.
How bol.com makes sense of its logs, using the Elastic technology stack.
 
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
 
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizadaCombinación de logs, métricas y seguimiento para una visibilidad centralizada
Combinación de logs, métricas y seguimiento para una visibilidad centralizada
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
VMworld 2013: Performance Management of Business Critical Applications using ...
VMworld 2013: Performance Management of Business Critical Applications using ...VMworld 2013: Performance Management of Business Critical Applications using ...
VMworld 2013: Performance Management of Business Critical Applications using ...
 
Keynote 1 the rise of stream processing for data management &amp; micro serv...
Keynote 1  the rise of stream processing for data management &amp; micro serv...Keynote 1  the rise of stream processing for data management &amp; micro serv...
Keynote 1 the rise of stream processing for data management &amp; micro serv...
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
Restate: Event-driven Asynchronous Services, Easy as Synchronous RPC
Restate: Event-driven Asynchronous Services, Easy as Synchronous RPCRestate: Event-driven Asynchronous Services, Easy as Synchronous RPC
Restate: Event-driven Asynchronous Services, Easy as Synchronous RPC
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
 
Monitoring as Software Validation
Monitoring as Software ValidationMonitoring as Software Validation
Monitoring as Software Validation
 

Último

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Último (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Application Timeline Server Past, Present and Future

  • 1. Application Timeline Server - Past, Present & Future NAGANARASIMHA G R & VARUN SAXENA
  • 2. Agenda  Who we are ?  Why we need History Server?  Application History Server  Timeline Server V1  Timeline Server V2
  • 3. Who we are ? Naganarasimha G R  Senior Technical Lead @ Huawei  Active Apache Hadoop Contributor.  Currently working in Hadoop Platform Dev team  Earlier worked in Reporting Domain Varun Saxena  Technical Lead @ Huawei  Active Apache Hadoop Contributor.  Currently working in Hadoop Platform Dev team  Earlier worked in Telecom Data Network Domain Both of us are currently participating in ATS V2 development
  • 4. Agenda  Who we are ?  Why we need History Server?  Application History Server  Timeline Server V1  Timeline Server V2
  • 5. Need for new History Server  Job History server is only for MR app, YARN supports many Applications.  YARN level Events and Metrics are not captured.  Storage is HDFS only, Not good for adhoc analysis.  JHS is only for historical or completed jobs.  On failures of Application Master, Data for current running application is lost.  Storage is very MR specific - Counters - Mappers and Reducers
  • 6. Agenda  Who we are ?  Why we need History Server?  Application History Server  Timeline Server V1 & V1.5  Timeline Server V2
  • 7. Application History Server  Separate Process  Resource Manager directly writes to Storage(HDFS)  Aggregated Logs  Separate UI, CLI and Rest End Point  Data stored : - Application level data (queue, user etc…) - List of ApplicationAttempts - Information about each ApplicationAttempt - List of containers for ApplicationAttempt - Generic information about each container.  CLI and REST Query interfaces were supported Drawbacks :  Storing Application specific custom data not supported  RM crashes, HDFS files are not readable  Hard limit no number of Files  Upgrades / Update  Supports only completed jobs.
  • 8. Agenda  Who we are ?  Why we need History Server?  Application History Server  Timeline Server V1  Timeline Server V2
  • 9. Application Timeline Service Motivation :  YARN takes care of it - Relieving the application from monitoring service  Application diversity - Framework specific metadata/metrics
  • 10. ATS V1 : Data Model  Timeline Domain - Namespace for Timeline server which supports isolations users and applications - Timeline server Security is defined at this level  Timeline Entity - An abstract concept of anything - Defines the relationship between entities - Can be an application, an application attempt, a container or any user-defined object - contains Primary filters which will be used to index the entities in the Timeline Store. - uniquely identified by an EntityId and EntityType.  Timeline Event - Event that is related to a specific Timeline Entity of an application - Users are free to define what an event means, such as starting an application, getting allocated a container,
  • 11. ATS V1 : Architecture  Separate Process  Pluggable store – defaults to LevelDB  REST Interfaces
  • 12. ATS V1 : Level DB  Key- value store  Lightweight  Open source Compatible license  Used to store - TimelineStore : Domain, Entity, Events and metrics - TimelineStateStore : Security Tokens  Supports Data Retention
  • 13. ATS V1 : Client & API  Timeline client - Wrap over REST POST method - POJO objects  TimelineEntity  TimelineEvent - In Client/AM/Container  Rest APIs, JSON as the media - Get timeline entities http://localhost:8188/ws/v1/timeline/{entityType} - Get timeline entity http://localhost:8188/ws/v1/timeline/{entityType}/{entityId} - Get timeline events http://localhost:8188/ws/v1/timeline/{entityType}/events
  • 14. ATS V1 : Security HTTP SPNEGO Kerberos Authentication Delegation Token - Performance - AM/Container no Kerberos Access Control - Admin/owner - Timeline entity-level
  • 15. ATS V1 : Use cases
  • 16. Agenda  Who we are ?  Why we need History Server?  Application History Server  Timeline Server V1  Timeline Server V2
  • 17. Why ATSv2 ?  Scalability • Single global instance of writer/reader • ATSv1 uses local disk based LevelDB storage  Usability • Handle flows as first-class concepts and model aggregation. • Elevate configuration and metrics to first-class members. • Better support for queries.  Reliability • Data is stored only in a local disk . • Single daemon so single point of failure.  Existing external tooling: hRaven, Finch, Dr. Elephant, etc. As new Hadoop versions are rolled out, maintenance of these tools becomes an issue.
  • 18. Key Design Points  Distributed writers (per app and per node) • Per App Writer/Collector launched as part of RM. • Per Node Collector/Writer launched as an auxiliary service in NM. • In future, will support standalone writers.  Scalable and reliable backend storage (HBase)  A new object model API with flows built into it.  Separate reader instance(s). Currently have a single reader instance.  Aggregation i.e. rolling up the metric values to the parent. • Online aggregation for apps and flow runs. • Offline aggregation for users, flows and queues.
  • 19. Timeline Reader Timeline Reader ATSv2 Components Application Master Node Manager Timeline Writer App Events / Metrics Container Events / Metrics Storage Resource Manager Timeline Writer Timeline Reader User Queries Timeline Reader Pool App / Container Events
  • 20. Resource Manager RMApp Distributed Writers / Collectors Node Manager 1 { app_1_collector_info …. } List of app collectors App Master 3. Launch App Master App Collector App Collector Aux Service 4. Notify Aux Service to bind new collector 5. Bind new collector NODE 1 HBase NM Collector Service 6. Register new collector RM Events Heartbeat with collector info App Collector App Collector Node Manager 2 Node Manager X 1. User submits app Heartbeat with collector info 2. RMApp launches companion app collector on new app submission 7. Report new collector info. (IP + Port) Container Events AM reports events to app collector notified in HB by RM. NM reports events to app collector notified in HB by RM. { app_1_collector_info app_2_collector_info …. }
  • 21. App 1 App 2 App 3 App 4 Run at 9:00 pm Flow Script / Program (eg. HIVE Query / Pig Script) App 1 App 2 App 3 App 4 Run at 7:30 pm Joe
  • 22. Data Model Entity ID + Type Configurations Metadata(Info) Parent-Child Relationships Metrics Events Cluster Type Cluster Attributes Flow Type User Flow Runs Flow Attributes Flow Run Type User Running apps Flow Run Attributes Application Type User Flow + Run Queue Attempts Attempt Type Application Queue Containers Container Type Attempt Attributes Entities of first class citizens User Username(ID) Aggregated metrics Queue Queue(ID) Sub queues Aggregated metrics Aggregation Event ID Metadata Timestamp Metric ID Metadata Single Value or Time Series(with timestamps)
  • 23. HBase vs Phoenix evaluation Based on the evaluation of both Hbase and Phoenix, it was decided that HBase will be used on write path. With Hbase, much higher throughput, a lower IO wait and far lower CPU load was witnessed. Test descript ion Map tasks Entities per mapper Total entities written Phoenix Transaction Rate (per mapper) ops/sec HBase Transactio n Rate (per mapper) ops/sec Phoenix Write Time (job counter TIMELINE_ SERVICE_ WRITE_TIME) Hbase Write Time (job counter TIMELINE _SERVICE _WRITE_TIME) Synthetic Data 170 1k 170k 112.83 2285.13 1506704 74394 Synthetic Data 170 10k 1.7M 53.029 636.41 32057957 2671241 Synthetic Data 1 50k 50k 196.67 19770.66 254225 2529 9 History Files 33 - 85k 319.19 (write errors) 962.32 265460 88049 555 History Files 33 - 810k 206.25 (write errors) 927.62 4102364 874151
  • 24. Aggregation  Aggregation basically means rolling up metrics from child entities to parent entities. We can perform different operations such as SUM, AVG ,etc. while rolling them up and store them in the parent.  App level aggregation will be done by app collector as and when it receives different metrics.  Online or real time aggregation for apps would be a simple SUM of metrics of child entities. Additional metrics will also be stored which indicate AVG, MAX, AREA(time integral) etc. More on this in next slide.  App to flow run aggregation will be done via a HBase coprocessor on the read path. Cell tags used to achieve this.  For user/flow, aggregation happens periodically(not real time i.e. offline). For this, Phoenix tables will be used. To achieve offline aggregation, a MR job is run which reads application table and writes to user and flow aggregation tables Container A1 (CPUCoresMillis = 400) Container A2 (CPUCoresMillis = 300) Container B1 (CPUCoresMillis = 200) App A (CPUCoresMillis = 700) App B (CPUCoresMillis = 200) Flow (CPUCoresMillis = 900)
  • 25. Accumulation  While aggregating, we also accumulate metric values along the time dimension. This is especially useful for gauges. Consider the table below which displays the CPU utilization for containers belonging to an app(in terms of cores). Here t1…t16 represents time 10ms. apart. This table shows how values are aggregated for an app and how they are accumulated and averages calculated. Trapezoidal integration rule is used to calculate area under the curve i.e. Area under the curve = ((valuet1 + valuet2)/2) * Dt t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5 Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5 Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5 Container 5 0.5 0.5 1 0 Application Area (CoreMillis) Average t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5 Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5 Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5 Container 5 0.5 0.5 1 0 Application 1 Area (CoreMillis) Average t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5 Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5 Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5 Container 5 0.5 0.5 1 0 Application 1 2.5 4 4 4 3.5 3 3 3 3 2 1.5 1 1 1 0 Area (CoreMillis) - Average - t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5 Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5 Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5 Container 5 0.5 0.5 1 1 Application 1 2.5 4 4 4 3.5 3 3 3 3 2 1.5 1 1 1 1 Area (CoreMillis) - 15 Average - 1.5 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0 Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0 Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5 0 Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0 Container 5 0.5 0.5 1 0 Application 1 2.5 4 4 4 3.5 3 3 3 3 2 1.5 1 1 1 0 Area (CoreMillis) - 15 42 Average - 1.5 2.1 t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 Container 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0 Container 2 0.5 1 1 1 1 1 0.5 0.5 0.5 0.5 0.5 0 Container 3 0.5 1 1 1 1 1 1 1 1 1 0.5 0.5 0 Container 4 0.5 1 1 1 1 1 1 1 1 1 1 0.5 0.5 0 Container 5 0.5 0.5 1 0 Application 1 2.5 4 4 4 3.5 3 3 3 3 2 1.5 1 1 1 0 Area (CoreMillis) - 15 42 82 122 160 192 222 252 282 307 325 335 345 355 360 Average - 1.5 2.1 2.7 3.1 3.2 3.2 3.1 3.1 3.1 3.1 3 2.8 2.6 2.5 2.4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 CPU Cores for App Avg
  • 26. HBase Table Schema  Entity Table – Used for storing Timeline Entity object. Contains configs, metrics and other info (events, parent child relationships, etc.). Row Key : clusterId!user!flowId!flowRunId!appId!entityType!entityId  Application Table – Used for storing YARN Application entity. Contains configs, metrics and other info. Same as entity table but added for better performance. Row Key : clusterId!user!flowId!flowRunId!appId  App To Flow Table – Used for getting flowId and flowRunId information based on cluster and app. This is helpful in querying entity table on the basis of just the cluster and app information. Row Key : clusterId! appId  Flow Run Table – Stores flow run information aggregated across apps. Row Key : clusterId!user!flowId!flowRunId
  • 27. HBase Table Schema (Contd.)  Flow Activity Table – Used for storing daily activity records for a flow. For quick lookup of flow level info. Row Key : clusterId!inverted top of the day timestamp!user!flowId Phoenix Tables for Offline Aggregation :  Flow Aggregation Table – Stores aggregated metrics at flow level. Metrics are aggregated from application table. Primary Key : user, cluster, flowId  User Aggregation Table – Stores aggregated metrics at user level. Metrics are aggregated from application table. Primary Key : user, cluster
  • 28. Querying ATSv2  ATSv2 offers major enhancement over ATSv1 in terms of queries supported. Efficient queries around flows, flow runs, apps, etc. are possible. Moreover, ATSv2 can support complex queries to filter out results.  ATSv1 offered only primary filters and secondary filters for filtering out entities. ATSv2 offers ability to filter out entities based on config values, metric values, entity parent child relationships and events. It also supports returning only certain configurations and metrics in the result.  ATSv1 queries supported only “equal to” match for primary and secondary filters. But for metrics this does not quite make sense. A user would while filtering on the basis of metric values would more likely be using relational operators such as >=, <=, != etc. All these relational operators are supported in ATSv2 for metrics. In addition to this different predicates in filters can be combined using “AND” and “OR” operators. All in all this gives ATSv2 a very powerful query interface.
  • 29. Querying ATSv2 (Contd.)  ATSv2, like ATSv1 supports a REST API interface with JSON as the media. Some examples are given below.  Get Entities – Returns a set of TimelineEntity objects based on cluster, app and entity type. The query also supports multiple optional query parameters such as limit on number of entities to be returned, configurations and metrics to be returned, filter on the basis of created and modified time window, config filters, metric filters and event filters. http://localhost:8188/entities/{clusterId}/{appId}/{entityType} Example : - http://localhost:8188/entities/cluster1/application_1334432321_0002/YARN_CONTAINER?limit=5&metrics= memory,cpu  Get Entity – Returns a Timeline Entity object based on cluster, app, entity type and entityId. http://localhost:8188/entity/{clusterId}/{appId}/{entityType}/{entityId}
  • 30. Possible use cases  Cluster utilization and inputs for capacity planning. Cluster can learn from flow’s/application’s historical data.  Mappers / reducers optimizations.  Application performance over time.  Identifying job bottlenecks.  Ad-hoc troubleshooting and identification of problems in cluster.  Complex queries possible at flow, user and queue level. For instance, queries like % of applications which ran more than 10000 containers.  Full DAG from flow to flow run to application to container level can be seen.
  • 31. Team Members  Sangjin Lee, Vrushali C and Joep Rottinghuis (Twitter)  Junping Du, Li Lu and Vinod Kumar Vavillapalli (Hortonworks)  Zhijie Shen (formerly Hortonworks)  Varun Saxena and Naganarasimha G R (Huawei)  Robert Kanter and Karthik Kambatla (Cloudera)  Inputs from LinkedIn, Yahoo! and Altiscale.
  • 32. Feature Status  Distributed per-app and per-node writers (as Aux Service)  RM Companion writer  NM, RM and AM writing events and metrics to ATS  File based readers and writers for test  HBase and Phoenix writer implementations  Performance evaluation of these writers  HBase based reader implementation  Support for flows  App and flow run level online Aggregation  Offline Aggregation  Query Interface
  • 33. Feature Status (Contd.)  Standalone timeline writer  Distributed timeline readers and a reader pool  ATSv2 UI  Security  Support for migration