eBay is highly using Spark as one of most significant data engines. In data warehouse domain, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. In machine learning domain, it is playing a more and more significant role. We have introduced our great achievement in migration work from MPP database to Apache Spark last year in Europe Summit. Furthermore, from the vision of the entire infrastructure, it is still a big challenge on managing workload and efficiency for all Spark jobs upon our data center. Our team is leading the whole infrastructure of big data platform and the management tools upon it, helping our customers -- not only DW engineers and data scientists, but also AI engineers -- to leverage on the same page. In this session, we will introduce how to benefit all of them within a self-service workload management portal/system. First, we will share the basic architecture of this system to illustrate how it collects metrics from multiple data centers and how it detects the abnormal workload real-time. We develop a component called Profiler which is to enhance the current Spark core to support customized metric collection. Next, we will demonstrate some real user stories in eBay to show how the self-service system reduces the efforts both in customer side and infra-team side. That's the highlight part about Spark job analysis and diagnosis. Finally, some incoming advanced features will be introduced to describe an automatic optimizing workflow rather than just alerting.
Speaker: Lantao Jin
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
Managing Apache Spark Workload and Automatic Optimizing
1. Managing Apache Spark Workload
and Automatic Optimizing
Lantao Jin,
Software Engineer, Data Platform Engineering (eBay)
2. Who We Are
2
● Date Platform Engineering team in eBay
● We build an automate data platform and self-serve site with minimum touch points
● Focus on Spark optimization and self-serve platform building
3. What We Do
3
● Build the platform of one-stop experience for Spark/Hadoop
● Manage entire Spark/Hadoop workload
● Open API and self-serve tools to users
● Performance tuning for Spark engine and jobs
4. Why Manage Spark Workload
4
● Complex failure job root cause analysis needs
● Extreme performance tuning and optimization need
● Maximum resource utilization needs
● Compute showback and capacity planning in a global view
6. Challenges
6
● Over 20 product clusters
● Over 500PB data
● Over 5PB(compressed) incremental data per day
● Over 80000 jobs per day
● Metadata of job/data is not clear
● Many kinds of job like Pig, Hive, Cascading, Spark, Mapreduce
● Jobs are not standard developed
● Over 20+ teams to communicate and hundreds of batch users
● Job onboarding is out of control
8. Gaps
8
● Development Experience
○ Distributed logging service for failure diagnostics
○ Job/Task level metrics is hard for developer understanding
○ Application healthiness visibility
○ Tedious communication to problem resolution for any workload issue
● Resource Efficiency
○ Huge manual effort of analyzing cluster/queue high load
○ Blind to “bad” jobs
9. Object
9Data Platform Engineering
❏ Application-specific
diagnostics and
Performance
Recommendation
❏ Highlight applications
need attention
❏ Identify bottlenecks and
resource usage
❏ Reduce performance
incidents in production
❏ Easy communication back
to developer for detailed
performance insights
❏ Shorten time to
production
❏ Resource usage insight
and guidance
❏ Increase cluster ROI
For Developers For Operators For Managers
12. Profile listener
12
● Collect/dump extra metrics for
compatibility purposes
○ Real memory usage
○ PRC count
○ Input/Output
* With this version spark profiler, we also modify the Spark
Core to expose memory related metrics.
Spark
Driver
DAGScheduler
ListenerBus
CatalogEventListener
ExecutionPlanListener
ExecutorMetricsListener
HDFS Rest API
Events
JPM profiler
23. Success Cases
23
❖ Reduce High RPC Jobs
❖ Reduce Account Usage
❖ Repeatedly failed jobs
❖ Optimize job path with data lineage
❖ Historical based optimization
❖ Running job issue detection
24. 24
Reduce High RPC Jobs
● Background: Jobs with high RPC
● Solution: JPM alert the high RPC jobs with advices:
○ add a reducer for map only jobs (hint)
○ change mapper join to reducer join (pipeline optimization)
● Sample: The RPC calls for the job reduced from 43M to 46k.
Cluster RPC Queue Time
Job Resource Usage Trend
Metrics
Engine
25. 25
Reduce Account Usage
*HCU (Hadoop Compute Unit): 1 HCU is equal to 1 GB memory used for 1
second or 0.5 GB used for 2 seconds.
● Background: Spark jobs may require much more
memory resource than they actually need.
● Solution: JPM highlights the resource wasted jobs
with advices:
○ make the advisory memory configuration
○ combine the SQLs which have same table scan
● Sample: the usage for the account b_seo_eng
decreases from 500MB to 30MB, saving around 1.5%
of cluster.
Metrics
Engine
Resource
Analyzer
Catalog
Analyzer
26. 26
Repeatedly Failed Jobs
● Background: Repeatedly failed jobs always mean
there are many opportunities in them.
● Solution: In JPM Spotlight page, these repeatedly
failed jobs will be grouped by
○ failure exception | user | diagnosis
○ limit the resource of those high failure rate
jobs, stop 0% success jobs when exceed
threshold and alert the users (configurable).
● Sample: The stopped jobs save around 1.4% cluster
usage per week.
Metrics
Engine
Resource
Analyzer
Log
Diagnoser
27. 27
Optimize job path with data lineage
● Background: Over 80k apps per day in our YARN
clusters. Partial of them are not standard developed.
Metadata is even unclear.
● Solution: JPM worked out the data lineage by analysing
jobs, analysing audit log, extracting Hive metastore,
combining OIV. Below actions are benefited based on
the lineage:
○ SQLs combination
○ Hotspot detection and optimization
○ Useless data/jobs retire
Catalog
Analyzer
Data
Lineage
Auditlog OIV
28. 28
● Sample 1: SQLs combination/Hotspot
detection
○ SEO team has many batch jobs
which scan one same big table
without middle table, and the only
difference in their outputs are
grouping condition.
● Sample 2: Useless data/jobs retire
○ There are many jobs without
downstream job which their data
no accessed over 6 months.
Table/Folder Save
/sys/edw/dw_lstg_item/orc
/sys/edw/dw_lstg_item/orc_partitioned
Apollo (1.3%)
/sys/edw/dw_lstg_item_cold/orc
/sys/edw/dw_lstg_item_cold/orc_partitioned
Ares(0.4%)
/sys/edw/dw_checkout_trans/orc Ares (0.15%)
29. 29
Historical based optimization
● Background: This is an old topic but always useful.
What we are care about here are the workload
and environment between different running
instances.
● Solution: Besides gives us the trend, JPM could:
○ analyzes the entire workload of multiple
level of queue and cluster environment.
○ tell us the impact from queue and env size.
○ tell us the changes of configurations
○ give an advice about job scheduler strategy (WIP)
HBO
Configura
tion Diff
Resource
Analyzer
Metrics
Engine
30. 30
● Sample 1: Job slowness due to resource crisis
○ “My job is running 30 minutes slower than yesterday, what happened?”
31. 31
● Sample 2: Job failed due to unexpected configuration changed
○ “My job failed today, but I did nothing. Any changes from platform side?”
this picture is not a
good sample
32. 32
Running job issue detection
● Background: Slowness critical job should be detected in running instead missed SLA. Hanging job should be
distinguished from (long) running job. Job/Data develop team needs aggregated job/cluster metrics ASAP.
● Solution: JPM gives users a chance to query the suspected slow running job on demand to get the report
in fly. Current JPM could tell us five kinds of cases:
○ Queue resource overload
○ Slow down by preemption
○ Shuffle Data skewing
○ Failure disk related
○ Known Spark/Hadoop bug detection
Resource
Analyzer
Metrics
Engine
Running Job
Checker
Stack
Analyzer
34. 34
● Sample 2: Known Spark/Hadoop bugs auto detection (WIP)
○ Job hang up due to known bug “[HDFS-10223]
peerFromSocketAndKey performs SASL exchange
before setting connection timeouts”
JPM will snapshots the threaddump for executor
which a slow task running in and analyzes
automatically.
35. 35
Part of issues may cause a hung task/job before 2.4
https://issues.apache.org/jira/browse/SPARK-22172
https://issues.apache.org/jira/browse/SPARK-22074
https://issues.apache.org/jira/browse/SPARK-18971
https://issues.apache.org/jira/browse/SPARK-21928
https://issues.apache.org/jira/browse/SPARK-22083
https://issues.apache.org/jira/browse/SPARK-14958
https://issues.apache.org/jira/browse/SPARK-20079
https://issues.apache.org/jira/browse/SPARK-13931
https://issues.apache.org/jira/browse/SPARK-19617
https://issues.apache.org/jira/browse/SPARK-23365
https://issues.apache.org/jira/browse/SPARK-21834
https://issues.apache.org/jira/browse/SPARK-19631
https://issues.apache.org/jira/browse/SPARK-21656
36. JPM frontend and UI
36
Restful API layer
Portal UI
Configuration
Manager and deploy
Health
Monitor
Read Job/ Metrics/
Analysis/
Suggestion
Read ES/ Storm/
Metrics/ Status
Frontend Layer
Read/Write conf for
each cluster
46. JPM vs similar product (open source v2.0.6)
46
Dr JPM
Scope Only cares isolated application Has all user related info and cluster resource status
Diagnostics Based on metrics Aggregates failed job log to diagnose
Scalability Uses thread pool which is hard to horizontally scale Uses distributed streaming
Maintenance One instance per cluster Designed for crossing clusters within one instance
Volume Mysql as backend storage couldn’t save all task level entities Elasticsearch has more powerful in volume
Veracity Sampling to avoid OOM, Sacrificing veracity of result Precisely handle every tasks
Availability Single point failure No single point failure
Variety Analysis only based on metrics Plus environment and cluster status
Realtime Weak since it depends on SHS Realtime
Relationship N/A Uses data pipeline to detect issue
Histories N/A Historic based analysis
Runtime N/A Running job analysis