SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Monitor Apache Spark 3
on Kubernetes using
Metrics and Plugins
Luca Canali
Data Engineer, CERN
About Luca
• Data Engineer at CERN
• Data analytics and Spark service, database services
• 20+ years with databases and data engineering
• Passionate about performance engineering
• Repos, blogs, presentations
• @LucaCanaliDB
https://github.com/lucacanali
http://cern.ch/canali
Agenda
 Apache Spark monitoring:
ecosystem and motivations
 Spark metrics system
 Spark 3 plugins
 Metrics for Spark on K8S
and cloud storage
 How you can run a Spark
performance dashboard
Performance Troubleshooting Goals
• The key to good performance:
• You run good execution plans
• There are no serialization points
• Without these all bets are off!
• Attribution: I first heard this from Andrew Holdsworth (context: Oracle DB performance discussion ~2007)
• Spark users care about performance at scale
• Investigate execution plans and bottlenecks in the workload
• Useful tools are the Spark Web UI and metrics instrumentation!
Apache Spark Monitoring Ecosystem
• Web UI
• Details on jobs, stages, tasks, SQL, streaming, etc
• Default URL: http://driver:4040
• https://spark.apache.org/docs/latest/web-ui.html
• Spark REST API + Spark Listener @Developer API
• Exposes task metrics and executor metrics
• https://spark.apache.org/docs/latest/monitoring.html
• Spark Metrics System
• Implemented using the Dropwizard metrics library
Spark Metrics System
• Many metrics instrument Spark workloads:
• https://spark.apache.org/docs/latest/monitoring.html#metrics
• Metrics are emitted from the driver, executors and other Spark components
into sinks (several sinks available)
• Example of metrics: number of active tasks, jobs/stages completed and failed,
executor CPU used, executor run time, garbage collection time, shuffle metrics,
I/O metrics, metrics with memory usage details, etc.
Explore Spark Metrics
• Web UI Servlet Sink
• Metrics from the driver + executor (executor metrics available only if Spark is in local mode)
• By default: metrics are exposed using the metrics servlet on the WebUI
http://driver_host:4040/metrics/json
• Prometheus sink for the driver and for spark standalone, exposes metrics on the Web UI too
• *.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
• *.sink.prometheusServlet.path=/metrics/prometheus
• Jmx Sink
• Use jmx sink and explore metrics using jconsole
• Configuration: *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
Spark Metrics System and Monitoring Pipeline
How to Use Spark Metrics – Sink to InfluxDB
• Edit $SPARK_HOME/conf/metrics.properties
• Alternative: use the config parameters spark.metrics.conf.*
• Use this method if running on K8S on Spark version 2.x or 3.0, see also SPARK-30985
cat $SPARK_HOME/conf/metrics.properties
*.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink"
*.sink.graphite.host"="<graphiteEndPoint_influxDB_hostName>"
*.sink.graphite.port"=<graphite_listening_port>
*.sink.graphite.period"=10
*.sink.graphite.unit"=seconds
*.sink.graphite.prefix"="lucatest"
*.source.jvm.class"="org.apache.spark.metrics.source.JvmSource"
$SPARK_HOME/bin/spark-shell 
--conf "spark.metrics.conf.*.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink“ 
--conf "spark.metrics.conf.*.sink.graphite.host"="<graphiteEndPoint_influxDB_hostName>" 
..etc..
Metrics Visualization with Grafana
• Metrics visualized as time series
• Metrics values visualized vs. time and detailed per executor
Number of Active Tasks Executor JVM CPU Usage
Metrics for Spark on Kubernetes
• Spark on Kubernetes is GA since Spark 3.1.1
• Running Spark on cloud resources is widely used
• Need for improved instrumentation in Spark
• Spark 3 metrics system can be extended with plugins
• Plugin metrics to monitor K8S pods’ resources usage (CPU, memory, network,
etc)
• Plugin instrumentation for cloud filesystems: S3A, GS, WASBS, OCI, CERN’s
EOS, etc)
Spark Plugins and
Custom Extensions to the Metrics System
• Plugin interface introduced in Spark 3
• Plugins are user-provided code run at the start of the executors (and of the driver)
• Plugins allow to extend Spark metrics with custom code and instrumentation
A First Look at the Spark Plugins API
• Code snippets for demonstration
• A Spark Plugin implementing a metric that reports the constant value 42
Spark 3.x Plugin API
metricRegistry allows to register user-provided
sources with Spark Metrics System
‘Kick the Tires’ of Spark Plugins
▪ From https://github.com/cerndb/SparkPlugins
▪ RunOSCommandPlugin - runs an OS command as the executor starts
▪ DemoMetricsPlugin - shows how to integrate with Spark Metrics
bin/spark-shell --master k8s://https://<K8S URL> 
• --packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1 
• --conf spark.plugins=ch.cern.RunOSCommandPlugin,
• ch.cern.DemoMetricsPlugin
Plugins for Spark on Kubernetes
• Measure metrics related to pods’ resources usage
• Integrated with rest of Spark metrics
• Plugin code implemented using cgroup instrumentation
• Example: measure CPU from /sys/fs/cgroup/cpuacct/cpuacct.usage
bin/spark-shell --master k8s://https://<K8S URL> 
--packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1 
--conf spark.kubernetes.container.image=<registry>/spark:v311 
--conf spark.plugins=ch.cern.CgroupMetrics 
...
CgroupMetrics Plugin
▪ Metrics (gauges), in ch.cern.CgroupMetrics plugin:
▪ CPUTimeNanosec: CPU time used by the processes in the cgroup
▪ this includes CPU used by Python processes (PySpark UDF)
▪ MemoryRss: number of bytes of anonymous and swap cache memory.
▪ MemorySwap: number of bytes of swap usage.
▪ MemoryCache: number of bytes of page cache memory.
▪ NetworkBytesIn: network traffic inbound.
▪ NetworkBytesOut: network traffic outbound.
Plugin to Measure Cloud Filesystems
▪ Example of how to measure S3 throughput metrics
▪ Note: Apache Spark instruments only HDFS and local filesystem
▪ Plugins uses Hadoop client API for Hadoop Compatible filesystems
▪ Metrics: bytesRead, bytesWritten, readOps, writeOps
--conf spark.plugins=ch.cern.CloudFSMetrics
--conf spark.cernSparkPlugin.cloudFsName=<name of the filesystem>
(example: "s3a", "gs", "wasbs", "oci", "root", etc.)
• Code and examples to get started:
• https://github.com/cerndb/spark-dashboard
• It simplifies the configuration of InfluxDB as a sink for Spark metrics
• Grafana dashboards with pre-built panels, graphs, and queries
• Option 1: Dockerfile
• Use this for testing locally
• From dockerhub: lucacanali/spark-dashboard:v01
• Option 2: Helm Chart
• Use for testing on Kubernetes
Tooling for a Spark Performance Dashboard
Tooling for Spark Plugins
• Code and examples to get started:
• https://github.com/cerndb/SparkPlugins
• --packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1
• Plugins for OS metrics (Spark on K8S)
• --conf spark.plugins=ch.cern.CgroupMetrics
• Plugins for measuring cloud storage
• Hadoop Compatible Filesystems: s3a, gs, wasbs, oci, root, etc..
• --conf spark.plugins=ch.cern.CloudFSMetrics
• --conf spark.cernSparkPlugin.cloudFsName=<filesystem name>
How to Deploy the Dashboard - Example
docker run --network=host -d lucacanali/spark-dashboard:v01
INFLUXDB_ENDPOINT=`hostname`
bin/spark-shell --master k8s://https://<K8S URL> 
• --packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1 
• --conf spark.plugins=ch.cern.CgroupMetrics 
• --conf "spark.metrics.conf.*.sink.graphite.class"=
• "org.apache.spark.metrics.sink.GraphiteSink" 
• --conf "spark.metrics.conf.*.sink.graphite.host"=$INFLUXDB_ENDPOINT 
• --conf "spark.metrics.conf.*.sink.graphite.port"=2003 
• --conf "spark.metrics.conf.*.sink.graphite.period"=10 
• --conf "spark.metrics.conf.*.sink.graphite.unit"=seconds 
• --conf "spark.metrics.conf.*.sink.graphite.prefix"="luca“ 
• --conf spark.metrics.appStatusSource.enabled=true
...
Spark Performance Dashboard
• Visualize Spark metrics
• Real-time + historical data
• Summaries and time series of key metrics
• Data for root-cause analysis
Dashboard Annotations
• Annotations to the workload graphs with:
• Start/end time for job id, or stage id, or SQL id
Dashboard Annotations
• Annotations to the workload graphs with:
• Start/end time for job id, or stage id, or SQL id
INFLUXDB_HTTP_ENDPOINT="http://`hostname`:8086"
--packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 
--conf spark.sparkmeasure.influxdbURL=$INFLUXDB_HTTP_ENDPOINT 
--conf spark.extraListeners=ch.cern.sparkmeasure.InfluxDBSink
Demo – Spark Performance Dashboard
(5 min)
Use Plugins to Instrument Custom Libraries
• Augment Spark metrics with metrics from custom code
• Example: experimental way to measure I/O time with metrics
• Measure read time, seek time and CPU time during read operations for HDFS, S3A, etc.
▪ Custom S3A jar with time instrumentation (for Hadoop 3.2.0, use with Spark 3.0 and 3.1):
https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation
▪ Metrics: S3AReadTimeMuSec, S3ASeekTimeMuSec, S3AGetObjectMetadataMuSec
▪ Spark metrics with custom Spark plugin:
▪ --packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1
▪ --conf spark.plugins=ch.cern.experimental.S3ATimeInstrumentation
Lessons Learned and Further Work
▪ Feedback on deploying the dashboard @CERN
▪ We provide the Spark dashboard as an optional configuration for the CERN
Jupyter-based data analysis platform
▪ Cognitive load to understand the available metrics and troubleshooting process
▪ We set data retention and in general pay attention not to overload InfluxDB
▪ Core Spark development
▪ A native InfluxDB sink would be useful (currently Spark has a Graphite sink)
▪ Spark is not yet fully instrumented, a few areas yet to cover, for example
instrumentation of I/O time and Python UDF run time.
Conclusions
▪ Spark metrics and dashboard
▪ Provide extensive performance monitoring data. Complement the Web UI.
▪ Spark plugins
▪ Introduced in Spark 3.0, make it easy to augment the Spark metrics system.
▪ Use plugins to monitor Spark on Kubernetes and cloud filesystem usage.
▪ Tooling
▪ Get started with a dashboard based on InfluxDB and Grafana.
▪ Build your plugins and dashboards, and share your experience!
Acknowledgments and Links
▪ Thanks to CERN data analytics and Spark service team
▪ Thanks to Apache Spark committers and Spark community
▪ For their help with PRs: SPARK-22190, SPARK-25170, SPARK-25228, SPARK-
26928, SPARK-27189, SPARK-28091, SPARK-29654, SPARK-30041, SPARK-
30775, SPARK-31711
▪ Links:
• https://github.com/cerndb/SparkPlugins
• https://github.com/cerndb/spark-dashboard
• https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-
apache-spark
• https://github.com/LucaCanali/sparkMeasure

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0What is New with Apache Spark Performance Monitoring in Spark 3.0
What is New with Apache Spark Performance Monitoring in Spark 3.0
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 

Similar a Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Databricks
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
harendra_pathak
 

Similar a Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins (20)

Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Native support of Prometheus monitoring in Apache Spark 3
Native support of Prometheus monitoring in Apache Spark 3Native support of Prometheus monitoring in Apache Spark 3
Native support of Prometheus monitoring in Apache Spark 3
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Data Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin NotebookData Science at Scale with Apache Spark and Zeppelin Notebook
Data Science at Scale with Apache Spark and Zeppelin Notebook
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 

Más de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Último

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 

Último (20)

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 

Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins

  • 1. Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins Luca Canali Data Engineer, CERN
  • 2. About Luca • Data Engineer at CERN • Data analytics and Spark service, database services • 20+ years with databases and data engineering • Passionate about performance engineering • Repos, blogs, presentations • @LucaCanaliDB https://github.com/lucacanali http://cern.ch/canali
  • 3. Agenda  Apache Spark monitoring: ecosystem and motivations  Spark metrics system  Spark 3 plugins  Metrics for Spark on K8S and cloud storage  How you can run a Spark performance dashboard
  • 4. Performance Troubleshooting Goals • The key to good performance: • You run good execution plans • There are no serialization points • Without these all bets are off! • Attribution: I first heard this from Andrew Holdsworth (context: Oracle DB performance discussion ~2007) • Spark users care about performance at scale • Investigate execution plans and bottlenecks in the workload • Useful tools are the Spark Web UI and metrics instrumentation!
  • 5. Apache Spark Monitoring Ecosystem • Web UI • Details on jobs, stages, tasks, SQL, streaming, etc • Default URL: http://driver:4040 • https://spark.apache.org/docs/latest/web-ui.html • Spark REST API + Spark Listener @Developer API • Exposes task metrics and executor metrics • https://spark.apache.org/docs/latest/monitoring.html • Spark Metrics System • Implemented using the Dropwizard metrics library
  • 6. Spark Metrics System • Many metrics instrument Spark workloads: • https://spark.apache.org/docs/latest/monitoring.html#metrics • Metrics are emitted from the driver, executors and other Spark components into sinks (several sinks available) • Example of metrics: number of active tasks, jobs/stages completed and failed, executor CPU used, executor run time, garbage collection time, shuffle metrics, I/O metrics, metrics with memory usage details, etc.
  • 7. Explore Spark Metrics • Web UI Servlet Sink • Metrics from the driver + executor (executor metrics available only if Spark is in local mode) • By default: metrics are exposed using the metrics servlet on the WebUI http://driver_host:4040/metrics/json • Prometheus sink for the driver and for spark standalone, exposes metrics on the Web UI too • *.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet • *.sink.prometheusServlet.path=/metrics/prometheus • Jmx Sink • Use jmx sink and explore metrics using jconsole • Configuration: *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
  • 8. Spark Metrics System and Monitoring Pipeline
  • 9. How to Use Spark Metrics – Sink to InfluxDB • Edit $SPARK_HOME/conf/metrics.properties • Alternative: use the config parameters spark.metrics.conf.* • Use this method if running on K8S on Spark version 2.x or 3.0, see also SPARK-30985 cat $SPARK_HOME/conf/metrics.properties *.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink" *.sink.graphite.host"="<graphiteEndPoint_influxDB_hostName>" *.sink.graphite.port"=<graphite_listening_port> *.sink.graphite.period"=10 *.sink.graphite.unit"=seconds *.sink.graphite.prefix"="lucatest" *.source.jvm.class"="org.apache.spark.metrics.source.JvmSource" $SPARK_HOME/bin/spark-shell --conf "spark.metrics.conf.*.sink.graphite.class"="org.apache.spark.metrics.sink.GraphiteSink“ --conf "spark.metrics.conf.*.sink.graphite.host"="<graphiteEndPoint_influxDB_hostName>" ..etc..
  • 10. Metrics Visualization with Grafana • Metrics visualized as time series • Metrics values visualized vs. time and detailed per executor Number of Active Tasks Executor JVM CPU Usage
  • 11. Metrics for Spark on Kubernetes • Spark on Kubernetes is GA since Spark 3.1.1 • Running Spark on cloud resources is widely used • Need for improved instrumentation in Spark • Spark 3 metrics system can be extended with plugins • Plugin metrics to monitor K8S pods’ resources usage (CPU, memory, network, etc) • Plugin instrumentation for cloud filesystems: S3A, GS, WASBS, OCI, CERN’s EOS, etc)
  • 12. Spark Plugins and Custom Extensions to the Metrics System • Plugin interface introduced in Spark 3 • Plugins are user-provided code run at the start of the executors (and of the driver) • Plugins allow to extend Spark metrics with custom code and instrumentation
  • 13. A First Look at the Spark Plugins API • Code snippets for demonstration • A Spark Plugin implementing a metric that reports the constant value 42 Spark 3.x Plugin API metricRegistry allows to register user-provided sources with Spark Metrics System
  • 14. ‘Kick the Tires’ of Spark Plugins ▪ From https://github.com/cerndb/SparkPlugins ▪ RunOSCommandPlugin - runs an OS command as the executor starts ▪ DemoMetricsPlugin - shows how to integrate with Spark Metrics bin/spark-shell --master k8s://https://<K8S URL> • --packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1 • --conf spark.plugins=ch.cern.RunOSCommandPlugin, • ch.cern.DemoMetricsPlugin
  • 15. Plugins for Spark on Kubernetes • Measure metrics related to pods’ resources usage • Integrated with rest of Spark metrics • Plugin code implemented using cgroup instrumentation • Example: measure CPU from /sys/fs/cgroup/cpuacct/cpuacct.usage bin/spark-shell --master k8s://https://<K8S URL> --packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1 --conf spark.kubernetes.container.image=<registry>/spark:v311 --conf spark.plugins=ch.cern.CgroupMetrics ...
  • 16. CgroupMetrics Plugin ▪ Metrics (gauges), in ch.cern.CgroupMetrics plugin: ▪ CPUTimeNanosec: CPU time used by the processes in the cgroup ▪ this includes CPU used by Python processes (PySpark UDF) ▪ MemoryRss: number of bytes of anonymous and swap cache memory. ▪ MemorySwap: number of bytes of swap usage. ▪ MemoryCache: number of bytes of page cache memory. ▪ NetworkBytesIn: network traffic inbound. ▪ NetworkBytesOut: network traffic outbound.
  • 17. Plugin to Measure Cloud Filesystems ▪ Example of how to measure S3 throughput metrics ▪ Note: Apache Spark instruments only HDFS and local filesystem ▪ Plugins uses Hadoop client API for Hadoop Compatible filesystems ▪ Metrics: bytesRead, bytesWritten, readOps, writeOps --conf spark.plugins=ch.cern.CloudFSMetrics --conf spark.cernSparkPlugin.cloudFsName=<name of the filesystem> (example: "s3a", "gs", "wasbs", "oci", "root", etc.)
  • 18. • Code and examples to get started: • https://github.com/cerndb/spark-dashboard • It simplifies the configuration of InfluxDB as a sink for Spark metrics • Grafana dashboards with pre-built panels, graphs, and queries • Option 1: Dockerfile • Use this for testing locally • From dockerhub: lucacanali/spark-dashboard:v01 • Option 2: Helm Chart • Use for testing on Kubernetes Tooling for a Spark Performance Dashboard
  • 19. Tooling for Spark Plugins • Code and examples to get started: • https://github.com/cerndb/SparkPlugins • --packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1 • Plugins for OS metrics (Spark on K8S) • --conf spark.plugins=ch.cern.CgroupMetrics • Plugins for measuring cloud storage • Hadoop Compatible Filesystems: s3a, gs, wasbs, oci, root, etc.. • --conf spark.plugins=ch.cern.CloudFSMetrics • --conf spark.cernSparkPlugin.cloudFsName=<filesystem name>
  • 20. How to Deploy the Dashboard - Example docker run --network=host -d lucacanali/spark-dashboard:v01 INFLUXDB_ENDPOINT=`hostname` bin/spark-shell --master k8s://https://<K8S URL> • --packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1 • --conf spark.plugins=ch.cern.CgroupMetrics • --conf "spark.metrics.conf.*.sink.graphite.class"= • "org.apache.spark.metrics.sink.GraphiteSink" • --conf "spark.metrics.conf.*.sink.graphite.host"=$INFLUXDB_ENDPOINT • --conf "spark.metrics.conf.*.sink.graphite.port"=2003 • --conf "spark.metrics.conf.*.sink.graphite.period"=10 • --conf "spark.metrics.conf.*.sink.graphite.unit"=seconds • --conf "spark.metrics.conf.*.sink.graphite.prefix"="luca“ • --conf spark.metrics.appStatusSource.enabled=true ...
  • 21. Spark Performance Dashboard • Visualize Spark metrics • Real-time + historical data • Summaries and time series of key metrics • Data for root-cause analysis
  • 22. Dashboard Annotations • Annotations to the workload graphs with: • Start/end time for job id, or stage id, or SQL id
  • 23. Dashboard Annotations • Annotations to the workload graphs with: • Start/end time for job id, or stage id, or SQL id INFLUXDB_HTTP_ENDPOINT="http://`hostname`:8086" --packages ch.cern.sparkmeasure:spark-measure_2.12:0.17 --conf spark.sparkmeasure.influxdbURL=$INFLUXDB_HTTP_ENDPOINT --conf spark.extraListeners=ch.cern.sparkmeasure.InfluxDBSink
  • 24. Demo – Spark Performance Dashboard (5 min)
  • 25. Use Plugins to Instrument Custom Libraries • Augment Spark metrics with metrics from custom code • Example: experimental way to measure I/O time with metrics • Measure read time, seek time and CPU time during read operations for HDFS, S3A, etc. ▪ Custom S3A jar with time instrumentation (for Hadoop 3.2.0, use with Spark 3.0 and 3.1): https://github.com/LucaCanali/hadoop/tree/s3aAndHDFSTimeInstrumentation ▪ Metrics: S3AReadTimeMuSec, S3ASeekTimeMuSec, S3AGetObjectMetadataMuSec ▪ Spark metrics with custom Spark plugin: ▪ --packages ch.cern.sparkmeasure:spark-plugins_2.12:0.1 ▪ --conf spark.plugins=ch.cern.experimental.S3ATimeInstrumentation
  • 26. Lessons Learned and Further Work ▪ Feedback on deploying the dashboard @CERN ▪ We provide the Spark dashboard as an optional configuration for the CERN Jupyter-based data analysis platform ▪ Cognitive load to understand the available metrics and troubleshooting process ▪ We set data retention and in general pay attention not to overload InfluxDB ▪ Core Spark development ▪ A native InfluxDB sink would be useful (currently Spark has a Graphite sink) ▪ Spark is not yet fully instrumented, a few areas yet to cover, for example instrumentation of I/O time and Python UDF run time.
  • 27. Conclusions ▪ Spark metrics and dashboard ▪ Provide extensive performance monitoring data. Complement the Web UI. ▪ Spark plugins ▪ Introduced in Spark 3.0, make it easy to augment the Spark metrics system. ▪ Use plugins to monitor Spark on Kubernetes and cloud filesystem usage. ▪ Tooling ▪ Get started with a dashboard based on InfluxDB and Grafana. ▪ Build your plugins and dashboards, and share your experience!
  • 28. Acknowledgments and Links ▪ Thanks to CERN data analytics and Spark service team ▪ Thanks to Apache Spark committers and Spark community ▪ For their help with PRs: SPARK-22190, SPARK-25170, SPARK-25228, SPARK- 26928, SPARK-27189, SPARK-28091, SPARK-29654, SPARK-30041, SPARK- 30775, SPARK-31711 ▪ Links: • https://github.com/cerndb/SparkPlugins • https://github.com/cerndb/spark-dashboard • https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard- apache-spark • https://github.com/LucaCanali/sparkMeasure