SlideShare una empresa de Scribd logo
1 de 34
Descargar para leer sin conexión
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Cloud-Native Spark Scheduling with YuniKorn Scheduler
Li Gao
Tech lead and engineer @ Databricks Compute Fabric
Previous tech lead at data infrastructure @ Lyft
Weiwei Yang
Tech Lead @ Cloudera Compute Platform
Apache Hadoop Committer & PMC member
Previous tech lead at Real-time Compute Infra @ Alibaba
Agenda
Li Gao
Why Lyft is choosing Spark on K8s
The need for custom k8s scheduler for Spark
Weiwei Yang
Spark Scheduling with YuniKorn
Deep Dive into YuniKorn Features
Community and Roadmap
Role of K8s in Lyft’s Data Landscape
Why Choose K8s for Spark
▪ Containerized spark compute to provide shared resources across
different ML and ETL jobs
▪ Support for multiple Spark versions, Python versions, and version
controlled containers on the shared K8s clusters for both faster iteration
and stable production
▪ A single, unified infrastructure for both majority of our data compute and
micro services with advanced, unified observability and resource
isolation support
▪ Fine grained access controls on shared clusters
The Spark K8s Infra @ Lyft
Multi-step creation for a Spark K8s job
Resource
Labels
Jobs
Cluster
Pool
K8s
Cluster
Namespace
Group
Namespace
Spark CRD
Spark Pods
DataLake
Problems of existing Spark K8s infrastructure
▪ Complexity of layers of custom K8s controllers to handle the scale of the
spark jobs
▪ Tight coupling of controller layers makes latency issues amplified in
certain cases
▪ Priority queues between jobs, clusters, and namespaces are managed by
multiple layers of controllers to achieve desired performance
Why we need a customized K8s Scheduler
▪ High latency (~100 seconds) using the default scheduler is observed on a single
K8s cluster for large volumes of batch workloads
▪ Large batch fair sharing in the same resource pool is unpredictable with the
default scheduler
▪ Mix of FIFO and FAIR requirements on shared jobs clusters
▪ The need for an elastic and hierarchical priority management for jobs in K8s
▪ Richer and online user visibility into the scheduling behavior
▪ Simplified layers of controllers with custom K8s scheduler
Spark Scheduling with YuniKorn
Flavors of Running Spark on K8s
Native Spark on K8s Spark K8s Operator
Identify Spark jobs by the pod labels Identify Spark jobs by CRDs (e.g SparkApplication)
Resource Scheduling in K8s
Scheduler workflow in human language: The scheduler picks
up a pod each time, find the best fit node and then launch
the pod on that node.
Spark on K8s: the scheduling challenges
▪ Job Scheduling Requirements
▪ Job ordering/queueing
▪ Job level priority
▪ Resource fairness (between jobs / queues)
▪ Gang scheduling
▪ Resource Sharing and Utilization Challenges
▪ Spark driver pods occupied all resources in a
namespace
▪ Resource competition, deadlock between large jobs
▪ Misbehave jobs could abuse resources
▪ High throughput
Ad-Hoc Queries Batch Jobs
Workflow (DAG) Streaming
The need of an unified architecture for both on-prem, cloud,
multi-cloud and hybrid cloud
K8s default scheduler was NOT created to tackle these challenges
Apache YuniKorn (Incubating)
What is it:
▪ A standalone resource scheduler for K8s
▪ Focus on building scheduling capabilities to empower Big Data
on K8s
Simple to use:
▪ A stateless service on K8s
▪ A secondary K8s scheduler or replacement of the default
scheduler
Resource Scheduling in YuniKorn (and compare w/ default scheduler)
Apps
API
Server
ETCD
Resource
Scheduler
master
Apps
Nodes
Queues
Request
Kubelet
Filter
Score
Sort
Extensions
Queue
Sort
App
Sort
Node
Sort
Pluggable
Policies
YUNIKORN
Default
Scheduler
31 2
YuniKorn QUEUE, APP
concepts are critical to
provide advanced job
scheduling and fine-grained
resource management
capabilities
Main difference (YuniKorn v.s Default Scheduler)
Feature Default
Scheduler
YUNIKORN Note
Scheduling at app
dimension
App is the 1st class citizen in YuniKorn, YuniKorn schedules apps with respect
to, e,g their submission order, priority, resource usage, etc.
Job ordering YuniKorn supports FIFO/FAIR/Priority (WIP) job ordering policies
Fine-grained resource
capacity management
Manage cluster resources with hierarchy queues, queue provides the
guaranteed resources (min) and the resource quota (max).
Resource fairness Inter-queue resource fairness
Natively support Big Data
workloads
The default scheduler is main for long-running services. YuniKorn is designed
for Big Data app workloads, it natively supports Spark/Flink/Tensorflow, etc.
Scale & Performance YuniKorn is optimized for performance, it is suitable for high throughput and
large scale environments.
Run Spark with YuniKorn
Submit a Spark job
1) Run spark-submit
2) Create SparkApplication CRD
Spark
Driver
pod
Pending
Spark-job-001
Spark-job-001
Spark
Driver
Pod Spark-job-001 Spark-job-001
Spark
Driver
Pod
Spark
Executor
Pod
Api-server creates the
driver pod
Spark job is registered to
YuniKorn in a leaf queue
Sort queues -> sort apps -> select
request -> select node
Driver pod is started, it
requests for Spark
executor pods from
api-server
Api-server binds the
pod to the assigned
node
Driver pod requests for
executors, api-server creates
executor pods
Spark
Driver
pod
Pending
Spark
Driver
pod
Bound
Spark
Driver
pod
Bound
Spark
Driver
pod
Bound
Spark
Executor
pod
Spark
Executor
pod
Spark
Executor
pod
Bound
Job is Starting
Spark driver is
running
Spark executors
are created
Spark job is
running
Spark-job-001
Spark
Driver
Pod
Spark
Executor
Pod
Spark
Executor
pod
Spark
Executor
pod
Spark
Executor
pod
Pending
New executors are added as
pending requests
Ask api-server to bind
the pod to the node
Schedule, and bind executors
Pending
Deep Dive into YuniKorn Features/Performance
Job Ordering
Why this matters?
▪ If I submit the job earlier, I want my job to run first
▪ I don’t want my job gets starved as resources are used by others
▪ I have a urgent job, let me run first!
Per queue sorting policy
▪ FIFO - Order jobs by submission time
▪ FAIR - Order jobs by resource consumption
▪ Priority (WIP-0.9) - Order jobs by job-level prioritizes within the
same queue
Resource Quota Management: K8s Namespace ResourceQuota
K8s Namespace Resource Quota
▪ Defines resource limits
▪ Enforced by the quota admission-controller
Problems
▪ Hard to control when resource quotas are overcommitted
▪ Users has no guaranteed resources
▪ Quota could be abused (e.g by pending pods)
▪ No queueing for jobs…
▪ Low utilization?!
Namespace Resource Quota is suboptimal to support
resource sharing between multi-tenants
Resource Quota Management: YuniKorn Queue Capacity
YuniKorn Queue provides a optimal solution to manage resource quotas
▪ A queue can map to one (or more) namespaces automatically
▪ Capacity is elastic from min to max
▪ Honor resource fairness
▪ Quota is only counted for pods which actually consumes resources
▪ Enable Job queueing
Namespace
YuniKorn
Queue
CPU: 1
Memory: 1024Mi
CPU: 2
Memory: 2048Mi
CPU: 2
Memory: 2048Mi
Queue Max CPU: 5
Memory: 5120Mi
-> better resource sharing, ensure guarantee, enforce max
-> zero config queue mgmt
-> avoid starving jobs/users
-> accurate resource counting, improve utilization
-> jobs can be queued in the scheduler, keep client side logic simple
Resource Fairness in YuniKorn Queues
Queue
Guaranteed Resource
(Mem)
Requests
(NumOfPods * Mem)
root.default 500,000 1000 * 10
root.search 400,000 500 * 10
root.test 100,000 200 * 10
root.sandbox 100,000 200 * 50
Scheduling workloads with different requests in 4 queues with
different guaranteed resources.
Usage ratios of queues increased with similar trend
Scheduler Throughput Benchmark
Schedule 50,000 pods on
2,000/4,000 nodes.
Compare Scheduling throughput
(Pods per second allocated by
scheduler)
Red line (YuniKorn)
Green line (Default Scheduler)
617 vs 263 ↑ 134%
373 vs 141 ↑ 164%
Detail report:
https://github.com/apache/incubator-yunikorn-core/blob/master/docs/evaluate-perf-function-with-Kubemark.md
50k pods on 2k nodes 50k pods on 4k nodes
Fully K8s Compatible
▪ Support K8s Predicates
▪ Node selector
▪ Pod affinity/anti-affinity
▪ Taints and toleration
▪ …
▪ Support PersistentVolumeClaim and PersistentVolume
▪ Volume bindings
▪ Dynamical provisioning
▪ Publishes key scheduling events to K8s event system
▪ Work with cluster autoscaler
▪ Support management commands
▪ cordon nodes
YuniKorn Management Console
Compare YuniKorn with other K8s schedulers
Scheduler
Capabilities
Resource Sharing Resource Fairness Preemption
Gang
Scheduling
Bin
Packing Throughput
Hierarchy
queues
Queue
elastic
capacity
Cross
queue
fairness
User level
fairness
App level
fairness
Basic
preemption
With
fairness
K8s
default
scheduler x x x x x v x x v
260 allocs/s
(2k nodes)
Kube-batch x x v x v v x v v
? Likely slower than
kube-default from [1]
YuniKorn v v v v v v v v* YUNIKORN-2 v
610 allocs/s
(2k nodes)
[1] https://github.com/kubernetes-sigs/kube-batch/issues/930
Community, Summary and Next
Current Status
▪ Open source at July 17, 2019, Apache 2.0 License
▪ Enter Apache Incubator since Jan 21, 2020
▪ Latest stable version 0.8.0 released on May 4, 2020
▪ Diverse community with members from Alibaba, Cloudera,
Microsoft, LinkedIn, Apple, Tencent, Nvidia and more…
The Community
▪ Deployed in non-production K8s clusters
▪ Launched 100s of large jobs per day on
some of the YuniKorn queues
▪ Reduced our large job scheduler latency by
factor of ~ 3x at peak time
▪ K8s cluster overall resource utilization
efficiency (cost per compute) improved
over the default kube-scheduler for mixed
workloads
▪ FIFO and FAIR requests are more frequently
met than before
▪ Shipping with Cloudera Public Cloud
offerings
▪ Provide resource quota management and
advanced job scheduling capabilities for
Spark
▪ Responsible for both micro-service, and
batch jobs scheduling
▪ Running on Cloud with auto-scaling enabled
▪ Deployed on pre-production on-prem
cluster with ~100 nodes
▪ Plan to deploy on 1000+ nodes production
K8s cluster this year
▪ Leverage YuniKorn features such as
hiercharchy queues, resource fairness to
run large scale Flink jobs on K8s
▪ Gained x4 scheduling performance
improvements
Roadmap
Current (0.8.0)
● Hirechay queues
● Cross queue fairness
● Fair/FIFO job ordering policies
● Fair/Bin-packing node sorting policies
● Self queue management
● Pluggable app discover
● Metrics system and Prometheus integration
Upcoming (0.9.0)
● Gang Scheduling
● Job/task priority support (scheduling & preemption)
● Support Spark dynamic allocation
3rd quarter of 2020
Our Vision - Resource Mgmt for Big Data & ML
Data Engineering, Realtime
Streaming, Machine Learning
Micro services, batch jobs, long
running workloads, interactive
sessions, model serving
Multi-tenancy, SLA, Resource
Utilization, Cost Mgmt, Budget
Computes Types Targets
Unified Compute Platform for Big Data & ML
Join us in the
YuniKorn Community !!
▪ Project web site: http://yunikorn.apache.org/
▪ Github repo: apache/incubator-yunikorn-core
▪ Mailing list: dev@yunikorn.apache.org
▪ Slack channel:
▪ Bi-weekly/Monthly sync up meetings for different time zones
Thank you!!

Más contenido relacionado

La actualidad más candente

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...confluent
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 

La actualidad más candente (20)

Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Presto
PrestoPresto
Presto
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 

Similar a Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler

Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesDatabricks
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMAlluxio, Inc.
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
 
Optimized NFV placement in Openstack Clouds
Optimized NFV placement in Openstack CloudsOptimized NFV placement in Openstack Clouds
Optimized NFV placement in Openstack CloudsYathiraj Udupi, Ph.D.
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftChester Chen
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
Containerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesContainerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesDataWorks Summit
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Jianfeng Zhang
 
Optimized placement in Openstack for NFV
Optimized placement in Openstack for NFVOptimized placement in Openstack for NFV
Optimized placement in Openstack for NFVDebojyoti Dutta
 
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOpsKubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOpsRightScale
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...confluent
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftLi Gao
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Mac Moore
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksRuslan Meshenberg
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoophitesh1892
 

Similar a Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler (20)

Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 
Optimized NFV placement in Openstack Clouds
Optimized NFV placement in Openstack CloudsOptimized NFV placement in Openstack Clouds
Optimized NFV placement in Openstack Clouds
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Containerized Hadoop beyond Kubernetes
Containerized Hadoop beyond KubernetesContainerized Hadoop beyond Kubernetes
Containerized Hadoop beyond Kubernetes
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
 
Optimized placement in Openstack for NFV
Optimized placement in Openstack for NFVOptimized placement in Openstack for NFV
Optimized placement in Openstack for NFV
 
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOpsKubernetes and Terraform in the Cloud: How RightScale Does DevOps
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
 
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
Modern Cloud-Native Streaming Platforms: Event Streaming Microservices with A...
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
NetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talksNetflixOSS Open House Lightning talks
NetflixOSS Open House Lightning talks
 
Running Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache HadoopRunning Non-MapReduce Big Data Applications on Apache Hadoop
Running Non-MapReduce Big Data Applications on Apache Hadoop
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 

Último

Data Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxData Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxEmmanuel Dauda
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfdcphostmaster
 
Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1bengalurutug
 
Microeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfMicroeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfmxlos0
 
Using DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseUsing DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseThinkInnovation
 
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Neo4j
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-ProfitsTimothy Spann
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptxFurkanTasci3
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxjkmrshll88
 
Báo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingBáo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingMarketingTrips
 
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...ferisulianta.com
 
How to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product DevelopmentHow to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product DevelopmentAggregage
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media PlatformsMahmoud Yasser
 
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j
 
PPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfggggPPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfggggbhadratanusenapati1
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsNeo4j
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTimothy Spann
 

Último (20)

Data Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potxData Analytics Fundamentals: data analytics types.potx
Data Analytics Fundamentals: data analytics types.potx
 
Paul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdfPaul Martin (Gartner) - Show Me the AI Money.pdf
Paul Martin (Gartner) - Show Me the AI Money.pdf
 
Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1Bengaluru Tableau UG event- 2nd March 2024 Q1
Bengaluru Tableau UG event- 2nd March 2024 Q1
 
Microeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdfMicroeconomic Group Presentation Apple.pdf
Microeconomic Group Presentation Apple.pdf
 
Target_Company_Data_breach_2013_110million
Target_Company_Data_breach_2013_110millionTarget_Company_Data_breach_2013_110million
Target_Company_Data_breach_2013_110million
 
Using DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data WarehouseUsing DAX & Time-based Analysis in Data Warehouse
Using DAX & Time-based Analysis in Data Warehouse
 
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
Deloitte+RedCross_Talk to your data with Knowledge-enriched Generative AI.ppt...
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
 
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptxSTOCK PRICE ANALYSIS  Furkan Ali TASCI --.pptx
STOCK PRICE ANALYSIS Furkan Ali TASCI --.pptx
 
Stochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptxStochastic Dynamic Programming and You.pptx
Stochastic Dynamic Programming and You.pptx
 
Báo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân MarketingBáo cáo Social Media Benchmark 2024 cho dân Marketing
Báo cáo Social Media Benchmark 2024 cho dân Marketing
 
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
Prediction Of Cryptocurrency Prices Using Lstm, Svm And Polynomial Regression...
 
How to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product DevelopmentHow to Build an Experimentation Culture for Data-Driven Product Development
How to Build an Experimentation Culture for Data-Driven Product Development
 
Data Collection from Social Media Platforms
Data Collection from Social Media PlatformsData Collection from Social Media Platforms
Data Collection from Social Media Platforms
 
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdfNeo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
Neo4j_Jesus Barrasa_The Art of the Possible with Graph.pptx.pdf
 
PPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfggggPPT for Presiding Officer.pptxvvdffdfgggg
PPT for Presiding Officer.pptxvvdffdfgggg
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge Graphs
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI Pipelines
 

Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler

  • 2. Cloud-Native Spark Scheduling with YuniKorn Scheduler
  • 3. Li Gao Tech lead and engineer @ Databricks Compute Fabric Previous tech lead at data infrastructure @ Lyft Weiwei Yang Tech Lead @ Cloudera Compute Platform Apache Hadoop Committer & PMC member Previous tech lead at Real-time Compute Infra @ Alibaba
  • 4. Agenda Li Gao Why Lyft is choosing Spark on K8s The need for custom k8s scheduler for Spark Weiwei Yang Spark Scheduling with YuniKorn Deep Dive into YuniKorn Features Community and Roadmap
  • 5. Role of K8s in Lyft’s Data Landscape
  • 6. Why Choose K8s for Spark ▪ Containerized spark compute to provide shared resources across different ML and ETL jobs ▪ Support for multiple Spark versions, Python versions, and version controlled containers on the shared K8s clusters for both faster iteration and stable production ▪ A single, unified infrastructure for both majority of our data compute and micro services with advanced, unified observability and resource isolation support ▪ Fine grained access controls on shared clusters
  • 7. The Spark K8s Infra @ Lyft
  • 8. Multi-step creation for a Spark K8s job Resource Labels Jobs Cluster Pool K8s Cluster Namespace Group Namespace Spark CRD Spark Pods DataLake
  • 9. Problems of existing Spark K8s infrastructure ▪ Complexity of layers of custom K8s controllers to handle the scale of the spark jobs ▪ Tight coupling of controller layers makes latency issues amplified in certain cases ▪ Priority queues between jobs, clusters, and namespaces are managed by multiple layers of controllers to achieve desired performance
  • 10. Why we need a customized K8s Scheduler ▪ High latency (~100 seconds) using the default scheduler is observed on a single K8s cluster for large volumes of batch workloads ▪ Large batch fair sharing in the same resource pool is unpredictable with the default scheduler ▪ Mix of FIFO and FAIR requirements on shared jobs clusters ▪ The need for an elastic and hierarchical priority management for jobs in K8s ▪ Richer and online user visibility into the scheduling behavior ▪ Simplified layers of controllers with custom K8s scheduler
  • 12. Flavors of Running Spark on K8s Native Spark on K8s Spark K8s Operator Identify Spark jobs by the pod labels Identify Spark jobs by CRDs (e.g SparkApplication)
  • 13. Resource Scheduling in K8s Scheduler workflow in human language: The scheduler picks up a pod each time, find the best fit node and then launch the pod on that node.
  • 14. Spark on K8s: the scheduling challenges ▪ Job Scheduling Requirements ▪ Job ordering/queueing ▪ Job level priority ▪ Resource fairness (between jobs / queues) ▪ Gang scheduling ▪ Resource Sharing and Utilization Challenges ▪ Spark driver pods occupied all resources in a namespace ▪ Resource competition, deadlock between large jobs ▪ Misbehave jobs could abuse resources ▪ High throughput Ad-Hoc Queries Batch Jobs Workflow (DAG) Streaming The need of an unified architecture for both on-prem, cloud, multi-cloud and hybrid cloud K8s default scheduler was NOT created to tackle these challenges
  • 15. Apache YuniKorn (Incubating) What is it: ▪ A standalone resource scheduler for K8s ▪ Focus on building scheduling capabilities to empower Big Data on K8s Simple to use: ▪ A stateless service on K8s ▪ A secondary K8s scheduler or replacement of the default scheduler
  • 16. Resource Scheduling in YuniKorn (and compare w/ default scheduler) Apps API Server ETCD Resource Scheduler master Apps Nodes Queues Request Kubelet Filter Score Sort Extensions Queue Sort App Sort Node Sort Pluggable Policies YUNIKORN Default Scheduler 31 2 YuniKorn QUEUE, APP concepts are critical to provide advanced job scheduling and fine-grained resource management capabilities
  • 17. Main difference (YuniKorn v.s Default Scheduler) Feature Default Scheduler YUNIKORN Note Scheduling at app dimension App is the 1st class citizen in YuniKorn, YuniKorn schedules apps with respect to, e,g their submission order, priority, resource usage, etc. Job ordering YuniKorn supports FIFO/FAIR/Priority (WIP) job ordering policies Fine-grained resource capacity management Manage cluster resources with hierarchy queues, queue provides the guaranteed resources (min) and the resource quota (max). Resource fairness Inter-queue resource fairness Natively support Big Data workloads The default scheduler is main for long-running services. YuniKorn is designed for Big Data app workloads, it natively supports Spark/Flink/Tensorflow, etc. Scale & Performance YuniKorn is optimized for performance, it is suitable for high throughput and large scale environments.
  • 18. Run Spark with YuniKorn Submit a Spark job 1) Run spark-submit 2) Create SparkApplication CRD Spark Driver pod Pending Spark-job-001 Spark-job-001 Spark Driver Pod Spark-job-001 Spark-job-001 Spark Driver Pod Spark Executor Pod Api-server creates the driver pod Spark job is registered to YuniKorn in a leaf queue Sort queues -> sort apps -> select request -> select node Driver pod is started, it requests for Spark executor pods from api-server Api-server binds the pod to the assigned node Driver pod requests for executors, api-server creates executor pods Spark Driver pod Pending Spark Driver pod Bound Spark Driver pod Bound Spark Driver pod Bound Spark Executor pod Spark Executor pod Spark Executor pod Bound Job is Starting Spark driver is running Spark executors are created Spark job is running Spark-job-001 Spark Driver Pod Spark Executor Pod Spark Executor pod Spark Executor pod Spark Executor pod Pending New executors are added as pending requests Ask api-server to bind the pod to the node Schedule, and bind executors Pending
  • 19. Deep Dive into YuniKorn Features/Performance
  • 20. Job Ordering Why this matters? ▪ If I submit the job earlier, I want my job to run first ▪ I don’t want my job gets starved as resources are used by others ▪ I have a urgent job, let me run first! Per queue sorting policy ▪ FIFO - Order jobs by submission time ▪ FAIR - Order jobs by resource consumption ▪ Priority (WIP-0.9) - Order jobs by job-level prioritizes within the same queue
  • 21. Resource Quota Management: K8s Namespace ResourceQuota K8s Namespace Resource Quota ▪ Defines resource limits ▪ Enforced by the quota admission-controller Problems ▪ Hard to control when resource quotas are overcommitted ▪ Users has no guaranteed resources ▪ Quota could be abused (e.g by pending pods) ▪ No queueing for jobs… ▪ Low utilization?! Namespace Resource Quota is suboptimal to support resource sharing between multi-tenants
  • 22. Resource Quota Management: YuniKorn Queue Capacity YuniKorn Queue provides a optimal solution to manage resource quotas ▪ A queue can map to one (or more) namespaces automatically ▪ Capacity is elastic from min to max ▪ Honor resource fairness ▪ Quota is only counted for pods which actually consumes resources ▪ Enable Job queueing Namespace YuniKorn Queue CPU: 1 Memory: 1024Mi CPU: 2 Memory: 2048Mi CPU: 2 Memory: 2048Mi Queue Max CPU: 5 Memory: 5120Mi -> better resource sharing, ensure guarantee, enforce max -> zero config queue mgmt -> avoid starving jobs/users -> accurate resource counting, improve utilization -> jobs can be queued in the scheduler, keep client side logic simple
  • 23. Resource Fairness in YuniKorn Queues Queue Guaranteed Resource (Mem) Requests (NumOfPods * Mem) root.default 500,000 1000 * 10 root.search 400,000 500 * 10 root.test 100,000 200 * 10 root.sandbox 100,000 200 * 50 Scheduling workloads with different requests in 4 queues with different guaranteed resources. Usage ratios of queues increased with similar trend
  • 24. Scheduler Throughput Benchmark Schedule 50,000 pods on 2,000/4,000 nodes. Compare Scheduling throughput (Pods per second allocated by scheduler) Red line (YuniKorn) Green line (Default Scheduler) 617 vs 263 ↑ 134% 373 vs 141 ↑ 164% Detail report: https://github.com/apache/incubator-yunikorn-core/blob/master/docs/evaluate-perf-function-with-Kubemark.md 50k pods on 2k nodes 50k pods on 4k nodes
  • 25. Fully K8s Compatible ▪ Support K8s Predicates ▪ Node selector ▪ Pod affinity/anti-affinity ▪ Taints and toleration ▪ … ▪ Support PersistentVolumeClaim and PersistentVolume ▪ Volume bindings ▪ Dynamical provisioning ▪ Publishes key scheduling events to K8s event system ▪ Work with cluster autoscaler ▪ Support management commands ▪ cordon nodes
  • 27. Compare YuniKorn with other K8s schedulers Scheduler Capabilities Resource Sharing Resource Fairness Preemption Gang Scheduling Bin Packing Throughput Hierarchy queues Queue elastic capacity Cross queue fairness User level fairness App level fairness Basic preemption With fairness K8s default scheduler x x x x x v x x v 260 allocs/s (2k nodes) Kube-batch x x v x v v x v v ? Likely slower than kube-default from [1] YuniKorn v v v v v v v v* YUNIKORN-2 v 610 allocs/s (2k nodes) [1] https://github.com/kubernetes-sigs/kube-batch/issues/930
  • 29. Current Status ▪ Open source at July 17, 2019, Apache 2.0 License ▪ Enter Apache Incubator since Jan 21, 2020 ▪ Latest stable version 0.8.0 released on May 4, 2020 ▪ Diverse community with members from Alibaba, Cloudera, Microsoft, LinkedIn, Apple, Tencent, Nvidia and more…
  • 30. The Community ▪ Deployed in non-production K8s clusters ▪ Launched 100s of large jobs per day on some of the YuniKorn queues ▪ Reduced our large job scheduler latency by factor of ~ 3x at peak time ▪ K8s cluster overall resource utilization efficiency (cost per compute) improved over the default kube-scheduler for mixed workloads ▪ FIFO and FAIR requests are more frequently met than before ▪ Shipping with Cloudera Public Cloud offerings ▪ Provide resource quota management and advanced job scheduling capabilities for Spark ▪ Responsible for both micro-service, and batch jobs scheduling ▪ Running on Cloud with auto-scaling enabled ▪ Deployed on pre-production on-prem cluster with ~100 nodes ▪ Plan to deploy on 1000+ nodes production K8s cluster this year ▪ Leverage YuniKorn features such as hiercharchy queues, resource fairness to run large scale Flink jobs on K8s ▪ Gained x4 scheduling performance improvements
  • 31. Roadmap Current (0.8.0) ● Hirechay queues ● Cross queue fairness ● Fair/FIFO job ordering policies ● Fair/Bin-packing node sorting policies ● Self queue management ● Pluggable app discover ● Metrics system and Prometheus integration Upcoming (0.9.0) ● Gang Scheduling ● Job/task priority support (scheduling & preemption) ● Support Spark dynamic allocation 3rd quarter of 2020
  • 32. Our Vision - Resource Mgmt for Big Data & ML Data Engineering, Realtime Streaming, Machine Learning Micro services, batch jobs, long running workloads, interactive sessions, model serving Multi-tenancy, SLA, Resource Utilization, Cost Mgmt, Budget Computes Types Targets Unified Compute Platform for Big Data & ML
  • 33. Join us in the YuniKorn Community !! ▪ Project web site: http://yunikorn.apache.org/ ▪ Github repo: apache/incubator-yunikorn-core ▪ Mailing list: dev@yunikorn.apache.org ▪ Slack channel: ▪ Bi-weekly/Monthly sync up meetings for different time zones