SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Data Processing and
Kubernetes
Anirudh Ramanathan (Google Inc.)
Agenda
• Basics of Kubernetes & Containers
• Motivation
• Apache Spark and HDFS on Kubernetes
• Data Processing Ecosystem
• Future Work
What is Kubernetes?
Kubernetes
Kubernetes is an open-source system
Kubernetes is an open-source system for
automating deployment, scaling, and
management
Kubernetes
Kubernetes is an open-source system for
automating deployment, scaling, and
management of containerized applications.
Kubernetes
‘Containerized’
Containers
• Repeatable Builds and
Workflows
• Application Portability
• High Degree of Control over
Software
• Faster Development Cycle
• Reduced dev-ops load
• Improved Infrastructure
Utilization
libs
app
kernel
libs
app
libs
app
libs
app
• Based on Google's experience running containers in
production for over 15 years
• Large OSS Community - 1200+ contributors and 45k+
commits
• Ecosystem and Partners - 100+ organizations involved
• One of the top 100 projects overall on GitHub - 23k+
stars
Statistics
Overview
At a Glance
kubelet
kubeletCLI
API
users master nodes
etcd
kubelet
scheduler
controllers
apiserver
UI
Nodes and Pods
Pod
Volume
Containers
Pod
Containers
8080 8080
• Pod is set of co-located
containers
• Created by declarative
specification
• Each pod has distinct IP
address
• Volumes local or
network-attached
8080
Volume
Controllers
● Drive current state -> desired state
● Act independently
● Recurring pattern in the system
Examples:
● Deployment
● DaemonSet
● StatefulSet
observe
diff
act
Motivation
• Resource sharing between batch, serving and stateful
workloads
– Streamlined developer experience
– Reduced operational costs
– Improved infrastructure utilization
• Kubernetes and the Container Ecosystem
– Lots of addon services: third-party logging, monitoring,
and security tools
– For example, the Istio project, announced May 24, by IBM,
Google and Lyft
Why Kubernetes?
Cluster Administration
Namespaces
Resource
Accounting
Logging
Monitoring
Resource
Quota
Pluggable
Authorization
Admission
Control
RBAC
• Launch Jobs as a particular
user into a specific
namespace
• RBAC and Namespace-level
resource quotas
• Audit logging for clusters
• Several monitoring solutions
to see node, cluster and
pod-level statistics
Data Processing
• Beta recently announced at Spark Summit 2017
• Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata,
Red Hat, and growing.
Spark on Kubernetes
https://github.com/apache-spark-on-k8s/spar
k
Spark Core
Kubernetes Standalone YARN Mesos
GraphX SparkSQL MLlib Streaming
Spark on Kubernetes
Kubernetes
Integration
Container images with dependencies baked
in
Files from GCS/S3/HDFS/HTTP
File Staging Server
Staged files and
JARs
Several ways of running Spark Jobs along with their dependencies
on Kubernetes
Spark on Kubernetes
Spark Core Kubernetes Scheduler
Backend
Kubernetes Clusternew executors
remove executors
configuration
• Resource Requests
• Authnz
• Communication with K8s
State of Spark
Spark Streaming
Spark Shell
Client Mode
Python/R support
Cluster Mode
Java/Scala
Support
Dynamic
Allocation
Local File Staging High Availability
Spark SQL
GraphX MLlib
Dec 2016
Development
Began
Mar 2017
Alpha
Release
June 2017
Beta
Release
Nov 2016
Design
= supported but
untested
= not yet
supported
• Community driven effort to get HDFS running well on
Kubernetes
• Uses a helm chart to install onto a cluster
• Identified and solved several problems around data
locality when running Spark Jobs
HDFS on Kubernetes
https://github.com/apache-spark-on-k8s/kubernetes-HDFS
HDFS on Kubernetes
node A node B
Driver Pod Executor Pod 1 Executor Pod 2
10.0.0.2
196.0.0.5 196.0.0.6
10.0.0.3 10.0.1.2
Namenode Pod Datanode Pod 1 Datanode Pod 2
HDFS on Kubernetes -- Lessons Learned [Public]
Kimoon Kim (PepperData)
State of HDFS
• HDFS with basic data locality works!
• Future Work
– Remaining data locality issues -- rack locality, node
preference, etc
– Performance benchmarks and testing
– Kerberos support
– Namenode HA
Ecosystem
• Pipelines feature many other components.
• All of the below must run well on K8s
– Cassandra
– Kafka
– Zookeeper
– Elasticsearch, Kibana, etc
Data Pipelines are complicated!
• Cassandra:
https://github.com/kubernetes/examples/tree/master/cassandra
• Kafka:
https://github.com/kubernetes/contrib/tree/master/statefulsets/ka
fka
• Zookeeper:
https://github.com/kubernetes/charts/tree/master/incubator/zook
eeper
• zetcd: https://github.com/coreos/zetcd
• Elasticsearch Operator:
https://github.com/upmc-enterprises/elasticsearch-operator
Cassandra, Kafka and Zookeeper
Future Work
• Batch Scheduling and Resource Sharing
– Priorities and Preemption
• Storage
– Local Storage Provisioning
• Extensibility
– Kubernetes CustomResources (formerly
ThirdPartyResources)
– UI and Dashboard Improvements
• Cluster Federation and Multi-cloud deployments
• Get involved!
https://github.com/kubernetes/community/t
ree/master/sig-big-data
• SIG BigData weekly meeting open to all
(10am PT on Wednesdays) via Zoom:
http://zoom.us/my/sig.big.data
Future Work
Questions/Discussion

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
 
Cloud spanner architecture and use cases
Cloud spanner architecture and use casesCloud spanner architecture and use cases
Cloud spanner architecture and use cases
 
OpenShift Enterprise 3.1 vs kubernetes
OpenShift Enterprise 3.1 vs kubernetesOpenShift Enterprise 3.1 vs kubernetes
OpenShift Enterprise 3.1 vs kubernetes
 
Why kubernetes matters
Why kubernetes mattersWhy kubernetes matters
Why kubernetes matters
 
Kubernetes and Cloud Native Update Q4 2018
Kubernetes and Cloud Native Update Q4 2018Kubernetes and Cloud Native Update Q4 2018
Kubernetes and Cloud Native Update Q4 2018
 
Kubernetes and Istio
Kubernetes and IstioKubernetes and Istio
Kubernetes and Istio
 
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoT
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoTZephyr: Creating a Best-of-Breed, Secure RTOS for IoT
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoT
 
DevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container EngineDevNexus 2015: Kubernetes & Container Engine
DevNexus 2015: Kubernetes & Container Engine
 
Episode 2: Deploying Kubernetes at Scale
Episode 2: Deploying Kubernetes at ScaleEpisode 2: Deploying Kubernetes at Scale
Episode 2: Deploying Kubernetes at Scale
 
Managing kubernetes deployment with operators
Managing kubernetes deployment with operatorsManaging kubernetes deployment with operators
Managing kubernetes deployment with operators
 
Top 3 reasons why you should run your Enterprise workloads on GKE
Top 3 reasons why you should run your Enterprise workloads on GKETop 3 reasons why you should run your Enterprise workloads on GKE
Top 3 reasons why you should run your Enterprise workloads on GKE
 
The Operator Pattern - Managing Stateful Services in Kubernetes
The Operator Pattern - Managing Stateful Services in KubernetesThe Operator Pattern - Managing Stateful Services in Kubernetes
The Operator Pattern - Managing Stateful Services in Kubernetes
 
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
Kubernetes Architecture | Understanding Kubernetes Components | Kubernetes Tu...
 
Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1
Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1
Architecture Overview: Kubernetes with Red Hat Enterprise Linux 7.1
 
Building Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and DockerBuilding Clustered Applications with Kubernetes and Docker
Building Clustered Applications with Kubernetes and Docker
 
Kubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep DiveKubecon US 2019: Kubernetes Multitenancy WG Deep Dive
Kubecon US 2019: Kubernetes Multitenancy WG Deep Dive
 
GlueCon kubernetes & container engine
GlueCon kubernetes & container engineGlueCon kubernetes & container engine
GlueCon kubernetes & container engine
 
Enabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via KubernetesEnabling ceph-mgr to control Ceph services via Kubernetes
Enabling ceph-mgr to control Ceph services via Kubernetes
 
Ports, pods and proxies
Ports, pods and proxiesPorts, pods and proxies
Ports, pods and proxies
 
Kubecon seattle 2018 recap - Application Deployment aspects
Kubecon seattle 2018 recap - Application Deployment aspectsKubecon seattle 2018 recap - Application Deployment aspects
Kubecon seattle 2018 recap - Application Deployment aspects
 

Similar a Big data and Kubernetes

What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
DataWorks Summit
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...
DataWorks Summit
 

Similar a Big data and Kubernetes (20)

[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Global azurebootcamp2019vancouver aks_presentation_by_ashprasad_arjavprasad
Global azurebootcamp2019vancouver aks_presentation_by_ashprasad_arjavprasadGlobal azurebootcamp2019vancouver aks_presentation_by_ashprasad_arjavprasad
Global azurebootcamp2019vancouver aks_presentation_by_ashprasad_arjavprasad
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
 
Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1Kubernetes Architecture - beyond a black box - Part 1
Kubernetes Architecture - beyond a black box - Part 1
 
Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...Running secured Spark job in Kubernetes compute cluster and integrating with ...
Running secured Spark job in Kubernetes compute cluster and integrating with ...
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
 
[DevDay 2017] OpenShift Enterprise - Speaker: Linh Do - DevOps Engineer at Ax...
[DevDay 2017] OpenShift Enterprise - Speaker: Linh Do - DevOps Engineer at Ax...[DevDay 2017] OpenShift Enterprise - Speaker: Linh Do - DevOps Engineer at Ax...
[DevDay 2017] OpenShift Enterprise - Speaker: Linh Do - DevOps Engineer at Ax...
 
Container Conf 2017: Rancher Kubernetes
Container Conf 2017: Rancher KubernetesContainer Conf 2017: Rancher Kubernetes
Container Conf 2017: Rancher Kubernetes
 
01 - VMUGIT - Lecce 2018 - Fabio Rapposelli, VMware
01 - VMUGIT - Lecce 2018 - Fabio Rapposelli, VMware01 - VMUGIT - Lecce 2018 - Fabio Rapposelli, VMware
01 - VMUGIT - Lecce 2018 - Fabio Rapposelli, VMware
 
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
 
Deploying Anything as a Service (XaaS) Using Operators on Kubernetes
Deploying Anything as a Service (XaaS) Using Operators on KubernetesDeploying Anything as a Service (XaaS) Using Operators on Kubernetes
Deploying Anything as a Service (XaaS) Using Operators on Kubernetes
 
Centralizing Kubernetes and Container Operations
Centralizing Kubernetes and Container OperationsCentralizing Kubernetes and Container Operations
Centralizing Kubernetes and Container Operations
 
Building Cloud-Native Applications with Kubernetes, Helm and Kubeless
Building Cloud-Native Applications with Kubernetes, Helm and KubelessBuilding Cloud-Native Applications with Kubernetes, Helm and Kubeless
Building Cloud-Native Applications with Kubernetes, Helm and Kubeless
 
Meetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-NeckerMeetup Kubernetes Rhein-Necker
Meetup Kubernetes Rhein-Necker
 
Docker kubernetes fundamental(pod_service)_190307
Docker kubernetes fundamental(pod_service)_190307Docker kubernetes fundamental(pod_service)_190307
Docker kubernetes fundamental(pod_service)_190307
 
A Primer on Kubernetes and Google Container Engine
A Primer on Kubernetes and Google Container EngineA Primer on Kubernetes and Google Container Engine
A Primer on Kubernetes and Google Container Engine
 
Moby KubeCon 2017
Moby KubeCon 2017Moby KubeCon 2017
Moby KubeCon 2017
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Big data and Kubernetes

  • 1. Data Processing and Kubernetes Anirudh Ramanathan (Google Inc.)
  • 2. Agenda • Basics of Kubernetes & Containers • Motivation • Apache Spark and HDFS on Kubernetes • Data Processing Ecosystem • Future Work
  • 4. Kubernetes Kubernetes is an open-source system
  • 5. Kubernetes is an open-source system for automating deployment, scaling, and management Kubernetes
  • 6. Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. Kubernetes
  • 8. Containers • Repeatable Builds and Workflows • Application Portability • High Degree of Control over Software • Faster Development Cycle • Reduced dev-ops load • Improved Infrastructure Utilization libs app kernel libs app libs app libs app
  • 9. • Based on Google's experience running containers in production for over 15 years • Large OSS Community - 1200+ contributors and 45k+ commits • Ecosystem and Partners - 100+ organizations involved • One of the top 100 projects overall on GitHub - 23k+ stars Statistics
  • 10.
  • 12. At a Glance kubelet kubeletCLI API users master nodes etcd kubelet scheduler controllers apiserver UI
  • 13. Nodes and Pods Pod Volume Containers Pod Containers 8080 8080 • Pod is set of co-located containers • Created by declarative specification • Each pod has distinct IP address • Volumes local or network-attached 8080 Volume
  • 14. Controllers ● Drive current state -> desired state ● Act independently ● Recurring pattern in the system Examples: ● Deployment ● DaemonSet ● StatefulSet observe diff act
  • 16. • Resource sharing between batch, serving and stateful workloads – Streamlined developer experience – Reduced operational costs – Improved infrastructure utilization • Kubernetes and the Container Ecosystem – Lots of addon services: third-party logging, monitoring, and security tools – For example, the Istio project, announced May 24, by IBM, Google and Lyft Why Kubernetes?
  • 17. Cluster Administration Namespaces Resource Accounting Logging Monitoring Resource Quota Pluggable Authorization Admission Control RBAC • Launch Jobs as a particular user into a specific namespace • RBAC and Namespace-level resource quotas • Audit logging for clusters • Several monitoring solutions to see node, cluster and pod-level statistics
  • 19. • Beta recently announced at Spark Summit 2017 • Google, Haiwen, Hyperpilot, Intel, Palantir, Pepperdata, Red Hat, and growing. Spark on Kubernetes https://github.com/apache-spark-on-k8s/spar k Spark Core Kubernetes Standalone YARN Mesos GraphX SparkSQL MLlib Streaming
  • 20. Spark on Kubernetes Kubernetes Integration Container images with dependencies baked in Files from GCS/S3/HDFS/HTTP File Staging Server Staged files and JARs Several ways of running Spark Jobs along with their dependencies on Kubernetes
  • 21. Spark on Kubernetes Spark Core Kubernetes Scheduler Backend Kubernetes Clusternew executors remove executors configuration • Resource Requests • Authnz • Communication with K8s
  • 22. State of Spark Spark Streaming Spark Shell Client Mode Python/R support Cluster Mode Java/Scala Support Dynamic Allocation Local File Staging High Availability Spark SQL GraphX MLlib Dec 2016 Development Began Mar 2017 Alpha Release June 2017 Beta Release Nov 2016 Design = supported but untested = not yet supported
  • 23. • Community driven effort to get HDFS running well on Kubernetes • Uses a helm chart to install onto a cluster • Identified and solved several problems around data locality when running Spark Jobs HDFS on Kubernetes https://github.com/apache-spark-on-k8s/kubernetes-HDFS
  • 24. HDFS on Kubernetes node A node B Driver Pod Executor Pod 1 Executor Pod 2 10.0.0.2 196.0.0.5 196.0.0.6 10.0.0.3 10.0.1.2 Namenode Pod Datanode Pod 1 Datanode Pod 2 HDFS on Kubernetes -- Lessons Learned [Public] Kimoon Kim (PepperData)
  • 25. State of HDFS • HDFS with basic data locality works! • Future Work – Remaining data locality issues -- rack locality, node preference, etc – Performance benchmarks and testing – Kerberos support – Namenode HA
  • 27. • Pipelines feature many other components. • All of the below must run well on K8s – Cassandra – Kafka – Zookeeper – Elasticsearch, Kibana, etc Data Pipelines are complicated!
  • 28. • Cassandra: https://github.com/kubernetes/examples/tree/master/cassandra • Kafka: https://github.com/kubernetes/contrib/tree/master/statefulsets/ka fka • Zookeeper: https://github.com/kubernetes/charts/tree/master/incubator/zook eeper • zetcd: https://github.com/coreos/zetcd • Elasticsearch Operator: https://github.com/upmc-enterprises/elasticsearch-operator Cassandra, Kafka and Zookeeper
  • 29. Future Work • Batch Scheduling and Resource Sharing – Priorities and Preemption • Storage – Local Storage Provisioning • Extensibility – Kubernetes CustomResources (formerly ThirdPartyResources) – UI and Dashboard Improvements • Cluster Federation and Multi-cloud deployments
  • 30. • Get involved! https://github.com/kubernetes/community/t ree/master/sig-big-data • SIG BigData weekly meeting open to all (10am PT on Wednesdays) via Zoom: http://zoom.us/my/sig.big.data Future Work