Storage Requirements and Options for Running Spark on Kubernetes

•Download as PPTX, PDF•

1 like•913 views

In a world of serverless computing users tend to be frugal when it comes to expenditure on compute, storage and other resources. Paying for the same when they aren’t in use becomes a significant factor. Offering Spark as service on cloud presents very unique challenges. Running Spark on Kubernetes presents a lot of challenges especially around storage and persistence. Spark workloads have very unique requirements of Storage for intermediate data, long time persistence, Share file system and requirements become very tight when it same need to be offered as a service for enterprise to mange GDPR and other compliance like ISO 27001 and HIPAA certifications. This talk covers challenges involved in providing Serverless Spark Clusters share the specific issues one can encounter when running large Kubernetes clusters in production especially covering the scenarios related to persistence. This talk will help people using Kubernetes or docker runtime in production and help them understand various storage options available and which is more suitable for running Spark workloads on Kubernetes and what more can be done

Technology

Storage requirements for
running Spark workloads on
Kubernetes
Rachit Arora
rachitar@in.ibm.com
IBM, India Software Labs

About Me
• Advisory Software Engineer @ IBM India Software Labs
• General Purpose Developer
• Love Containers & Kubernetes
• Conference traveler
• Upcoming book on Hadoop and Its Ecosystem
• Cricket fan, Foodie

Spark
Unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Yarn Mesos
Standalon
e
Scheduler
Kubernete
s
Spark SQL
Interactive
Queries
Spark
Streaming
Stream
processing
Spark
MLlib
Machine
Learning
GraphX
Graph
Computation

Typical Bigdata Application
Secure
Catalog and Search
Ingest &
Store
Prepare Analyze Visualize
Date Engineer Date Scientist
Application
Developer

Evolution of Spark Analytics
On Prem Install
• Acquire
Hardware
• Prepare
Machine
• Install Spark
• Retry
• Apply patches
• security
• Upgrades
• Scale
• High
availability
Virtualization
• Prepare Vm
Imaging
Solution
• Network
Management
• High
Avilability
• Patches
• Scale
Managed
• Configure
Cluster
• Customize
• Scale
• Pay even if
idle
Serverless
• Run analytics

What Kubernetes Bring in?
• Kubernetes is an open-source system for automating deployment,
scaling, and management of containerized applications.
• It Manages Containers for me
• It Manages High availability
• It Provides me flexibility to choose resource I WANT and Persistence I want
• Kubernetes – Lots of addon services: third-party logging, monitoring,
and security tools
• Reduced operational costs
• Improved infrastructure utilization

Storage Requirements
• Distributed File System
• Local Scratch Space
• Fast disk rights – DO NOT Write to Containers!!
• User Library
• Logs
• History Server Events
• Configs
• Secrets

What can we leverage
• Distributed
• NFS
• PV to PVC (1 to 1 Mapping in most of the Cloud Providers)
• Big NFS – Multiple PV – qouta
• HDFS – No Direct Support but can be configured to make it work but no data
localization
• DBFS – s3 based Databricks File System (DBFS) is a distributed file system
• S3/Obect Storage – Performance concerns
• Portworx – under exploration
• Glusterfs

What can we leverage
• Local temp dir scratch space
• emptyDir
• Clean Delete ? Need to return machines
• HostPath
• You manage delete
• Logs
• emptyDir vs NFS
• Push to Object store using fluentd (side containers)
• Roll over
• Do not write to containers

What we are looking for?
• Image as Volume
• https://github.com/kubernetes/kubernetes
/issues/831
• Flex Volume Plugin
• CSI
• Encrypted PVCs options – portworx
• PV to PVC 1 to Many Mapping with
Isolations
• Config Map: Better support for updates
• Local
• Clean Delete for HIPAA
• Distributed
• Clean Delete for HIPAA
• PVC transfer across Namespaces

References
• IBM Watson Studio
https://datascience.ibm.com
• IBM Watson
https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/
• Analytics Engine
https://www.ibm.com/cloud/analytics-engine
• Apache Spark
• Kubernetes Scheduler
Design & Discussion
• Kuberenetes Clusters on IBM Cloud
Rachit Arora
rachitar@in.ibm.com
@rachit1arora

Thank you
Rachit Arora
rachitar@in.ibm.com
Twitter @rachit1arora

What's hot

Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks

Hive + Tez: A Performance Deep DiveDataWorks Summit

Apache Zeppelin on Kubernetes with Spark and Kafka - meetup @twitterApache Zeppelin

Cassandra Introduction & FeaturesDataStax Academy

Introduction to Apache SparkRahul Jain

Schema-on-Read vs Schema-on-WriteAmr Awadallah

Hadoop Security ArchitectureOwen O'Malley

3D: DBT using Databricks and DeltaDatabricks

Apache Ambari: Past, Present, FutureHortonworks

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

Apache Tez - Accelerating Hadoop Data Processinghitesh1892

Securing Hadoop with Apache RangerDataWorks Summit

The Impala CookbookCloudera, Inc.

Manage Add-On Services with Apache AmbariDataWorks Summit

HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks

Apache Spark ArchitectureAlexey Grishchenko

Apache Flink internalsKostas Tzoumas

Apache Tez: Accelerating Hadoop Query ProcessingHortonworks

Kafka replication apachecon_2013Jun Rao

Hive on spark is blazing fast or is it finalHortonworks

What's hot (20)

Magnet Shuffle Service: Push-based Shuffle at LinkedIn

Hive + Tez: A Performance Deep Dive

Apache Zeppelin on Kubernetes with Spark and Kafka - meetup @twitter

Cassandra Introduction & Features

Introduction to Apache Spark

Schema-on-Read vs Schema-on-Write

Hadoop Security Architecture

3D: DBT using Databricks and Delta

Apache Ambari: Past, Present, Future

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Apache Tez - Accelerating Hadoop Data Processing

Securing Hadoop with Apache Ranger

The Impala Cookbook

Manage Add-On Services with Apache Ambari

HDFS on Kubernetes—Lessons Learned with Kimoon Kim

Apache Spark Architecture

Apache Flink internals

Apache Tez: Accelerating Hadoop Query Processing

Kafka replication apachecon_2013

Hive on spark is blazing fast or is it final

Similar to Storage Requirements and Options for Running Spark on Kubernetes

Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit

Meetup Kubernetes Rhein-Neckerinovex GmbH

Webinar - DreamObjects/Ceph Case StudyCeph Community

Netflix oss season 2 episode 1 - meetup Lightning talksRuslan Meshenberg

Best of re:InventAmazon Web Services

State of the Container EcosystemVinay Rao

Lessons learned from running Spark on DockerDataWorks Summit

Trend Micro Big Data Platform and Apache BigtopEvans Ye

Serverless sparkMamathaBusi

Intro Docker october 2013dotCloud

What are clouds made fromJohn Garbutt

Solr + Hadoop: Interactive Search for Hadoopgregchanan

Kubernetes – An open platform for container orchestrationinovex GmbH

Apache Cassandra training. Overview and BasicsOleg Magazov

Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit

Fusion on Kubernetes - Alan Eugenio & Joe Streeky, LucidworksLucidworks

Move your on prem data to a lake in a Lake in CloudCAMMS

Cloud computing UNIT 2.1 presentation inRahulBhole12

Hadoop ppt1chariorienit

Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit

Similar to Storage Requirements and Options for Running Spark on Kubernetes (20)

Why Kubernetes as a container orchestrator is a right choice for running spar...

Meetup Kubernetes Rhein-Necker

Webinar - DreamObjects/Ceph Case Study

Netflix oss season 2 episode 1 - meetup Lightning talks

Best of re:Invent

State of the Container Ecosystem

Lessons learned from running Spark on Docker

Trend Micro Big Data Platform and Apache Bigtop

Serverless spark

Intro Docker october 2013

What are clouds made from

Solr + Hadoop: Interactive Search for Hadoop

Kubernetes – An open platform for container orchestration

Apache Cassandra training. Overview and Basics

Hadoop in the cloud – The what, why and how from the experts

Fusion on Kubernetes - Alan Eugenio & Joe Streeky, Lucidworks

Move your on prem data to a lake in a Lake in Cloud

Cloud computing UNIT 2.1 presentation in

Hadoop ppt1

Big Data in the Cloud - The What, Why and How from the Experts

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

How to convert PDF to text with Nanonetsnaman860154

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Breaking the Kubernetes Kill Chain: Host Path Mount

Axa Assurance Maroc - Insurer Innovation Award 2024

08448380779 Call Girls In Civil Lines Women Seeking Men

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Finology Group – Insurtech Innovation Award 2024

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

Data Cloud, More than a CDP by Matt Robison

Driving Behavioral Change for Information Management through Data-Driven Gree...

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

Automating Google Workspace (GWS) & more with Apps Script

Presentation on how to chat with PDF using ChatGPT code interpreter

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

08448380779 Call Girls In Friends Colony Women Seeking Men

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

How to convert PDF to text with Nanonets

CNv6 Instructor Chapter 6 Quality of Service

Storage Requirements and Options for Running Spark on Kubernetes

1. Storage requirements for running Spark workloads on Kubernetes Rachit Arora rachitar@in.ibm.com IBM, India Software Labs

2. About Me • Advisory Software Engineer @ IBM India Software Labs • General Purpose Developer • Love Containers & Kubernetes • Conference traveler • Upcoming book on Hadoop and Its Ecosystem • Cricket fan, Foodie

3. Spark Unified, open source, parallel, data processing framework for Big Data Analytics Spark Core Engine Yarn Mesos Standalon e Scheduler Kubernete s Spark SQL Interactive Queries Spark Streaming Stream processing Spark MLlib Machine Learning GraphX Graph Computation

4. Typical Bigdata Application Secure Catalog and Search Ingest & Store Prepare Analyze Visualize Date Engineer Date Scientist Application Developer

5. Evolution of Spark Analytics On Prem Install • Acquire Hardware • Prepare Machine • Install Spark • Retry • Apply patches • security • Upgrades • Scale • High availability Virtualization • Prepare Vm Imaging Solution • Network Management • High Avilability • Patches • Scale Managed • Configure Cluster • Customize • Scale • Pay even if idle Serverless • Run analytics

6. What Kubernetes Bring in? • Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. • It Manages Containers for me • It Manages High availability • It Provides me flexibility to choose resource I WANT and Persistence I want • Kubernetes – Lots of addon services: third-party logging, monitoring, and security tools • Reduced operational costs • Improved infrastructure utilization

7. Typical Spark deployment

8. Storage Requirements • Distributed File System • Local Scratch Space • Fast disk rights – DO NOT Write to Containers!! • User Library • Logs • History Server Events • Configs • Secrets

9. What can we leverage • Distributed • NFS • PV to PVC (1 to 1 Mapping in most of the Cloud Providers) • Big NFS – Multiple PV – qouta • HDFS – No Direct Support but can be configured to make it work but no data localization • DBFS – s3 based Databricks File System (DBFS) is a distributed file system • S3/Obect Storage – Performance concerns • Portworx – under exploration • Glusterfs

10. What can we leverage • Local temp dir scratch space • emptyDir • Clean Delete ? Need to return machines • HostPath • You manage delete • Logs • emptyDir vs NFS • Push to Object store using fluentd (side containers) • Roll over • Do not write to containers

11. What we are looking for? • Image as Volume • https://github.com/kubernetes/kubernetes /issues/831 • Flex Volume Plugin • CSI • Encrypted PVCs options – portworx • PV to PVC 1 to Many Mapping with Isolations • Config Map: Better support for updates • Local • Clean Delete for HIPAA • Distributed • Clean Delete for HIPAA • PVC transfer across Namespaces

12. References • IBM Watson Studio https://datascience.ibm.com • IBM Watson https://www.ibm.com/analytics/us/en/watson-data-platform/tutorial/ • Analytics Engine https://www.ibm.com/cloud/analytics-engine • Apache Spark • Kubernetes Scheduler Design & Discussion • Kuberenetes Clusters on IBM Cloud Rachit Arora rachitar@in.ibm.com @rachit1arora

13. Thank you Rachit Arora rachitar@in.ibm.com Twitter @rachit1arora

Editor's Notes

Spark is an open source, scalable, massively parallel, in-memory execution engine for analytics applications. Think of it as an in-memory layer that sits above multiple data stores, where data can be loaded into memory and analyzed in parallel across a cluster. Spark Core: The foundation of Spark that lot of libraires for scheduling and basic I/O Spark offers over 100s of high-level operators that make it easy to build parallel apps. Spark also includes prebuilt machine-learning algorithms and graph analysis algorithms that are especially written to execute in parallel and in memory. It also supports interactive SQL processing of queries and real-time streaming analytics. As a result, you can write analytics applications in programming languages such as Java, Python, R and Scala. You can run Spark using its standalone cluster mode, on Cloud, on Hadoop YARN, on Apache Mesos, or on Kubernetes. Access data in HDFS, Cassandra, HBase, Hive, Object Store, and any Hadoop data source.
Prepare Even though you have the right data, it may not be in the right format or structure for analysis. That’s where data preparation comes in. Data engineers need to bring raw data into one interface from wherever it lives – on premises, in the cloud or on your desktop – where it can then be shaped, transformed, explored, and prepared for analysis.Data scientist: Primarily responsible for building predictive analytic models and building insights. He will analyze data that’s been cataloged and prepared by the data engineer using machine learning tools like Watson Machine Learning. He will build applications using Jupyter Notebooks, RStudio After the data scientist shares his Analytical outputs , Application developer can build APPs like a cognitive chatbot. As the chatbot engages with customers, it will continuously improve its knowledge and help uncover new insights.
As a data scientist what I was required to do On Prem to Virtuliation as demand increased in my organization for the sevrice I decided to move to virtualized VM to handle many request on demand but there still pain was more Then I decided to try services being offereed on cloud like EMR and IBM Analytics Engine or Microsoft Insights etce but there I need to order cluster sand configure them to suit my work loads Keep them running even when I do not want to use them Cover what is takes to install a hadoop/spark cluster
IBM Watson brings together data management, data policies, data preparation, and analysis capabilities into a common framework. You can index, discover, control, and share data with Watson Knowledge Catalog, refine and prepare the data with Data Refinery, then organize resources to analyze the same data with Watson Studio. The IBM Watson apps are fully integrated to use the same user interface and framework. You can pick whichever apps and tools you need for your organization. Watson Studio (Watson Studio) provides you with the environment and tools to solve your business problems by collaboratively analyzing data What is Analytics Engine? You can use AE to Build and deploy clusters within minutes with simplified user experience, scalability, and reliability. You Custom configure the environment and Scale on demand.

Storage Requirements and Options for Running Spark on Kubernetes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Storage Requirements and Options for Running Spark on Kubernetes

Similar to Storage Requirements and Options for Running Spark on Kubernetes (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Storage Requirements and Options for Running Spark on Kubernetes

Editor's Notes