SlideShare a Scribd company logo
1 of 37
Accelerating workloads and bursting
data with Google Dataproc & Alluxio
Enterprises are telling us
they need:
To respond to different business data needs with different
urgency and emphasis
● Create bespoke hadoop clusters customized for any
workload
● Use them for a minute or a year
A faster, more scalable way to get insights from data
● Get up and running without waiting for hardware or
software to be installed or configured
To get their people out of owning and monitoring
technology and back to innovating
● Design workflows that create clusters, complete jobs
end-to-end, and then delete themselves
To spend less money
● Create clusters in seconds
● Pay only for when the cluster is running
● Take advantage of preemptible VM instances
Enterprise Hadoop cluster woes
You know that managing a
Hadoop cluster can be
frustrating and time
consuming
It’s a hassle to renew the license
on your on-premises system
It’s hard to scale compute or storage on-
demand
Maintaining the operations of your Hadoop
cluster takes too much time
Your system can’t keep up with forecasted
usage and data growth
Your legacy system busts
your budget
What is Cloud Dataproc?
Rapid cluster creation
Familiar open source tools
Google Cloud Platform’s fully-
managed Apache Spark and
Apache Hadoop service
Ephemeral clusters on-demand
Customizable machines
Tightly Integrated
with other Google Cloud
Platform services
Fast
Things take seconds to
minutes, not hours or
weeks
Easy
Be an expert with
your data, not your
data infrastructure
Cost-effective
Pay for exactly what you
use to process your
data, not more
Google Cloud Dataproc vision
Disaggregation of storage and compute
Analysis
Cloud Datalab
Development & Test
Data sinksProduction
Cloud Dataproc
External applications
Storage
Cloud Storage
Application Logs
Storage
BigQuery
Development
Cloud Dataproc
Test
Cloud Dataproc
Data sources
Storage
Cloud Bigtable
Storage
Cloud Storage
Storage
BigQuery
Storage
Cloud Bigtable
Data scienceCluster monitoring
Monitor
Stackdriver
Logs
Logging
Ephemeral and long-lived clusters
Semi-long-lived clusters - group and select by labelClusters per job
Cluster
Cloud Dataproc
Cluster
Cloud Dataproc
Cluster
Cloud Dataproc
Cloud Storage
Edge Nodes
Compute Engine
Client Client Client
ClientsClients
Development (Preview)
Production (1.2)
Prod 1
Cloud Dataproc
Dev cluster
Cloud Dataproc
Prod 2
Cloud Dataproc
Customers using
Dataproc to Scale
BigQuery
Stackdriver
Compute
Cloud Storage PSO & SupportBigTable
Dataflow
Dataproc
Pub/Sub
Challenge
To build machine learning models that focused on fraud detection and inventory management
How Google Helped
Partnered with retailer to think about both the digital experience as well as the in-store customer
experience to especially help them manage major retail events like Black Friday.
What they are running:
67avg. clusters per day 513 nodes per cluster
Products & Services:
Traditional Brick and Mortar Retailer
Combining the best of open
source and cloud.
Cloud Dataproc
Introduction to Alluxio
The Alluxio Story
Originated as Tachyon project, at the UC Berkeley’s AMP Lab
by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li.
2014
2015
Open Source project established & company to
commercialize Alluxio founded
Goal: Orchestrate Data for the Cloud for data driven apps
such as Big Data Analytics, ML and AI.
Focus: Accelerating modern app frameworks running on
HDFS/S3/ GCS -based data lakes or warehouses
Data Orchestration for the Cloud
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Enable innovation with any frameworks running
on data stored anywhere
Data Analyst
Data Engineer
Storage Ops
Data Scientist
Lines of Business
Data Orchestration for the Cloud
Cross-platform Security & Governance
Authentication
Kerberos, Delegation token, LDAP, AD
Authorization
FS security model, AWS IAM model, Ranger integration
Encryption
On the wire with TLS, at rest with client-side encryption
Audit Logging
Track accesses to all data
Compute
Storage
2–5 Mins
2–5 Mins
Elastic
P
Elastic
P
Enterprise Cloud Compute & Storage is Great…
but Data got left behind
2–4 Weeks
Request
Data
Request Review Find
Dataset
Code
Script/Job
Run
ETL jobs
Grant
Permissions
Not Elastic
!
Dataset
Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
Alluxio enables compute!
Alluxio Data Orchestration and Control Service
Solution: Consistent High Performance
• Performance increases range from 1.5X
to 10X
• AWS EMR & Google Dataproc
integrations
• Fewer copies of data means lower costs
Problem: Object Stores have
inconsistent performance for analytics
and AI workloads
 SLAs are hard to achieve
 Metadata operations are expensive
 Copied data storage costs add up making
the solution expensive
Accelerating Analytics in the cloud
17
Presto & Alluxio on
Works well together…
Small range query response time
Lower is better
Large scan query response time
Lower is better
Concurrency
Higher is better
Presto Presto + Alluxio
• Query performance bottlenecks
• Un-predictable network IO
• Query pattern - Datasets modelled in star
schema could benefit by dimension table
caching
• Presto + Alluxio
• Avoids unpredictable network
• Consistent query latency
• Higher throughput and better concurrency
Alluxio in Dataproc
Using Alluxio with Google Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Google
Dataproc
Cluster
What about remote data?
Bursting workloads to the cloud with remote data
Typical Restrictions
 Data cannot be persisted in a public cloud
 Additional I/O capacity cannot be added to existing Hadoop infrastructure
 On-prem level security needs to be maintained
 Network bandwidth utilization needs to be minimal
Options
Lift and Shift
Data copy by
workload
“Zero-copy” Bursting
Problem: HDFS cluster is compute-
bound & complex to maintain
AWS Public Cloud IaaS
Spark Presto Hive TensorFlow
Alluxio Data Orchestration and Control Service
On Premises
Connectivity
Datacenter
Spark Presto Hive
Tensor
Flow
Alluxio Data Orchestration and Control Service
Barrier 1: Prohibitive network latency
and bandwidth limits
• Makes hybrid analytics unfeasible
Barrier 2: Copying data to cloud
• Difficult to maintain copies
• Data security and governance
• Costs of another silo
Step 1: Hybrid Cloud for Burst Compute Capacity
• Orchestrates compute access to on-prem data
• Working set of data, not FULL set of data
• Local performance
• Scales elastically
• On-Prem Cluster Offload (both Compute & I/O)
Step 2: Online Migration of Data Per Policy
• Flexible timing to migrate, with less dependencies
• Instead of hard switch over, migrate at own pace
• Moves the data per policy – e.g. last 7 days
“Zero-copy” bursting to scale to the cloud
DEMO
Demo: initialization action installs Alluxio in Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Google
Dataproc
Cluster
#1 - First access data in Google Cloud Store
Demo: initialization action installs Alluxio in Dataproc
Presto Hive
Metadata &
Data cache
Presto Hive
Metadata &
Data cache
Compute-driven
Continuous sync
Compute-driven
Continuous sync
Google
Dataproc
Cluster
#2 - access data from remote Hadoop cluster
Get Started with Alluxio on Dataproc
Single command created Dataproc cluster with Alluxio installed
$ gcloud dataproc clusters create roderickyao-alluxio --initialization-actions
gs://alluxio-public/enterprise-dataproc/2.1.0-1.0/alluxio-dataproc.sh
--metadata alluxio_root_ufs_uri=gs://ryao-test/alluxio-test/,
alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<KEYID>;
alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<SECRET>",
alluxio_license_base64=$(cat alluxio-enterprise-license.json | base64 | tr -d
"n"),alluxio_download_path=gs://ryao-test/alluxio-enterprise-2.1.0-1.0.tar.gz
Tutorial: Getting started with Dataproc and Alluxio
https://www.alluxio.io/products/google-cloud/gcp-dataproc-tutorial/
Resources
Alluxio Initialization Action
- https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/alluxio
Alluxio with Google Cloud Storage documentation
- https://docs.alluxio.io/ee/user/stable/en/ufs/GCS.html
Cloud Dataproc on
Kubernetes (Alpha)
Combining the best of open
source and cloud.
Cloud
Dataproc
Machine
Learning
ETL/ ELT SQL
Partner
Component
Secure Manage Support
Streaming
Jan ‘19 - Kubernetes
Operator for Apache Spark
Open Sourced
Sept ‘19 - Kubernetes Operator
for Apache Flink Open Sourced
Alluxio – under the hood
Alluxio
MasterZookeeper /
RAFT
Standby
Master
WAN
Alluxio
Client
Alluxio
Client
Alluxio
Worker
RAM / SSD / HDD
Alluxio
Worker
RAM / SSD / HDD
Alluxio Reference Architecture
Application
Application
Under Store 1
Under Store 2
Data Elasticity via Unified Namespace
Enables effective data management across different Under Store
- Uses Mounting withTransparent Naming
Spark Presto Hive TensorFlow
RAM
Framework
Read file /trades/us
Bucket Trades Bucket Customers
Data requests
Feature Highlight: Data Caching for faster compute
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again
Spark Presto Hive TensorFlow
RAM
Framework
Read file /trades/us
Trades Directory Customers Directory
Data requests
”Zero-copy” bursting under the hood
Read file /trades/us again Read file /trades/top
Read file /trades/top
Variable latency
with throttling
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again Read file /trades/top
Read file /trades/top
Read file /trades/us again
Spark Presto Hive TensorFlow
RAM
SSD
Disk
Framework
New Trades
Policy Defined Move data > 90 days old to
Feature Highlight – Policy-driven Data Management
GCS
Policy interval : Every day
Policy applied everyday
Questions?

More Related Content

What's hot

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Architecting Big Data Applications with HDInsight
Architecting Big Data Applications with HDInsightArchitecting Big Data Applications with HDInsight
Architecting Big Data Applications with HDInsightAshish Thapliyal
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesMapR Technologies
 
Hadoop's Problem and How to Fix it
Hadoop's Problem and How to Fix itHadoop's Problem and How to Fix it
Hadoop's Problem and How to Fix itKognitio
 
BDM - project management in big data context.pptx
BDM -  project management in big data context.pptxBDM -  project management in big data context.pptx
BDM - project management in big data context.pptxJean-Louis Quéguiner
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduceRyan Tabora
 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Cloudera, Inc.
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...Vasu S
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Pactera_US
 
Cloudera showcase c5.4
Cloudera showcase c5.4Cloudera showcase c5.4
Cloudera showcase c5.4Cloudera, Inc.
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setKognitio
 
MetaScale Case Study: Hadoop Extends DataStage ETL Capacity
MetaScale Case Study: Hadoop Extends DataStage ETL CapacityMetaScale Case Study: Hadoop Extends DataStage ETL Capacity
MetaScale Case Study: Hadoop Extends DataStage ETL CapacityMetaScale
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduCloudera, Inc.
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB WorkshopAhmed Salman
 
Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopEdureka!
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIsCisco DevNet
 
Microsoft azure infrastructure essentials course manual
Microsoft azure infrastructure essentials   course manualMicrosoft azure infrastructure essentials   course manual
Microsoft azure infrastructure essentials course manualmichaeldejene4
 
巨量資料入門 The evolution of data architecture
巨量資料入門 The evolution of data architecture巨量資料入門 The evolution of data architecture
巨量資料入門 The evolution of data architectureWei-Chiu Chuang
 
Changing the tires on a big data racecar
Changing the tires on a big data racecarChanging the tires on a big data racecar
Changing the tires on a big data racecarDavid McNelis
 

What's hot (20)

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Architecting Big Data Applications with HDInsight
Architecting Big Data Applications with HDInsightArchitecting Big Data Applications with HDInsight
Architecting Big Data Applications with HDInsight
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best Practices
 
Hadoop's Problem and How to Fix it
Hadoop's Problem and How to Fix itHadoop's Problem and How to Fix it
Hadoop's Problem and How to Fix it
 
BDM - project management in big data context.pptx
BDM -  project management in big data context.pptxBDM -  project management in big data context.pptx
BDM - project management in big data context.pptx
 
Intro to HDFS and MapReduce
Intro to HDFS and MapReduceIntro to HDFS and MapReduce
Intro to HDFS and MapReduce
 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera

 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...
Case Study - DataXu Uses Qubole To Make Big Data Cloud Querying, Highly Avail...
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks
 
Cloudera showcase c5.4
Cloudera showcase c5.4Cloudera showcase c5.4
Cloudera showcase c5.4
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 
MetaScale Case Study: Hadoop Extends DataStage ETL Capacity
MetaScale Case Study: Hadoop Extends DataStage ETL CapacityMetaScale Case Study: Hadoop Extends DataStage ETL Capacity
MetaScale Case Study: Hadoop Extends DataStage ETL Capacity
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use Hadoop
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIs
 
Microsoft azure infrastructure essentials course manual
Microsoft azure infrastructure essentials   course manualMicrosoft azure infrastructure essentials   course manual
Microsoft azure infrastructure essentials course manual
 
巨量資料入門 The evolution of data architecture
巨量資料入門 The evolution of data architecture巨量資料入門 The evolution of data architecture
巨量資料入門 The evolution of data architecture
 
Changing the tires on a big data racecar
Changing the tires on a big data racecarChanging the tires on a big data racecar
Changing the tires on a big data racecar
 

Similar to Accelerating workloads and bursting data with Google Dataproc & Alluxio

Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Alluxio, Inc.
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataprocAlluxio, Inc.
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.
 
GCP On Prem Buyers Guide - White-paper | Qubole
GCP On Prem Buyers Guide - White-paper | Qubole GCP On Prem Buyers Guide - White-paper | Qubole
GCP On Prem Buyers Guide - White-paper | Qubole Vasu S
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudAlluxio, Inc.
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformAlluxio, Inc.
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudShubham Tagra
 
Data science lab enabling flexibility
Data science lab   enabling flexibilityData science lab   enabling flexibility
Data science lab enabling flexibilityKognitio
 

Similar to Accelerating workloads and bursting data with Google Dataproc & Alluxio (20)

Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 
GCP On Prem Buyers Guide - White-paper | Qubole
GCP On Prem Buyers Guide - White-paper | Qubole GCP On Prem Buyers Guide - White-paper | Qubole
GCP On Prem Buyers Guide - White-paper | Qubole
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data PlatformThe Practice of Presto & Alluxio in E-Commerce Big Data Platform
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Data science lab enabling flexibility
Data science lab   enabling flexibilityData science lab   enabling flexibility
Data science lab enabling flexibility
 

Recently uploaded

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 

Recently uploaded (20)

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Accelerating workloads and bursting data with Google Dataproc & Alluxio

  • 1. Accelerating workloads and bursting data with Google Dataproc & Alluxio
  • 2. Enterprises are telling us they need: To respond to different business data needs with different urgency and emphasis ● Create bespoke hadoop clusters customized for any workload ● Use them for a minute or a year A faster, more scalable way to get insights from data ● Get up and running without waiting for hardware or software to be installed or configured To get their people out of owning and monitoring technology and back to innovating ● Design workflows that create clusters, complete jobs end-to-end, and then delete themselves To spend less money ● Create clusters in seconds ● Pay only for when the cluster is running ● Take advantage of preemptible VM instances
  • 3. Enterprise Hadoop cluster woes You know that managing a Hadoop cluster can be frustrating and time consuming It’s a hassle to renew the license on your on-premises system It’s hard to scale compute or storage on- demand Maintaining the operations of your Hadoop cluster takes too much time Your system can’t keep up with forecasted usage and data growth Your legacy system busts your budget
  • 4. What is Cloud Dataproc? Rapid cluster creation Familiar open source tools Google Cloud Platform’s fully- managed Apache Spark and Apache Hadoop service Ephemeral clusters on-demand Customizable machines Tightly Integrated with other Google Cloud Platform services
  • 5. Fast Things take seconds to minutes, not hours or weeks Easy Be an expert with your data, not your data infrastructure Cost-effective Pay for exactly what you use to process your data, not more Google Cloud Dataproc vision
  • 6. Disaggregation of storage and compute Analysis Cloud Datalab Development & Test Data sinksProduction Cloud Dataproc External applications Storage Cloud Storage Application Logs Storage BigQuery Development Cloud Dataproc Test Cloud Dataproc Data sources Storage Cloud Bigtable Storage Cloud Storage Storage BigQuery Storage Cloud Bigtable Data scienceCluster monitoring Monitor Stackdriver Logs Logging
  • 7. Ephemeral and long-lived clusters Semi-long-lived clusters - group and select by labelClusters per job Cluster Cloud Dataproc Cluster Cloud Dataproc Cluster Cloud Dataproc Cloud Storage Edge Nodes Compute Engine Client Client Client ClientsClients Development (Preview) Production (1.2) Prod 1 Cloud Dataproc Dev cluster Cloud Dataproc Prod 2 Cloud Dataproc
  • 9. BigQuery Stackdriver Compute Cloud Storage PSO & SupportBigTable Dataflow Dataproc Pub/Sub Challenge To build machine learning models that focused on fraud detection and inventory management How Google Helped Partnered with retailer to think about both the digital experience as well as the in-store customer experience to especially help them manage major retail events like Black Friday. What they are running: 67avg. clusters per day 513 nodes per cluster Products & Services: Traditional Brick and Mortar Retailer
  • 10. Combining the best of open source and cloud. Cloud Dataproc
  • 12. The Alluxio Story Originated as Tachyon project, at the UC Berkeley’s AMP Lab by then Ph.D. student & now Alluxio CTO, Haoyuan (H.Y.) Li. 2014 2015 Open Source project established & company to commercialize Alluxio founded Goal: Orchestrate Data for the Cloud for data driven apps such as Big Data Analytics, ML and AI. Focus: Accelerating modern app frameworks running on HDFS/S3/ GCS -based data lakes or warehouses
  • 13. Data Orchestration for the Cloud Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver Enable innovation with any frameworks running on data stored anywhere Data Analyst Data Engineer Storage Ops Data Scientist Lines of Business
  • 14. Data Orchestration for the Cloud Cross-platform Security & Governance Authentication Kerberos, Delegation token, LDAP, AD Authorization FS security model, AWS IAM model, Ranger integration Encryption On the wire with TLS, at rest with client-side encryption Audit Logging Track accesses to all data
  • 15. Compute Storage 2–5 Mins 2–5 Mins Elastic P Elastic P Enterprise Cloud Compute & Storage is Great… but Data got left behind 2–4 Weeks Request Data Request Review Find Dataset Code Script/Job Run ETL jobs Grant Permissions Not Elastic ! Dataset
  • 16. Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service Alluxio enables compute! Alluxio Data Orchestration and Control Service Solution: Consistent High Performance • Performance increases range from 1.5X to 10X • AWS EMR & Google Dataproc integrations • Fewer copies of data means lower costs Problem: Object Stores have inconsistent performance for analytics and AI workloads  SLAs are hard to achieve  Metadata operations are expensive  Copied data storage costs add up making the solution expensive Accelerating Analytics in the cloud
  • 17. 17 Presto & Alluxio on Works well together… Small range query response time Lower is better Large scan query response time Lower is better Concurrency Higher is better Presto Presto + Alluxio • Query performance bottlenecks • Un-predictable network IO • Query pattern - Datasets modelled in star schema could benefit by dimension table caching • Presto + Alluxio • Avoids unpredictable network • Consistent query latency • Higher throughput and better concurrency
  • 19. Using Alluxio with Google Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync Google Dataproc Cluster
  • 21. Bursting workloads to the cloud with remote data Typical Restrictions  Data cannot be persisted in a public cloud  Additional I/O capacity cannot be added to existing Hadoop infrastructure  On-prem level security needs to be maintained  Network bandwidth utilization needs to be minimal Options Lift and Shift Data copy by workload “Zero-copy” Bursting
  • 22. Problem: HDFS cluster is compute- bound & complex to maintain AWS Public Cloud IaaS Spark Presto Hive TensorFlow Alluxio Data Orchestration and Control Service On Premises Connectivity Datacenter Spark Presto Hive Tensor Flow Alluxio Data Orchestration and Control Service Barrier 1: Prohibitive network latency and bandwidth limits • Makes hybrid analytics unfeasible Barrier 2: Copying data to cloud • Difficult to maintain copies • Data security and governance • Costs of another silo Step 1: Hybrid Cloud for Burst Compute Capacity • Orchestrates compute access to on-prem data • Working set of data, not FULL set of data • Local performance • Scales elastically • On-Prem Cluster Offload (both Compute & I/O) Step 2: Online Migration of Data Per Policy • Flexible timing to migrate, with less dependencies • Instead of hard switch over, migrate at own pace • Moves the data per policy – e.g. last 7 days “Zero-copy” bursting to scale to the cloud
  • 23. DEMO
  • 24. Demo: initialization action installs Alluxio in Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync Google Dataproc Cluster #1 - First access data in Google Cloud Store
  • 25. Demo: initialization action installs Alluxio in Dataproc Presto Hive Metadata & Data cache Presto Hive Metadata & Data cache Compute-driven Continuous sync Compute-driven Continuous sync Google Dataproc Cluster #2 - access data from remote Hadoop cluster
  • 26. Get Started with Alluxio on Dataproc Single command created Dataproc cluster with Alluxio installed $ gcloud dataproc clusters create roderickyao-alluxio --initialization-actions gs://alluxio-public/enterprise-dataproc/2.1.0-1.0/alluxio-dataproc.sh --metadata alluxio_root_ufs_uri=gs://ryao-test/alluxio-test/, alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<KEYID>; alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<SECRET>", alluxio_license_base64=$(cat alluxio-enterprise-license.json | base64 | tr -d "n"),alluxio_download_path=gs://ryao-test/alluxio-enterprise-2.1.0-1.0.tar.gz Tutorial: Getting started with Dataproc and Alluxio https://www.alluxio.io/products/google-cloud/gcp-dataproc-tutorial/
  • 27. Resources Alluxio Initialization Action - https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/tree/master/alluxio Alluxio with Google Cloud Storage documentation - https://docs.alluxio.io/ee/user/stable/en/ufs/GCS.html
  • 28. Cloud Dataproc on Kubernetes (Alpha) Combining the best of open source and cloud.
  • 30. Jan ‘19 - Kubernetes Operator for Apache Spark Open Sourced Sept ‘19 - Kubernetes Operator for Apache Flink Open Sourced
  • 31. Alluxio – under the hood
  • 32. Alluxio MasterZookeeper / RAFT Standby Master WAN Alluxio Client Alluxio Client Alluxio Worker RAM / SSD / HDD Alluxio Worker RAM / SSD / HDD Alluxio Reference Architecture Application Application Under Store 1 Under Store 2
  • 33. Data Elasticity via Unified Namespace Enables effective data management across different Under Store - Uses Mounting withTransparent Naming
  • 34. Spark Presto Hive TensorFlow RAM Framework Read file /trades/us Bucket Trades Bucket Customers Data requests Feature Highlight: Data Caching for faster compute Read file /trades/us again Read file /trades/top Read file /trades/top Variable latency with throttling Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again
  • 35. Spark Presto Hive TensorFlow RAM Framework Read file /trades/us Trades Directory Customers Directory Data requests ”Zero-copy” bursting under the hood Read file /trades/us again Read file /trades/top Read file /trades/top Variable latency with throttling Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again Read file /trades/top Read file /trades/top Read file /trades/us again
  • 36. Spark Presto Hive TensorFlow RAM SSD Disk Framework New Trades Policy Defined Move data > 90 days old to Feature Highlight – Policy-driven Data Management GCS Policy interval : Every day Policy applied everyday