Cloud Optimized Big Data

•Download as PPTX, PDF•

2 likes•822 views

Joydeep Sen Sarma

What makes a big-data platform 'cloud-optimized'. Here's our (Qubole's) shot at it. @Cloud-Asia 2014.

Engineering

Cloud-Optimized Big-Data as a Service
Joydeep Sen Sarma
Co-Founder Qubole, Apache-Hive

About Me
• @Facebook (2007-2011):
– First Hadoop Engineer
– Founder - Apache Hive project, PMC Member
– Contributor to Apache Hadoop/HBase
• Founder Qubole (2012-)
– Hadoop-as-a-Service
– 30+ customers: Pinterest, Quora, Mediamath, Tubemogul …
– Design/Code/Ops/Support/…

Big Data Cloud
• Elasticity:
– Workloads are Bursty
– Allows easy rolling upgrades and testing
• Lower TCO:
– Cloud Storage is Inexpensive (2-3c/GB/month – globally replicated)
– Zero cost to try new projects
– Upgrade to new hardware easily (no cluster migrations!)

Big Data Cloud
• Global:
– Easily set up where employees/customer/entities are located
• Collaboration:
– Zero-Copy sharing of data with Partners and across Departments
– Easy access to great public data sets
• As-a-Service delivery model vastly lowers Operational Cost

Cloud-Optimized Big Data?
• Optimized for lower TCO
• Optimized for Speed
• Optimized for Operations/Support

Cloud-Optimized Big Data
Optimized for lower TCO

7
Automated LifeCycle Mgmt
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as a.county
from SMALL_TABLE a) t group by
t.county;
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id) group
by a.id, a.zip;
AdCo Hadoop

Auto-Scaling
insert overwrite table dest
select … from ads join campaigns
on …group by …;
8
StarCluster
Map Tasks
ReduceTasks
Demand
Supply
AWS
Progress
Master
Slaves
Job Tracker

9
Spot Instances
On an average 50-60% cheaper
• Fallback to regular
instances when Spot
unavailable
• Replace regular
instances with Spot
when available

10
Using Fast but ‘Thin’ nodes
• C3 instances: 50% better performance at 20% lower cost
• Little local storage 

11
Using Fast but ‘Thin’ nodes
Modify Hadoop to use Network drives for overflow
Map-Reduce HDFS
Local
SSD
Disk I/O
Network Drives
Overflow

Cloud-Optimized Big Data
Optimized for Speed

Faster, Faster ..
• Optimize I/O to AWS S3
– Faster Split Computation (8x)
– Prefetching S3 files (30%)
– Zero-Copy writes to S3
• JVM Reuse (1.2-2x speedup)
• Columnar File Caches on local disks (1.2-2x speedup)

Faster, Faster ..
• 5x Faster than nearest competitor (Hive against S3)

• Presto-as-a-Service – 3-22x faster SQL against S3
– (as tested by customer)
Faster, Faster ..

Cloud-Optimized Big Data
Optimized for Operations/Support

Rolling Upgrades
• @Facebook – we spent months upgrading large cluster
• @Qubole: Start new cluster, Reassign label

Questions?
joydeep@qubole.com
@jsensarma
www.qubole.com

What's hot

How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks

Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk

Hadoop @ eBay: Past, Present, and FutureRyan Hennig

Hd insight essentials quick viewRajesh Nadipalli

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...Cloudera, Inc.

HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon

Hadoop Summit 2014 - recapUserReport

HBaseCon 2015- HBase @ FlipboardMatthew Blair

HUG August 2010: Best practicesHadoop User Group

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack

HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.

Hadoop: The elephant in the roomcacois

HBase at MendeleyDan Harvey

Asbury Hadoop OverviewBrian Enochson

Netflix running Presto in the AWS CloudZhenxiao Luo

October 2014 HUG : Hive On SparkYahoo Developer Network

Introduction to MapReduce & hadoopColin Su

Cost effective BigData Processing on Amazon EC2Sujee Maniyam

Kylin and Druid Presentationargonauts007

Keynote: The Future of Apache HBaseHBaseCon

What's hot (20)

How Adobe Does 2 Million Records Per Second Using Apache Spark!

Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...

Hadoop @ eBay: Past, Present, and Future

Hd insight essentials quick view

HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...

HBaseCon2017 Community-Driven Graphs with JanusGraph

Hadoop Summit 2014 - recap

HBaseCon 2015- HBase @ Flipboard

HUG August 2010: Best practices

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

Hadoop: The elephant in the room

HBase at Mendeley

Asbury Hadoop Overview

Netflix running Presto in the AWS Cloud

October 2014 HUG : Hive On Spark

Introduction to MapReduce & hadoop

Cost effective BigData Processing on Amazon EC2

Kylin and Druid Presentation

Keynote: The Future of Apache HBase

Viewers also liked

Optimizing Big Data to run in the Public CloudQubole

Big dataanalyticsinthecloudSivaramakrishnan Narayanan

Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...Data Con LA

Facebook Presto presentationCyanny LIANG

Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi

Qubole - Big data in cloudDmitry Tolpeko

Viewers also liked (6)

Optimizing Big Data to run in the Public Cloud

Big dataanalyticsinthecloud

Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...

Facebook Presto presentation

Presto - Hadoop Conference Japan 2014

Qubole - Big data in cloud

Similar to Cloud Optimized Big Data

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative

Azure data platform overviewJames Serra

SQL Engines for Hadoop - The case for Impalamarkgrover

advance computing and big adata analytic.pptxTeddyIswahyudi1

Design Choices for Cloud Data PlatformsAshish Mrig

Introducing Azure SQL Data WarehouseJames Serra

Gunther hagleitner：apache hive & stingerhdhappy001

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh

Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole

How Glidewell Moves Data to Amazon RedshiftAttunity

Meta scale kognitio hadoop webinarKognitio

A Scalable Data Transformation Framework using the Hadoop EcosystemSerendio Inc.

AWS (Hadoop) Meetup 30.04.09Chris Purrington

Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFSUSE Italy

A Scalable Data Transformation Framework using Hadoop EcosystemDataWorks Summit

Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely

Microsoft Data Platform - What's includedJames Serra

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.

Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit

SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...Splunk

Similar to Cloud Optimized Big Data (20)

Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...

Azure data platform overview

SQL Engines for Hadoop - The case for Impala

advance computing and big adata analytic.pptx

Design Choices for Cloud Data Platforms

Introducing Azure SQL Data Warehouse

Gunther hagleitner：apache hive & stinger

Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...

Harnessing the Hadoop Ecosystem Optimizations in Apache Hive

How Glidewell Moves Data to Amazon Redshift

Meta scale kognitio hadoop webinar

A Scalable Data Transformation Framework using the Hadoop Ecosystem

AWS (Hadoop) Meetup 30.04.09

Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF

A Scalable Data Transformation Framework using Hadoop Ecosystem

Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...

Microsoft Data Platform - What's included

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Hadoop in the cloud – The what, why and how from the experts

SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...

Recently uploaded

Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774

Katarzyna Lipka-Sidor - BIM School Coursebim.edu.pl

Configuration of IoT devices - Systems managamentBharaniDharan195623

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1

young call girls in Green Park🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani

Risk Management in Engineering Construction ProjectErbil Polytechnic University

Correctly Loading Incremental Data at ScaleAlluxio, Inc.

IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst

Industrial Safety Unit-IV workplace health and safety.pptNarmatha D

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort

Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913

Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441

Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM

Virtual memory management in Operating SystemRashmi Bhat

Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra

THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTIONjhunlian

home automation using Arduino by Aditya Prasadaditya806802

Internet of things -Arshdeep Bahga .pptxVelmuruganTECE

Recently uploaded (20)

Arduino_CSE ece ppt for working and principal of arduino.ppt

Katarzyna Lipka-Sidor - BIM School Course

Configuration of IoT devices - Systems managament

TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers

young call girls in Green Park🔝 9953056974 🔝 escort Service

Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction

CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf

Risk Management in Engineering Construction Project

Correctly Loading Incremental Data at Scale

IVE Industry Focused Event - Defence Sector 2024

Industrial Safety Unit-IV workplace health and safety.ppt

Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service

Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg

Instrumentation, measurement and control of bio process parameters ( Temperat...

Ch10-Global Supply Chain - Cadena de Suministro.pdf

Virtual memory management in Operating System

Mine Environment II Lab_MI10448MI__________.pptx

THE SENDAI FRAMEWORK FOR DISASTER RISK REDUCTION

home automation using Arduino by Aditya Prasad

Internet of things -Arshdeep Bahga .pptx

Cloud Optimized Big Data

1. Cloud-Optimized Big-Data as a Service Joydeep Sen Sarma Co-Founder Qubole, Apache-Hive

2. About Me • @Facebook (2007-2011): – First Hadoop Engineer – Founder - Apache Hive project, PMC Member – Contributor to Apache Hadoop/HBase • Founder Qubole (2012-) – Hadoop-as-a-Service – 30+ customers: Pinterest, Quora, Mediamath, Tubemogul … – Design/Code/Ops/Support/…

3. Big Data Cloud • Elasticity: – Workloads are Bursty – Allows easy rolling upgrades and testing • Lower TCO: – Cloud Storage is Inexpensive (2-3c/GB/month – globally replicated) – Zero cost to try new projects – Upgrade to new hardware easily (no cluster migrations!)

4. Big Data Cloud • Global: – Easily set up where employees/customer/entities are located • Collaboration: – Zero-Copy sharing of data with Partners and across Departments – Easy access to great public data sets • As-a-Service delivery model vastly lowers Operational Cost

5. Cloud-Optimized Big Data? • Optimized for lower TCO • Optimized for Speed • Optimized for Operations/Support

6. Cloud-Optimized Big Data Optimized for lower TCO

7. 7 Automated LifeCycle Mgmt select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) group by a.id, a.zip; AdCo Hadoop

8. Auto-Scaling insert overwrite table dest select … from ads join campaigns on …group by …; 8 StarCluster Map Tasks ReduceTasks Demand Supply AWS Progress Master Slaves Job Tracker

9. 9 Spot Instances On an average 50-60% cheaper • Fallback to regular instances when Spot unavailable • Replace regular instances with Spot when available

10. 10 Using Fast but ‘Thin’ nodes • C3 instances: 50% better performance at 20% lower cost • Little local storage 

11. 11 Using Fast but ‘Thin’ nodes Modify Hadoop to use Network drives for overflow Map-Reduce HDFS Local SSD Disk I/O Network Drives Overflow

12. Cloud-Optimized Big Data Optimized for Speed

13. Faster, Faster .. • Optimize I/O to AWS S3 – Faster Split Computation (8x) – Prefetching S3 files (30%) – Zero-Copy writes to S3 • JVM Reuse (1.2-2x speedup) • Columnar File Caches on local disks (1.2-2x speedup)

14. Faster, Faster .. • 5x Faster than nearest competitor (Hive against S3)

15. • Presto-as-a-Service – 3-22x faster SQL against S3 – (as tested by customer) Faster, Faster ..

16. Cloud-Optimized Big Data Optimized for Operations/Support

17. Rolling Upgrades • @Facebook – we spent months upgrading large cluster • @Qubole: Start new cluster, Reassign label

18. Support CHAT EMail

19. Visually browse Historical Jobs

20. Visually browse Historical Jobs

21. Questions? joydeep@qubole.com @jsensarma www.qubole.com