Cloud Friendly Hadoop and Hive

•

0 recomendaciones•720 vistas

DataWorks Summit

Tecnología

Cloud Friendly Hadoop & Hive

Joydeep Sen Sarma

Qubole

Agenda

 What is Qubole Data Service

 Hadoop as a Service in Cloud

 Hive as a Service in Cloud

2

Qubole Data Service

SDK ODBC

Explore – Integrate – Analyze – Schedule

API
Vertica
Oozie Hive Pig Sqoop

Mysql
Hadoop
AWS EC2
3
S3://adco/logs
AWS S3

Agenda

• What is Qubole Data Service

• Hadoop as a Service in Cloud

• Hive as a Service in Cloud

4

Step 2: Fire Away
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…

select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;

AdCo Hadoop

insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id)
6 group by a.id, a.zip;
6

Hadoop as Service
1. Detect when cluster is required
– Not all Hive statements require cluster (EXPLAIN/SHOW/..)

2. Atomically create cluster
– Long running process, concurrency control using Mysql

3. Shutdown when not in use
– Do on hour boundary (whose?)
– Not if User Sessions are active!

8

Hadoop as Service
• Archive Job History/Logs to S3
– Transparent access to Old jobs

• Auto-Config different node types
– Use ALL ephemeral drives for HDFS/MR
– Use right number of slots per machine

• Scrub, Scrub, Scrub
– Bad Nodes, Bad Clusters, AWS timeouts

9

Scaling Up
insert overwrite table dest Slaves
select … from ads join
campaigns on …group by …;
Progress

Map Tasks

Job Tracker

ReduceTasks
Supply

Demand

Master StarCluster

10
AWS

Scaling Down
1. On hour boundary – check if node is required:
– Can’t remove nodes with map-outputs (today)
– Don’t go below minimum cluster size

2. Remove node from Map-Reduce Cluster

3. Request HDFS Decomissioning – fast!
– Delete affected cache files instead of re-replicating
– One surviving replica and we are Done.

4. Delete Instance
11

Spot Instances

On an average 50-60% cheaper
12 12

Spot Instance: Challenges
• Can lose Spot nodes anytime
– Disastrous for HDFS
– Hybrid Mode: Use mix of On-Demand and Spot
– Hybrid Mode: Keep one replica in On-Demand nodes

• Spot Instances may not be available
– Timeout and use On-Demand nodes as fallback

13

Agenda

 What is Qubole Data Service

 Hadoop as a Service in Cloud

 Hive as a Service in Cloud

14

Cheap to Test

 Evaluate expressions on
sample data

 Run Query on Sample

16

Fastest Hive SaaS
• Works with Small Files! • Stable JVM Reuse!
– Faster Split Computation (8x) – Fix re-entrancy issues
– Prefetching S3 files (30%) – 1.2-2x speedup

• Direct writes to S3 • Columnar Cache
– HIVE-1620 – Use HDFS as cache for S3
– Upto 5x faster for JSON
data
• N E W – Multi-Tenant Hive
Server

17

Questions?

@Qubole
Fr e e Tr i a l :
www.qubole.com

Más contenido relacionado

La actualidad más candente

Pig with Cassandra: Adventures in AnalyticsJeremy Hanna

Cloud Optimized Big DataJoydeep Sen Sarma

Cassandra synergyniallmilton

Hadoop 2EasyMedico.com

Hadoop mapreduce performance study on arm clusterairbots

Upgrading from-hdp-21-to-hdp-24wyukawa

Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi

Devopsconf 2015 sebamontiniSebastian Montini

Intro to cassandra + hadoopJeremy Hanna

2012 apache hadoop_map_reduce_windows_azureDataPlato, Crossing the line

Upgrading from HDP 2.1 to HDP 2.2SATOSHI TAGOMORI

Hadoop - Introduction to HDFSVibrant Technologies & Computers

CUDA performance study on Hadoop MapReduce Clusterairbots

Intro to py spark (and cassandra)Jon Haddad

Druid meetup 4th_sql_on_druidYousun Jeong

Heuritech: Apache Spark REXdidmarin

BigFoot: Big Data For Every OrganizationMatteo Dell'Amico

Effectively deploying hadoop to the cloudAvinash Ramineni

Orchestrating Cassandra with Kubernetes: Challenges and OpportunitiesRaghavendra Prabhu

Hadoop sqoop Wei-Yu Chen

La actualidad más candente (20)

Pig with Cassandra: Adventures in Analytics

Cloud Optimized Big Data

Cassandra synergy

Hadoop 2

Hadoop mapreduce performance study on arm cluster

Upgrading from-hdp-21-to-hdp-24

Treasure Data on The YARN - Hadoop Conference Japan 2014

Devopsconf 2015 sebamontini

Intro to cassandra + hadoop

2012 apache hadoop_map_reduce_windows_azure

Upgrading from HDP 2.1 to HDP 2.2

Hadoop - Introduction to HDFS

CUDA performance study on Hadoop MapReduce Cluster

Intro to py spark (and cassandra)

Druid meetup 4th_sql_on_druid

Heuritech: Apache Spark REX

BigFoot: Big Data For Every Organization

Effectively deploying hadoop to the cloud

Orchestrating Cassandra with Kubernetes: Challenges and Opportunities

Hadoop sqoop

Similar a Cloud Friendly Hadoop and Hive

Qubole hadoop-summit-2013-europeJoydeep Sen Sarma

Hadoop PrimerSteve Staso

Hadoop and Hive Development at FacebookS S

Hadoop and Hive Development at Facebookelliando dias

(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...Amazon Web Services

Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser

Hadoop introductionChirag Ahuja

Optimizing Big Data to run in the Public CloudQubole

Hadoop 2.0 handout 5.0Manaranjan Pradhan

Hadoop and OpenStackDataWorks Summit

Hadoop and OpenStack - Hadoop Summit San Jose 2014spinningmatt

Asbury Hadoop OverviewBrian Enochson

Hadoop and Big Data: RevealedSachin Holla

Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit

Hadoop For OpenStack Log AnalysisOpenStack Foundation

Pittaro open stackloganalysis_20130416OpenStack Foundation

Presentation sreenu dwh-servicesSreenu Musham

Hadoop online trainingSmartittrainings

Chemogenomics in the cloud: Is the sky the limit?Rajarshi Guha

Hd insight essentials quick viewRajesh Nadipalli

Similar a Cloud Friendly Hadoop and Hive (20)

Qubole hadoop-summit-2013-europe

Hadoop Primer

Hadoop and Hive Development at Facebook

(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS r...

Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn

Hadoop introduction

Optimizing Big Data to run in the Public Cloud

Hadoop 2.0 handout 5.0

Hadoop and OpenStack

Hadoop and OpenStack - Hadoop Summit San Jose 2014

Asbury Hadoop Overview

Hadoop and Big Data: Revealed

Hadoop in the cloud – The what, why and how from the experts

Hadoop For OpenStack Log Analysis

Pittaro open stackloganalysis_20130416

Presentation sreenu dwh-services

Hadoop online training

Chemogenomics in the cloud: Is the sky the limit?

Hd insight essentials quick view

Más de DataWorks Summit

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Más de DataWorks Summit (20)

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Último

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Histor y of HAM Radio presentation slidevu2urc

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Slack Application Development 101 Slidespraypatel2

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Cloud Friendly Hadoop and Hive

1. Cloud Friendly Hadoop & Hive Joydeep Sen Sarma Qubole

2. Agenda  What is Qubole Data Service  Hadoop as a Service in Cloud  Hive as a Service in Cloud 2

3. Qubole Data Service SDK ODBC Explore – Integrate – Analyze – Schedule API Vertica Oozie Hive Pig Sqoop Mysql Hadoop AWS EC2 3 S3://adco/logs AWS S3

4. Agenda • What is Qubole Data Service • Hadoop as a Service in Cloud • Hive as a Service in Cloud 4

5. Step 1(Optional): Setup Hadoop 5

6. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) 6 group by a.id, a.zip; 6

7. Come back anytime 7

8. Hadoop as Service 1. Detect when cluster is required – Not all Hive statements require cluster (EXPLAIN/SHOW/..) 2. Atomically create cluster – Long running process, concurrency control using Mysql 3. Shutdown when not in use – Do on hour boundary (whose?) – Not if User Sessions are active! 8

9. Hadoop as Service • Archive Job History/Logs to S3 – Transparent access to Old jobs • Auto-Config different node types – Use ALL ephemeral drives for HDFS/MR – Use right number of slots per machine • Scrub, Scrub, Scrub – Bad Nodes, Bad Clusters, AWS timeouts 9

10. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master StarCluster 10 AWS

11. Scaling Down 1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today) – Don’t go below minimum cluster size 2. Remove node from Map-Reduce Cluster 3. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating – One surviving replica and we are Done. 4. Delete Instance 11

12. Spot Instances On an average 50-60% cheaper 12 12

13. Spot Instance: Challenges • Can lose Spot nodes anytime – Disastrous for HDFS – Hybrid Mode: Use mix of On-Demand and Spot – Hybrid Mode: Keep one replica in On-Demand nodes • Spot Instances may not be available – Timeout and use On-Demand nodes as fallback 13

14. Agenda  What is Qubole Data Service  Hadoop as a Service in Cloud  Hive as a Service in Cloud 14

15. Query History/Results 15

16. Cheap to Test  Evaluate expressions on sample data  Run Query on Sample 16

17. Fastest Hive SaaS • Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup • Direct writes to S3 • Columnar Cache – HIVE-1620 – Use HDFS as cache for S3 – Upto 5x faster for JSON data • N E W – Multi-Tenant Hive Server 17

18. Questions? @Qubole Fr e e Tr i a l : www.qubole.com

Cloud Friendly Hadoop and Hive

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Cloud Friendly Hadoop and Hive

Similar a Cloud Friendly Hadoop and Hive (20)

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

Cloud Friendly Hadoop and Hive