6. Step 2: Fire Away
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…
select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;
AdCo Hadoop
insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id)
6 group by a.id, a.zip;
6
8. Hadoop as Service
1. Detect when cluster is required
– Not all Hive statements require cluster (EXPLAIN/SHOW/..)
2. Atomically create cluster
– Long running process, concurrency control using Mysql
3. Shutdown when not in use
– Do on hour boundary (whose?)
– Not if User Sessions are active!
8
9. Hadoop as Service
• Archive Job History/Logs to S3
– Transparent access to Old jobs
• Auto-Config different node types
– Use ALL ephemeral drives for HDFS/MR
– Use right number of slots per machine
• Scrub, Scrub, Scrub
– Bad Nodes, Bad Clusters, AWS timeouts
9
10. Scaling Up
insert overwrite table dest Slaves
select … from ads join
campaigns on …group by …;
Progress
Map Tasks
Job Tracker
ReduceTasks
Supply
Demand
Master StarCluster
10
AWS
11. Scaling Down
1. On hour boundary – check if node is required:
– Can’t remove nodes with map-outputs (today)
– Don’t go below minimum cluster size
2. Remove node from Map-Reduce Cluster
3. Request HDFS Decomissioning – fast!
– Delete affected cache files instead of re-replicating
– One surviving replica and we are Done.
4. Delete Instance
11
13. Spot Instance: Challenges
• Can lose Spot nodes anytime
– Disastrous for HDFS
– Hybrid Mode: Use mix of On-Demand and Spot
– Hybrid Mode: Keep one replica in On-Demand nodes
• Spot Instances may not be available
– Timeout and use On-Demand nodes as fallback
13
14. Agenda
What is Qubole Data Service
Hadoop as a Service in Cloud
Hive as a Service in Cloud
14
16. Cheap to Test
Evaluate expressions on
sample data
Run Query on Sample
16
17. Fastest Hive SaaS
• Works with Small Files! • Stable JVM Reuse!
– Faster Split Computation (8x) – Fix re-entrancy issues
– Prefetching S3 files (30%) – 1.2-2x speedup
• Direct writes to S3 • Columnar Cache
– HIVE-1620 – Use HDFS as cache for S3
– Upto 5x faster for JSON
data
• N E W – Multi-Tenant Hive
Server
17
18. Questions?
@Qubole
Fr e e Tr i a l :
www.qubole.com