Qubole @ AWS Meetup Bangalore - July 2015

Big Data as a Service
Joydeep Sen Sarma
Hariharan Iyer
Shubham Tagra

Introduction
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2
– S3
– RedShift
– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
3
Agenda

Introduction
• Founded ~ 10/2011
• Team:
– Founding Crew initial authors of Apache Hive, ran Data @Facebook
– + Notable Alumni from Greenplum/Vertica/EngineYard/Oracle/AWS etc
– + 50 engineers + 20 sales/mkting across Bangalore/Palo-Alto
• Financing:
– Completed Series-B 10/2014
– LightSpeed, Charles-River, Norwest, Anand/Venky
• Product: Qubole Data Service
– Big Data as a Service
– AWS/GCE/Azure
4
About Qubole

Introduction
Qubole Data Service

7
Introduction
Qubole Data Service

Introduction
• Self-Serve Big Data Analytics:
– Lack of Hadoop trained IT/engineers
– Team of Analysts
• Lowest TCO
– Cloud Optimized - takes full advantage of AWS
• Unified Platform for all Tools:
– Hive/Pig/Spark/SQL/Map-Reduce/Cascading/…
– Pick and Choose. Combine and Use
• Awesome Support and Solutions
8
Why Qubole

Self Service Analytics: Direct Access to Big Data
9

Self Service Analytics: Manage Clusters Easily
10

Self Service Analytics: Schedule Jobs
11

Self Service Analytics: Self Serve Dashboards using Notebooks
12

Qubole and EC2
– EC2
– S3
– RedShift
– Kinesis
• Qubole vs. EMR
13
Agenda

Qubole and EC2
• Custom AMIs for much faster boot-up
• Auto-termination
• Auto-scaling
• Spot Instances
• EBS
14
EC2 Magic Sauce

• Auto-start and termination
– Cluster starts automatically when you need to run a command
• Intelligent - no cluster required for metadata commands
– Terminated after couple of hours of Idle time
• Auto-scaling
– Min Size <= Cluster <= Max Size
15
Cluster LifeCycle Management
Qubole and EC2

16
Map Slots Reduce Slots
Slave
Slave
Slave
Slave
Slave
SELECT * FROM FOO JOIN BAR ON BAZ =
...
Auto-Scaling

• Upscaling
– Engine-specific algorithms
– Cannot just look at expected time (parallelism matters)
• Downscaling
– Decommissioning takes time
– Need to consider hour boundary
– Stuck on mapper output
• Output offloading
• AWS Integration
– Hour boundary
– Eventual consistency
17
Why is it hard?
Qubole and EC2

Min Cluster Size: 400
Max Cluster Size: 800
Time for which cluster size < max size: 49%
18
But it pays off!
Qubole and EC2

19
But it pays off!
Expected Compute Hours Compute hours saved Savings (%)
2902246.2 2107311.01 72.6
4655815.5 2105486.11 45.2
1698052.65 1658738.375 97.6
1776944.4 1476547.835 83.1
2063127.85 838628.7 40.6
919721.25 613630.955 66.7
Qubole and EC2

• Various configurations
– All nodes on-demand
– Minimum nodes on-demand, rest combination of on-demand & spot
– All nodes spot
• Minimum set has higher bid price => less likely to lose
– Up to 90% savings compared to on-demand price
20
Spot instances
Qubole and EC2

• Not always available
– Fall back to on-demand
• Increases overall cost of cluster
– Periodically replace extra on-demand instances when spot available
• Can go away at any time
– Hadoop has built-in resiliency
– Place replicas on stable instances
• Auto-scaling
– Must maintain requested ratio
21
Why is it hard?
Qubole and EC2

• Useful for newer instance types - c3/m3
– Low ephemeral storage
• Better performance/$
– Compared to older instance types with more storage
• Writes are changed
– Minimize writes to EBS volumes
– Use them only when ephemeral is near full
22
Elastic Block Store
Qubole and EC2

• Spot Fleets
• EBS-only instances
23
What’s coming next
Qubole and EC2

Resources
• Auto-scaling Hadoop Clusters in Qubole
• Spot Instances in Qubole Clusters
• Rebalancing Hadoop Clusters for Better Spot Utilization
• Improved Performance with Low-Ephemeral-Storage Instances
24
Qubole and EC2

Agenda
– EC2
– S3
– RedShift
– Kinesis
• Qubole vs. EMR
25
Agenda

Qubole and S3
• No real Rename (aka Move) operation
– renames are copies and expensive!
• S3 connection establishment is expensive
– ie. - small calls like getObjectDetails(key) and reads are expensive!
• S3 has bulk prefix listing
– listObjectsChunked(startKey, maxListing)
• Puts are atomic
– Objects created when object uploaded
– Unlike HDFS where files are created on first write
• MultiPart!
26
S3 != HDFS

Qubole and S3
• Naiive
• Smart
• Up to 1000x improvement
27
Prefix Listing
for path in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:
result << listObject(path)
pathList = listPrefix(‘/x’)
while (entry = pathList.next()):
if entry in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:
result << entry

Qubole and S3
• Split Computation
– Divide input files into tasks for Map-Reduce/Spark/Presto
• Recovering Partitions
• List Paths matching regex pattern (‘/x/y/z/*/*’)
• and many more ..
28
Prefix Listing - Use Cases

Qubole and S3
• Normally:
– Write data to temporary location - atomically rename to final location
• With S3:
– Write data to final location
– Atomic puts deal with speculation/retries
– Optional: Remove on Failure
• By default in Hive, DirectFileOutputCommitter in MR/Spark
• Tricky: retries/speculation must use same path
29
Direct Writes

Qubole and S3
• Naiive:
30
Pre-Fetching
Client S3
• Smart:
Client S3

S3 Local Disk Cache (Presto)
31

S3 Local Disk Cache (Presto)
32

Qubole and S3
• Populating Cache while performing Query may cause Slowness
• Large Files are split
– Cache Files or Splits?
• Should Caching Combine Small Files?
• Should Caching transform data into Columnar?
Watch out for Table Copies!
33
Why S3 Caching is hard

Qubole and S3
• Handle S3 Timeouts and Exceptions (truncated streams)
• Optimize away seek() operations
• Data Sharing across Organizations using Cross-Account Roles
34
Miscellaneous

Resources
● S3 Optimizations in Hive
○ http://www.qubole.com/optimizing-hadoop-for-s3-part-1/
● Caching in Presto
○ http://www.qubole.com/blog/product/caching-presto/
● Qubole vs. EMR Performance Comparison
○ http://www.qubole.com/a-performance-comparison-of-qubole-and-amazons-elastic-
mapreduce-emr/
● Data Sharing via AWS Roles:
○ https://qubole-eng.quora.com/Securely-sharing-data-across-Organizations-with-Qubole-2
Qubole and S3

Agenda
– EC2
– S3
– RedShift
– Kinesis
• Qubole vs. EMR
36
Agenda

• Introduction to Presto
• Brief Introduction to Kinesis
• Qubole’s Value add to Kinesis
• Brief introduction to Redshift
• Qubole’s Value add to Redshift
• When to use which system
37
Presto, Kinesis and RedShift
Presto, Kinesis, RedShift

• Interactive SQL Query Engine for Big Data
• Open source by Facebook in late 2013
• Follows ANSI-SQL
• Own execution engine
38
Presto, What is it?

● Extensibility
○ Pluggable Datasources
● Performance
○ In-memory Execution
○ Aggressive Pipelining
○ Highly efficient Java code
○ Dynamic Query Compilation
○ Vectorization
39
Why Presto?

● Smooth learning curve as it adheres to ANSI-SQL
● Active open source community
● Proven worth at scale in companies like Facebook, NetFlix, Airbnb
40
Why not something else?

● Self managed Presto clusters
● Auto-configured
● Autoscaling
● Data Caching
41
Benefits of Presto @Qubole

● Kinesis Connector
● S3 Optimizations
● Insert Support
● UDF support
42
Qubole’s Contribution

● Average times across the performance tests by MediaMath on a 22TB text format table,
partitioned on date, queried on partition with ~1.2b rows
http://www.qubole.com/blog/big-data/performance-testing-presto/
43
Comparison with Hive

● High Capacity Pipe for Real-Time Processing
● Key Concepts
○ Record
○ Streams
○ Shards
○ Checkpoints
44
Kinesis

● Streaming usecase
○ Spark
● SQL usecase
○ Via Hive
○ Via Presto
45
Qubole and Kinesis

● Kinesis Connector
46
Presto-Kinesis Integration

● Example
○ Step1: Define Schema
47

○ Step2: Run Query
48

● Datawarehouse service
● OLAP
● Storage + Compute
49
Redshift

● ETL Usecase
○ DBImport
○ DBExport
● Adhoc Query Usecase
○ Direct Query
○ Hive
○ Presto
50
Qubole and Redshift

Two ways to access Redshift via Presto
● Via Hive JDBC Storage Handler
51
Presto-Redshift integration

Two ways to access Redshift via Presto
● Via jdbc connector
52
Presto-Redshift integration

● Single Platform
● Consistent User Interface
● Cross-source Joins without data consolidation
53
New Opportunities

● Hive: ETL, huge Joins, Group By on high cardinality columns
● Redshift: Interactive Queries when data loading is acceptable
● Presto:
○ Interactive Queries
○ Direct Queries without loading
○ Joining data across Redshift, Kinesis, S3, MySql, Cassandra, etc
54
Hive? Presto? RedShift?

● Iterative Machine Learning
● In-Memory Computing
● Spark Streaming
55
Got Spark?

Resources
● Presto vs Hive
○ http://www.qubole.com/blog/big-data/performance-testing-presto/
● Presto Kinesis Integration
○ http://blogs.aws.amazon.com/bigdata/post/Tx2DDFNHXSAAH2G/Presto-Amazon-Kinesis-Connector-for-
Interactively-Querying-Streaming-Data
● Presto Kinesis Connector
○ https://github.com/qubole/presto-kinesis
● Hive Redshift/Jdbc Connector
○ http://www.qubole.com/blog/product/hive-jdbc-storage-handler/
● Hive Redshift/Jdbc Connector
○ https://github.com/qubole/Hive-JDBC-Storage-Handler

Agenda
– EC2
– S3
– RedShift
– Kinesis
• Qubole vs. EMR
57
Agenda

Thanks!
joydeep@qubole.com, hiyer@qubole.com, stagra@qubole.com
info@qubole.com

Qubole @ AWS Meetup Bangalore - July 2015

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Qubole @ AWS Meetup Bangalore - July 2015

Similar a Qubole @ AWS Meetup Bangalore - July 2015 (20)

Último

Último (20)

Qubole @ AWS Meetup Bangalore - July 2015