PE 459 LECTURE 2- natural gas basic concepts and properties
Qubole @ AWS Meetup Bangalore - July 2015
1. Big Data as a Service
Joydeep Sen Sarma
Hariharan Iyer
Shubham Tagra
2. Introduction
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2
– S3
– RedShift
– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
3
Agenda
3. Introduction
• Founded ~ 10/2011
• Team:
– Founding Crew initial authors of Apache Hive, ran Data @Facebook
– + Notable Alumni from Greenplum/Vertica/EngineYard/Oracle/AWS etc
– + 50 engineers + 20 sales/mkting across Bangalore/Palo-Alto
• Financing:
– Completed Series-B 10/2014
– LightSpeed, Charles-River, Norwest, Anand/Venky
• Product: Qubole Data Service
– Big Data as a Service
– AWS/GCE/Azure
4
About Qubole
7. Introduction
• Self-Serve Big Data Analytics:
– Lack of Hadoop trained IT/engineers
– Team of Analysts
• Lowest TCO
– Cloud Optimized - takes full advantage of AWS
• Unified Platform for all Tools:
– Hive/Pig/Spark/SQL/Map-Reduce/Cascading/…
– Pick and Choose. Combine and Use
• Awesome Support and Solutions
8
Why Qubole
12. Qubole and EC2
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2
– S3
– RedShift
– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
13
Agenda
13. Qubole and EC2
• Custom AMIs for much faster boot-up
• Auto-termination
• Auto-scaling
• Spot Instances
• EBS
14
EC2 Magic Sauce
14. • Auto-start and termination
– Cluster starts automatically when you need to run a command
• Intelligent - no cluster required for metadata commands
– Terminated after couple of hours of Idle time
• Auto-scaling
– Min Size <= Cluster <= Max Size
15
Cluster LifeCycle Management
Qubole and EC2
15. 16
Map Slots Reduce Slots
Slave
Slave
Slave
Slave
Slave
SELECT * FROM FOO JOIN BAR ON BAZ =
...
Auto-Scaling
16. • Upscaling
– Engine-specific algorithms
– Cannot just look at expected time (parallelism matters)
• Downscaling
– Decommissioning takes time
– Need to consider hour boundary
– Stuck on mapper output
• Output offloading
• AWS Integration
– Hour boundary
– Eventual consistency
17
Why is it hard?
Qubole and EC2
17. Min Cluster Size: 400
Max Cluster Size: 800
Time for which cluster size < max size: 49%
18
But it pays off!
Qubole and EC2
18. 19
But it pays off!
Expected Compute Hours Compute hours saved Savings (%)
2902246.2 2107311.01 72.6
4655815.5 2105486.11 45.2
1698052.65 1658738.375 97.6
1776944.4 1476547.835 83.1
2063127.85 838628.7 40.6
919721.25 613630.955 66.7
Qubole and EC2
19. • Various configurations
– All nodes on-demand
– Minimum nodes on-demand, rest combination of on-demand & spot
– All nodes spot
• Minimum set has higher bid price => less likely to lose
– Up to 90% savings compared to on-demand price
20
Spot instances
Qubole and EC2
20. • Not always available
– Fall back to on-demand
• Increases overall cost of cluster
– Periodically replace extra on-demand instances when spot available
• Can go away at any time
– Hadoop has built-in resiliency
– Place replicas on stable instances
• Auto-scaling
– Must maintain requested ratio
21
Why is it hard?
Qubole and EC2
21. • Useful for newer instance types - c3/m3
– Low ephemeral storage
• Better performance/$
– Compared to older instance types with more storage
• Writes are changed
– Minimize writes to EBS volumes
– Use them only when ephemeral is near full
22
Elastic Block Store
Qubole and EC2
22. • Spot Fleets
• EBS-only instances
23
What’s coming next
Qubole and EC2
23. Resources
• Auto-scaling Hadoop Clusters in Qubole
• Spot Instances in Qubole Clusters
• Rebalancing Hadoop Clusters for Better Spot Utilization
• Improved Performance with Low-Ephemeral-Storage Instances
24
Qubole and EC2
24. Agenda
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2
– S3
– RedShift
– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
25
Agenda
25. Qubole and S3
• No real Rename (aka Move) operation
– renames are copies and expensive!
• S3 connection establishment is expensive
– ie. - small calls like getObjectDetails(key) and reads are expensive!
• S3 has bulk prefix listing
– listObjectsChunked(startKey, maxListing)
• Puts are atomic
– Objects created when object uploaded
– Unlike HDFS where files are created on first write
• MultiPart!
26
S3 != HDFS
26. Qubole and S3
• Naiive
• Smart
• Up to 1000x improvement
27
Prefix Listing
for path in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:
result << listObject(path)
pathList = listPrefix(‘/x’)
while (entry = pathList.next()):
if entry in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:
result << entry
27. Qubole and S3
• Split Computation
– Divide input files into tasks for Map-Reduce/Spark/Presto
• Recovering Partitions
• List Paths matching regex pattern (‘/x/y/z/*/*’)
• and many more ..
28
Prefix Listing - Use Cases
28. Qubole and S3
• Normally:
– Write data to temporary location - atomically rename to final location
• With S3:
– Write data to final location
– Atomic puts deal with speculation/retries
– Optional: Remove on Failure
• By default in Hive, DirectFileOutputCommitter in MR/Spark
• Tricky: retries/speculation must use same path
29
Direct Writes
32. Qubole and S3
• Populating Cache while performing Query may cause Slowness
• Large Files are split
– Cache Files or Splits?
• Should Caching Combine Small Files?
• Should Caching transform data into Columnar?
Watch out for Table Copies!
33
Why S3 Caching is hard
33. Qubole and S3
• Handle S3 Timeouts and Exceptions (truncated streams)
• Optimize away seek() operations
• Data Sharing across Organizations using Cross-Account Roles
34
Miscellaneous
34. Resources
● S3 Optimizations in Hive
○ http://www.qubole.com/optimizing-hadoop-for-s3-part-1/
● Caching in Presto
○ http://www.qubole.com/blog/product/caching-presto/
● Qubole vs. EMR Performance Comparison
○ http://www.qubole.com/a-performance-comparison-of-qubole-and-amazons-elastic-
mapreduce-emr/
● Data Sharing via AWS Roles:
○ https://qubole-eng.quora.com/Securely-sharing-data-across-Organizations-with-Qubole-2
Qubole and S3
35. Agenda
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2
– S3
– RedShift
– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
36
Agenda
36. • Introduction to Presto
• Brief Introduction to Kinesis
• Qubole’s Value add to Kinesis
• Brief introduction to Redshift
• Qubole’s Value add to Redshift
• When to use which system
37
Presto, Kinesis and RedShift
Presto, Kinesis, RedShift
37. • Interactive SQL Query Engine for Big Data
• Open source by Facebook in late 2013
• Follows ANSI-SQL
• Own execution engine
38
Presto, What is it?
Presto, Kinesis, RedShift
39. ● Smooth learning curve as it adheres to ANSI-SQL
● Active open source community
● Proven worth at scale in companies like Facebook, NetFlix, Airbnb
40
Why not something else?
Presto, Kinesis, RedShift
41. ● Kinesis Connector
● S3 Optimizations
● Insert Support
● UDF support
42
Qubole’s Contribution
Presto, Kinesis, RedShift
42. ● Average times across the performance tests by MediaMath on a 22TB text format table,
partitioned on date, queried on partition with ~1.2b rows
http://www.qubole.com/blog/big-data/performance-testing-presto/
43
Comparison with Hive
Presto, Kinesis, RedShift
43. ● High Capacity Pipe for Real-Time Processing
● Key Concepts
○ Record
○ Streams
○ Shards
○ Checkpoints
44
Kinesis
Presto, Kinesis, RedShift
44. ● Streaming usecase
○ Spark
● SQL usecase
○ Via Hive
○ Via Presto
45
Qubole and Kinesis
Presto, Kinesis, RedShift
50. Two ways to access Redshift via Presto
● Via Hive JDBC Storage Handler
51
Presto-Redshift integration
Presto, Kinesis, RedShift
51. Two ways to access Redshift via Presto
● Via jdbc connector
52
Presto-Redshift integration
Presto, Kinesis, RedShift
52. ● Single Platform
● Consistent User Interface
● Cross-source Joins without data consolidation
53
New Opportunities
Presto, Kinesis, RedShift
53. ● Hive: ETL, huge Joins, Group By on high cardinality columns
● Redshift: Interactive Queries when data loading is acceptable
● Presto:
○ Interactive Queries
○ Direct Queries without loading
○ Joining data across Redshift, Kinesis, S3, MySql, Cassandra, etc
54
Hive? Presto? RedShift?
Presto, Kinesis, RedShift
56. Agenda
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2
– S3
– RedShift
– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
57
Agenda