SlideShare una empresa de Scribd logo
1 de 57
Big Data as a Service
Joydeep Sen Sarma
Hariharan Iyer
Shubham Tagra
Introduction
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2
– S3
– RedShift
– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
3
Agenda
Introduction
• Founded ~ 10/2011
• Team:
– Founding Crew initial authors of Apache Hive, ran Data @Facebook
– + Notable Alumni from Greenplum/Vertica/EngineYard/Oracle/AWS etc
– + 50 engineers + 20 sales/mkting across Bangalore/Palo-Alto
• Financing:
– Completed Series-B 10/2014
– LightSpeed, Charles-River, Norwest, Anand/Venky
• Product: Qubole Data Service
– Big Data as a Service
– AWS/GCE/Azure
4
About Qubole
Introduction
5
Customers
Introduction
Qubole Data Service
7
Introduction
Qubole Data Service
Introduction
• Self-Serve Big Data Analytics:
– Lack of Hadoop trained IT/engineers
– Team of Analysts
• Lowest TCO
– Cloud Optimized - takes full advantage of AWS
• Unified Platform for all Tools:
– Hive/Pig/Spark/SQL/Map-Reduce/Cascading/…
– Pick and Choose. Combine and Use
• Awesome Support and Solutions
8
Why Qubole
Self Service Analytics: Direct Access to Big Data
9
Self Service Analytics: Manage Clusters Easily
10
Self Service Analytics: Schedule Jobs
11
Self Service Analytics: Self Serve Dashboards using Notebooks
12
Qubole and EC2
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2
– S3
– RedShift
– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
13
Agenda
Qubole and EC2
• Custom AMIs for much faster boot-up
• Auto-termination
• Auto-scaling
• Spot Instances
• EBS
14
EC2 Magic Sauce
• Auto-start and termination
– Cluster starts automatically when you need to run a command
• Intelligent - no cluster required for metadata commands
– Terminated after couple of hours of Idle time
• Auto-scaling
– Min Size <= Cluster <= Max Size
15
Cluster LifeCycle Management
Qubole and EC2
16
Map Slots Reduce Slots
Slave
Slave
Slave
Slave
Slave
SELECT * FROM FOO JOIN BAR ON BAZ =
...
Auto-Scaling
• Upscaling
– Engine-specific algorithms
– Cannot just look at expected time (parallelism matters)
• Downscaling
– Decommissioning takes time
– Need to consider hour boundary
– Stuck on mapper output
• Output offloading
• AWS Integration
– Hour boundary
– Eventual consistency
17
Why is it hard?
Qubole and EC2
Min Cluster Size: 400
Max Cluster Size: 800
Time for which cluster size < max size: 49%
18
But it pays off!
Qubole and EC2
19
But it pays off!
Expected Compute Hours Compute hours saved Savings (%)
2902246.2 2107311.01 72.6
4655815.5 2105486.11 45.2
1698052.65 1658738.375 97.6
1776944.4 1476547.835 83.1
2063127.85 838628.7 40.6
919721.25 613630.955 66.7
Qubole and EC2
• Various configurations
– All nodes on-demand
– Minimum nodes on-demand, rest combination of on-demand & spot
– All nodes spot
• Minimum set has higher bid price => less likely to lose
– Up to 90% savings compared to on-demand price
20
Spot instances
Qubole and EC2
• Not always available
– Fall back to on-demand
• Increases overall cost of cluster
– Periodically replace extra on-demand instances when spot available
• Can go away at any time
– Hadoop has built-in resiliency
– Place replicas on stable instances
• Auto-scaling
– Must maintain requested ratio
21
Why is it hard?
Qubole and EC2
• Useful for newer instance types - c3/m3
– Low ephemeral storage
• Better performance/$
– Compared to older instance types with more storage
• Writes are changed
– Minimize writes to EBS volumes
– Use them only when ephemeral is near full
22
Elastic Block Store
Qubole and EC2
• Spot Fleets
• EBS-only instances
23
What’s coming next
Qubole and EC2
Resources
• Auto-scaling Hadoop Clusters in Qubole
• Spot Instances in Qubole Clusters
• Rebalancing Hadoop Clusters for Better Spot Utilization
• Improved Performance with Low-Ephemeral-Storage Instances
24
Qubole and EC2
Agenda
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2
– S3
– RedShift
– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
25
Agenda
Qubole and S3
• No real Rename (aka Move) operation
– renames are copies and expensive!
• S3 connection establishment is expensive
– ie. - small calls like getObjectDetails(key) and reads are expensive!
• S3 has bulk prefix listing
– listObjectsChunked(startKey, maxListing)
• Puts are atomic
– Objects created when object uploaded
– Unlike HDFS where files are created on first write
• MultiPart!
26
S3 != HDFS
Qubole and S3
• Naiive
• Smart
• Up to 1000x improvement
27
Prefix Listing
for path in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:
result << listObject(path)
pathList = listPrefix(‘/x’)
while (entry = pathList.next()):
if entry in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]:
result << entry
Qubole and S3
• Split Computation
– Divide input files into tasks for Map-Reduce/Spark/Presto
• Recovering Partitions
• List Paths matching regex pattern (‘/x/y/z/*/*’)
• and many more ..
28
Prefix Listing - Use Cases
Qubole and S3
• Normally:
– Write data to temporary location - atomically rename to final location
• With S3:
– Write data to final location
– Atomic puts deal with speculation/retries
– Optional: Remove on Failure
• By default in Hive, DirectFileOutputCommitter in MR/Spark
• Tricky: retries/speculation must use same path
29
Direct Writes
Qubole and S3
• Naiive:
30
Pre-Fetching
Client S3
• Smart:
Client S3
S3 Local Disk Cache (Presto)
31
S3 Local Disk Cache (Presto)
32
Qubole and S3
• Populating Cache while performing Query may cause Slowness
• Large Files are split
– Cache Files or Splits?
• Should Caching Combine Small Files?
• Should Caching transform data into Columnar?
Watch out for Table Copies!
33
Why S3 Caching is hard
Qubole and S3
• Handle S3 Timeouts and Exceptions (truncated streams)
• Optimize away seek() operations
• Data Sharing across Organizations using Cross-Account Roles
34
Miscellaneous
Resources
● S3 Optimizations in Hive
○ http://www.qubole.com/optimizing-hadoop-for-s3-part-1/
● Caching in Presto
○ http://www.qubole.com/blog/product/caching-presto/
● Qubole vs. EMR Performance Comparison
○ http://www.qubole.com/a-performance-comparison-of-qubole-and-amazons-elastic-
mapreduce-emr/
● Data Sharing via AWS Roles:
○ https://qubole-eng.quora.com/Securely-sharing-data-across-Organizations-with-Qubole-2
Qubole and S3
Agenda
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2
– S3
– RedShift
– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
36
Agenda
• Introduction to Presto
• Brief Introduction to Kinesis
• Qubole’s Value add to Kinesis
• Brief introduction to Redshift
• Qubole’s Value add to Redshift
• When to use which system
37
Presto, Kinesis and RedShift
Presto, Kinesis, RedShift
• Interactive SQL Query Engine for Big Data
• Open source by Facebook in late 2013
• Follows ANSI-SQL
• Own execution engine
38
Presto, What is it?
Presto, Kinesis, RedShift
● Extensibility
○ Pluggable Datasources
● Performance
○ In-memory Execution
○ Aggressive Pipelining
○ Highly efficient Java code
○ Dynamic Query Compilation
○ Vectorization
39
Why Presto?
Presto, Kinesis, RedShift
● Smooth learning curve as it adheres to ANSI-SQL
● Active open source community
● Proven worth at scale in companies like Facebook, NetFlix, Airbnb
40
Why not something else?
Presto, Kinesis, RedShift
● Self managed Presto clusters
● Auto-configured
● Autoscaling
● Data Caching
41
Benefits of Presto @Qubole
Presto, Kinesis, RedShift
● Kinesis Connector
● S3 Optimizations
● Insert Support
● UDF support
42
Qubole’s Contribution
Presto, Kinesis, RedShift
● Average times across the performance tests by MediaMath on a 22TB text format table,
partitioned on date, queried on partition with ~1.2b rows
http://www.qubole.com/blog/big-data/performance-testing-presto/
43
Comparison with Hive
Presto, Kinesis, RedShift
● High Capacity Pipe for Real-Time Processing
● Key Concepts
○ Record
○ Streams
○ Shards
○ Checkpoints
44
Kinesis
Presto, Kinesis, RedShift
● Streaming usecase
○ Spark
● SQL usecase
○ Via Hive
○ Via Presto
45
Qubole and Kinesis
Presto, Kinesis, RedShift
● Kinesis Connector
46
Presto-Kinesis Integration
Presto, Kinesis, RedShift
● Example
○ Step1: Define Schema
47
Presto-Kinesis Integration
Presto, Kinesis, RedShift
○ Step2: Run Query
48
Presto-Kinesis Integration
● Datawarehouse service
● OLAP
● Storage + Compute
49
Redshift
Presto, Kinesis, RedShift
● ETL Usecase
○ DBImport
○ DBExport
● Adhoc Query Usecase
○ Direct Query
○ Hive
○ Presto
50
Qubole and Redshift
Presto, Kinesis, RedShift
Two ways to access Redshift via Presto
● Via Hive JDBC Storage Handler
51
Presto-Redshift integration
Presto, Kinesis, RedShift
Two ways to access Redshift via Presto
● Via jdbc connector
52
Presto-Redshift integration
Presto, Kinesis, RedShift
● Single Platform
● Consistent User Interface
● Cross-source Joins without data consolidation
53
New Opportunities
Presto, Kinesis, RedShift
● Hive: ETL, huge Joins, Group By on high cardinality columns
● Redshift: Interactive Queries when data loading is acceptable
● Presto:
○ Interactive Queries
○ Direct Queries without loading
○ Joining data across Redshift, Kinesis, S3, MySql, Cassandra, etc
54
Hive? Presto? RedShift?
Presto, Kinesis, RedShift
● Iterative Machine Learning
● In-Memory Computing
● Spark Streaming
55
Got Spark?
Presto, Kinesis, RedShift
Resources
● Presto vs Hive
○ http://www.qubole.com/blog/big-data/performance-testing-presto/
● Presto Kinesis Integration
○ http://blogs.aws.amazon.com/bigdata/post/Tx2DDFNHXSAAH2G/Presto-Amazon-Kinesis-Connector-for-
Interactively-Querying-Streaming-Data
● Presto Kinesis Connector
○ https://github.com/qubole/presto-kinesis
● Hive Redshift/Jdbc Connector
○ http://www.qubole.com/blog/product/hive-jdbc-storage-handler/
● Hive Redshift/Jdbc Connector
○ https://github.com/qubole/Hive-JDBC-Storage-Handler
Agenda
• Introduction to Qubole
• How Qubole integrates with AWS:
– EC2
– S3
– RedShift
– Kinesis
• Hive vs. Presto vs. RedShift vs. Spark
• Qubole vs. EMR
57
Agenda
Thanks!
joydeep@qubole.com, hiyer@qubole.com, stagra@qubole.com
info@qubole.com

Más contenido relacionado

La actualidad más candente

Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Tugdual Grall
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Qubole
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 
Scaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsScaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsYelp Engineering
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemLior Sidi
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache TezGal Vinograd
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksDataWorks Summit
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEDataWorks Summit/Hadoop Summit
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 

La actualidad más candente (20)

Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015Proud to be Polyglot - Riviera Dev 2015
Proud to be Polyglot - Riviera Dev 2015
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Scaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique VisitorsScaling Traffic from 0 to 139 Million Unique Visitors
Scaling Traffic from 0 to 139 Million Unique Visitors
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 

Destacado

Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeJoydeep Sen Sarma
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataQubole
 
5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoptionQubole
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 
Nw qubole overview_033015
Nw qubole overview_033015Nw qubole overview_033015
Nw qubole overview_033015Michael Mersch
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSAmazon Web Services
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloudDmitry Tolpeko
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup Qubole
 
Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics SuiteJames Serra
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsITProceed
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleQubole
 
Fortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure WorkloadsFortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure WorkloadsAmazon Web Services
 
Azure ARM’d and Ready
Azure ARM’d and ReadyAzure ARM’d and Ready
Azure ARM’d and Readymscug
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...NoSQLmatters
 
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...Amazon Web Services
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleQubole
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...yalisassoon
 

Destacado (20)

Qubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europeQubole hadoop-summit-2013-europe
Qubole hadoop-summit-2013-europe
 
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 
Nw qubole overview_033015
Nw qubole overview_033015Nw qubole overview_033015
Nw qubole overview_033015
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWS
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloud
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
RDO-Packstack Workshop
RDO-Packstack Workshop RDO-Packstack Workshop
RDO-Packstack Workshop
 
Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics Suite
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Creating a fortigate vpn network & security blog
Creating a fortigate vpn   network & security blogCreating a fortigate vpn   network & security blog
Creating a fortigate vpn network & security blog
 
Fortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure WorkloadsFortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure Workloads
 
Azure ARM’d and Ready
Azure ARM’d and ReadyAzure ARM’d and Ready
Azure ARM’d and Ready
 
Azure Document Db
Azure Document DbAzure Document Db
Azure Document Db
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
 
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
 
Big Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by QuboleBig Data at Pinterest - Presented by Qubole
Big Data at Pinterest - Presented by Qubole
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...
 

Similar a Qubole @ AWS Meetup Bangalore - July 2015

Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...Data Con LA
 
Efficient Kubernetes scaling using Karpenter
Efficient Kubernetes scaling using KarpenterEfficient Kubernetes scaling using Karpenter
Efficient Kubernetes scaling using KarpenterMarko Bevc
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted CloudColin Charles
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaSahdev Zala
 
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...Mohamed Sayed
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesRose Toomey
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesRose Toomey
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...Radhika Puthiyetath
 
HPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 HighlightsHPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 HighlightsHPCC Systems
 
Cloud Overview
Cloud OverviewCloud Overview
Cloud Overviewiasaglobal
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computingiasaglobal
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...Lucidworks
 
Going Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS GlueGoing Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS GlueMichael Rainey
 
Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3Gareth Rogers
 
Oracle Solutions on AWS : May 2014
Oracle Solutions on AWS : May 2014Oracle Solutions on AWS : May 2014
Oracle Solutions on AWS : May 2014Tom Laszewski
 
Journey of Kubernetes Scaling
Journey of Kubernetes ScalingJourney of Kubernetes Scaling
Journey of Kubernetes ScalingOpsta
 

Similar a Qubole @ AWS Meetup Bangalore - July 2015 (20)

Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
 
Efficient Kubernetes scaling using Karpenter
Efficient Kubernetes scaling using KarpenterEfficient Kubernetes scaling using Karpenter
Efficient Kubernetes scaling using Karpenter
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Lc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangyaLc3 beijing-june262018-sahdev zala-guangya
Lc3 beijing-june262018-sahdev zala-guangya
 
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
 
HPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 HighlightsHPCC Systems 6.0.0 Highlights
HPCC Systems 6.0.0 Highlights
 
Cloud Overview
Cloud OverviewCloud Overview
Cloud Overview
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
MySQL in the Cloud
MySQL in the CloudMySQL in the Cloud
MySQL in the Cloud
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Best of re:Invent
Best of re:InventBest of re:Invent
Best of re:Invent
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
 
Going Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS GlueGoing Serverless - an Introduction to AWS Glue
Going Serverless - an Introduction to AWS Glue
 
Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3Metail at Cambridge AWS User Group Main Meetup #3
Metail at Cambridge AWS User Group Main Meetup #3
 
Oracle Solutions on AWS : May 2014
Oracle Solutions on AWS : May 2014Oracle Solutions on AWS : May 2014
Oracle Solutions on AWS : May 2014
 
Journey of Kubernetes Scaling
Journey of Kubernetes ScalingJourney of Kubernetes Scaling
Journey of Kubernetes Scaling
 

Último

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEselvakumar948
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 

Último (20)

1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 

Qubole @ AWS Meetup Bangalore - July 2015

  • 1. Big Data as a Service Joydeep Sen Sarma Hariharan Iyer Shubham Tagra
  • 2. Introduction • Introduction to Qubole • How Qubole integrates with AWS: – EC2 – S3 – RedShift – Kinesis • Hive vs. Presto vs. RedShift vs. Spark • Qubole vs. EMR 3 Agenda
  • 3. Introduction • Founded ~ 10/2011 • Team: – Founding Crew initial authors of Apache Hive, ran Data @Facebook – + Notable Alumni from Greenplum/Vertica/EngineYard/Oracle/AWS etc – + 50 engineers + 20 sales/mkting across Bangalore/Palo-Alto • Financing: – Completed Series-B 10/2014 – LightSpeed, Charles-River, Norwest, Anand/Venky • Product: Qubole Data Service – Big Data as a Service – AWS/GCE/Azure 4 About Qubole
  • 7. Introduction • Self-Serve Big Data Analytics: – Lack of Hadoop trained IT/engineers – Team of Analysts • Lowest TCO – Cloud Optimized - takes full advantage of AWS • Unified Platform for all Tools: – Hive/Pig/Spark/SQL/Map-Reduce/Cascading/… – Pick and Choose. Combine and Use • Awesome Support and Solutions 8 Why Qubole
  • 8. Self Service Analytics: Direct Access to Big Data 9
  • 9. Self Service Analytics: Manage Clusters Easily 10
  • 10. Self Service Analytics: Schedule Jobs 11
  • 11. Self Service Analytics: Self Serve Dashboards using Notebooks 12
  • 12. Qubole and EC2 • Introduction to Qubole • How Qubole integrates with AWS: – EC2 – S3 – RedShift – Kinesis • Hive vs. Presto vs. RedShift vs. Spark • Qubole vs. EMR 13 Agenda
  • 13. Qubole and EC2 • Custom AMIs for much faster boot-up • Auto-termination • Auto-scaling • Spot Instances • EBS 14 EC2 Magic Sauce
  • 14. • Auto-start and termination – Cluster starts automatically when you need to run a command • Intelligent - no cluster required for metadata commands – Terminated after couple of hours of Idle time • Auto-scaling – Min Size <= Cluster <= Max Size 15 Cluster LifeCycle Management Qubole and EC2
  • 15. 16 Map Slots Reduce Slots Slave Slave Slave Slave Slave SELECT * FROM FOO JOIN BAR ON BAZ = ... Auto-Scaling
  • 16. • Upscaling – Engine-specific algorithms – Cannot just look at expected time (parallelism matters) • Downscaling – Decommissioning takes time – Need to consider hour boundary – Stuck on mapper output • Output offloading • AWS Integration – Hour boundary – Eventual consistency 17 Why is it hard? Qubole and EC2
  • 17. Min Cluster Size: 400 Max Cluster Size: 800 Time for which cluster size < max size: 49% 18 But it pays off! Qubole and EC2
  • 18. 19 But it pays off! Expected Compute Hours Compute hours saved Savings (%) 2902246.2 2107311.01 72.6 4655815.5 2105486.11 45.2 1698052.65 1658738.375 97.6 1776944.4 1476547.835 83.1 2063127.85 838628.7 40.6 919721.25 613630.955 66.7 Qubole and EC2
  • 19. • Various configurations – All nodes on-demand – Minimum nodes on-demand, rest combination of on-demand & spot – All nodes spot • Minimum set has higher bid price => less likely to lose – Up to 90% savings compared to on-demand price 20 Spot instances Qubole and EC2
  • 20. • Not always available – Fall back to on-demand • Increases overall cost of cluster – Periodically replace extra on-demand instances when spot available • Can go away at any time – Hadoop has built-in resiliency – Place replicas on stable instances • Auto-scaling – Must maintain requested ratio 21 Why is it hard? Qubole and EC2
  • 21. • Useful for newer instance types - c3/m3 – Low ephemeral storage • Better performance/$ – Compared to older instance types with more storage • Writes are changed – Minimize writes to EBS volumes – Use them only when ephemeral is near full 22 Elastic Block Store Qubole and EC2
  • 22. • Spot Fleets • EBS-only instances 23 What’s coming next Qubole and EC2
  • 23. Resources • Auto-scaling Hadoop Clusters in Qubole • Spot Instances in Qubole Clusters • Rebalancing Hadoop Clusters for Better Spot Utilization • Improved Performance with Low-Ephemeral-Storage Instances 24 Qubole and EC2
  • 24. Agenda • Introduction to Qubole • How Qubole integrates with AWS: – EC2 – S3 – RedShift – Kinesis • Hive vs. Presto vs. RedShift vs. Spark • Qubole vs. EMR 25 Agenda
  • 25. Qubole and S3 • No real Rename (aka Move) operation – renames are copies and expensive! • S3 connection establishment is expensive – ie. - small calls like getObjectDetails(key) and reads are expensive! • S3 has bulk prefix listing – listObjectsChunked(startKey, maxListing) • Puts are atomic – Objects created when object uploaded – Unlike HDFS where files are created on first write • MultiPart! 26 S3 != HDFS
  • 26. Qubole and S3 • Naiive • Smart • Up to 1000x improvement 27 Prefix Listing for path in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]: result << listObject(path) pathList = listPrefix(‘/x’) while (entry = pathList.next()): if entry in [‘/x/y/a’, ‘/x/y/b’, ‘/x/z/c’, … ]: result << entry
  • 27. Qubole and S3 • Split Computation – Divide input files into tasks for Map-Reduce/Spark/Presto • Recovering Partitions • List Paths matching regex pattern (‘/x/y/z/*/*’) • and many more .. 28 Prefix Listing - Use Cases
  • 28. Qubole and S3 • Normally: – Write data to temporary location - atomically rename to final location • With S3: – Write data to final location – Atomic puts deal with speculation/retries – Optional: Remove on Failure • By default in Hive, DirectFileOutputCommitter in MR/Spark • Tricky: retries/speculation must use same path 29 Direct Writes
  • 29. Qubole and S3 • Naiive: 30 Pre-Fetching Client S3 • Smart: Client S3
  • 30. S3 Local Disk Cache (Presto) 31
  • 31. S3 Local Disk Cache (Presto) 32
  • 32. Qubole and S3 • Populating Cache while performing Query may cause Slowness • Large Files are split – Cache Files or Splits? • Should Caching Combine Small Files? • Should Caching transform data into Columnar? Watch out for Table Copies! 33 Why S3 Caching is hard
  • 33. Qubole and S3 • Handle S3 Timeouts and Exceptions (truncated streams) • Optimize away seek() operations • Data Sharing across Organizations using Cross-Account Roles 34 Miscellaneous
  • 34. Resources ● S3 Optimizations in Hive ○ http://www.qubole.com/optimizing-hadoop-for-s3-part-1/ ● Caching in Presto ○ http://www.qubole.com/blog/product/caching-presto/ ● Qubole vs. EMR Performance Comparison ○ http://www.qubole.com/a-performance-comparison-of-qubole-and-amazons-elastic- mapreduce-emr/ ● Data Sharing via AWS Roles: ○ https://qubole-eng.quora.com/Securely-sharing-data-across-Organizations-with-Qubole-2 Qubole and S3
  • 35. Agenda • Introduction to Qubole • How Qubole integrates with AWS: – EC2 – S3 – RedShift – Kinesis • Hive vs. Presto vs. RedShift vs. Spark • Qubole vs. EMR 36 Agenda
  • 36. • Introduction to Presto • Brief Introduction to Kinesis • Qubole’s Value add to Kinesis • Brief introduction to Redshift • Qubole’s Value add to Redshift • When to use which system 37 Presto, Kinesis and RedShift Presto, Kinesis, RedShift
  • 37. • Interactive SQL Query Engine for Big Data • Open source by Facebook in late 2013 • Follows ANSI-SQL • Own execution engine 38 Presto, What is it? Presto, Kinesis, RedShift
  • 38. ● Extensibility ○ Pluggable Datasources ● Performance ○ In-memory Execution ○ Aggressive Pipelining ○ Highly efficient Java code ○ Dynamic Query Compilation ○ Vectorization 39 Why Presto? Presto, Kinesis, RedShift
  • 39. ● Smooth learning curve as it adheres to ANSI-SQL ● Active open source community ● Proven worth at scale in companies like Facebook, NetFlix, Airbnb 40 Why not something else? Presto, Kinesis, RedShift
  • 40. ● Self managed Presto clusters ● Auto-configured ● Autoscaling ● Data Caching 41 Benefits of Presto @Qubole Presto, Kinesis, RedShift
  • 41. ● Kinesis Connector ● S3 Optimizations ● Insert Support ● UDF support 42 Qubole’s Contribution Presto, Kinesis, RedShift
  • 42. ● Average times across the performance tests by MediaMath on a 22TB text format table, partitioned on date, queried on partition with ~1.2b rows http://www.qubole.com/blog/big-data/performance-testing-presto/ 43 Comparison with Hive Presto, Kinesis, RedShift
  • 43. ● High Capacity Pipe for Real-Time Processing ● Key Concepts ○ Record ○ Streams ○ Shards ○ Checkpoints 44 Kinesis Presto, Kinesis, RedShift
  • 44. ● Streaming usecase ○ Spark ● SQL usecase ○ Via Hive ○ Via Presto 45 Qubole and Kinesis Presto, Kinesis, RedShift
  • 45. ● Kinesis Connector 46 Presto-Kinesis Integration Presto, Kinesis, RedShift
  • 46. ● Example ○ Step1: Define Schema 47 Presto-Kinesis Integration Presto, Kinesis, RedShift
  • 47. ○ Step2: Run Query 48 Presto-Kinesis Integration
  • 48. ● Datawarehouse service ● OLAP ● Storage + Compute 49 Redshift Presto, Kinesis, RedShift
  • 49. ● ETL Usecase ○ DBImport ○ DBExport ● Adhoc Query Usecase ○ Direct Query ○ Hive ○ Presto 50 Qubole and Redshift Presto, Kinesis, RedShift
  • 50. Two ways to access Redshift via Presto ● Via Hive JDBC Storage Handler 51 Presto-Redshift integration Presto, Kinesis, RedShift
  • 51. Two ways to access Redshift via Presto ● Via jdbc connector 52 Presto-Redshift integration Presto, Kinesis, RedShift
  • 52. ● Single Platform ● Consistent User Interface ● Cross-source Joins without data consolidation 53 New Opportunities Presto, Kinesis, RedShift
  • 53. ● Hive: ETL, huge Joins, Group By on high cardinality columns ● Redshift: Interactive Queries when data loading is acceptable ● Presto: ○ Interactive Queries ○ Direct Queries without loading ○ Joining data across Redshift, Kinesis, S3, MySql, Cassandra, etc 54 Hive? Presto? RedShift? Presto, Kinesis, RedShift
  • 54. ● Iterative Machine Learning ● In-Memory Computing ● Spark Streaming 55 Got Spark? Presto, Kinesis, RedShift
  • 55. Resources ● Presto vs Hive ○ http://www.qubole.com/blog/big-data/performance-testing-presto/ ● Presto Kinesis Integration ○ http://blogs.aws.amazon.com/bigdata/post/Tx2DDFNHXSAAH2G/Presto-Amazon-Kinesis-Connector-for- Interactively-Querying-Streaming-Data ● Presto Kinesis Connector ○ https://github.com/qubole/presto-kinesis ● Hive Redshift/Jdbc Connector ○ http://www.qubole.com/blog/product/hive-jdbc-storage-handler/ ● Hive Redshift/Jdbc Connector ○ https://github.com/qubole/Hive-JDBC-Storage-Handler
  • 56. Agenda • Introduction to Qubole • How Qubole integrates with AWS: – EC2 – S3 – RedShift – Kinesis • Hive vs. Presto vs. RedShift vs. Spark • Qubole vs. EMR 57 Agenda