The document discusses Impala releases and roadmaps. It outlines key features released in different Impala versions, including SQL capabilities, performance improvements, and support for additional file formats and data types. It also describes Impala's performance advantages compared to other SQL-on-Hadoop systems and how its approach is expected to increasingly favor performance gains. Lastly, it encourages trying out Impala and engaging with their community.
13. Introduction
13
Tubular Labs
SAAS Platform for online Video
Audience Development
(e.g. Big Data for YouTube videos)
David Koblas
VP Engineering, Tubular Labs
14. Overview
14
This presentation will talk about the work
Tubular Labs has done to use Impala as
one of the core components to our SAAS
platform. We'll go through the pipeline
for getting data into the system, to how
we've distributed responsibility across
AWS instances, and other tips and tricks
for getting real-time responses to our
end-user queries over billions of data
points.
15. User Story: Audience Also Watches
15
For any YouTube video can we figure out
who the audience is and what other
videos and channels they are watching.
Also to have the ability to slice the
audience by demographic information.
…and have it all run interactively from a
web SAAS platform.
17. Technology Options
17
• Pre-compute (e.g. Map/Reduce)
• MySQL or similar
• Data Warehouse
• Impala or Redshift
• Homebrew
18. Impala 0.7
18
Now we have a technology
…
Make it interactive
…
and make a bet on Cloudera
19. Now We Have A Technology
Time To Make It Fast
and Economical
19
Source: Tubular Labs
20. Pipeline
20
Loading
• Sqoop
- collect data from MySQL
• Hive
- preprocess data
Query
• Impala
- interactive display
• Python
- REST endpoint
21. AWS EC2: Node types
21
• m1.xlarge
- 1.6TB of Instance Storage
- slow IO
• hi1.4xlarge
- 2TB of SSD
- expensive
Note: this would be an i2.4xlarge instance today
22. Managing costs
22
Problem
• hi1.4xlarge - expensive
• m1.xlarge - slow IO
Solution – HDFS rack replication for separation
• One copy of data on both racks
• Hive creates tables on m1.xlarge instances
• Impala queries on hi1.4xlarge instances
23. Interactive Performance
23
Problem
• Large tables take time to scan
• No indexes
• Need to deliver results in < 1second
Solution – partitioning (duh!)
• Partitions are targeted to be between 100…200MB
• The query log is your friend
25. Summary
25
Impala can back your SAAS application
• We’re now running version 1.3
• We’re “spinning” 10TB of data
• Delivering queries in < 2seconds
We’re hiring – but you already knew that.