4. Use Cases
● Big Batch Jobs
○ high throughput, fault tolerant, ETL
○ data spills to disk
○ Hive on Tez, Pig on Tez
● Adhoc Queries
○ low latency, interactive, data exploration
○ in-memory, but limited data size
○ Impala, Redshift, Spark, Presto
5. Netflix Requirement
● SQL like Language
● Low latency for adhoc queries
● Work well on AWS cloud
● Good integration with Hadoop stack
● Scale to 1000+ node cluster
● Open source with community support
9. Our Operations Environment
● Launch script on top of EMR
● Ganglia integration
● Usage graphs - concurrent queries & tasks
10. Current Deployment
● Presto in Production @ Netflix
● 100+ nodes Presto Cluster
● 1000+ queries running per day
● Presto query against the same Petabyte Scale S3 Data
Warehouse as Hive and Pig
11. Observed Performance @ Netflix
● Data in Sequence File Format
● One MapReduce Job SmallTableScan
○ MapReduce overhead dominates the query execution time
○ Presto is always ~10X faster than Hive
● One MapReduce Job BigTableScan
○ MapReduce overhead is marginal compared with big table scan time
○ Presto performs similar to Hive
● Multiple MapReduce Aggregation
○ Presto is always > 10X faster than Hive
● Joins
○ Presto is always > 2X faster than Hive
12. What we are working on
● Support Parquet File Format
○ https://github.com/facebook/presto/pull/1147
○ Parquet performs similar to Sequence, but not as fast as RCFile
● ODBC/JDBC driver for Presto
○ Support Microstrategy running on Presto
13. Some inconveniences ...
● Support Server Side “Use Schema”
○ Workaround: Client Side “Use Schema” Or “Schema.Table”
● Recurse the partition directory
○ Different behavior with Hive
● Metadata caching
○ have to rerun the query a number of times to see the metadata
change
● Extend JSON extract functions to allow . notation
○ json_extract_scalar(mapColumn, '$.namePart1.namePart2')
○ Workaround: regexp_extract
● WebUI running slow
○ load query task info on demand
14. Features we would like
● Big table join
● User Defined Functions
● Break down one column value into several tuples
○ In Hive: lateral view explode json_tuple
● Decimal type
● Scheduler
● Writes
○ Insert overwrite
○ Alter table add partition
○ Parallel writes from workers (not client only)