Netflix running Presto in the AWS Cloud

•

19 recomendaciones•9,401 vistas

Zhenxiao Luo

Tecnología

Outline
● BigDataPlatform@Netflix
● Use cases & requirements
● What we did
○ Reading/Writing from/to Amazon S3
○ Operations
○ Deployment
○ Performance
● What’s next?

Use Cases
● Big Batch Jobs
○ high throughput, fault tolerant, ETL
○ data spills to disk
○ Hive on Tez, Pig on Tez
● Adhoc Queries
○ low latency, interactive, data exploration
○ in-memory, but limited data size
○ Impala, Redshift, Spark, Presto

Netflix Requirement
● SQL like Language
● Low latency for adhoc queries
● Work well on AWS cloud
● Good integration with Hadoop stack
● Scale to 1000+ node cluster
● Open source with community support

Reading/Writing to/from S3
● Option 1: Apache Hadoop NativeS3FileSysyem
● Option 2: PrestoS3FileSystem
○ retry logic for read timeout
○ write directly to final S3 path
● Option 3: emrFileSystem
○ disable hadoop logging
○ disable hadoop FileSystem cache

Bug Fixes
● https://github.
com/facebook/presto/commit/cf0b2d66f4050fb1959c832809fa76e323d6d4
6e
● https://github.
com/facebook/presto/commit/594b06c3e93a482dc162d2c49c9bd265795ef
b86
● https://github.com/facebook/presto/pull/1147
● https://github.com/facebook/presto/pull/1300
● https://github.com/facebook/presto/issues/1285
● https://github.com/facebook/presto/issues/1264

Our Operations Environment
● Launch script on top of EMR
● Ganglia integration
● Usage graphs - concurrent queries & tasks

Current Deployment
● Presto in Production @ Netflix
● 100+ nodes Presto Cluster
● 1000+ queries running per day
● Presto query against the same Petabyte Scale S3 Data
Warehouse as Hive and Pig

Observed Performance @ Netflix
● Data in Sequence File Format
● One MapReduce Job SmallTableScan
○ MapReduce overhead dominates the query execution time
○ Presto is always ~10X faster than Hive
● One MapReduce Job BigTableScan
○ MapReduce overhead is marginal compared with big table scan time
○ Presto performs similar to Hive
● Multiple MapReduce Aggregation
○ Presto is always > 10X faster than Hive
● Joins
○ Presto is always > 2X faster than Hive

What we are working on
● Support Parquet File Format
○ https://github.com/facebook/presto/pull/1147
○ Parquet performs similar to Sequence, but not as fast as RCFile
● ODBC/JDBC driver for Presto
○ Support Microstrategy running on Presto

Some inconveniences ...
● Support Server Side “Use Schema”
○ Workaround: Client Side “Use Schema” Or “Schema.Table”
● Recurse the partition directory
○ Different behavior with Hive
● Metadata caching
○ have to rerun the query a number of times to see the metadata
change
● Extend JSON extract functions to allow . notation
○ json_extract_scalar(mapColumn, '$.namePart1.namePart2')
○ Workaround: regexp_extract
● WebUI running slow
○ load query task info on demand

Features we would like
● Big table join
● User Defined Functions
● Break down one column value into several tuples
○ In Hive: lateral view explode json_tuple
● Decimal type
● Scheduler
● Writes
○ Insert overwrite
○ Alter table add partition
○ Parallel writes from workers (not client only)

Más contenido relacionado

La actualidad más candente

Amazon Elastic Map Reduce - Ian Meyershuguk

An overview of Amazon AthenaJulien SIMON

Putting Lipstick on Apache Pig at NetflixJeff Magnusson

How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks

Hadoop Networking at Datasifthuguk

Scaling Traffic from 0 to 139 Million Unique VisitorsYelp Engineering

Introduction to Presto at Treasure DataTaro L. Saito

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Databricks

Building a unified data pipeline in Apache SparkDataWorks Summit

HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkMichael Stack

When OLAP Meets Real-Time, What Happens in eBay?DataWorks Summit

Presto at Hadoop Summit 2016kbajda

Deep Learning to Production with MLflow & RedisAIDatabricks

Presto updates to 0.178Kai Sasaki

Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks

Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks

Presto @ Facebook: Past, Present and FutureDataWorks Summit

Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupWojciech Biela

La actualidad más candente (20)

Amazon Elastic Map Reduce - Ian Meyers

An overview of Amazon Athena

Putting Lipstick on Apache Pig at Netflix

How Adobe Does 2 Million Records Per Second Using Apache Spark!

Hadoop Networking at Datasift

Scaling Traffic from 0 to 139 Million Unique Visitors

Introduction to Presto at Treasure Data

Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...

Building a unified data pipeline in Apache Spark

HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark

When OLAP Meets Real-Time, What Happens in eBay?

Presto at Hadoop Summit 2016

Deep Learning to Production with MLflow & RedisAI

Presto updates to 0.178

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Rental Cars and Industrialized Learning to Rank with Sean Downes

Presto @ Facebook: Past, Present and Future

Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup

Destacado

Presto@UberZhenxiao Luo

Performance Tuning EC2 InstancesBrendan Gregg

Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi

presto-at-netflix-hadoop-summit-15Zhenxiao Luo

Presto in the cloudQubole

Engineering Velocity: Shifting the Curve at NetflixDianne Marsh

Microservices and elastic resource pools with Amazon EC2 Container ServiceBoyan Dimitrov

Data Science Languages and Industry AnalyticsWes McKinney

Map reduce vs sparkTudor Lapusan

Amazon EMR Facebook Presto Meetupstevemcpherson

Prestogres internalsSadayuki Furuhashi

Why Scala Is Taking Over the Big Data WorldDean Wampler

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney

How To Analyze Geolocation Data with Hive and HadoopHortonworks

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...Amazon Web Services

Modern SQL in Open Source and Commercial DatabasesMarkus Winand

Destacado (16)

Presto@Uber

Performance Tuning EC2 Instances

Presto - Hadoop Conference Japan 2014

presto-at-netflix-hadoop-summit-15

Presto in the cloud

Engineering Velocity: Shifting the Curve at Netflix

Microservices and elastic resource pools with Amazon EC2 Container Service

Data Science Languages and Industry Analytics

Map reduce vs spark

Amazon EMR Facebook Presto Meetup

Prestogres internals

Why Scala Is Taking Over the Big Data World

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...

How To Analyze Geolocation Data with Hive and Hadoop

AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...

Modern SQL in Open Source and Commercial Databases

Similar a Netflix running Presto in the AWS Cloud

AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty

Netflix Open Source Meetup Season 4 Episode 2aspyker

RubiXShubham Tagra

Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.

Introduction to AWS Big Data Omid Vahdaty

The Parquet Format and Performance Optimization OpportunitiesDatabricks

Intro to Apache HadoopSufi Nawaz

ApacheCon 2022_ Large scale unification of file format.pptxXinliShang1

Netty trainingMarcelo Serpa

Netty trainingJackson dos Santos Olveira

A Day in the Life of a Druid Implementor and Druid's RoadmapItai Yaffe

Presto Summit 2018 - 09 - Netflix Icebergkbajda

Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du

Apache Hadoop 3.0 Community UpdateDataWorks Summit

20140120 presto meetup_enOgibayashi

Understanding HadoopAhmed Ossama

It's Time To Stop Using Lambda ArchitectureYaroslav Tkachenko

Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty

Introduction to Hadoop AdministrationRamesh Pabba - seeking new projects

Similar a Netflix running Presto in the AWS Cloud (20)

AWS Big Data Demystified #1: Big data architecture lessons learned

Netflix Open Source Meetup Season 4 Episode 2

RubiX

Apache Iceberg - A Table Format for Hige Analytic Datasets

Introduction to AWS Big Data

The Parquet Format and Performance Optimization Opportunities

Intro to Apache Hadoop

ApacheCon 2022_ Large scale unification of file format.pptx

Netty training

A Day in the Life of a Druid Implementor and Druid's Roadmap

Presto Summit 2018 - 09 - Netflix Iceberg

Hadoop 3 @ Hadoop Summit San Jose 2017

Apache Hadoop 3.0 Community Update

20140120 presto meetup_en

Understanding Hadoop

It's Time To Stop Using Lambda Architecture

Introduction to Apache Tajo: Data Warehouse for Big Data

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Introduction to Hadoop Administration

Más de Zhenxiao Luo

Real time analytics on deep learning @ strata data 2019Zhenxiao Luo

Real time analytics at uber @ strata data 2019Zhenxiao Luo

Presto Elasticsearch Connector at Presto SummitZhenxiao Luo

Uber Geo spatial data platform at DataWorks SummitZhenxiao Luo

Machine learning and big data @ uber a tale of two systemsZhenxiao Luo

Presto GeoSpatial @ Strata New York 2017Zhenxiao Luo

Presto @ Uber Hadoop summit2017Zhenxiao Luo

Presto Apache BigData 2017Zhenxiao Luo

Más de Zhenxiao Luo (8)

Real time analytics on deep learning @ strata data 2019

Real time analytics at uber @ strata data 2019

Presto Elasticsearch Connector at Presto Summit

Uber Geo spatial data platform at DataWorks Summit

Machine learning and big data @ uber a tale of two systems

Presto GeoSpatial @ Strata New York 2017

Presto @ Uber Hadoop summit2017

Presto Apache BigData 2017

Último

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

A Domino Admins Adventures (Engage 2024)Gabriella Davis

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Histor y of HAM Radio presentation slidevu2urc

Netflix running Presto in the AWS Cloud

1. Netflix running Presto in the AWS Cloud Zhenxiao Luo Senior Software Engineer @ Netflix

2. Outline ● BigDataPlatform@Netflix ● Use cases & requirements ● What we did ○ Reading/Writing from/to Amazon S3 ○ Operations ○ Deployment ○ Performance ● What’s next?

3. BigDataPlatform @ Netflix

4. Use Cases ● Big Batch Jobs ○ high throughput, fault tolerant, ETL ○ data spills to disk ○ Hive on Tez, Pig on Tez ● Adhoc Queries ○ low latency, interactive, data exploration ○ in-memory, but limited data size ○ Impala, Redshift, Spark, Presto

5. Netflix Requirement ● SQL like Language ● Low latency for adhoc queries ● Work well on AWS cloud ● Good integration with Hadoop stack ● Scale to 1000+ node cluster ● Open source with community support

6. What did Netflix do?

7. Reading/Writing to/from S3 ● Option 1: Apache Hadoop NativeS3FileSysyem ● Option 2: PrestoS3FileSystem ○ retry logic for read timeout ○ write directly to final S3 path ● Option 3: emrFileSystem ○ disable hadoop logging ○ disable hadoop FileSystem cache

8. Bug Fixes ● https://github. com/facebook/presto/commit/cf0b2d66f4050fb1959c832809fa76e323d6d4 6e ● https://github. com/facebook/presto/commit/594b06c3e93a482dc162d2c49c9bd265795ef b86 ● https://github.com/facebook/presto/pull/1147 ● https://github.com/facebook/presto/pull/1300 ● https://github.com/facebook/presto/issues/1285 ● https://github.com/facebook/presto/issues/1264

9. Our Operations Environment ● Launch script on top of EMR ● Ganglia integration ● Usage graphs - concurrent queries & tasks

10. Current Deployment ● Presto in Production @ Netflix ● 100+ nodes Presto Cluster ● 1000+ queries running per day ● Presto query against the same Petabyte Scale S3 Data Warehouse as Hive and Pig

11. Observed Performance @ Netflix ● Data in Sequence File Format ● One MapReduce Job SmallTableScan ○ MapReduce overhead dominates the query execution time ○ Presto is always ~10X faster than Hive ● One MapReduce Job BigTableScan ○ MapReduce overhead is marginal compared with big table scan time ○ Presto performs similar to Hive ● Multiple MapReduce Aggregation ○ Presto is always > 10X faster than Hive ● Joins ○ Presto is always > 2X faster than Hive

12. What we are working on ● Support Parquet File Format ○ https://github.com/facebook/presto/pull/1147 ○ Parquet performs similar to Sequence, but not as fast as RCFile ● ODBC/JDBC driver for Presto ○ Support Microstrategy running on Presto

13. Some inconveniences ... ● Support Server Side “Use Schema” ○ Workaround: Client Side “Use Schema” Or “Schema.Table” ● Recurse the partition directory ○ Different behavior with Hive ● Metadata caching ○ have to rerun the query a number of times to see the metadata change ● Extend JSON extract functions to allow . notation ○ json_extract_scalar(mapColumn, '$.namePart1.namePart2') ○ Workaround: regexp_extract ● WebUI running slow ○ load query task info on demand

14. Features we would like ● Big table join ● User Defined Functions ● Break down one column value into several tuples ○ In Hive: lateral view explode json_tuple ● Decimal type ● Scheduler ● Writes ○ Insert overwrite ○ Alter table add partition ○ Parallel writes from workers (not client only)

15. Q & A Thank you!

Netflix running Presto in the AWS Cloud

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (16)

Similar a Netflix running Presto in the AWS Cloud

Similar a Netflix running Presto in the AWS Cloud (20)

Más de Zhenxiao Luo

Más de Zhenxiao Luo (8)

Último

Último (20)

Netflix running Presto in the AWS Cloud