Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

Building a high-performance data lake analytics engine at Alibaba Cloud with Presto+Alluxio

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Cargando en…3
×

Eche un vistazo a continuación

1 de 35 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

Similares a Building a high-performance data lake analytics engine at Alibaba Cloud with Presto+Alluxio (20)

Anuncio

Más de Alluxio, Inc. (20)

Más reciente (20)

Anuncio

Building a high-performance data lake analytics engine at Alibaba Cloud with Presto+Alluxio

  1. 1. Building a High-performance Data Lake Analytics Engine at Alibaba Cloud with Presto+Alluxio Zhenlin Ma
  2. 2. ⽬录 Introduction to DLA 01 02 03 DLA Presto Architecture Optimizations on OSS Data Source
  3. 3. Introduction to DLA Data Lake Analytics (DLA) is a large scale serverless data federation service on Alibaba Cloud. Serverless Data Federation Database-like User Experience High performance
  4. 4. 列存表 Data Lake Storag e (OSS) One Click 
 Data Lake DB - Data Streaming - Data Spark Streaming LogService Application Logs Serverless Spark ETL&ML Serverless Presto Metadata 
 Management Auto Discovery Archived Transactional Data DW DMS APP QuickBI Data Lake Engin e (DLA) Introduction to DLA
  5. 5. DLA Presto Multi-Coordinator s Lake Formation:One Click Data Warehouse, Metadata Discover y Enterprise level Access Contro l Cost:Billing methods based on the volume of scanned data, or the number of compute units used. MySQL protocol support Caching Data sources:More than15 types of data source is supported,including Alibaba Cloud OSS, ADB, Table Store , etc. 

  6. 6. Billing methods
  7. 7. Contents Introduction to DLA 01 02 03 Optimizations on OSS Data Source DLA Presto Architecture
  8. 8. Oracle DLA Presto Architecture FrontNode Uni fi ed Meta Service OSS MySQL SQLServer … TableStore MaxCompute ElasticSearch Druid Worker Worker Worker Coordinator Default Cluster Worker Worker Worker Coordinator CU Cluster Presto Clusters PostgreSQL SQL Dialect Transformation/Submit Query/Fetch Result TableScan/Pushdown Met a Operation MySQL Protocol Multiple Charging Model Unified Meta & Access Control
  9. 9. About Presto Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Full Memory Processing Pluggable Connectors Great Community Full SQL Semantics Blazing fast, suitable for adhoc queries, data exploration, and lightweight ETL. Compliant with ANSI SQL, don’t need to worry that any SQL syntax not supported.
  10. 10. Challenges to DLA Presto Oracle FrontNode Uni fi ed Meta Service OSS MySQL SQLServer … TableStore MaxCompute ElasticSearch Druid Worker Worker Worker Coordinator Default Cluster Worker Worker Worker Coordinator CU Cluster Presto Clusters PostgreSQL SQL Dialect Transformation/Submit Query/Fetch Result TableScan/Pushdown Met a Operation
  11. 11. Challenges to DLA Presto Oracle FrontNode Uni fi ed Meta Service OSS MySQL SQLServer … TableStore MaxCompute ElasticSearch Druid Worker Worker Worker Coordinator Default Cluster Worker Worker Worker Coordinator CU Cluster Presto Clusters PostgreSQL SQL Dialect Transformation/Submit Query/Fetch Result TableScan/Pushdown Met a Operation Request costs Bandwidth limit Performance pulling large data Latency to get metadata/partitions Performance pulling large data Pressure on data source
  12. 12. small data big data update frequently update infrequently OTS OSS ODPS ? Mysql Redis Mongodb PostgresSQL … Big Data/NoSQL: Performance of pulling large data Online System:Pressure on data source Big Data/O ffl ine:Performance of pulling large data Challenges to DLA Prest o -Analysis
  13. 13. small data big data update frequently update not frequently OTS OSS ODPS ? Mysql Redis Mongodb PostgresSQL … Big Data/NoSQL: Performance of pulling large data Online System:Pressure on data source Big Data/O ffl ine:Performance of pulling large data Concurrency limitation Avoid reading master Challenges to DLA Prest o -Analysis
  14. 14. small data big data update frequently update not frequently OTS OSS ODPS ? Mysql Redis Mongodb PostgresSQL … Big Data/NoSQL: Performance of pulling large data Online System:Pressure on data source Big Data/O ffl ine:Performance of pulling large data Concurrency limitation Avoid reading master Caching Challenges to DLA Prest o -Analysis
  15. 15. small data big data update frequently update not frequently OTS OSS ODPS ? Mysql Redis Mongodb PostgresSQL … Big Data/NoSQL: Performance of pulling large data Online System:Pressure on data source Big Data/O ffl ine:Performance of pulling large data Concurrency limitation Avoid reading master Caching Pushdown Challenges to DLA Prest o -Analysis
  16. 16. Oracle Solutions FrontNode 统⼀元 数据管 理 OSS MySQL SQLServer … TableStore MaxCompute ElasticSearch Druid Worker Worker Worker Coordinator Default Cluster Worker Worker Worker Coordinator CU Cluster Presto Clusters PostgreSQL SQL改写 / 提交查询 / 取查询结果 TableScan/Pushdown 元数据操作 Decrease Request count Alluxio Data Cache Data Cache Partition meta cache splits cache 对源库影响 对源库影响 对源库影响 Limit Concurrency Read from slavery One Click Data Lake Pushdown
  17. 17. Contents Introduction to DLA 01 02 03 Optimizations on OSS Data Source DLA Presto Architecture
  18. 18. DLA Presto Optimizations on OSS Data Source Decreasing OSS API request count Alluxio Data Cache
  19. 19. Decreasing OSS API request count Background Users report that the OSS Calling fees are high, even higher than DLA fees OSS Calling fees = Actual calls × Unit price per 10,000 calls/10000
  20. 20. Hadoop FileSystem API Invocation Alibaba Cloud OSS API Invocation read read … seek(100) read seek(128MB) read #1 read as much data as possible with 1 request small seek, continue reading big seek, start a new request continue reading #2 read continue reading … 1.Reduced API call count down to 1/10 for data stored in Text format. 2.Reduced API call count down to 1/3 for data stored in ORC/Parquet format. 3.Saves cost for about 60% to 90% on average. Decreasing OSS API request count
  21. 21. Fully tested in Facebook/ Netease/JD production environment Alluxio Data Cach e -Why Alluxio Proven Solution Efficiency Monitoring High concurrency Asynchronous write cache Easy to monitor and diagnosis
  22. 22. Alluxio Data Cach e -Local Cache v.s. Cluster OSS Worker Worker Worker Coordinator Presto Cluster Worker Worker Worker Master Alluxio Cluster read alluxio on cache miss cache to alluxio return data Presto Cluster
  23. 23. Alluxio Data Cach e -Local Cache Alluxio data cache is a library residing in the Presto worker. Cache data is stored in local Disk.
  24. 24. SOFT_AFFINITY Makes the best attempt to assign the same split to the same worker when doing the scheduling Preferred(0) -> Preferred(1) -> LeastBusy Alluxio Data Cach e -Local Cache Preferred(1) Preferred(0) Preferred(0) Preferred(0) LeastBusy
  25. 25. Alluxio Data Cach e -Cluster OSS Worker Worker Worker Coordinator Presto Cluster Worker Worker Worker Master Alluxio Cluster read alluxio on cache miss cache to alluxio return data Alluxio is a distributed caching service to Presto Short-circuit read supported
  26. 26. Alluxio Data Cach e -Local Cache v.s. Cluster OSS Worker Worker Worker Coordinator Presto Cluster Worker Worker Worker Master Alluxio Cluster read alluxio on cache miss cache to alluxio return data Presto Cluster Local Cache v.s.Cluster Data closer to compute node No extra nodes needed Local Cache v.s. Collocated Cluster Easy to maintanance No resource waste if user didn’t has OSS data source
  27. 27. Local Cache v.s. Cluster Data closer to compute node No extra resource needed Local Cache v.s. Collocated Cluster Easy to maintenance No resource waste if user didn’t has OSS data source Alluxio Data Cach e -Local Cache v.s. Cluster
  28. 28. Alluxio Data Cach e -Improvements in DLA Sceneries of Community Solution v.s. Sceneries of DLA Queries mainly on hive data sources v.s. Can’t assume that for a specific user SSD v.s. Ultra cloud disk Challenges Performance improvement in the statistical sense may not be perceivable by users, necessary to increase cache hit ratio for every single query Low disk throughput affects the acceleration effect Increase cache hit ratio for every single query Increase disk throughput
  29. 29. Alluxio Data Cach e -Improvements in DLA Increase cache hit ratio Analysis SOFT_AFFINITY:Preferred(0) -> Preferred(1) -> LeastBusy Key is to submit more splits to Preferred Nodes node-scheduler.max-splits-per-node Increase node-scheduler.max-splits-per-node Effect:Cache hit ratio increased Side effect:load for workers become Unbalanced 4 splits 1 split 1 split split1 split2 split3 split5 split6 split4
  30. 30. Alluxio Data Cach e -Improvements in DLA Increase cache hit ratio Unbalanced load HiveSplit Preferred Nodes: path.hashCode() % numWorkers Big file generate more splits, Cause the corresponding worker getting more load Need to submit splits of a big file to different nodes (path.hashCode() + (start / (fileSize / numWorkers)))) % numWorkers 2 splits 2 splits 2splits split4 split5 split1 split2 split3 split4
  31. 31. Alluxio Data Cach e -Improvements in DLA Improve disk throughput 20GB Ultra disk throughput: Write109MB/s Read 108MB/s Multiple disks 6 ultra disks performance: 600MB/s read/write Implement page.path = $root/$page_path => page.path = $roots[page.hash % roots.size]/$page_path
  32. 32.  Environment: Cluster:16cpu64GB * 16 nodes Disk:20GB ultra disk * 6 Data:TPCH-1TB / ORC / Stored at OSS Queries Chosen from TPCH: Include scan to table lineitem(the biggest table) without join between three or more tables Alluxio Data Cach e -Performance
  33. 33.  Test Result Alluxio Data Cach e -Performance
  34. 34. Future Plan Alluxio Cluster Shared by multi users Suitable when Presto auto scaling Improvements for OSS Data Source Fragment Result Cache Query Result Cache Improve performance of querying small files
  35. 35. More Information about DLA • DLA Homepage:https://www.aliyun.com/product/datalakeanalytics • DLA SQL Introduction:https://developer.aliyun.com/article/770819 We are hiring :)

×