Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Dancing Elephants - Efficiently Working with Object Stores from Apache Spark and Apache Hive

As Hadoop applications move into cloud deployments, object stores become more and more the source and destination of data. But object stores are not filesystems: sometimes they are slower; security is different,

What are the secret settings to get maximum performance from queries against data living in cloud object stores? That's at the filesystem client, the file format and the query engine layers? It's even how you lay out the files —the directory structure and the names you give them.

We know these things, from our work in all these layers, from the benchmarking we've done —and the support calls we get when people have problems. And now: we'll show you.

This talk will start from the ground up "why isn't an object store a filesystem?" issue, showing how that breaks fundamental assumptions in code, and so causes performance issues which you don't get when working with HDFS. We'll look at the ways to get Apache Hive and Spark to work better, looking at optimizations which have been done to enable this —and what work is ongoing. Finally, we'll consider what your own code needs to do in order to adapt to cloud execution.

Speaker:
Sanjay Radia, Founder and Chief Architect, Hortonworks

  • Sé el primero en comentar

Dancing Elephants - Efficiently Working with Object Stores from Apache Spark and Apache Hive

  1. 1. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Dancing Elephants: Working with Object Storage in Apache Spark and Hive Sanjay Radia June 2017
  2. 2. © Hortonworks Inc. 2011 – 2017 All Rights Reserved About the Speaker Sanjay Radia Chief Architect, Founder, Hortonworks Part of the original Hadoop team at Yahoo! since 2007 – Chief Architect of Hadoop Core at Yahoo! –Apache Hadoop PMC and Committer Prior Data center automation, virtualization, Java, HA, OSs, File Systems Startup, Sun Microsystems, Inria … Ph.D., University of Waterloo Page 2
  3. 3. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Why Cloud? • No upfront hardware costs – Pay as you use • Elasticity • Often lower TCO • Natural for Data ingress for IoT, mobile apps, .. • Business agility
  4. 4. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Key Architectural Considerations for Hadoop in the Cloud Shared Data & Storage On-Demand Ephemeral Workloads 10101 10101010101 01010101010101 010101010101010101 0 Elastic Resource Management Shared Metadata, Security & Governance
  5. 5. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Shared Data Requires Shared Metadata, Security, and Governance ⬢Shared Metadata Across All Workloads/Clusters  Metadata considerations • Tabular data metastore • Lineage and provenance metadata • Add upon ingest • Update as processing modifies data  Access / tag-based policies ⬢ & audit logs  Key Observation: Classification Prohibition Time Location Streams Pipelines Feeds Tables Files Objects Shared Metadata Policies
  6. 6. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Shared Data and Cloud Storage ⬢ Cloud Storage is the Shared Data Lake –For both Hadoop and Cloud-native (non-Hadoop) Apps –Lower Cost •HDFS via EBS can get very very expensive •HDFS’s role changes –Built-in geo-distribution and DR ⬢ Challenges –Cloud storage designed for scale, low cost and geo-distribution –Performance is slower – was not designed for data-intensive apps –Cloud storage segregated from compute –API and semantics not like a FS – especially wrt. consistency
  7. 7. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Making Object Stores work for Big Data Apps ⬢ Focus Areas –Address cloud storage consistency –Performance (changes in connectors and frameworks) –Caching in memory and local storage ⬢ Other issues not covered in this talk –Shared Metastore, Common Governance, Security across multiple clusters –Columnar access control to Tabular data See Hortonworks cloud offering Shared Data & Storage 10101 10101010101 01010101010101 0101010101010101010
  8. 8. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Cloud Storage Integration: Evolution for Agility HDFS Application HDFS Application GoalEvolution towards cloud storage as the persistent Data Lake Input Output Backup Restore Input Output Upload HDFS Application Input Output tmp AzureAWS –today ->>>>>
  9. 9. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Danger: Object stores are not hierarchical filesystems Focus: Cost & Geo-distribution over Consistency and Performance
  10. 10. © Hortonworks Inc. 2011 – 2017 All Rights Reserved A Filesystem: Directories, Files  Data / work pending part-00 part-01 00 00 00 01 01 01 complete part-01 rename("/work/pending/part-01", "/work/complete")
  11. 11. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Object Store: hash(name)⇒data 00 00 00 01 01 s01 s02 s03 s04 hash("/work/pending/part-01") ["s02", "s03", "s04"] No rename hence: copy("/work/pending/part-01", "/work/complete/part01") 01 01 01 01 delete("/work/pending/part-01") hash("/work/pending/part-00") ["s01", "s02", "s04"]
  12. 12. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Often: Eventually Consistent 00 00 00 01 01 s01 s02 s03 s04 01 DELETE /work/pending/part-00 GET /work/pending/part-00 GET /work/pending/part-00 200 200 200
  13. 13. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Eventual Consistency problems ⬢ When listing a directory –Newly created files may not yet be visible, deleted ones still present ⬢ After updating a file –Opening and reading the file may still return the previous data ⬢ After deleting a file –Opening the file may succeed, returning the data ⬢ While reading an object –If object is updated or deleted during the process
  14. 14. © Hortonworks Inc. 2011 – 2017 All Rights Reserved The dangers of Eventual Consistency and Lack of Atomicity ⬢ Temp Data leftovers –Annoying Garbage or Worse if direct output committer is used ⬢ List inconsistency means new data may not be visible –Hadoop thinks of directories are containers of data ⬢ Lack of atomic rename() can leave output directories inconsistent You can get bad or missing data and not even notice Especially if only a portion of your large data is missing
  15. 15. © Hortonworks Inc. 2011 – 2017 All Rights Reserved org.apache.hadoop.fs.FileSystem hdfs s3awasb adlswift gs
  16. 16. © Hortonworks Inc. 2011 – 2017 All Rights Reserved s3:// —“inode on S3” s3n:// “Native” S3 s3a:// Replaces s3n swift:// OpenStack wasb:// Azure WASB Phase I: Stabilize S3A oss:// Aliyun gs:// Google Cloud Phase II: speed & scale adl:// Azure Data Lake 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 s3:// Amazon EMR S3 History of Object Storage Support Phase III: scale & consistency (proprietary)
  17. 17. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Cloud Storage Connectors Azure WASB ● Strongly consistent ● Good performance ● Well-tested on applications (incl. HBase) ADL ● Strongly consistent ● Tuned for big data analytics workloads Amazon Web Services S3A ● Eventually consistent - consistency work in recently completed by Hortonworks, Cloudera, others.. ● Performance improvements recently and in progress ● Active development in Apache EMRFS ● Proprietary connector used in EMR ● Optional strong consistency for a cost Google Cloud Platform GCS ● Multiple configurable consistency policies ● Currently Google open source ● Good performance ● Could improve test coverage
  18. 18. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Make Apache Hadoop at home in the cloud Step 1: Hadoop runs great on Azure Step 2: Be the best performance on EC2 (i.e. Beat propriety solutions like EMR) ✔ ✔ ✔
  19. 19. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Problem: S3 Analytics is too slow/broken 1. Analyze benchmarks and bug-reports 2. Optimize the non-io metaDataOps (very cheap on HDFS) 3. Fix Read-path for Columnar Data 4. Fix Write-path 5. Improve query partitioning (not covered in this talk) 6. The Commitment Problem
  20. 20. getFileStatus() read() LLAP (single node) on AWS TPC-DS queries at 200 GB scale readFully(pos)
  21. 21. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Hadoop 2.8/HDP 2.6 transforms I/O performance! // forward seek by skipping stream fs.s3a.readahead.range=256K // faster backward seek for Columnar Storage fs.s3a.experimental.input.fadvise=random // Write-IO - enhanced data upload (parallel background uploads) // Additional flags for mem vs disk fs.s3a.fast.output.enabled=true fs.s3a.multipart.size=32M fs.s3a.fast.upload.active.blocks=8 // Additional per-bucket flags —see HADOOP-11694 for lots more!
  22. 22. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Every HTTP request is precious ⬢ HADOOP-13162: Reduce number of getFileStatus calls in mkdirs() ⬢ HADOOP-13164: Optimize deleteUnnecessaryFakeDirectories() ⬢ HADOOP-13406: Consider reusing filestatus in delete() and mkdirs() ⬢ HADOOP-13145: DistCp to skip getFileStatus when not preserving metadata ⬢ HADOOP-13208: listFiles(recursive=true) to do a bulk listObjects see HADOOP-11694
  23. 23. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Caching in Memory or Local Disk (ssd) even more relevant for slow cloud storage benchmarks != your queries, your data, your VMs, … …but we think we've made a good start
  24. 24. © Hortonworks Inc. 2011 – 2017 All Rights Reserved S3 Data Source 1TB TPCDS LLAP- vs Hive 1.x: 0 500 1,000 1,500 2,000 2,500 LLAP-1TB-TPCDS Hive-1-1TB-TPCDS 1 TB TPC-DS ORC DataSet 3 x i2x4x Large (16 CPU x 122 GB RAM x 4 SSD)
  25. 25. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Rename Problem and Direct Output Committer "
  26. 26. © Hortonworks Inc. 2011 – 2017 All Rights Reserved The S3 Commitment Problem rename() used for atomic commitment transaction ⬢ Additional Time: copy() + delete() proportional to data * files – Server side copy is used to make this faster, but still a copy – Non-Atomic!! ⬢ Alternate: Direct output committer can solve the performance problem ⬢ BOTH can give wrong results –Intermediate data may be visible –Failures (task or job) leave storage in unknown state –Speculative execution makes it worse ⬢ BTW Compared to Azure Storage, S3 is slow (6-10+ MB/s)
  27. 27. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Spark's Direct Output Committer? Risk of Corruption of data
  28. 28. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Netflix Staging Committer 1. Saves output to file:// 2. Task commit: upload to S3A as multipart PUT —but does not commit the PUT, just saves the information about it to hdfs:// 3. Normal commit protocol manages task and job data promotion in HDFS 4. Final Job committer reads pending information and generates final PUT —possibly from a different host 1. But multiple files hence not fully atomic – window is much much smaller Outcome: ⬢ No visible overwrite until final job commit: resilience and speculation ⬢ Task commit time = data/bandwidth ⬢ Job commit time = POST * #files
  29. 29. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Use the Hive Metastore to Commit Atomically ⬢ Work in progress – use the Hive metastore to record the commit –Databricks seems to have done a similar thing for Databricks Spark (i.e. propriety) ⬢ Fits into the Hive ACID work
  30. 30. © Hortonworks Inc. 2011 – 2017 All Rights Reserved S3guard Fast, consistent S3 metadata HADOOP-13445
  31. 31. © Hortonworks Inc. 2011 – 2017 All Rights Reserved S3Guard: Fast And Consistent S3 Metadata ⬢ Goals –Provide consistent list and get status operations on S3 objects written with S3Guard enabled •listStatus() after put and delete •getFileStatus() after put and delete –Performance improvements that impact real workloads –Provide tools to manage associated metadata and caching policies. ⬢ Again, 100% open source in Apache Hadoop community –Hortonworks, Cloudera, Western Digital, Disney … ⬢ Inspired by Apache licensed S3mper project from Netflix –Note apparently EMRFS’s committer is also inspired from this but copied and kept prorierty ⬢ Seamless integration with S3AFileSystem
  32. 32. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Use DynamoDB as fast, consistent metadata store 00 00 00 01 01 s01 s02 s03 s04 01 DELETE part-00 200 HEAD part-00 200 HEAD part-00 404 PUT part-00 200 00
  33. 33. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Availability  Read + Write in HDP 2.6 and Apache Hadoop 2.8  S3Guard: preview of DDB caching soon  Zero-rename commit: work in progress
  34. 34. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Summary ⬢ Cloud Storage is the Data Lake on the Cloud – HDFS plays a different role ⬢ Challenges: Performance, Consistency, Correctness – Output committer – non-atomicity should not be ignored ⬢ We have made significant improvements – Object store connectors – Upper layers, such as Hive and ORC – S3Guard branch merged ⬢ LLAP as the cache for tabular data ⬢ Other considerations – Shared Metadata, Security and Governance (See HDP Cloud offerings)
  35. 35. © Hortonworks Inc. 2011 – 2017 All Rights Reserved Big thanks to: Rajesh Balamohan Steve Loughran Mingliang Liu Chris Nauroth Dominik Bialek Ram Venkatesh Everyone in QE, RE + everyone who reviewed/tested, patches and added their own, filed bug reports and measured performance
  36. 36. © Hortonworks Inc. 2011 – 2017 All Rights Reserved5 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Questions? sanjay@hortonworks.com @srr

×