Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Ozone and HDFS's Evolution

167 visualizaciones

Publicado el

HDFS has several strengths: horizontally scale its IO bandwidth and scale its storage to petabytes of storage. Further, it provides very low latency metadata operations and scales to over 60K concurrent clients. Hadoop 3.0 recently added Erasure Coding. One of HDFS’s limitations is scaling a number of files and blocks in the system. We describe a radical change to Hadoop’s storage infrastructure with the upcoming Ozone technology. It allows Hadoop to scale to tens of billions of files and blocks and, in the future, to every larger number of smaller objects. Ozone fundamentally separates the namespace layer and the block layer allowing new namespace layers to be added in the future. Further, the use of RAFT protocol has allowed the storage layer to be self-consistent. We show how this technology helps a Hadoop user and also what it means for evolving HDFS in the future. We will also cover the technical details of Ozone.

Speaker: Sanjay Radia, Chief Architect, Founder, Hortonworks

Publicado en: Tecnología
  • Sé el primero en comentar

Ozone and HDFS's Evolution

  1. 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved HDFS Scalability and Evolution: HDDS and Ozone Sanjay Radia, Founder, Chief Architect, Hortonworks
  2. 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved About the Speakers • Sanjay Radia • Chief Architect, Founder, Hortonworks • Apache Hadoop PMC and Committer • Part of the original Hadoop team at Yahoo! since 2007 • Chief Architect of Hadoop Core at Yahoo! • Prior • Data center automation, virtualization, Java, HA, OSs, File Systems • Startup, Sun Microsystems, INRIA… • Ph.D., University of Waterloo Page 2 Architecting the Future of Big Data
  3. 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved HDFS – What It Does Well and Not So Well HDFS does well • Scaling – IO + PBs + clients • Horizontal scaling – IO + PBs • Fast IO – scans and writes • Number of concurrent clients 60K++ • Low latency metadata operations • Fault tolerant storage layer • Locality • Replicas/Reliability and parallelism • Layering – Namespace layer and storage layer • Security But scaling Namespace is limited to 500M files (192G Heap) • Scaling Namespace – 500M FILES • Scaling Block space • Scaling Block reports • Scaling DN’s block management • Need further scaling of client/RPC 150K++ Ironically, Namespace in mem is strength and weakness
  4. 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Proof Points of Scaling Data, IO, Clients/RPC • Proof points of large data and large clusters • Single Organizations have over 600PB in HDFS • Single clusters with over 200PB using federation • Large clusters over 4K multi-core nodes bombarding a single NN • Federation is the currents caling solution (both Namespace & Operations) • In deployment at Twitter, Yahoo, FB, and elsewhere Metadata in memory the strength of the original GFS and HDFS design But also its weakness in scaling number of files and blocks
  5. 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Scaling HDFS— with HDDS and Ozone
  6. 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved HDFS Layering DN 1 DN 2 DN m .. .. .. NS1 ... NS k Block Management Layer Block Pool kBlock Poo 1 NN-1 NN-k Common Storage BlockStorageNamespace
  7. 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Solutions to Scaling Files, Blocks, Clients/RPC Scale Namespace • Hierarchical file system • Cache only workingSet of namespace in memory • Partition: • Distributed namespace (transparent automatic partitioning) • Volumes (static partitioning) Flat Key-Value store • Cache only workingSet of namespace in memory • Partition/Shard the space (easy to hash) Scale Metadata Clients/RPC • Multi-thread namespace manager • Partitioning/Sharding Slow NN startup • Cache only workingSet in mem • Shard/partition namespace Scale Block Management • Containers of blocks (2GB-16GB+) • Will significantly reduce BlockMap • Reduce Number of Block/Container reports
  8. 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Scaling HDFS Must Scale both the Namespace and the Block Layer • Scaling one is not sufficient Scalable Block layer: Hadoop Distributed Data Storage (HDDS) • Containers of blocks • Replicated as a group • Reduces Block Map Scale Namespace: Several approaches (not exclusive) • Partial namespace in memory • Shard namespace • Use flat namespace (KV namespace) – easier to implement and scale – Ozone
  9. 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Scale Storage Layer: Container of Blocks HDDS Flat KV Namespace: Ozone New HDFS OzoneFS: Hadoop Compatible FS Hierarchical Namespace: New Scalable NN Evolution Towards New HDFS
  10. 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved HDFS Ozone and Quadra on Same Cluster/storage - Shared Storage Servers and Shared Physical Storage Data Nodes : Shared Storage Servers for HDFS-Blocks and Ozone/Quadra Blocks Shared Physical Storage HDFS Scalable FS with Hierarchical Name space Hadoop Compatible FS API FileSystem or FileContext Quadra Raw Storage Volumes Raw Storage API (Lun/EBS like, SCSI) Linux FS Ozone Highly Scalable KV Object Store Flat Namespace S3 API
  11. 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved How it all Fits Together Old HDFS NN All namespace in memory Storage&IONamespace HDFS Block storage on DataNodes (Bid -> Data) Physical Storage - Shared DataNodes and physical storage shared between Old HDFS and HDDS Block Reports BlockMap (Bid ->IPAddress of DN File = Bid[] Ozone Master K-V Flat Namespace File (Object) = Bid[] Bid = Cid+ LocalId New HDFS NN (scalable) Hierarchical Namespace File = Bid[] Bid = Cid+ LocalId Container Management & Cluster Membership HDDS Container Storage on DataNodes (Bid -> Data, but blocks grouped in containers) HDDS HDDS – Clean Separation of Block layer DataNodes ContainerMap (CId ->IPAddress of DNContainer Reports NewExisting HDFS
  12. 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Ozone FS Ozone/HDDS Can Be Used Separately, or also with HDFS • Initially HDFS is the default FS • Has many features • so cannot be replaced by OzoneFS on day one • Ozone FS sits on side as additional namespace, sharing DNs • For applications work with Hadoop Compatible FS on K-V Store – Hive, Spark … • How is Ozone FS accessed? • Use direct URIs for either HDFS or OzoneFS • Mount in HDFS or in ViewFS HDFS Default FS
  13. 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Scalable Block Layer: Hadoop Distributed Data Storage (HDDS) Container: Containers of blocks (2GB-16GB+) • Replicated as a group • Each Container has a unique ContainerId – Every block within a container has a block id - BlockId = ContainerId, LocalId CM – Container manager • Cluster membership • Receives container reports from DNs • Manages container replication • Maintained Container Map (Cid->IPAddr) Data Nodes – HDFS and HDDS can share DNs • DataNodes contain a set of containers (just like they used to contain blocks) • DataNodes send Container-reports (like block reports) to CM (Container Manager) Block Pools • Just like blocks were in block pools, containers are also in container pools – This allow independent namespaces to carve out their block space HDDS: Separate layer from namespace layer (strictly separate, not almost)
  14. 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Key Ozone Characteristics—Compare with HDFS • Scale Block Management • Containers of block (2 GB to 16GB) • 2-4gb block containers initially => 40-80x reduction in BR and CM block map • Reduce BR on DNs, Masters, Network • Scale Namespace • Key Space Manager caches only working set in memory • Future scaling: • Flat namespace is easy to shard (Bucket are natural sharding points) • Scale Num of Metadata Clients/Rpc • No single global lock like NN • Metadata operations are simpler • Sharding will help further § Fault Tolerance – Blocks – inherits HDFS’s block-layer FT – Namespace – uses Raft rather then Journal Nodes •HA Easier § Manageability – GC/Overloaded Master is not longer an issue • caches working set – Journal nodes disappear – Raft is used – Faster and more predictable failover – Fast start up • Faster upgrades • Faster failover • Retains HDFS Semantics & Performance – Strong consistency, locality, fast scans, … • Other: – OM can run on DNs – beneficial for small clusters or embedded systems
  15. 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved Will OzoneFS’s Key-Value Store Work with Hadoop Apps? • Two years ago – NO! • Today - Yes! • Hive, Spark and others are making sure they work on Cloud K-V Object Stores via HCFS • Even customers are ensuring that their apps work on Cloud K-V Object Stores via HCFS • Lack of real directories and their ACLs: Fake directories + Buckets ACLs • Lack of eventual consistency in S3 is being worked around – S3Gaurd (Note: OzoneFS is consistent) • Lack of rename in S3 is being worked around • Various direct output committers (early versions had issues) • Netflix Direct Commiter; being replaced by Iceberg • Via Metastore (Databricks has proprietary version, Hive’s approach)
  16. 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Details of HDDS
  17. 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved Container Structure (Using RocksDB) • An embedded LSM/KVStore (RocksDB) • BlockId is the key, • filename of local chunk file is value • Optimizations • Small blocks (< 1MB) can be stored directly in rocksDB • Compaction for block data to avoid lots of files • But this can be evolved over time Container Index Chunk data file Chunk data file Chunk data file Chunk data file Key 1 LSM LevelDB/RocksDB Key N Chunk Data File Name Offset Length
  18. 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Replication of Container • Use RAFT replication instead of data pipeline, for both data and metadata • Proven to be correct • Traditionally Raft used for small updates and transactions, fits well for metadata • Performance considerations • When writing the meta data into raft-journal, put the data directly in container storage • Raft-journal in separate disk – fast contagious writes without seeking • Data spread across the other disks • Client uses Raft protocol to write data to the DNs storing the container Page 18
  19. 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Open and Closed Containers Open – active writers • Need at least( NumSpindles * Data nodes) open active containers • Clients can get locality on writes • Data is spread across all data nodes • Improved IO and better chance of getting locality • Keep DNs and ALL spindles busy Closed – typically when full or had a failure in the past • Why close a container on failures • We originally considered keeping it open and bringing in a new DN • Wait for the data to copy? • Decided to close it, and have it replicated • Can open later or can merge with other closed container – under design
  20. 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Details of Ozone
  21. 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Ozone Master DN1 DN2 DNn Ozone Master K-V Namespace File (Object) = Bid[] Bid = Cid+ LocalId CM ContainerMap (CId ->IPAddress of DN Client RocksDB bId[]= Open(Key,..) GetBlockLocations(Bid) $$$ $$$ - Container Map Cache $$$ Read, Write, …
  22. 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved Ozone APIs • Key: /VolumeName/BucketId/ObjectKey e.g /Home/John/foo/bar/zoo) • ACLs at Volume and Bucket level (the other dirs are fake) • Future sharding at bucket level • => Ozone is Consistent (unlike S3) Ozone Object API (RPC) S3 Connector Hadoop FileSystem and Hadoop FileContext Connectors
  23. 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved Where Does the Ozone Master Run? Which Node? • On a separate node with large enough memory for caching the working set • Caching the working set is important for large number of concurrent clients • This option would give predictable performance for large clusters • On the Datanodes • How much memory for caching, • Note: tasks and other services run on DN since they are typically also compute nodes Where is Storage for the Ozone KV Metadata? • Local disk • If on DN then is it dedicated disk or shared with DN? • Use the container storage (Its using RocksDB anyway) • Spread Ozone volumes across containers to gain performance, • but this may limit volume size & force more Ozone volumes than Admin wants
  24. 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved Quadra – Lun-like Raw-Block Storage Used for creating mountable disk FS volume
  25. 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved Quadra: Raw-Block Storage Volume (Lun) Lun-like storage service where the blocks are stored on HDDS • Volume: A raw-block device that can be used to create a mountable disk on Linux. • Raw-Blocks - those of the native FS that will use the Lun Volume • Raw-block size is dictated by the native fs like ext4 (4K) • Raw-Blocks are unit of IO operations by native file systems. • Raw-Block is the unit of read/write/update to HDDS • Ozone and Quadra share HDDS as a common storage backend • Current prototype: 1 raw-block = 1 HDDS block (but this will change later) Can be used in Kubernetes for container state
  26. 26. 28 © Hortonworks Inc. 2011–2018. All rights reserved Status • HDDS: Block container • 2-4gb block containers initially – Reduction of 40-80 in BR and block map – Reduce BR pressure in on NN/OzoneMaster • Initial version to scale to 10s billions of blocks • Ozone Master • Implemented using RocksDB (just like the HDDS in DNs) • Initial version to scale to 10 billion objects • Current Status and Steps to GA • Stabilize HDDS and Ozone • Measure and improve performance • Add HA for Ozone Master and Container Manager • Add security – Security design completed and published • After GA • Further stabilization and performance improvements • Transparent encryption • Erasure codes • Snapshots (or their equivalent) • ..
  27. 27. 29 © Hortonworks Inc. 2011–2018. All rights reserved Summary • HDFS scale proven in real production systems • 4K+ clusters • Raw Storage >200PB in single federated NN cluster and >30PB in non-federated clusters • Scales to 60K+ concurrent clients bombarding the NN • But very large number of small files is a challenge (500M files) • HDDS + Ozone: Scalable Hadoop Storage • Retains • HDFS block storage Fault-tolerance • HDFS Horizonal scaling for Storage, IO • HDFS’s move computation to Storage • HDDS: Block containers: • Initially scale to 10B blocks, later to 100B+ blocks (HDFS-7240) • Ozone – Flat KV namespace + Hadoop Compatible FS (OzoneFS) • initially scale to 10B files (HDFS-13074) • Community working on a Hierarchal Namespace on HDDS (HDFS-10419)
  28. 28. 30 © Hortonworks Inc. 2011–2018. All rights reserved Thank You Q&A

×