Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node

290 visualizaciones

Publicado el

Konstantin Shvachko and Chen Liang of LinkedIn team up with Chao Sun of Uber to present regarding the current state of and future plans for HDFS scalability, with an extended discussion on the newly introduced read-from-standby feature.

This is taken from the Apache Hadoop Contributors Meetup on January 30, hosted by LinkedIn in Mountain View.

Publicado en: Tecnología
  • Sé el primero en comentar

Hadoop Meetup Jan 2019 - HDFS Scalability and Consistent Reads from Standby Node

  1. 1. Consistent Reads from Standby Node Konstantin V Shvachko Sr. Staff Software Engineer @LinkedIn Chen Liang Senior Software Engineer @LinkedIn Chao Sun Software Engineer @Uber
  2. 2. Agenda HDFS CONSISTENT READ FROM STANDBY 1 • Motivation • Consistency Read from Standby • Challenges • Design and Implementation • Next steps
  3. 3. The Team 2 • Konstantin Shvachko (LinkedIn) • Chen Liang (LinkedIn) • Erik Krogen (LinkedIn) • Chao Sun (Uber) • Plamen Jeliazkov (Paypal)
  4. 4. Consistent Reads From Standby Nodes
  5. 5. Motivation 4 • 2x Growth/Year In Workloads and Size • Approaching active Name Node performance limits rapidly • We need a scalability solution • Key Insights: • Reads comprise 95% of all metadata operations in our practice • Another source of truth for read: Standby Nodes • Standby Nodes Serving Read Requests • Can substantially decrease active Name Node workload • Allowing cluster to scale further!
  6. 6. Architecture ROLE OF STANDBY NODES DataNodes Active NameNode Standby NameNodes JournalNodes Write 5 /Read• Standby nodes have same copy of all metadata (with some delay) • Standby Node syncs edits from Active NameNode • Standby nodes can potentially serve read requests • All reads can go to Standby nodes • OR, time critical applications can still choose to read from Active only
  7. 7. Challenges DataNodes Active NameNode Standby NameNodes JournalNodes Write Read 6 /Read • Standby Node delay • ANN write edits to JN, then SbNN applying the edits from JN • With delay at minute magnitude • Consistency • If client performs a read after a write, client would expect to see the state change
  8. 8. Fast Journaling DELAY REDUCTION 7 • Fast Edit Tailing HDFS-13150 • Current JN is slow: serving whole segments of edits from disk • Optimization on JN and SbNN o JN caching recent edits in memory, only applied edits are served o SbNN request only recent edits through RPC calls o Fall back to existing mechanism on error • Significantly reduce SbNN delay o Reduce from 1 minute to 2 to 50 milliseconds • Standby node delay is no more than a few ms in most cases
  9. 9. Consistency Model 8 • Consistency Principle: • If client c1 modifies an object state at id1 at time t1, then in any future time t2 > t1, c1 will see the state of that object at id2 >= id1 • Read-Your-Own-Write • Client writes to Active NameNode • Then read from the StandbyNode • Read should reflect the write Active NameNode Standby NameNodes JournalNodes lastSeenStateId txnid = 100 = 100 txnid = 99 100 100
  10. 10. txnid = 100 Consistency Model 9 • Consistency Principle: • If client c1 modifies an object state at id1 at time t1, then in any future time t2 > t1, c1 will see the state of that object at id2 >= id1 • LastSeenStateID • Monotonically increasing Id of ANN namespace state txnid • Kept on client side, client’s known most recent ANN state • Sent to SbNN, SbNN only replies after it has caught up to this state Active NameNode Standby NameNodes JournalNodes lastSeenStateId txnid = 100 = 100 100 100
  11. 11. Corner Case: Stale Reads 10 • Stale Read Cases • Case1: Multiple client instances • DFSClient#1 to write to ANN, DFSClient#2 to read SbNN • DFSClient#2’s state older than DFSClient#1, read is out of sync • Case2: Out-of-band communication • Client#1 writes to ANN, inform client#2 • Client#2 read from SbNN, not see the write Active NameNode DFSClient#1 Standby NameNode Write DFSClient#2 Read Read your own writes Active NameNode DFSClient#1 Standby NameNode Write DFSClient#2 Read Third-party communication
  12. 12. msync API 11 • Dealing with Stale Reads: FileSystem.msync() • Sync between existing client instances • Force the DFSClient to sync up to the most recent state of ANN • Multiple client instances: call msync on DFSClient#2 • Out-of-band communication: client#2 calls msync first before read • “Always msync” mode HDFS-14211 Active NameNode DFSClient#1 Standby NameNode Write DFSClient#2 Read Read your own writes Active NameNode DFSClient#1 Standby NameNode Write DFSClient#2 Read Third-party communication
  13. 13. Robustness Optimization: Standby Node Back-off REDIRECT WHEN TOO FAR BEHIND • In the case where a Standby node state is too far behind, client may retry another node • e.g. Standby node machine running slow • Standby Node Back-off • 1: Upon receiving request, if Standby node finds itself too far behind requested state, it rejects the request, throwing retry exception • 2: If a request has been in queue for long, and Standby still is not caught up, Standby rejects the request, throwing retry exception • Client Retry • Upon retry exception, client tries a different standby node, or simply falling back to ANN 12
  14. 14. Configuration and Startup Process 13 • Configuring NameNodes • Configure namenodes via haadmin • Observer mode is similar to Standby, but serves read and does not perform checkpointing • All NameNodes start as check pointing Standby, Standby can be transitioned to Active or Observer • Configuring Client • Configure to use ObserverReadProxyProvider • If not, client still works but only talks to ANN • ObserverReadProxy will discover the state of all NNs Active Standby Observer Check Pointing Standby Read Serving Standby Active
  15. 15. Current Status 14 • Test and benchmark • With YARN application, e.g. TeraSort • With HDFS benchmarks, e.g. DFSIO • Run on a cluster with >100 nodes and with Kerberos and delegation token enabled • Merged to trunk (3.3.0) • Being backported to branch-2 • Active work on further improvement/optimization • Has been running at Uber in production
  16. 16. Background ● Back in 2017, Uber’s HDFS clusters were in a bad shape ○ Rapid growth in # of jobs accessing HDFS ○ Ingestion & adhoc jobs co-locate on the same cluster ○ Lots of listing calls on very large directories (esp. Hudi) ● HDFS traffic composition: 96% reads, 4% writes ● Presto is very sensitive to HDFS latency ○ Occupies ~20% of HDFS traffic ○ Only reads from HDFS, no write
  17. 17. Implementation & Timeline ● Implementation (compare to open source version) ○ No msync or fast edit log tailing ■ Only eventual consistency with max staleness of 10s ○ Observer was NOT eligible to NN failover ○ Batched edits loading to reduce NN locktime when tailing edits ● Timeline ○ 08/2017 - finished the POC and basic testing in dev clusters ○ 12/2017 - started collaborating with HDFS open source community (e.g., Linkedin, Paypal) ○ 12/2018 - fully rolled out to Presto in production ○ Tool multiple retries along the process ■ Disable access time (dfs.namenode.accesstime.precision) ■ HDFS-13898, HDFS-13924
  18. 18. Impact Comparing to traffic goes to active NameNode, Observer NameNode improves the overall throughput by ~20% (roughly the same throughput from Presto), while RPC queue time has dropped ~30X.
  19. 19. Impact (cont.) Presto listing status call latency has dropped 8-10X after migrating to Observer
  20. 20. Next Steps
  21. 21. Three-Stage Scalability Plan 2X GROWTH / YEAR IN WORKLOADS AND SIZE • Stage I. Consistent reads from standby • Optimize for reads: 95% of all operations • Consistent reading is a coordination problem • Stage II. In-memory Partitioned Namespace • Optimize write operations • Eliminate NameNode’s global lock – fine-grained locking • Stage III. Dynamically Distributed Namespace Service • Linear scaling to accommodate increases in RPC load and metadata growth HDFS-12943 20
  22. 22. NameNode Current State NAMENODE’S GLOBAL LOCK – PERFORMANCE BOTTLENECK • Three main data structures • INodeMap: id -> INode • BlocksMap: key -> BlockInfo • DatanodeMap: don’t split • GSet – an efficient HashMap implementation • Hash(key) -> Value • Global lock to update INodes and blocks 21 NameNode – FSNamesystem INodeMap – Directory Tree GSet: Id -> INode BlocksMap – Block Manager GSet: Block -> BlockInfo DataNode Manager
  23. 23. Stage II. In-memory Partitioned Namespace ELIMINATE NAMENODE’S GLOBAL LOCK • PartitionedGSet: • two level mapping 1. RangeMap: keyRange -> GSet 2. RangeGSet: key -> INode • Fine-grained locking • Individual locks per range • Different ranges are accessed in parallel 22 NameNode GSet-1 DataNode Manager GSet-2 GSet-n GSet-1 GSet-2 GSet-n INodeMap - Partitioned GSet BlocksMap - Partitioned GSet
  24. 24. Stage II. In-memory Partitioned Namespace EARLY POC RESULTS 23 • PartitionedGSet: two level mapping • LatchLock: swap RangeMap lock for GSet locks corresponding to inode keys • Run NNTroughputBenchmark creating 10 million directories • 30% throughput gain • Large batches of edits • Why not 100%? • Key is inodeId – incremental number generator • Contention on the last partition • Expect MORE
  25. 25. Stage III. Dynamically Distributed Namespace SCALABLE DATA AND METADATA • Split NameNode state into multiple servers based on ranges • Each NameNode • Serves a designate range of INode keys • Metadata in PartitionedGSet • Can reassign certain subranges to adjacent nodes • Coordination Service (Ratis) • Change ranges served by NNs • Renames / moves, Quotas 24 NameNode 1 INodeMap Part-GSet DataNode Manager BlocksMap Part-GSet NameNode 2 NameNode n INodeMap Part-GSet DataNode Manager BlocksMap Part-GSet INodeMap Part-GSet DataNode Manager BlocksMap Part-GSet
  26. 26. Thank You! Konstantin V Shvachko Chen Liang Chao Sun Sr. Staff Software Engineer @LinkedIn Software Engineer @Uber Senior Software Engineer @LinkedIn 25 Consistent Reads from Standby Node

×