Apache Hadoop 3.0 Community Update

Apache Hadoop 3.0
Community Update
Sydney, September 2017
Sanjay Radia, Vinod Kumar Vavilapalli

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About.html
Sanjay Radia
Chief Architect, Founder, Hortonworks
Part of the original Hadoop team at Yahoo! since 2007
– Chief Architect of Hadoop Core at Yahoo!
–Apache Hadoop PMC and Committer
Prior
Data center automation, virtualization, Java, HA, OSs, File Systems
Startup, Sun Microsystems, Inria …
Ph.D., University of Waterloo
Page 2

Why Hadoop 3.0
 Lot of content in Trunk that did not make
it to 2.x branch
 JDK Upgrade – does not truly require
bumping major number
 Hadoop command scripts rewrite
(incompatible)
 Big features that need stabilizing major
release – Erasure codes
 YARN: long running services
 Ephemeral Ports (incompatible)
Driving Reasons Some features taking advantage of 3.0

Apache Hadoop 3.0
 HDFS: Erasure codes
 YARN:
– Long running services,
– Scheduler enhancements,
– Isolation & Docker
– UI
 Lots of Trunk content
 JDK8 and newer dependent libraries
- 3.0.0-alpha1 - Sep/3/2016
- Alpha2 - Jan/25/2017
- Alpha3 – May/16/2017
- Alpha4 – July/7/2017
- Beta/GA – Q4 2017 (Estimated)
Key Takeaways Release Timeline
3.0

Agenda
 Major changes you should know
before upgrade Hadoop 3.0
– JDK upgrade
– Dependency upgrade
– Change on default port for
daemon/services
– Shell script rewrite
 Features
– Hadoop Common
• Client-Side Classpath Isolation
• Shell script rewrite
– HDFS/Storage
• Erasure Coding
• Multiple Standby NameNodes
• Intradata balancer
• Cloud Storage: Support for Azure Data Lake, S3
consistency & performance
– YARN
• Support for long running services
• Scheduling enhancements: : App / Queue
Priorities, global scheduling, placement
strategies
• New UI
• ATS v2
– MAPREDUCE
• Task-level native optimization
HADOOP-11264

 Minimum JDK for Hadoop 3.0.x is JDK8 OOP-11858
– Oracle JDK 7 is EoL at April 2015!!
 Moving forward to use new features of JDK8
– Lambda Expressions – starting to use this
– Stream API
– Security enhancements
– Performance enhancement for HashMaps, IO/NIO, etc.
 Hadoop’s evolution with JDK upgrades
– Hadoop 2.6.x - JDK 6, 7, 8 or later
– Hadoop 2.7.x/2.8.x/2.9.x - JDK 7, 8 or later
– Hadoop 3.0.x - JDK 8 or later
Hadoop Operation - JDK Upgrade

 Previously, the default ports of multiple Hadoop services were in the Linux ephemeral
port range (32768-61000)
– Can conflict with other apps running on the same node
– Can cause problem during rolling restart if another app takes the port
 New ports:
– Namenode ports: 50470  9871, 50070  9870, 8020  9820
– Secondary NN ports: 50091  9869, 50090  9868
– Datanode ports: 50020  9867, 50010  9866, 50475  9865, 50075  9864
 KMS service port 16000  9600
Change of Default Ports for Hadoop Services

Classpath isolation (HADOOP-11656)
 Hadoop leaks lots of dependencies
onto the application’s classpath
○ Known offenders: Guava, Protobuf,
Jackson, Jetty, …
○ Potential conflicts with your app
dependencies (No shading)
 No separate HDFS client jar means
server jars are leaked
● NN, DN libraries pulled even though
not needed
 HDFS-6200: Split HDFS client into
separate JAR
 HADOOP-11804: Shaded hadoop-
client dependency
 YARN-6466: Shade the task
umbilical for a clean YARN
container environment (ongoing)

HDFS
Support for Three NameNodes for HA
Intra data node balancer
Cloud storage improvements (see afternoon talk)
Erasure coding

Current (2.x) HDFS Replication Strategy
 Three replicas by default
– 1st replica on local node, local rack or random node
– 2nd and 3rd replicas on the same remote rack
– Reliability: tolerate 2 failures
 Good data locality, local shortcut
 Multiple copies => Parallel IO for parallel compute
 Very Fast block recovery and node recovery
– Parallel recover - the bigger the cluster the faster
– 10TB Node recovery 30sec to a few hours
 3/x storage overhead vs 1.4-1.6 of Erasure Code
– Remember that Hadoop’s JBod is very cheap
• 1/10 - 1/20 of SANs
• 1/10 – 1/5 of NFS
r1
Rack I
DataNode
r2
Rack II
DataNode
r3

Erasure Coding
 k data blocks + m parity blocks (k + m)
– Example: Reed-Solomon 6+3
 Reliability: tolerate m failures
 Save disk space
 Save I/O bandwidth on the write path
 1.5x storage
overhead
 Tolerate any 3
failures
b3b1 b2 P1b6b4 b5 P2 P3
6 data blocks 3 parity blocks
3-replication (6, 3) Reed-Solomon
Maximum fault
Tolerance
2 3
Disk usage
(N byte of data)
3N 1.5N

Block Reconstruction
 Block reconstruction overhead
– Higher network bandwidth cost
– Extra CPU overhead
• Local Reconstruction Codes (LRC), Hitchhiker
Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12.
Sathiamoorthy et al. XORing elephants: novel erasure codes for big data. VLDB 2013.
Rashmi et al. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers. SIGCOMM'14.
b4
Rack
b2
Rack
b3
Rack
b1
Rack
b6
Rack
b5
Rack RackRack
P1 P2
Rack
P3

Erasure Coding on Contiguous/Striped Blocks
Two Approaches
 EC on contiguous blocks
– Pros: Better for locality
– Cons: small files cannot be handled
 EC on striped blocks
– Pros: Leverage multiple disks in parallel
– Pros: Works for small small files
– Cons: No data locality for readers
C1 C2 C3 C4 C5 C6 PC1 PC2 PC3
C7 C8 C9 C10 C11 C12 PC4 PC5 PC6
stripe 1
stripe 2
stripe n
b1 b2 b3 b4 b5 b6 P1 P2 P3
6 Data Blocks 3 Parity Blocks
b3b1 b2 b6b4 b5
File f1
P1 P2 P3
parity blocks
File f2 f3
data blocks

Erasure Coding Zone
 Create a zone on an empty directory
– Shell command:
hdfs erasurecode –createZone [-s <schemaName>] <path>
 All the files under a zone directory are automatically erasure coded
– Rename across zones with different EC schemas are disallowed

Write Pipeline for Replicated Files
 Write pipeline to datanodes
 Durability
– Use 3 replicas to tolerate maximum 2 failures
 Visibility
– Read is supported for being written files
– Data can be made visible by hflush/hsync
 Consistency
– Client can start reading from any replica and failover to any other replica to read the same data
 Appendable
– Files can be reopened for append
* DN = DataNode
DN1 DN2 DN3
data data
ackack
Writer
data
ack

Parallel Write for EC Files
 Parallel write
– Client writes to a group of 9 datanodes at the same time
– Calculate Parity bits at client side, at Write Time
 Durability
– (6, 3)-Reed-Solomon can tolerate maximum 3 failures
 Visibility (Same as replicated files)
– Read is supported for being written files
– Data can be made visible by hflush/hsync
 Consistency
– Client can start reading from any 6 of the 9 replicas
– When reading from a datanode fails, client can failover to any other
remaining replica to read the same data.
 Appendable (Same as replicated files)
– Files can be reopened for append
DN1
DN6
DN7
data
parit
y
ack
ackWriter
data
ack
DN9
parit
yack
……
Stipe size 1MB

EC: Write Failure Handling
 Datanode failure
– Client ignores the failed datanode and continue writing.
– Able to tolerate 3 failures.
– Require at least 6 datanodes.
– Missing blocks will be reconstructed later.
DN1
DN6
DN7
data
parit
y
ack
ackWriter
data
ack
DN9
parit
yack
……

Replication: Slow Writers & Replace Datanode on Failure
 Write pipeline for replicated files
– Datanode can be replaced in case of failure.
 Slow writers
– A write pipeline may last for a long time
– The probability of datanode failures increases over time.
– Need to replace datanode on failure.
 EC files
– Do not support replace-datanode-on-failure.
– Slow writer improved
DN1 DN4
data
ack
DN3DN2
data
ack
Writer
data
ack

Reading with Parity Blocks
 Parallel read
– Read from 6 Datanodes with data blocks
– Support both stateful read and pread
 Block reconstruction
– Read parity blocks to reconstruct
missing blocks
DN3
DN7
DN1
DN2
Reader
DN4
DN5
DN6
Block3
reconstruct
Block2
Block1
Block4
Block5
Block6Parity1

EC implications
 File data is striped across multiple nodes and racks
 Reads and writes are remote and cross-rack
 Reconstruction is network-intensive, reads m blocks cross-rack
– Need fast network
• Require high network bandwidth between client-server
• Dead DataNode implies high network traffic and reconstruction time
 Important to use optimized ISA-L for performance
– 1+ GB/s encode/decode speed, much faster than Java implementation
– CPU is no longer a bottleneck
 Need to combine data into larger files to avoid an explosion in replica count
– Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas)
– Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas)
Works best for archival / cold data usecases
Need
Fast
Network

EC performance – write performance faster with right EC lib

EC performance – TPC with no DN killed

EC performance - TPC with 2 DN killed

Erasure coding status
 Massive development effort by the Hadoop community
○ 20+ contributors from many companies (Hortonworks, Y! JP, Cloudera, Intel, Huawei, …)
○ 100s of commits over three years (started in 2014)
 Erasure coding is feature complete!
 Solidifying some user APIs in preparation for beta1
 Current focus is on testing and integration efforts
○ Want the complete Hadoop stack to work with HDFS erasure coding enabled
○ Stress / endurance testing to ensure stability

Apache Hadoop 3.0 – YARN Enhancements
 YARN Scheduling Enhancements
 Support for Long Running Services
 Re-architecture for YARN Timeline Service - ATS v2
 Better elasticity and resource utilization
 Better resource isolation and Docker!!
 Better User Experiences
 Other Enhancements
3.0

Scheduling Enhancements
 Application priorities within a queue: YARN-1963
– In Queue A, App1 > App 2
 Inter-Queue priorities
– Q1 > Q2 irrespective of demand / capacity
– Previously based on unconsumed capacity
 Affinity / anti-affinity: YARN-1042
– More restraints on locations
• Affinity to rack (where you have your sibling)
• Anti-affinity (e.g. Hbase region servers)
 Global Scheduling: YARN-5139
– Get rid of scheduling triggered on node heartbeat
– Replaced with global scheduler that has parallel threads
• Globally optimal placement –expect evolution of the scheduler
• Critical for long running services – they stick to the allocation – better be a good one
• Enhanced container scheduling throughput (6x)

Scheduling Enhancements (Contd.)
 CapacityScheduler improvements
– Queue Management Improvements
• More Dynamic Queue reconfiguration
• REST API support for queue management
– Absolute resource configuration support
– Priority Support in Application and Queue
– Preemption improvements
• Inter-Queue preemption support

Key Drivers for Long Running Services
 Consolidation of Infrastructure
– Hadoop clusters have a lot of compute and storage resources (some unused)
• Can’t I use Hadoop’s resources for non-Hadoop load?
• Openstack is hard to manage/operate, can I use YARN?
• VMs are expensive, can I use YARN?
• But does it support Docker? – yes, we heard you
 Hadoop related Data Services that run outside a Hadoop cluster
– Why can’t I run them in the Hadoop cluster
 Run Hadoop services (Hive, HBase) on YARN
– Run Multiple instances
– Benefit from YARN’s Elasticity and resource management

Built-in support for long running Service in YARN
 A native YARN framework. YARN-4692
• Abstract common Framework (Similar to Slider) to support long running service
• More simplified API (to manage service lifecycle)
• Better support for long running service
 Recognition of long running service
• Affect the policy of preemption, container reservation, etc.
• Auto-restart of containers
• Containers for long running service are restarted on same node in case of local state
 Service/application upgrade support – YARN-4726
• In general, services are expected to run long enough to cross versions
 Dynamic container configuration
• Only ask for resources just enough, but adjust them at runtime (memory harder)

Discovery services in YARN
 Services can run on any YARN node; how do get its IP?
– It can also move due to node failure
 YARN Service Discovery via DNS: YARN-4757
– Expose existing service information in YARN registry via DNS
• YARN service registry’s records will be converted into DNS entries
– Discovery of container IP and service port via standard DNS lookups.
• Application
– zkapp1.user1.yarncluster.com -> 192.168.10.11:8080
• Container
– Container 1454001598828-0001-01-00004.yarncluster.com -> 192.168.10.18

A More Powerful YARN
 Elastic Resource Model
– Dynamic Resource Configuration (YARN-291)
• Allow tune down/up on NM’s resource in runtime
– E.g. Helps when Hadoop cluster nodes are shared with other workloads
– E.g. Hadoop-on-Hadoop allows flexible resource allocation
– Graceful decommissioning of NodeManagers (YARN-914)
• Drains a node that’s being decommissioned to allow running containers to finish
• E.g. Removing a node for maintenance, Spot pricing on cloud, …
 Efficient Resource Utilization
– Support for container resizing (YARN-1197)
• Allows applications to change the size of an existing container
• E.g. long running services

More Powerful YARN (Contd.)
 Resource Isolation
– Resource isolation support for disk and network
• YARN-2619 (disk), YARN-2140 (network)
• Containers get a fair share of disk and network resources using Cgroups
– Docker support in LinuxContainerExecutor (YARN-3611)
• Support to launch Docker containers alongside process
• Packaging and resource isolation
– Packing easier e.g. TensorFlow
• Complements YARN’s support for long running services

Hadoop Apps
Docker on Yarn & YARN on YARN  - YCloud
YARN
MR Tez Spark
TensorFlow YARN
MR Tez Spark
Can use Yarn to test Hadoop!!

YARN New UI (YARN-3368)

Other YARN work planned in Hadoop 3.X
 Resource profiles (YARN-3926)
– Users can specify resource profile name instead of individual resources
– Resource types read via a config file
 YARN federation (YARN-2915)
– Allows YARN to scale out to tens of thousands of nodes
– Cluster of clusters which appear as a single cluster to an end user

Compatibility & Testing
3.0

Compatibility
 Preserves wire-compatibility with Hadoop 2 clients
○ Impossible to coordinate upgrading off-cluster Hadoop clients
 Will support rolling upgrade from Hadoop 2 to Hadoop 3
○ Can’t take downtime to upgrade a business-critical cluster
 Not fully preserving API compatibility!
○ Dependency version bumps
○ Removal of deprecated APIs and tools
○ Shell script rewrite, rework of Hadoop tools scripts
○ Incompatible bug fixes

Testing and validation
 Extended alpha → beta → GA plan designed for stabilization
 EC already has some usagein production (700 nodes at Y! JP)
– Hortonworks has worked closely with this very large customer
 Hortonworks is integrating and testing HDP 3
– Integrating with all components of HDP stack
– HDP2 ++ integration tests
 Cloudera is also testing Hadoop 3 as part of their stack
 Plans for extensive HDFS EC testing by Hortonworks and Cloudera
 Happy synergy between 2.8.x and 3.0.x lines
– Shares much of the same code, fixes flow into both
– Yahoo! Deployments based on 2.8.0

Summary : What’s new in Apache Hadoop 3.0?
Storage Optimization
HDFS: Erasure codes
Improved Utilization
YARN: Long Running Services
YARN: Schedule Enhancements
Additional Workloads
YARN: Docker & Isolation
Easier to Use
New User Interface
Refactor Base
Lots of Trunk content
JDK8 and newer dependent libraries
3.0

Thank you!
Reminder: BoFs on Thursday

Apache Hadoop 3.0 Community Update

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Apache Hadoop 3.0 Community Update

Similar a Apache Hadoop 3.0 Community Update (20)

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

Apache Hadoop 3.0 Community Update

Notas del editor