Design, Scale and Performance of MapR's Distribution for Hadoop

Design, Scale & Performance of the
MapR Distribution

M.C. Srivas
CTO, MapR Technologies, Inc.

6/29/2011 © MapR Technologies, Inc. 1

Outline of Talk
• What does MapR do?
• Motivation: why build this?
• Distributed NameNode Architecture
• Scalability factors
• Programming model
• Distributed transactions in MapR
• Performance across a variety of loads


Complete Distribution
 Integrated, tested,
hardened
 Super simple
 Unique advanced
features
 100% compatible
with MapReduce,
HBase, HDFS APIs
 No recompile
required, drop in
and use now


MapR Areas of Development

HBase Map
Reduce
Ecosystem

Storage
Management
Services


JIRAs Open For Year(s)
• HDFS-347 – 7/Dec/08 - Streaming perf sub-optimal
• HDFS-273, 395 – 7/Mar/07 – DFS Scalability problems, optimize
block-reports
• HDFS-222, 950 – Concatenate files into larger files
• Tom White on 2/Jan/09: "Small files are a big problem for Hadoop ... 10
million files, each using a block, would use about 3 gigabytes of memory.
Scaling up much beyond this level is a problem with current hardware.
Certainly a billion files is not feasible."
• HDFS Append – no 'blessed' Apache Hadoop distro has fix
• HDFS-233 – 25/Jun/08 – Snapshot support
• Dhruba Borthakur on 10/Feb/09 "...snapshots can be designed very elegantly
only if there is complete separation between namespace management and
block management."


Observations on Apache Hadoop
 Inefficient HDFS-347 1200

MB/sec
 Scaling problems HDFS-273 1000

 NameNode bottleneck HDFS-395 800

 Limited number of files HDFS-222 600 READ
WRITE
 Admin overhead significant 400

 NameNode failure loses data 200

 Not trusted as permanent store 0
HARDWARE HDFS
 Write-once
 Data lost unless file closed
 hflush/hsync – unrealistic to expect folks will re-write apps


MapR Approach
• Some are architectural issues
• Change at that level is a big deal
– Will not be accepted unless proven
– Hard to prove without building it first

• Build it and prove it
– Improve reliability significantly
– Make it tremendously faster at the same time
– Enable new class of apps (eg, real-time analytics)

HDFS Architecture Review
 Files are broken into blocks
 Distributed across data-nodes
 NameNode holds (in memory)
 Directories, Files
Files
 Block replica locations sharded into
blocks
 Data Nodes
 Serve blocks
 No idea about files/dirs
 All ops go to NN
DataNodes save Blocks

HDFS Architecture Review
DataNode (DN) reports blocks to
NameNode
NameNode (NN)
 Large DN does 60K blocks/report
 256M x 60K = 15T = 5 disks @ 3T per
DataNode DataNode
 >100K causes extreme load
 40GB NN restart takes 1-2 hours

Addressing Unit is an individual block
 Flat block-address forces DN's to send giant block-reports
 NN can hold about ~300M blocks max
 Limits cluster size to 10's of Petabytes
 Increasing block size negatively impacts map/reduce

How to Scale
• Central meta server does not scale
– Make every server a meta-data server too
– But need memory for map/reduce
• Must page meta-data to disk
• Reduce size of block-reports
– while increasing number of blocks per DN
• Reduce memory footprint of location service
– cannot add memory indefinitely
• Need fast-restart (HA)

MapR Goal: Scale to 1000X
HDFS MapR
# files 150 million 1 trillion
# data 10-50 PB 1-10 Exabytes
# nodes 2000 10,000+

Full random read/write semantics
 export via NFS and other protocols
 with enterprise-class reliability: instant-restart, snapshots,
mirrors, no-single-point-of-failure, …
Run close to hardware speeds
 On extreme scale, efficiency matters extremely
 exploit emerging technology like SSD, 10GE


MapR's Distributed NameNode
Files/directories are sharded into blocks, which
are placed into mini NNs (containers ) on disks
 Each container contains
 Directories & files
 Data blocks
 Replicated on servers
Containers are 16-  No need to manage
32 GB segments of directly
disk, placed on  Use MapR Volumes
nodes

Patent Pending


MapR Volumes
Significant advantages over “Cluster-
/projects wide” or “File-level” approaches

/tahoe Volumes allow management attributes
/yosemite to be applied in a scalable way at a
very granular level and with flexibility
/user
/msmith • Replication factor
• Scheduled mirroring
/bjohnson • Scheduled snapshots
• Data placement control
100K volumes are OK, • Usage tracking
create as many as • Administrative permissions
desired!


MapR Distributed NameNode
Containers are tracked globally
• Clients cache containers & server info for extended periods

NameNode Map

S1, S2, S4
Client S1
Fetches Contacts
S1, S3
container server to
S1, S4, S5 locations read data
S2, S3, S5 from the
S3
container
S2, S4, S5
S3

S4 S5
S2


MapR's Distr NameNode Scaling
Containers represent 16 - 32GB of data
 Each can hold up to 1 Billion files and directories
 100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container
 25GB to cache all containers for 2EB cluster
 But not necessary, can page to disk
 Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports
 Serve 100x more data-nodes
 Increase container size to 64G to serve 4EB cluster
 Map/reduce not affected


MapR Distr NameNode HA
MapR Apache Hadoop*
1. apt-get install mapr-cldb 1. Stop cluster very carefully
while cluster is online 2. Move fs.checkpoint.dir onto NAS (eg. NetApp)
3. Install, configure DRBD + Heartbeat packages
i. yum -y install drbd82 kmod-drbd82 heartbeat
ii. chkconfig -add heartbeat (both machines)
iii. edit /etc/drbd.conf on 2 machines
iv-xxxix. make raid-0 md, ask drbd to manage raid md, zero
it if drbd dies & try again
xxxx. mkfs ext3 on it, mount /hadoop (both machines)
xxxxi. install all rpms in /hadoop, but don't run them yet
(chkconfig off)
xxxxii. umount /hadoop (!!)
xxxxiii. edit 3 files /etc/ha.d/* to configure heartbeat
...
40. Restart cluster. If any problems, start at
/var/log/ha.log for hints on what went wrong.

*As described in www.cloudera.com/blog/2009/07/hadoop-ha-configuration
Author: Christophe Bisciglia, Cloudera.

Step Back & Rethink Problem
Big disruption in hardware landscape
Year 2000 Year 2012
# cores per box 2 128
DRAM per box 4GB 512GB
# disks per box 250+ 12
Disk capacity 18GB 6TB
Cluster size 2-10 10,000

 No spin-locks / mutexes, 10,000+ threads
 Minimal footprint – preserve resources for App
 Rapid re-replication, scale to several Exabytes


MapR's Programming Model
Written in C++ and is asynchronous
ioMgr->read(…, callbackFunc, void *arg)
Each module runs requests from its request-queue
 One OS thread per cpu-core
 Dispatch: map container-> queue -> cpu-core
 Callback guaranteed to be invoked on same core
 No mutexes needed
 When load increases, add cpu-core + move some queues
to it
State machines on each queue
 'thread stack' is 4K, 10,000+ threads costs ~40M
 Context-switch is 3 instructions, 250K c.s./core/sec ok!

MapR on Linux
 User-space process, avoids system crashes
 Minimal footprint
 Preserves cpu, memory & resources for app
 uses only 1/5th of system memory
 runs on 1 or 2 cores, others left for app
 Emphasis on efficiency, avoids lots of layering
raw devices, direct-IO, doesn't use Linux VM
 CPU/memory firewalls implemented
 runaway tasks no longer impact system processes


Random Writing in MapR
S1
Ask for
Client
64M block NameNode Map
writing Create cont.
data S1, S2, S4
attach S1, S3
Write S1, S4, S5
next chunk S2
Picks master S2, S4, S5
to S2
and 2 replica slaves S3
S2, S3, S5

S4 S5
S3


MapR's Distributed NameNode
 Distributed transactions to stitch containers together
 Each node uses write-ahead log
 Supports both value-logging and operational-logging
 Value log, record = { disk-offset, old, new }
 Op log, record = { op-details, undo-op, redo-op }
 Recovery in 2 seconds
 'global ids' enable participation in distributed
transactions


2-Phase Commit Unsuitable
App
• BeginTrans .. work .. Commit C = coordinator
Force
P = participant C Log
 On app-commit
 C sends prepare to P P
P sends prepare-ack, P
 C
gives up right to abort C
 Waits for C even across

crashes/reboots P
P
 P unlocks only when C
commit received
Too many message exchanges
P P
Single failure can lock up entire cluster

Quorum-completion Unsuitable
• BeginTrans .. work .. Commit C = coordinator
P = participant
 On app-commit
 C broadcasts prepare P
 If majority responds, App
C commits C
 If not, cluster goes P
into election mode
 If no majority found, all fails P
Update throughput very poor P

Does not work with < N/2 nodes
Monolithic. Hierarchical? Cycles? Oh No!!

MapR Lockless Transactions
• BeginTrans + work + Commit
 No explicit commit NN1
NN1
NN1
 Uses rollback
 confirm callback, piggy-backed
 Undo on confirmed failure NN4
NN4 NN2
NN2
 Any replica can confirm NN2
Update throughput very high
No locks held across messages
Crash resistant, cycles OK NN3
NN3
NN3
Patent pending


Small Files (Apache Hadoop, 10 nodes)

Out of box
Op: - create file
Rate (files/sec)

- write 100 bytes
Tuned - close
Notes:
- NN not replicated
- NN uses 20G DRAM
- DN uses 2G DRAM

# of files (m)


MapR Distributed NameNode
Same 10 nodes, but with 3x replication added …

Test
stopped
Create here
Rate

100-byte
files/sec

# of files (millions)

MapR's Data Integrity
 End-to-end check-sums on all data (not optional)
 Computed in client's memory, written to disk at server
 On read, validated at both client & server

 RPC packets have own independent check-sum
 Detects RPC msg corruption

 Transactional with ACID semantics
 Meta data incl. log itself is check-summed

 Allocation bitmaps written to two places (dual blocks)

 Automatic compression built-in


MapR’s Random-Write Eases Data Import
With MapR, use NFS Otherwise, use Flume/Scribe
1. mount /mapr 1. Set up sinks (find unused
real-time, HA machines??)
2. Set up intrusive agents
i. tail(“xxx”), tailDir(“y”)
ii. agentBESink
3. All reliability levels lose data
i. best-effort
ii. one-shot
iii. disk fail-over
iv. end-to-end
4. Data not available now


MapR's Streaming Performance
2250 2250
11 x 7200rpm SATA 11 x 15Krpm SAS
2000 2000
1750 1750
1500 1500
1250 1250 Hardware
MapR
1000 1000
MB Hadoop
750 750
per
sec 500 500
250 250
0 0
Read Write Read Write
Higher is better

Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB


HBase on MapR
YCSB Insert with 1 billion 1K records
10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
600

500

400
1000
records 300 MapR
per Apache
second 200

100

0
WAL off WAL on Higher is better


HBase on MapR
YCSB Random Read with 1 billion 1K records
10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
25000

20000

Records 15000
per MapR
second Apache
10000

5000

0
Zipfian Uniform Higher is better


Terasort on MapR
10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
60 300

50 250

40 200

Elapsed 150
MapR
30
time Hadoop
(mins) 20 100

10 50

0 0
1.0 TB 3.5 TB

Lower is better


PigMix on MapR
4000

3500

3000

2500

2000
Time MapR
in 1500 Hadoop
Sec
1000

500

0

Lower is better

Summary
 Fully HA
 JobTracker, Snapshot, Mirrors, multi-cluster capable
 Super simple to manage
 NFS mountable
 Complete read/write semantics
 Can see file contents immediately
 MapR has signed Apache CCLA
 Zookeeper, Mahout, YCSB, HBase fixes contributed
 Continue to contribute more and more
 Download it at www.mapr.com

Design, Scale and Performance of MapR's Distribution for Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Design, Scale and Performance of MapR's Distribution for Hadoop

Similar to Design, Scale and Performance of MapR's Distribution for Hadoop (20)

Recently uploaded

Recently uploaded (20)

Design, Scale and Performance of MapR's Distribution for Hadoop