More Related Content Similar to Design, Scale and Performance of MapR's Distribution for Hadoop (20) Design, Scale and Performance of MapR's Distribution for Hadoop1. Design, Scale & Performance of the
MapR Distribution
M.C. Srivas
CTO, MapR Technologies, Inc.
6/29/2011 © MapR Technologies, Inc. 1
2. Outline of Talk
• What does MapR do?
• Motivation: why build this?
• Distributed NameNode Architecture
• Scalability factors
• Programming model
• Distributed transactions in MapR
• Performance across a variety of loads
6/29/2011 © MapR Technologies, Inc. 2
3. Complete Distribution
Integrated, tested,
hardened
Super simple
Unique advanced
features
100% compatible
with MapReduce,
HBase, HDFS APIs
No recompile
required, drop in
and use now
6/29/2011 © MapR Technologies, Inc. 3
4. MapR Areas of Development
HBase Map
Reduce
Ecosystem
Storage
Management
Services
6/29/2011 © MapR Technologies, Inc. 4
5. JIRAs Open For Year(s)
• HDFS-347 – 7/Dec/08 - Streaming perf sub-optimal
• HDFS-273, 395 – 7/Mar/07 – DFS Scalability problems, optimize
block-reports
• HDFS-222, 950 – Concatenate files into larger files
• Tom White on 2/Jan/09: "Small files are a big problem for Hadoop ... 10
million files, each using a block, would use about 3 gigabytes of memory.
Scaling up much beyond this level is a problem with current hardware.
Certainly a billion files is not feasible."
• HDFS Append – no 'blessed' Apache Hadoop distro has fix
• HDFS-233 – 25/Jun/08 – Snapshot support
• Dhruba Borthakur on 10/Feb/09 "...snapshots can be designed very elegantly
only if there is complete separation between namespace management and
block management."
6/29/2011 © MapR Technologies, Inc. 5
6. Observations on Apache Hadoop
Inefficient HDFS-347 1200
MB/sec
Scaling problems HDFS-273 1000
NameNode bottleneck HDFS-395 800
Limited number of files HDFS-222 600 READ
WRITE
Admin overhead significant 400
NameNode failure loses data 200
Not trusted as permanent store 0
HARDWARE HDFS
Write-once
Data lost unless file closed
hflush/hsync – unrealistic to expect folks will re-write apps
6/29/2011 © MapR Technologies, Inc. 6
7. MapR Approach
• Some are architectural issues
• Change at that level is a big deal
– Will not be accepted unless proven
– Hard to prove without building it first
• Build it and prove it
– Improve reliability significantly
– Make it tremendously faster at the same time
– Enable new class of apps (eg, real-time analytics)
6/29/2011 © MapR Technologies, Inc. 7
8. HDFS Architecture Review
Files are broken into blocks
Distributed across data-nodes
NameNode holds (in memory)
Directories, Files
Files
Block replica locations sharded into
blocks
Data Nodes
Serve blocks
No idea about files/dirs
All ops go to NN
DataNodes save Blocks
6/29/2011 © MapR Technologies, Inc. 8
9. HDFS Architecture Review
DataNode (DN) reports blocks to
NameNode
NameNode (NN)
Large DN does 60K blocks/report
256M x 60K = 15T = 5 disks @ 3T per
DataNode DataNode
>100K causes extreme load
40GB NN restart takes 1-2 hours
Addressing Unit is an individual block
Flat block-address forces DN's to send giant block-reports
NN can hold about ~300M blocks max
Limits cluster size to 10's of Petabytes
Increasing block size negatively impacts map/reduce
6/29/2011 © MapR Technologies, Inc. 9
10. How to Scale
• Central meta server does not scale
– Make every server a meta-data server too
– But need memory for map/reduce
• Must page meta-data to disk
• Reduce size of block-reports
– while increasing number of blocks per DN
• Reduce memory footprint of location service
– cannot add memory indefinitely
• Need fast-restart (HA)
6/29/2011 © MapR Technologies, Inc. 10
11. MapR Goal: Scale to 1000X
HDFS MapR
# files 150 million 1 trillion
# data 10-50 PB 1-10 Exabytes
# nodes 2000 10,000+
Full random read/write semantics
export via NFS and other protocols
with enterprise-class reliability: instant-restart, snapshots,
mirrors, no-single-point-of-failure, …
Run close to hardware speeds
On extreme scale, efficiency matters extremely
exploit emerging technology like SSD, 10GE
6/29/2011 © MapR Technologies, Inc. 11
12. MapR's Distributed NameNode
Files/directories are sharded into blocks, which
are placed into mini NNs (containers ) on disks
Each container contains
Directories & files
Data blocks
Replicated on servers
Containers are 16- No need to manage
32 GB segments of directly
disk, placed on Use MapR Volumes
nodes
Patent Pending
6/29/2011 © MapR Technologies, Inc. 12
13. MapR Volumes
Significant advantages over “Cluster-
/projects wide” or “File-level” approaches
/tahoe Volumes allow management attributes
/yosemite to be applied in a scalable way at a
very granular level and with flexibility
/user
/msmith • Replication factor
• Scheduled mirroring
/bjohnson • Scheduled snapshots
• Data placement control
100K volumes are OK, • Usage tracking
create as many as • Administrative permissions
desired!
6/29/2011 © MapR Technologies, Inc. 13
14. MapR Distributed NameNode
Containers are tracked globally
• Clients cache containers & server info for extended periods
NameNode Map
S1, S2, S4
Client S1
Fetches Contacts
S1, S3
container server to
S1, S4, S5 locations read data
S2, S3, S5 from the
S3
container
S2, S4, S5
S3
S4 S5
S2
6/29/2011 © MapR Technologies, Inc. 14
15. MapR's Distr NameNode Scaling
Containers represent 16 - 32GB of data
Each can hold up to 1 Billion files and directories
100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container
25GB to cache all containers for 2EB cluster
But not necessary, can page to disk
Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports
Serve 100x more data-nodes
Increase container size to 64G to serve 4EB cluster
Map/reduce not affected
6/29/2011 © MapR Technologies, Inc. 15
16. MapR Distr NameNode HA
MapR Apache Hadoop*
1. apt-get install mapr-cldb 1. Stop cluster very carefully
while cluster is online 2. Move fs.checkpoint.dir onto NAS (eg. NetApp)
3. Install, configure DRBD + Heartbeat packages
i. yum -y install drbd82 kmod-drbd82 heartbeat
ii. chkconfig -add heartbeat (both machines)
iii. edit /etc/drbd.conf on 2 machines
iv-xxxix. make raid-0 md, ask drbd to manage raid md, zero
it if drbd dies & try again
xxxx. mkfs ext3 on it, mount /hadoop (both machines)
xxxxi. install all rpms in /hadoop, but don't run them yet
(chkconfig off)
xxxxii. umount /hadoop (!!)
xxxxiii. edit 3 files /etc/ha.d/* to configure heartbeat
...
40. Restart cluster. If any problems, start at
/var/log/ha.log for hints on what went wrong.
*As described in www.cloudera.com/blog/2009/07/hadoop-ha-configuration
Author: Christophe Bisciglia, Cloudera.
6/29/2011 © MapR Technologies, Inc. 16
17. Step Back & Rethink Problem
Big disruption in hardware landscape
Year 2000 Year 2012
# cores per box 2 128
DRAM per box 4GB 512GB
# disks per box 250+ 12
Disk capacity 18GB 6TB
Cluster size 2-10 10,000
No spin-locks / mutexes, 10,000+ threads
Minimal footprint – preserve resources for App
Rapid re-replication, scale to several Exabytes
6/29/2011 © MapR Technologies, Inc. 17
18. MapR's Programming Model
Written in C++ and is asynchronous
ioMgr->read(…, callbackFunc, void *arg)
Each module runs requests from its request-queue
One OS thread per cpu-core
Dispatch: map container-> queue -> cpu-core
Callback guaranteed to be invoked on same core
No mutexes needed
When load increases, add cpu-core + move some queues
to it
State machines on each queue
'thread stack' is 4K, 10,000+ threads costs ~40M
Context-switch is 3 instructions, 250K c.s./core/sec ok!
6/29/2011 © MapR Technologies, Inc. 18
19. MapR on Linux
User-space process, avoids system crashes
Minimal footprint
Preserves cpu, memory & resources for app
uses only 1/5th of system memory
runs on 1 or 2 cores, others left for app
Emphasis on efficiency, avoids lots of layering
raw devices, direct-IO, doesn't use Linux VM
CPU/memory firewalls implemented
runaway tasks no longer impact system processes
6/29/2011 © MapR Technologies, Inc. 19
20. Random Writing in MapR
S1
Ask for
Client
64M block NameNode Map
writing Create cont.
data S1, S2, S4
attach S1, S3
Write S1, S4, S5
next chunk S2
Picks master S2, S4, S5
to S2
and 2 replica slaves S3
S2, S3, S5
S4 S5
S3
6/29/2011 © MapR Technologies, Inc. 20
21. MapR's Distributed NameNode
Distributed transactions to stitch containers together
Each node uses write-ahead log
Supports both value-logging and operational-logging
Value log, record = { disk-offset, old, new }
Op log, record = { op-details, undo-op, redo-op }
Recovery in 2 seconds
'global ids' enable participation in distributed
transactions
6/29/2011 © MapR Technologies, Inc. 21
22. 2-Phase Commit Unsuitable
App
• BeginTrans .. work .. Commit C = coordinator
Force
P = participant C Log
On app-commit
C sends prepare to P P
P sends prepare-ack, P
C
gives up right to abort C
Waits for C even across
crashes/reboots P
P
P unlocks only when C
commit received
Too many message exchanges
P P
Single failure can lock up entire cluster
6/29/2011 © MapR Technologies, Inc. 22
23. Quorum-completion Unsuitable
• BeginTrans .. work .. Commit C = coordinator
P = participant
On app-commit
C broadcasts prepare P
If majority responds, App
C commits C
If not, cluster goes P
into election mode
If no majority found, all fails P
Update throughput very poor P
Does not work with < N/2 nodes
Monolithic. Hierarchical? Cycles? Oh No!!
6/29/2011 © MapR Technologies, Inc. 23
24. MapR Lockless Transactions
• BeginTrans + work + Commit
No explicit commit NN1
NN1
NN1
Uses rollback
confirm callback, piggy-backed
Undo on confirmed failure NN4
NN4 NN2
NN2
Any replica can confirm NN2
Update throughput very high
No locks held across messages
Crash resistant, cycles OK NN3
NN3
NN3
Patent pending
6/29/2011 © MapR Technologies, Inc. 24
25. Small Files (Apache Hadoop, 10 nodes)
Out of box
Op: - create file
Rate (files/sec)
- write 100 bytes
Tuned - close
Notes:
- NN not replicated
- NN uses 20G DRAM
- DN uses 2G DRAM
# of files (m)
6/29/2011 © MapR Technologies, Inc. 25
26. MapR Distributed NameNode
Same 10 nodes, but with 3x replication added …
Test
stopped
Create here
Rate
100-byte
files/sec
# of files (millions)
6/29/2011 © MapR Technologies, Inc. 26
27. MapR's Data Integrity
End-to-end check-sums on all data (not optional)
Computed in client's memory, written to disk at server
On read, validated at both client & server
RPC packets have own independent check-sum
Detects RPC msg corruption
Transactional with ACID semantics
Meta data incl. log itself is check-summed
Allocation bitmaps written to two places (dual blocks)
Automatic compression built-in
6/29/2011 © MapR Technologies, Inc. 27
28. MapR’s Random-Write Eases Data Import
With MapR, use NFS Otherwise, use Flume/Scribe
1. mount /mapr 1. Set up sinks (find unused
real-time, HA machines??)
2. Set up intrusive agents
i. tail(“xxx”), tailDir(“y”)
ii. agentBESink
3. All reliability levels lose data
i. best-effort
ii. one-shot
iii. disk fail-over
iv. end-to-end
4. Data not available now
6/29/2011 © MapR Technologies, Inc. 28
29. MapR's Streaming Performance
2250 2250
11 x 7200rpm SATA 11 x 15Krpm SAS
2000 2000
1750 1750
1500 1500
1250 1250 Hardware
MapR
1000 1000
MB Hadoop
750 750
per
sec 500 500
250 250
0 0
Read Write Read Write
Higher is better
Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB
6/29/2011 © MapR Technologies, Inc. 29
30. HBase on MapR
YCSB Insert with 1 billion 1K records
10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
600
500
400
1000
records 300 MapR
per Apache
second 200
100
0
WAL off WAL on Higher is better
6/29/2011 © MapR Technologies, Inc. 30
31. HBase on MapR
YCSB Random Read with 1 billion 1K records
10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
25000
20000
Records 15000
per MapR
second Apache
10000
5000
0
Zipfian Uniform Higher is better
6/29/2011 © MapR Technologies, Inc. 31
32. Terasort on MapR
10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
60 300
50 250
40 200
Elapsed 150
MapR
30
time Hadoop
(mins) 20 100
10 50
0 0
1.0 TB 3.5 TB
Lower is better
6/29/2011 © MapR Technologies, Inc. 32
33. PigMix on MapR
4000
3500
3000
2500
2000
Time MapR
in 1500 Hadoop
Sec
1000
500
0
Lower is better
6/29/2011 © MapR Technologies, Inc. 33
34. Summary
Fully HA
JobTracker, Snapshot, Mirrors, multi-cluster capable
Super simple to manage
NFS mountable
Complete read/write semantics
Can see file contents immediately
MapR has signed Apache CCLA
Zookeeper, Mahout, YCSB, HBase fixes contributed
Continue to contribute more and more
Download it at www.mapr.com
6/29/2011 © MapR Technologies, Inc. 34