Más contenido relacionado La actualidad más candente (20) Similar a Inside MapR's M7 (20) Más de MapR Technologies (20) Inside MapR's M72. 2©MapR Technologies - Confidential
Me, Us
Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG
MapR
Distributes more open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s
Tonight
Hash tag - #mapr #fast
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
3. 3©MapR Technologies - Confidential
MapR does MapReduce (fast)
TeraSort Record
1 TB in 54 seconds
1003 nodes
MinuteSort Record
1.5 TB in 59 seconds
2103 nodes
4. 4©MapR Technologies - Confidential
MapR: Lights Out Data Center Ready
• Automated stateful failover
• Automated re-replication
• Self-healing from HW and SW
failures
• Load balancing
• Rolling upgrades
• No lost jobs or data
• 99999’s of uptime
Reliable Compute Dependable Storage
• Business continuity with snapshots
and mirrors
• Recover to a point in time
• End-to-end check summing
• Strong consistency
• Built-in compression
• Mirror between two sites by RTO
policy
6. 6©MapR Technologies - Confidential
Part 1:
What’s past is prologue
HBase is really good
except when it isn’t
but it has a heart of gold
13. 13©MapR Technologies - Confidential
HBase Table Architecture
Tables are divided into key ranges (regions)
Regions are served by nodes (RegionServers)
Columns are divided into access groups (columns families)
CF1 CF2 CF3 CF4 CF5
R1
R2
R3
R4
14. 14©MapR Technologies - Confidential
HBase Architecture is Better
Strong consistency model
– when a write returns, all readers will see same value
– "eventually consistent" is often "eventually inconsistent"
Scan works
– does not broadcast
– ring-based NoSQL databases (eg, Cassandra, Riak) suffer on scans
Scales automatically
– Splits when regions become too large
– Uses HDFS to spread data, manage space
Integrated with Hadoop
– map-reduce on HBase is straightforward
15. 15©MapR Technologies - Confidential
But ... how well do you know HBCK?
a.k.a. HBase Recovery
HBase-5843: Improve HBase MTTR – Mean Time To Recover
HBase-6401: HBase may lose edits after a crash with 1.0.3
– uses appends
HBase-3809: .META. may not come back online if ….
etc
about 40-50 Jiras on this topic
Very complex algorithm to assign a region
– and still does not get it right on reboot
16. 16©MapR Technologies - Confidential
HBase Issues
Reliability
•Compactions disrupt operations
•Very slow crash recovery
•Unreliable splitting
Business continuity
•Common hardware/software issues cause downtime
•Administration requires downtime
•No point-in-time recovery
•Complex backup process
Performance
•Many bottlenecks result in low throughput
•Limited data locality
•Limited # of tables
Manageability
•Compactions, splits and merges must be done manually (in reality)
•Basic operations like backup or table rename are complex
17. 17©MapR Technologies - Confidential
Examples: Performance Issues
Limited support for multiple column families: HBase has
issues handling multiple column family due to compactions. The standard
HBase documentation recommends no more than 2-3 column families.
(HBASE-3149)
Limited data locality: HBase does not take into account block locations
when assigning regions. After a reboot, RegionServers are often reading data
over the network rather than the local drives. (HBASE-4755, HBASE-4491)
Cannot utilize disk space: HBase RegionServers struggle with more
than 50-150 regions per RegionServer so a commodity server can only handle
about 1TB of HBase data, wasting disk space.
(http://hbase.apache.org/book/important_configurations.html,
http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/)
Limited # of tables: A single cluster can only handle several tens of
tables effectively.
(http://hbase.apache.org/book/important_configurations.html)
18. 18©MapR Technologies - Confidential
Examples: Manageability Issues
Manual major compactions: HBase major compactions are disruptive
so production clusters keep them disabled and rely on the administrator to
manually trigger compactions.
(http://hbase.apache.org/book.html#compaction)
Manual splitting: HBase auto-splitting does not work properly in a busy
cluster so users must pre-split a table based on their estimate of data
size/growth. (http://chilinglam.blogspot.com/2011/12/my-experience-
with-hbase-dynamic.html)
Manual merging: HBase does not automatically merge regions that are
too small. The administrator must take down the cluster and trigger the
merges manually.
Basic administration is complex: Renaming a table requires copying
all the data. Backing up a cluster is a complex process. (HBASE-643)
19. 19©MapR Technologies - Confidential
Examples: Reliability Issues
Compactions disrupt HBase operations: I/O bursts overwhelm
nodes (http://hbase.apache.org/book.html#compaction)
Very slow crash recovery: RegionServer crash can cause data to be
unavailable for up to 30 minutes while WALs are replayed for
impacted regions. (HBASE-1111)
Unreliable splitting: Region splitting may cause data to be
inconsistent and unavailable.
(http://chilinglam.blogspot.com/2011/12/my-experience-with-
hbase-dynamic.html)
No client throttling: HBase client can easily overwhelm
RegionServers and cause downtime. (HBASE-5161, HBASE-5162)
20. 20©MapR Technologies - Confidential
One Issue – Crash Recovery Too Slow
HBASE-1111 superseded by HBASE-5843 which is blocked by
HDFS-3912 HBASE-6736 HBASE-6970 HBASE-7989 HBASE-6315
HBASE-7815 HBASE-6737 HBASE-6738 HBASE-7271 HBASE-7590
HBASE-7756 HBASE-8204 HBASE-5992 HBASE-6156 HBASE-6878
HBASE-6364 HBASE-6713 HBASE-5902 HBASE-4755 HBASE-7006
HDFS-2576 HBASE-6309 HBASE-6751 HBASE-6752 HBASE-6772
HBASE-6773 HBASE-6774 HBASE-7246 HBASE-7334 HBASE-5859
HBASE-6058 HBASE-6290 HBASE-7213 HBASE-5844 HBASE-5924
HBASE-6435 HBASE-6783 HBASE-7247 HBASE-7327 HDFS-4721
HBASE-5877 HBASE-5926 HBASE-5939 HBASE-5998 HBASE-6109
HBASE-6870 HBASE-5930 HDFS-4754 HDFS-3705
22. 22©MapR Technologies - Confidential
RegionServers are problematic
Coordinating 3 separate distributed systems is very hard
– HBase, HDFS, ZK
– Each of these systems has multiple internal systems
– Too many races, too many undefined properties
Distributed transaction framework not available
– Too many failures to deal with
Java GC wipes out the RS from time to time
– Cannot use -Xmx20g for a RS
Hence all the bugs
– HBCK is your "friend"
24. 24©MapR Technologies - Confidential
Files are broken into blocks
Distributed across data-nodes
NameNode holds (in DRAM)
Directories, Files
Block replica locations
Data Nodes
Serve blocks
No idea about files/dirs
All ops go to NN
HDFS Architecture Review
DataNodes save Blocks
Files
sharded into
blocks
25. 25©MapR Technologies - Confidential
NameNode holds in-memory
Dir hierarchy ("names")
File attrs ("inode")
Composite file structure
Array of block-ids
1-byte file in HDFS
1 HDFS "block" on 3 DN's
3 entries in NN totaling 1K DRAM
A File at the NameNode
Composite File Structure
26. 26©MapR Technologies - Confidential
DN reports blocks to NN
– 128M blocks
– 12T of disk => DN sends 100K blocks/report
– RPC on wire is 4M
– causes extreme load
• at both DN and NN
With NN-HA, DN's do dual block-reports
– one to primary, one to secondary
– doubles the load on DN
NN scalability problems
27. 27©MapR Technologies - Confidential
Scaling Parameters
Unit of I/O
– 4K/8K (8K in MapR)
Unit of Chunking (a map-reduce
split)
– 10-100's of megabytes
Unit of Resync (a replica)
– 10-100's of gigabytes
– container in MapR
i/o
10^3
map-red
10^6
resync
10^9
admin
HDFS 'block'
Unit of Administration
(snap, repl, mirror, quota, backu
p)
– 1 gigabyte - 1000's of terabytes
– volume in MapR
– what data is affected by my
missing blocks?
28. 28©MapR Technologies - Confidential
NameNode
E F
NameNode
E F
NameNode
E F
MapR's No-NameNode Architecture
HDFS Federation MapR (distributed metadata)
• Multiple single points of failure
• Limited to 50-200 million files
• Performance bottleneck
• Commercial NAS required
• HA w/ automatic failover
• Instant cluster restart
• Up to 1 trillion files
• 20x higher performance
• 100% commodity hardware
NAS
appliance
NameNode
A B
NameNode
C D
NameNode
E F
DataNode DataNode DataNode
DataNode DataNode DataNode
A F C D E D
B C E B
C F B F
A B
A D
E
29. 29©MapR Technologies - Confidential
Each container contains
Directories & files
Data blocks
Replicated on servers
Millions of containers in
a typical cluster
MapR's Distributed NameNode
Files/directories are sharded into blocks, which
are placed into mini NNs (containers ) on disks
Containers are 16-
32 GB segments of
disk, placed on
nodes
Patent Pending
30. 30©MapR Technologies - Confidential
M7 Containers
Container holds many files
– regular, dir, symlink, btree, chunk-map, region-map, …
– all random-write capable
– each can hold 100's of millions of files
Container is replicated to servers
– unit of resynchronization
Region lives entirely inside 1 container
– all files + WALs + btree's + bloom-filters + range-maps
31. 31©MapR Technologies - Confidential
Read-write Replication
Write are synchronous
– All copies have same data
Data is replicated in a "chain"
fashion
– better bandwidth, utilizes full-
duplex network links well
Meta-data is replicated in a "star"
manner
– response time better, bandwidth not
of concern
– data can also be done this way
31
client1
client2
clientN
32. 33©MapR Technologies - Confidential
HB loss + upstream entity
reports failure
=> server dead
Increment epoch at CLDB
Rearrange replication
Exact same code for files
and M7 tables
No ZK needed at this level
Failure Handling
Containers managed at CLDB (HB, container-reports).
Container Location DataBase
(CLDB)
33. 34©MapR Technologies - Confidential
Same 10 nodes, but with 3X repl
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 1000 2000 3000 4000 5000 6000
Filecreates/s
Files (M)
0 100 200 400 600 800 1000
MapR distribution
Other distribution
Benchmark: File creates (100B)
Hardware: 10 nodes, 2 x 4 cores, 24 GB
RAM, 12 x 1 TB 7200 RPM
0
50
100
150
200
250
300
350
400
0 0.5 1 1.5
Filecreates/s
Files (M)
Other distributionMapR Other Advantage
Rate (creates/s) 14-16K 335-360 40x
Scale (files) 6B 1.3M 4615x
34. 35©MapR Technologies - Confidential
Recap
HBase has a good basis
– But is handicapped by HDFS
– But can’t do without HDFS
– HBase can’t be fixed in isolation
Separating key storage scaling parameters is key
– Allows additional layer of storage indirection
– Results in huge scaling and performance improvement
Low-level transactions is hard
– Allows R/W file system, decentralized meta-data
– Also allows non-file implementations
36. 37©MapR Technologies - Confidential
An Outline of Important Factors
Start with MapR FS (mutability, transactions, real snapshots)
C++ not Java (data never moves, better control)
Lockless design, custom queue executive (3 ns switch)
New RPC layer (> 1 M RPC / s)
Cut out the middle man (single hop to data)
Hybridize log-structured merge trees and B-trees
Adjust sizes and fanouts
Don’t be silly
37. 38©MapR Technologies - Confidential
An Outline of Important Factors
Start with MapR FS (mutability, transactions, real snapshots)
C++ not Java (data never moves, better control)
Lockless design, custom queue executive (3 ns switch)
New RPC layer (> 1 M RPC / s)
Cut out the middle man (single hop to data)
Hybridize log-structured merge trees and B-trees
Adjust sizes and fanouts
Don’t be silly
We get these all for
free by putting
tables into MapR FS
38. 39©MapR Technologies - Confidential
M7: Tables Integrated into Storage
No extra daemons to manage
One hop to data
Superior caching
policies
No JVM problems
40. 41©MapR Technologies - Confidential
Why Not Java?
Disclaimer: I am a pro-Java bigot
But that only goes so far …
Consider the memory size of
struct {x, y}[] a;
Consider also interpreting data as it has arrived from the wire
Consider the problem of writing a micro-stack queue executive
with hundreds of thousands of threads and 3 ns context switch
Consider the problem of a core-locked processes running cache
aware, lock-free, zero copy queue of tasks
Consider the GC-free life-style
41. 42©MapR Technologies - Confidential
At What Cost
But writing performant C++ is hard
Managing low-level threads is hard
Implementing very fast failure recovery is hard
Doing manual memory allocation is hard (and dangerous)
Benefits outweigh costs with the right dev team
Benefits dwarfed by the costs with the wrong dev team
43. 44©MapR Technologies - Confidential
M7 Table Architecture
table
tablet
tablet
partition
segmentsegment
parition
tablet tablet
44. 45©MapR Technologies - Confidential
M7 Table Architecture
table
tablet
tablet
partition
segmentsegment
parition
tablet tablet
This structure is
internal and not
user-visible
45. 46©MapR Technologies - Confidential
Multi-level Design
Fixed number of levels like HBase
Specialized fanout to match sizes to device physics
Mutable file system allows chimeric LSM-tree / B-tree
Sized to match container structure
Guaranteed locality
– If the data moves, the new node will handle it
– If the node fails, the new node will handle it
47. 48©MapR Technologies - Confidential
RPC Reimplementation
At very high data rates, protobuf is too slow
– Not good as an envelope, still a great schema definition language
– Most systems never hit this limit
Alternative 1
– Lazy parsing allows deferral of content parsing
– Naïve implementation imposes (yet another) extra copy
Alternative 2
– Bespoke parsing of envelope from the wire
– Content packages can land fully aligned and ready for battle directly from
the wire
Let’s use BOTH ideas
49. 50©MapR Technologies - Confidential
Don’t Be Silly
Detailed review of the code revealed an extra copy
– It was subtle. Really.
Performance increased when this was stopped
Not as easy to spot as it sounds
– But absolutely still worth finding and fixing
51. 52©MapR Technologies - Confidential
Server Reboot
Full container-reports are tiny
– CLDB needs 2G dram for 1000-node cluster
Volumes come online very fast
– each volume independent of others
– as soon as min-repl # of containers ready
– no need to wait for whole cluster
(eg, HDFS waits for 99.9% blocks reporting)
1000-node cluster restart < 5 mins
52. 53©MapR Technologies - Confidential
M7 provides Instant Recovery
0-40 microWALs per region
– idle WALs go to zero quickly, so most are empty
– region is up before all microWALs are recovered
– recovers region in background in parallel
– when a key is accessed, that microWAL is recovered inline
– 1000-10000x faster recovery
Why doesn't HBase do this?
– M7 leverages unique MapR-FS capabilities, not impacted by HDFS
limitations
– No limit to # of files on disk
– No limit to # open files
– I/O path translates random writes to sequential writes on disk
53. 54©MapR Technologies - Confidential
Other M7 Features
Smaller disk footprint
– M7 never repeats the key or column name
Columnar layout
– M7 supports 64 column families
– in-memory column-families
Online admin
– M7 schema changes on the fly
– delete/rename/redistribute tables
54. 55©MapR Technologies - Confidential
Binary Compatible
HBase applications work "as is" with M7
– No need to recompile (binary compatible)
Can run M7 and HBase side-by-side on the same cluster
– eg, during a migration
– can access both M7 table and HBase table in same program
Use standard Apache HBase CopyTable tool to copy a table
from HBase to M7 or vice-versa, viz.,
% hbase org.apache.hadoop.hbase.mapreduce.CopyTable
--new.name=/user/srivas/mytable oldtable
58. 59©MapR Technologies - Confidential
Recap
HBase has some excellent core ideas
– But is burdened by years of technical debt
– Much of the debt was charged on the HDFS credit cards
MapR FS provides ideal substrate for HBase-like service
– One hop from client to data
– Many problems never even exist in the first place
– Other problems have relatively simple solutions with better foundation
Practical results bear out the theory
59. 60©MapR Technologies - Confidential
Me, Us
Ted Dunning, Chief Application Architect, MapR
Committer PMC member, Mahout, Zookeeper, Drill
Bought the beer at the first HUG
MapR
Distributes more open source components for Hadoop
Adds major technology for performance and HA
Adds industry standard API’s
Tonight
Hash tag - #nosqlnow #mapr #fast
See also - @ApacheMahout @ApacheDrill
@ted_dunning and @mapR
Notas del editor The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly. Another major advantage with MapR is the distributed Namenode. The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is between 70-100M. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly. This slide needs a lot of work. Can you look at layout changes? The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.