In version 2.4, MongoDB introduces hash-based sharding, a new option for distributing data in sharded collections. Hash-based sharding and range-based sharding present different advantages for MongoDB users deploying large scale systems. In this talk, we'll provide an overview of this new feature and discuss when to use hash-based sharding or range-based sharding.
5. What Is A Shard Key?
• Shard key is used to partition your collection
• Shard key must exist in every document
• Shard key is immutable
• Shard key values are immutable
• Shard key must be indexed
• Shard key is used to route requests to shards
11. Chunk Splitting
{x: 0} {x: 6} {x: 7}{x: -5} {x: 10}{x: -9}
0 0
• Achunk is split once it exceeds the maximum size
• There is no split point if all documents have the same shard key
• Chunk split is a logical operation (no data is moved)
• If split creates too large of a discrepancy of chunk count across cluster
a balancing round starts
12. Data Distribution
• MinKey to 0 lives on Shard1
• 0 to MaxKey lives on Shard2
• Mongos routes queries appropriately
50. Under the Hood
• Create a hashed index used for sharding
• Uses the first 64-bits of md5 hash of field
• Hash both data and BSON type
• Represented as a NumberLong in the shell
51. // hash on 1 as an integer
> db.runCommand({_hashBSONElement:1})
{
"key" : 1,
"seed" : 0,
"out" : NumberLong("5902408780260971510"),
"ok" : 1
}
// hash on “1” as a string
> db.runCommand({_hashBSONElement:"1"})
{
"key" : "1",
"seed" : 0,
"out" : NumberLong("-2448670538483119681"),
"ok" : 1
}
Hash on both data and BSON type
62. Limitations
• Cannot use a compound key
• Key cannot have an array value
• Incompatible with tag aware sharding
– Tags would be assigned the value of the hash, not the
value of the underlying key
• Key with poor cardinality is going to give a hash
with poor cardinality
– Floating point numbers are squashed. E.g. 100.4 will be
hashed as 100
63. Summary
• There are 3 different approaches for sharding
• Hash shard keys give great distribution
• Hash shard keys are good for equality
• Pick the right shard key for your application
Remind everyone what a sharded cluster is. We will take a close look at some how sharded clusters work and at the new hashed shard key feature of 2.4
Isolating queries (to a few shards)Scatter -- gather ( high latency but not bad )hash keys
Min value includedMax value not included
Balancer is running on mongosOnce the difference in chunks between the most dense shard and the least dense shard is above the migration threshold, a balancing round starts
Moved chunk on shard2 should be gray
Source shard deletes moved dataMust wait for open cursors to either close or time outNoTimeout cursors may prevent the release of the lockMongos releases the balancer lock after old chunks are deleted
Moving data is expensive (i/o, network bandwidth)Moving many chunks takes a long time (can only move one chunk at a time)Balancing and migrations compete for resources with your application
The mongos does not have to load the whole set into memory since each shard sorts locally. The mongos can just getMore from the shards as needed and incrementally return the results to the client.
What’s the solution to sharding on incremental values as a shard key?
Uses the hashed index
Range Based - bestHash Based – uniform writes but not routed range queriesTag Aware