Hellenic MongoDB user group - Introduction to sharding

Introduction to sharding
Christos Soulios
Software Architect, Persado
(christos.soulios@persado.com)

15 Jan 2013 0

Lets start with an example:

 We have launched our latest and greatest web application
 We use MongoDB database which is fast and cool
 We even have setup replication for high availability
 Our application turns out to be popular and we are already
planning our next project
 Cool!

1

Unfortunately, our website becomes too popular too fast.
And this causes problems

2

MongoDB problems when dataset grows

 Dataset does not fit on local disks.
Solution: Let’s buy more disks
 Database indexes do not fit in memory. They have to be paged
in and out. Database becomes sluggish
Solution: Let’s buy more memory
 High throughput writing operations cause high contention on
the infamous MongoDB locks
Now what?

We need to scale horizontally. We need sharding

3

What is sharding?

 Shardingis automatic data partitioning
 Distributes data evenly across cluster nodes (called shards)
 Allows for seamless querying. Almost no functionality lost
over single master
 Keeps database consistent

4

How sharding works

 Collection data is broken into chunks based on the range of a
selected collection field. This field is called the shard key
 Chunks are evenly distributed across shards. Each data chunk is
controlled by a single shard
 Special config servers are responsible for storing which shard
controls which chunks
 Database clients communicate with the shards through the mongos
router process
 mongos router behaves to the client just as a normal mongod
server. Sharding is transparent to the client
 For each database operation, the mongosrouter queries the config
servers using the shard key and redirects the operation to the
correct shards
 While more data is inserted, ranges are split into more chunks

5

Example (Users collection)
{„user_id‟ : 45,
„username‟: „asterix‟,
„email‟ : asterix@google.com
„last_login‟: ‟11/11/2012‟
},
{„user_id‟ : 4503,
„username‟: „gandalf‟,
„email‟ : gandalf_rules@yahoo.com
„last_login‟: ‟01/14/2013‟
},
{„user_id‟ : 1153,
„username‟: „superman‟,
„email‟ : superman@superdomain.com
„last_login‟: ‟10/30/2012‟
},
{„user_id‟ : 5434,
„username‟: „darth_vader‟,
„email‟ : darth@stardestroyer.org
„last_login‟: ‟07/01/2012‟
}

>db.runCommand( { shardcollection: “test.users”, key: { username: 1 }} )

6

Shard architecture (sharding by user_id)

7

Database operations

 All queries are routed through the mongosprocess
 Insert operations are routed by shard key. Shard key is
required
 Querying by shard key routes the query to shards
 Querying by non-shard key scatters the query to all shards
and gathers results
 Updates and deletes behave like queries

8

Data balancing

 System becomes unbalanced when one shard stores more
data chunks than others
 Data is automatically balanced without intervention from the
client application or the administrator

9

Data balancing

 The range of the loaded shard is split and chunks are migrated
to other shards

10

Data balancing

 Config servers are updated using a 2phase commit process to
ensure database consistency
 System ends up balanced

11

Choosing a shard key

 Choosing a good shard key is critical
 Once chosen, we are stuck with it
 Shard key must be immutable
 Should distribute data load evenly across shards
 Should be of high cardinality. Enumerated values are not good
shard keys
 Should not be monotically increasing. ObjectIds, dates or database
sequences are not good shard keys, because they create hotspots
 Should be used by most critical queries to provide query isolation.
Avoid scatter-gather queries
 Should provide good data affinity to avoid disk to memory transfers
(random values are not good shard keys)

13


Know your data. It is important
 What is the expected dataset size?
 What is the write throughput?
 How do data look like? Which fields are random or increasing?
Are there low cardinality fields?
 Can we identify any access patterns for reads?
 What data is indexed?
 What is the active working set? Are there historical data that
are not used after sometime?

14


 It is not trivial
 Most of the times there is no single field that can be used as
shard key
 We have to invent one

15


 Usually applications access lately inserted data more often
 What about a compound shard key?
 What about a combination of a coarsely ascending field and a
commonly queried search key?
 Coarsely ascending key should have a few hundreds of chunks
per value. This provides good data locality and even
distribution
 Search key provides query isolation

Rule of thumb: {coarseLocality: 1, search : 1}

16

Example (Tweets collection)
{user: „asterix‟,
ts: ISODate(“01/14/2013Z22:53:33.123”),
month: „2013-01‟
retweets: 45,
client: „TweetDeck‟,
text: „Mongodbsharding is super cool!‟
}

We are typically looking for the latest tweets of a user.

 Therefore, a combination of „month + user‟ fields would create a
good shard key
 monthfield is coarsely ascending, allowing to transfer only
latest tweets to memory
 user field is a commonly searched key

17

Conclusion

 Sharding allows MongoDB databases to scale horizontally
 Shard balancing is performed automatically by the system
 Sharding is transparent to the client application
 Choosing a good shard key is critical
 Choosing a good shard key is not trivial
 Be creative and experiment with your data before choosing
the shard key

18

Hellenic MongoDB user group - Introduction to sharding

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Hellenic MongoDB user group - Introduction to sharding

Similar a Hellenic MongoDB user group - Introduction to sharding (20)

Último

Último (20)

Hellenic MongoDB user group - Introduction to sharding