2. 2
Agenda
Overview
• What is sharding?
• Why and what should I use sharding for?
Building your First Sharded Cluster
• What do I need to know to succeed with sharding?
Q&A
3. 3
What is Sharding?
Sharding is a means of partitioning data across servers to enable:
Scale
needed by
modern
applications
to support
massive work
loads and
data volume.
Geo-Locality
to support
geographically
distributed
deployments to
support optimal UX
for customers across
vast geographies.
Hardware
Optimizations
on Performance vs.
Cost
Lower Recovery Times
to make “Recovery Time Objectives”
(RTO) feasible.
4. 4
What is Sharding?
Sharding involves a shard key defined by a data modeler
that describes the partition space of a data set.
MongoDB provides 3 Types of Sharding Strategies:
• Ranged
• Hashed
• Tag-aware
Data is partitioned into data chunks by the shard key, and
these chunks are distributed evenly across shards that
reside across many physical servers.
6. 6
Hash Sharding
Hash Sharding is a subset of Range Sharding.
…3334…3333 …8001…8000 …AAAB…AAAA …DDDF…DDDD
MongoDB apples a MD5 hash on the key when a hash shard key is
used:
Hash Shard Key(deviceId) = MD5(deviceId)
Ensures data is distributed randomly within the range of MD5 values
7. 7
Tag-aware Sharding
Tag-aware sharding allows subset of shards to be tagged, and assigned
to a sub-range of the shard-key.
Tag Start End
West MinKey, MinKey MaxKey,50
East MinKey, 50 MaxKey, MaxKey
Collection: Users, Shard Key: {uId, regionCode}
Secondary
Secondary
Secondary
Secondary
Shard2,
Tag=West
Shard3,
Tag=East
Secondary
Secondary
Shard1,
Tag=West
Secondary
Secondary
Shard4,
Tag=East
Example: Sharding User Data belong to users from 100 “regions”
Assign Regions
1-50 to the West
Assign Regions
51-100 to the
East
8. 8
Applying Sharding
Usage Required Strategy
Scale Range or Hash
Geo-Locality Tag-aware
Hardware Optimization Tag-aware
Lower Recovery Times Range or Hash
10. 10
Typical Small Deployment
Highly Available
but not Scalable
Replica Set
Writes Reads
Limited by capacity
of the Primary’s
host
When Immediate
Consistency
Matters: Limited by
capacity of the
Primary’s host
When Eventual
Consistency is
Acceptable: Limited
by capacity of
available replicaSet
members
11. 11
Sharded Architecture
Auto-balancing:
data is partitioned
based on a shard
key, and
automatically
balanced across
shards by MongoDB
Query Routing: database
operations are
transparently routed
across the cluster
through a routing proxy
process (software).
Horizontal Scalability:
load is distributed and
resources are pooled
across commodity
servers.
Increasing read/write capacity
Decoupling for
Development and
Operational Simplicity:
Ops can add capacity
without app dependencies
and Dev involvement.
12. 13
Value of Scale-out Architecture
System Capacity
$
Scale-up Limits
High
Capacity/$
Scale-out to 100-1000s
of servers
Optimize on Capacity/$
Scale-down and re-allocate
resources
Apps of
the Past
Modern Apps Trend
13. 14
Sharding for Geo-Locality
Adobe Cloud Services among other popular consumer and
Enterprise services use sharding to run servers across multiple data
centers across geographies.
● Amazon - Every 1/10 second delay resulted in 1% loss of
sales.
● Google - Half a second delay caused a 20% drop in traffic.
● Aberdeen Group - 1-second delay in page-load time
o 11% fewer page views
o 16% decrease in customer satisfaction
o 7% loss in conversions
Network latency from West to East is ~80ms
14. 15
Multi-Active DCs via Tag-aware
Sharding
Secondary Secondary
WEST EAST
Local Reads (Eg. Read Preference = Nearest)
Query
Query
Shard 1
Secondary Secondary
Shard 2
Tag = East
Tag = West
Secondary Secondary
Shard 3
Tag = East
Update
Update
Collection: Users, Shard Key: {regionCode,uId}
Priority=10Priority=5
Priority=10 Priority=5
Tag Start End
West MinKey, MinKey MaxKey,50
East MinKey, 50 MaxKey, MaxKey
15. 17
Optimizing Latency and Cost
Event Latency Normalized to 1 s
RAM access 120 ns 6 min
SSD access 150 µs 6 days
HDD access 10 ms 12 months
Storage Type Avg. Cost ($/GB) Cost at 100TB ($)
RAM 5.50 550K
SSD 0.50-1.00 50K to 100K
HDD 0.03 3K
Magnitudes of Difference in Speed
Magnitudes of Difference in Cost
16. 18
Optimizing Latency and Cost
Data Type Description Latency SLA Data Volume
Meta Data Fast look-ups to
drive real-time
decisions
95th
Percentile <
1ms
< 1 TB
Last 90 days of
Metrics
95+% of data
reported and
monitored
95th
Percentile <
30ms
< 10 TB
Historic Used for historic
reporting. Access
infrequently
95th
Percentile < 2s > 100TB
Use Case: Sensor data collected from millions of devices. Data used for
real-time decision automation, real-time monitoring and historical
reporting.
17. 19
Hardware Optimizations
Tag: Cache Tag: Flash Tag: Archive
Collections Tag Start End
Meta Cache MinKey MaxKey
Metrics Flash MinKey,MinKey MaxKey, 90 days ago
Metrics Archive MinKey,>90 days ago MaxKey, MaxKey
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Collection Shard Key
Meta DeviceId
Metrics DeviceId, TimestampHigh Memory Ratio,
Fast Cores
HDD
Medium Memory Ratio,
High Compute
SSDs
Low Memory Ratio,
Medium Compute
HHD
18. 20
Restoration Times
Scenario: Application bug causes logical corruption of the data, and the database
needs to be rolled back to a previous PIT. What’s RTO does your business require in
this event?
N = 10
10X10TB Snapshots generated
and/or transferred in parallel
Total DB Snapshot Size = 100TB
N = 100
100X1TB Snapshots generated
and/or transferred in parallel.
Potentially 10X faster
restoration time.
Tar.g
z
Tar.g
z
Tar.g
z
Tar.g
z
20. 22
Life-cycle of Sharding
Design &
Development
Test/QA Pre-Production Production
Product Definition: Starts with an idea to build something big!
Predictive Maintenance Platform: a cloud platform for building
predictive maintenance applications and services—such as a service
that monitors various vehicle components by collecting data from
sensors, and automatically prescribes actions to take.
•Allow tenant to register, ingest and modify data collected by sensors
•Define and apply workflows and business rules
•Publish and subscribe to notifications
•Data access API for things like reporting
21. 23
Life-cycle of Sharding
Design &
Development
Test/QA Pre-Production Production
Data Modeling: Do I need to Shard?
Throughput: data from millions of sensors updated in real-
time
Latency: The value of certain attributes need to be access
with 95th
percentile < 10ms to support real-time decisions
and automation.
Volume: 1TB of data collected per day. Retained for 5
years.
22. 26
Life-cycle of Sharding
Design &
Development
Test/QA Pre-Production Production
Data Modeling: Select a Good Shard Key
Critical Step
•Sharding is only effective as the shard key.
•Shard Key attributes are immutable.
•Re-sharding is non-trivial. Requires re-partitioning
data.
31. 35
Index Locality
MD5 -0 day-1 day-2 day...
Right balanced index may only
need to be partially in RAM to be
effective.
Working Set
Random Access Index Right Balance Index
Key = Hash(DeviceId) Key = DeviceId, Timestamp
… Device N
33. 37
Life-cycle of Sharding
Design &
Development
Test/QA Pre-Production Production
Performance Testing: Avoid Pitfalls
Best Practices:
•Pre-split:
1. Hash Shard Key: specify numInitialChunks:
http://docs.mongodb.org/manual/reference/command/shardC
ollection/
2. Custom Shard Key: create a pre-split script:
http://docs.mongodb.org/manual/tutorial/create-chunks-in-
sharded-cluster/
• Run mongos (query router) on app server if possible.
Splits happen on demand
as chunks grow
Migrations happen when an
imbalance is detected
Sharding results in massive
performance degradation!
34. 38
Life-cycle of Sharding
Design &
Development
Test/QA Pre-Production Production
Capacity Planning: How many shards do I need?
Sizing:
•What are the total resources required for your initial
deployment?
•What are the ideal hardware specs, and the # shards
necessary?
Capacity Planning: create a model to scale MongoDB for a
specific app.
•How do I determine when more shards need to be added?
•How much capacity do I gain from adding a shard?
35. 39
How Many Servers?
Strategy Accuracy Level of Effort Feasibility of
Early Project
Analysis
Domain Expert High to Low:
inversely related
to complexity of
the Application
Low Yes
Empirical (Load
Testing)
High High Unlikely
36. 41
Domain Expert
Model and Load Definition
Business Solution
Analysis
Resource Analysis Hardware Specification
• What is the document model? Collections, documents, indexes
• What are the major operations?
- Throughput
- Latency
• What is the working set? Eg. Last 90 days of orders
Normally, performed by MongoDB Solution Architect:
http://bit.ly/1rkXcfN
37. 42
Domain Expert
Model and Load Definition
Business Solution
Analysis
Resource Analysis Hardware Specification
Resource Methodology
RAM Standard: Working Set + Indexes
Adjust more or less depending on latency vs. cost requirements. Very large clusters
should account for connection pooling/thread overhead (1MB per active thread)
IOPs Primarily based on throughput requirements. Writes + estimation on query page
faults. Assume random IO. Account for replication, journal and log (note:
sequential IO). Ideally, estimated empirically through prototype testing. Experts
can use experience from similar applications as an estimate. Spot testing maybe
needed.
Storage Estimate using throughput, document and index size approximations, and
retention requirements. Account for overhead like fragmentation if applicable.
CPU Rarely the bottleneck; a lot less CPU intensive than RDBMs. Using current
commodity CPU specs will suffice.
Network Estimate using throughput and document size approximations.
38. 44
Sizing by Empirical Testing
• Sizing can be more accurately obtained by prototyping your application, and
performing load tests on selected hardware.
• Capacity Planning can be simultaneously accomplished through load testing.
• Past Webinars: http://www.mongodb.com/presentations/webinar-capacity-planning
Strategy:
1. Implement a prototype that can at least simulate major workloads
2. Select an economical server that you plan to scale-out on.
3. Saturate a single replicaSet or shard (maintaining latency SLA as needed). Address
bottlenecks, optimize and repeat.
4. Add an additional shard (as well as mongos and clients as needed). Saturate and
confirm roughly linear scaling.
5. Repeat step 4 until you are able to model capacity gains (throughput + latency)
versus #physical servers.
39. 45
Operational Scale
Design &
Development
Test/QA Pre-Production Production
Business Critical Operations: How do I manage 100s to 1000s of nodes?
MongoDB Management Services (MMS): https://mms.mongodb.com
• Real-time monitoring
and visualization of
cluster health
• Alerting
• Automated cluster
provisioning
• Automation of daily
operational tasks like no-
downtime upgrades
• Centralized configuration
management
• Automated PIT
snapshotting of
clusters
• PITR support for
sharded clusters
42. Get Expert Advice on Scaling. For Free.
For a limited time, if you’re
considering a commercial
relationship with
MongoDB, you can sign up
for a free one hour consult
about scaling with one of
our MongoDB Engineers.
Sign Up: http://bit.ly/1rkXcfN
EA Sports FIFA: world&apos;s best-selling sports video game franchise. User data and game state for millions of players,
Yandex: The largest search engine in Russia uses MongoDB to manage all user and metadata for its file sharing service. MongoDB has scaled to support tens of billions of objects and TBs of data, growing at 10 million new file uploads per day.
eBay
FourSquare: Foursquare is used by over 50 million people worldwide, who have checked in over 6 billion times, with millions more added every day. MongoDB is Foursquare’s main database, supporting hundreds of thousands of operations per second and storing all check-ins and history, user and venue data along with reviews.
AHL, a part of Man Group plc, is a quantitative investment manager based in London and Hong Kong, with over $11.3 billion in assets under management. After evaluating multiple technology options, AHL used MongoDB to replace its relational and specialised &quot;tick&quot; databases. MongoDB supports 250 million ticks per second, at 40x lower cost than the legacy technologies it replaced.
Adobe: Many of the world’s most recognizable brands use Adobe Experience Manager to accelerate development of digital experiences that increase customer loyalty, engagement and demand. Adobe uses MongoDB to store petabytes of data the large-scale content repositories underpinning the Experience Manager.
MongoDB MMS Back-up: 2PB
Mcafee: MongoDB powers McAfee Global Threat Intelligence (GTI), a cloud-based intelligence service that correlates data from millions of sensors around the globe. Billions of documents are stored and analyzed in MongoDB to deliver real-time threat intelligence to other McAfee end-client products.
Carfax: CARFAX relies on its Vehicle History database to connect potential buyers with used vehicles in their area, and for analytics to guide the business. To improve customer experience, CARFAX migrated to MongoDB which now manages over 13 billion documents, before replication across multiple data centers.
Limited capacity: monolithic architecture, siloed data access, or complex application-level facilitated sharding that is difficult and expensive to successfully implement
Scale down: system with seasonal load (ex. Online games- popular from the start and fade over time). Sharding enables scaling-down to reallocate resources on bare-metal servers. Can’t do the same for a database appliance
Hash Key isn’t always a good key. Can be good for the most most simplistic k-v access patterns, but any query retrieving multiple documents like range based queries can potentially lead to scatter-gather queries, which have high overhead and impair the ability to scale linearly.