This session introduces the basic components of high availability before going into a deep dive on MongoDB replication. We'll explore some of the advanced capabilities with MongoDB replication and best practices to ensure data durability and redundancy. We'll also look at various deployment scenarios and disaster recovery configurations.
2. @spf13
AKA
Steve Francia
15+ years
building the
internet
Father, husband,
skateboarder
Chief Solutions Architect @
responsible for drivers,
integrations, web & docs
3. Agenda
• Intro to replication
• How MongoDB does Replication
• Configuring a ReplicaSet
• Advanced Replication
• Durability
• High Availability Scenarios
7. Use cases
• High Availability (auto-failover)
• Read Scaling (extra copies to read from)
8. Use cases
• High Availability (auto-failover)
• Read Scaling (extra copies to read from)
• Backups Delayed Copy (fat finger)
• Online, Time (PiT) backups
• Point in
9. Use cases
• High Availability (auto-failover)
• Read Scaling (extra copies to read from)
• Backups Delayed Copy (fat finger)
• Online, Time (PiT) backups
• Point in
• Use (hidden) replica for secondary
workload
• Analytics
• Data-processingexternal systems
• Integration with
11. Types of outage
Planned
• Hardware upgrade
• O/S or file-system tuning
• Relocation of data to new file-system /
storage
• Software upgrade
12. Types of outage
Planned
• Hardware upgrade
• O/S or file-system tuning
• Relocation of data to new file-system /
storage
• Software upgrade
Unplanned
• Hardware failure
• Data center failure
• Region outage
• Human error
• Application corruption
16. Replica Set features
• A cluster of N servers
• Any (one) node can be primary
• Consensus election of primary
17. Replica Set features
• A cluster of N servers
• Any (one) node can be primary
• Consensus election of primary
• Automatic failover
18. Replica Set features
• A cluster of N servers
• Any (one) node can be primary
• Consensus election of primary
• Automatic failover
• Automatic recovery
19. Replica Set features
• A cluster of N servers
• Any (one) node can be primary
• Consensus election of primary
• Automatic failover
• Automatic recovery
• All writes to primary
20. Replica Set features
• A cluster of N servers
• Any (one) node can be primary
• Consensus election of primary
• Automatic failover
• Automatic recovery
• All writes to primary
• Reads can be to primary
(default) or a secondary
29. How Is Data
Replicated? to the
• Change operations are written
oplog
• The oplog is a capped collection (fixed size)
•Must have enough space to allow new secondaries to
catch up (from scratch or from a backup)
•Must have enough space to cope with any applicable
slaveDelay
30. How Is Data
Replicated? to the
• Change operations are written
oplog
• The oplog is a capped collection (fixed size)
•Must have enough space to allow new secondaries to
catch up (from scratch or from a backup)
•Must have enough space to cope with any applicable
slaveDelay
• Secondaries query the primary’s oplog
and apply what they find
• All replicas contain an oplog
35. Managing a Replica Set
rs.conf()
Shell helper: get current configuration
rs.initiate(<cfg>);
Shell helper: initiate replica set
36. Managing a Replica Set
rs.conf()
Shell helper: get current configuration
rs.initiate(<cfg>);
Shell helper: initiate replica set
rs.reconfig(<cfg>)
Shell helper: reconfigure a replica set
37. Managing a Replica Set
rs.conf()
Shell helper: get current configuration
rs.initiate(<cfg>);
Shell helper: initiate replica set
rs.reconfig(<cfg>)
Shell helper: reconfigure a replica set
rs.add("hostname:<port>")
Shell helper: add a new member
38. Managing a Replica Set
rs.conf()
Shell helper: get current configuration
rs.initiate(<cfg>);
Shell helper: initiate replica set
rs.reconfig(<cfg>)
Shell helper: reconfigure a replica set
rs.add("hostname:<port>")
Shell helper: add a new member
rs.remove("hostname:<port>")
Shell helper: remove a member
40. Managing a Replica Set
rs.status()
Reports status of the replica set from one
node's point of view
41. Managing a Replica Set
rs.status()
Reports status of the replica set from one
node's point of view
rs.stepDown(<secs>)
Request the primary to step down
42. Managing a Replica Set
rs.status()
Reports status of the replica set from one
node's point of view
rs.stepDown(<secs>)
Request the primary to step down
rs.freeze(<secs>)
Prevents any changes to the current replica
set configuration (primary/secondary status)
Use during backups
50. Other member
types
• Arbiters
• Don’t store a copy of the data
• Vote in elections
• Used as a tie breaker
51. Other member
types
• Arbiters
• Don’t store a copy of the data
• Vote in elections
• Used as a tie breaker
• Hidden
• Not reported in isMaster
• Hidden from slaveOk reads
53. Priorities
• Priority: a number between 0 and 100
• Used during an election:
• Most up to date
• Highest priority
• Less than 10s behind failed Primary
• Allows weighting of members during
failover
55. Priorities - example
A B C D E
p:10 p:10 p:1 p:1 p:0
• Assuming all members are up to date
56. Priorities - example
A B C D E
p:10 p:10 p:1 p:1 p:0
• Assuming all members are up to date
• Members A or B will be chosen first
• Highest priority
57. Priorities - example
A B C D E
p:10 p:10 p:1 p:1 p:0
• Assuming all members are up to date
• Members A or B will be chosen first
• Highest priority
• Members C or D will be chosen when:
• A and B are unavailable
• A and B are not up to date
58. Priorities - example
A B C D E
p:10 p:10 p:1 p:1 p:0
• Assuming all members are up to date
• Members A or B will be chosen first
• Highest priority
• Members C or D will be chosen when:
• A and B are unavailable
• A and B are not up to date
• Member E is never chosen
• priority:0 means it cannot be elected
66. Write Concern
w:
the number of servers to replicate to (or
majority)
wtimeout:
timeout in ms waiting for replication
67. Write Concern
w:
the number of servers to replicate to (or
majority)
wtimeout:
timeout in ms waiting for replication
j:
wait for journal sync
68. Write Concern
w:
the number of servers to replicate to (or
majority)
wtimeout:
timeout in ms waiting for replication
j:
wait for journal sync
tags:
ensure replication to n nodes of given tag
69. Fire and Forget
Driver Primary
write
apply in memory
•Operations are applied in memory
•No waiting for persistence to disk
•MongoDB clients do not block waiting to confirm
the operation completed
70. Wait for error
Driver Primary
write
getLastError apply in memory
•Operations are applied in memory
•No waiting for persistence to disk
•MongoDB clients do block waiting to confirm the
operation completed
71. Wait for journal
sync
Driver
write
Primary
getLastError
apply in memory
j:true
Write to journal
•Operations are applied in memory
•Wait for persistence to journal
•MongoDB clients do block waiting to confirm the
operation completed
72. Wait for fsync
Driver Primary
write
getLastError
apply in memory
fsync:true
write to journal (if enabled)
fsync
•Operations are applied in memory
•Wait for persistence to journal
•Wait for persistence to disk
•MongoDB clients do block waiting to confirm the
operation completed
73. Wait for replication
Driver Primary Secondary
write
getLastError
apply in memory
w:2
replicate
•Operations are applied in memory
•No waiting for persistence to disk
•Waiting for replication to n nodes
•MongoDB clients do block waiting to confirm the
operation completed
74. Tagging
• Control over where data is written to.
• Each member can have one or more tags:
tags: {dc: "stockholm"}
tags: {dc: "stockholm",
ip: "192.168",
rack: "row3-rk7"}
• Replica set defines rules for where data resides
• Rules defined in RS config... can change
without change application code
77. Single Node
• Downtime inevitable
• If node crashes human
intervention might be
needed
• Should absolutely run
with journaling to
prevent data loss /
78. Replica Set 1
• Single datacenter
Arbiter
• Single switch & power
• One node failure
• Automatic recovery of
single node crash
• Points of failure:
• Power
• Network
• Datacenter
79. Replica Set 2
• Single datacenter
Arbiter
• Multiple power/network
zones
• Automatic recovery of single
node crash
• w=2 not viable as losing 1
node means no writes
• Points of failure:
• Datacenter
• Two node failure
80. Replica Set 3
• Single datacenter
• Multiple power/network
zones
• Automatic recovery of
single node crash
• w=2 viable as 2/3 online
• Points of failure:
• Datacenter
• Two node failure
82. Replica Set 4
• Multi datacenter
• DR node for safety
• Can't do multi data
center durable write
safely since only 1 node
in distant DC
83. Replica Set 5
• Three data centers
• Can survive full data
center loss
• Can do w= { dc : 2 } to
guarantee write in 2
data centers
84. Set
Use? Data Protection High Availability Notes
size
Must use --journal to protect
X One No No
against crashes
On loss of one member, surviving
Two Yes No
member is read only
On loss of one member, surviving
Three Yes Yes - 1 failure two members can elect a new
primary
* On loss of two members,
X Four Yes Yes - 1 failure* surviving two members are read
only
On loss of two members, surviving
Five Yes Yes - 2 failures three members can elect a new
primary
Typical
85. http://spf13.com
http://github.com/s
@spf13
Questions?
download at mongodb.org
We’re hiring!! Contact us at jobs@10gen.com