In this webinar Ger Hartnett, Director of Engineering, Technical Services, talked about what happened when a data centre outage caused chaos and uncovered some significant flaws in a disaster recovery plan. It was late on a Friday evening, 17TB of data was at risk, and there was uncertainty about the reliability of the backups. The Technical Services team had until Monday morning to get everything back to normal.
3. ●The main talk should take 30-35 minutes
●You can submit questions via the chat box
●We’ll answer as many as possible at the end
●We are recording and will send slides Friday
●The next webinar in the series will take place
on Wednesday 23rd March
Before we start
4. ●You work in operations
●You work in development
●You have a MongoDB system in production
●You have contacted MongoDB Technical
Services (support)
A quick poll - add a word to the chat
to let me know your perspective
5. ●We collect - observations about common
mistakes - to share the experience of many
●Names have been changed to protect the
(mostly) innocent
●No animals were harmed during the making
of this presentation (but maybe some DBAs
and engineers had light emotional scarring)
●While you might be new to MongoDB we
have deep experience that you can leverage
Stories
6. 1. Discovering a DR flaw during a data
centre outage
2. Complex documents, memory and
an upgrade “surprise”
3. Wild success “uncovers” the wrong
shard key
The Stories (part one today)
8. Story #1: Recovering from a
disaster
●Prospect in the process of signing up for a
subscription
●Called us late on Friday, data centre power
outage and 30+ (11 shards) servers down
●When they started bringing up the first
shard, the nodes crashed with data
corruption
●17TB of data, very little free disk space,
JOURNALLING DISABLED!
9. Recovery plan
●Multisite team worked with customer over
weekend to put plan in place
●Stop everything
●Repair config servers with mongodump /
mongorestore
●In each replica set
o Start secondary in read only mode
o Mount NFS storage for repaired files
o Repair a former primary node
o Iterative rsync to seed a secondary
10. Recovering each shard
1.Start secondary
read only
2.Mount NFS
storage for repair
3.Repair former
primary node
4.Iterative rsync to
seed a secondary
Secondary
Primary
Secondary
11. Implementing the plan
●Multiple calls checking progress at every
step
●Config servers repaired
●Read-only shards started
●Repairing each shard primary while doing
document count checks (some documents
missing, 9k on one shard)
●Provided method to “dump --repair” and diff
to recover most of 9k missing documents
13. Aftermath and lessons learned
●Signed up for a subscription
●Enabled journalling
●Added a second data center with a RS
member in each
●Put together disaster recovery procedures,
backups and tested them
14. Key takeaways for you
●If you are departing significantly from
standard config, check with us (i.e. if you
think journalling is a bad idea)
●Two DC in different buildings on different
flood plains, not in the path of the same
storm (i.e. secondaries in AWS)
●DR/backups are useless if you haven’t
tested them
17. ●You can submit questions via the chat box
●I’ll read out the question before answering
●We are recording and will send slides Friday
●Part 2: the next webinar will take place on
Wednesday 23rd March
Questions
Some borrowed, some merged into a single narrative
Some of the people that inspired them may well be here in this room today
Bill's Bulk Updates randomly affected an ever larger data set.
In order to cope with the database size, Bill added more shards.
The cluster scaled linearly, as intended.