Pyrhian is a global data and infrastructure management company. We employ more than 250 DBAs and we support more then 6000 databases.If you need help – Call us!
BAAG’s main idea is to eliminate guesswork from the decision making process. This is especially important in my work. If I’m guessing – I might be wrongIf I base decisions on facts – the chances of being right are higher.And for the scope of this presentation, It’s also important to make sure your RMAN scripts are written so that they minimize the guesswork during the restore situations.
If backups are successful and the test recovery succeeded = Does it mean everything is OK?NOTest oftenDocument important information (where are the backups stored?) – you say I have just few databases I know where they are, what if you have 10? What if you have 20?In best scenario you might loose time looking for backups and the information needed for recovery, And nothing can be worse than the CEO watching over your shoulder and seeing how you browse the filesystem looking for backups. Be sceptical.Prepare for the worsHope for the best.
Let’s start with few general thought to prepare your brain for this discussion about backup scriptsA common misconception is that modern disk arrays are extremely reliable. It’s not true – there are still failure scenarios when all data on the array is lost simultaneously.The most typical example besides physical damage is a firmware update. So Even on the fanciest disk array you don’t really have backups unless you store a copy of backup elsewhere.
This somewhat extends the previous thought…But imagine you have a database that you you take archived log backups every 4 hours.You backup each archived log twice, that is – you delete it only after it’s included in 2 archived log backups.What happens if the same tape cartridge is used for 2 consecutive backups and it’s damaged? You loose an archived log.state risks clearly Do it often.“I can’t recover this database if it breaks”
This is a typical output log created by the “backup database” command. Do you see any issues?Shout if you know what’s wrong!Keep in mind the statement below: “Prepare all you may need for smooth recovery while working on backup procedures”. Does it provide enough information for most of the situations?
Is this better?/// You see the exact command used to take the backup – this immediately gives you lot of information about the data included in this backups./// You see the exact timestamps and are able to understand the oldest point in time you can recovert to using this backup. It also gives you a hint on how much archived redo you have to apply on top of the backup.
Date +%s – seconds since the 1st of January, 1970.Why do we need to know how long the backups take? Helps in planning the backup schedules. Look at the history, check the elapsed time trend.
Another example of a script I’ve seen … /explain the script/Does anything think it’s a good script?
Don’t use crosscheckWhat if an archived log “disappears” before it’s backed up?My colleague recently encountered a situation where an archived log “disappeared” when it was being archived during a FS resize operation.We didn’t crosscheck inside RMAN script and got alerted immediately after the backup failed. And solved the issue by running an incremental backup ASAP.CROSSCHECK has to be a manual activity executed by DBAs to resolve issues.If you run crosscheck and delete expired within the script you loose the archived logs and don’t even find out about it.
/ explain the script / Ok. .. You see the “include current controlfile” is red, must be something wrong with it.And it’s correct. We are making the controlfilebackup outdated immediately
Now you must be thinking common – we all use controlfileautbackups and we should,
Do not rely on a single backup.Always have a plan B.Talk about these options:REDUNDANCY 1 – REDUNDANCY 1 + 2 copies = data files, are read once even if there are 2 copies, so if a memory corruption occurs during the backup, you might take a corrupted backup.REDUNDANCY X – not good as you never know the recovery window (i.e. someone might take a one off backup before the maintenance )
Deleting archived logs based on time only is dangerous.How do you make sure the archived logs have been backed up?As previously was explained – we should also heck the number of backups taken for each archived log before deleting it.And a deletion policy should be set if to applied on standby if standby database is used.
It’s important how you start RMAN from your backup scripts –If the target database and catalog database connection information is passed as parameters upon initialization of RMAN – unavailability of RMAN prevents startup of rman and you will not take the backup
Better approach….
Another method to accomplish the same thing is to take backup without connecting to the RMAN catalog.Then after the backup completes, resync the catalog.
There are number of reasons why rman stored settings might change - some of the reasons will be valid/planned changed to the backup policy - other changes might be temporary fixes or workarounds for a specific purpose – what happens while If all DBAs are not 100% sure of all specific configurations for the backups the situations when some stored settings are accidentally changed can happen.Here are few examples– 1. DBA temporarily reduces the retention settings to free additional room for archived log backups because of unusual peak activity in the database. 2.Parallelism settings might be temporarily changed to take one off backupIf you don’t have a catalog database, but use tapes for backups looking up controlfileautobackup can take long time.
Document the settings before executing your scriptsI find it’s much better to ensure stability of backup scripts if the required settings are hardcoded in the backup script.But to do that, we probably need a solution to save and restore the settings that were present when the backup script started./ Explain the implementation /
Backup validations and alerts - I’ve often see it being done wrong or being done incompletely.SO how do you monitor the backup jobs? Here are few examples I’ve seen that are not very thorough:We don’t report at allDBA logs on to the server and checks logs sometimesBackup logs are sent to a shared email address (good!)DBA on duty checks emails (what if no one available/no email received?)We check RMAN command errors code $? and sending emailThe email based approach when the DBA is supposed to check alerts in the morning is surprisingly popular, but this is definitely not good enough!
I would suggest more thorough analysis…/walk though the slide /-- ALERT about any failure to the oncall DBA immediatelyFull backups are usually very resource consuming – do you want to re-run the full backups in the morning after you find out it failed?-- ALERT about LONG running backupsWhy? Because if the backup runs too long it threatens to impact the enduser experience.Unsderstand why backups take longer and take actions to avoid impacting users.Plan the thresholds so that you had enough time or action, i..e if the time is slowly increasing as the database grows give youtseklf enough time to be able to tune the backup
Additionally, the notifications alone are not enough!You have to make sure you database is safely backed up based on the business requirements! Implement another check Probably even running on a different server (not on the same DB server) to check if:All datafiles have been backed up I last 24 hours?We have enough datafiles to satisfy the retention settingsDatafiles containing unrecoverable operrrations.This is extremely important, because these checks will let you know of it.i.e. on Security-Enhanced Linux cron stops working when the password for the OS user expires!
What happens if you have redundancy=1 the previous backup and archived logs are removed immediately after completing the current backup.When the tape backup runs it backs up 1 full backup + half a day of archived logs, so in the end the backups on tapes will contain a full backup for each daysome archived logs after each full backupThen a gap before the next backup.Redundancy 2 resolved the problem and ensures the continuous redo stream on tape as it will always keep all redo logs between last two backups.Redundancy 2 required space for 3 backups.
You should remove files from disk based on disk retentionYou should remove obsolete files’ information from the based on tape retention