In the past, Harry Splinter, Database Reliability Engineer at OptimaData, wrote about SAP ASE backups under the title "Thoughts on backups," in which he also paid attention to recovery. During his career as a DBA, he encountered quite a few situations in which recovery was the last straw. This sometimes led to unpleasant surprises. In this blog he shares his experiences. To learn, but also to amuse.
Everyone is familiar with changing tapes and disposing of them to a secure, off-site, environment or in a vault. Restoring a tape is no different than reading back the data on disk. In doing so, of course, it is relevant that that data can actually be read back. Even if the backup software reports that a backup was successful, a tape unit may be "out of sync. In this case, the write head is not properly "aligned. Sounds complicated, but the result is that backups are unreadable. though. Fortunately, this shortcoming can be fixed with the help of an engineer, but all tapes have to be re-read and re-written. It happened to me once, and then it was a costly week-long job to restore everything. In this case, by coincidence we had copied a backup of this "not production critical database" to a test environment the day before
Database corruption is also not always detected by the backup software. It is important to perform regular dbcc checks or other checks to recognize and repair corruption in a timely manner. Cross-database queries is a cause of database inconsistency. Point-in-time recovery - a database recovery technique that allows you to restore a database to a specific time in the past - is then virtually impossible. If you don't take this into account, the recovery of a single database is not sufficient and all related databases must be restored where data loss can still occur. Backups are often performed serially in a database management system. This also affects the recovery time and which point-in-time is chosen for dependencies with other systems.
Performing a backup over the network, to disk or other media can also go wrong. This can be caused by a variety of problems, such as data loss from network failures - called dropped data packets - from disruptions on disk or memory leakage.
The human factor
Deleting data, dropping a table, database or other operation is something that happens more often than we would like. Often it is caused by the lack of an access and permissions policy for which we ourselves are partly responsible. Recovery is also the lifesaver here, but often a bit more complex if an entire database may not be restored. Sufficient space within the environment or on the machine is therefore critical. Building a recovery environment and keeping it on hand can greatly speed up the turnaround time of the recovery process.
"Is on the company wiki," is an often-heard answer when we ask where the recovery plan is. Implementation often leaves much to be desired. Because of the daily worries, this task is quickly postponed. Moreover, there is often only one employee who is familiar with the contents of the plan and practiced in its implementation. Not a job for someone who has no experience with it. In today's society, data availability is a high priority and a lot is invested in it, think terms like Always On, Cluster software, monitoring, 24/7 support, 99.999% uptime). Recovery is almost always an item on the agenda, but as a result it does not always get the attention it should.
Peace, Regularity and Recovery
Peace of mind for the DBA comes with the regularity with which recovery plans are implemented. Distributing the knowledge and monitoring it gives confidence. Not leaning on one person, but several employees who know how to execute the plan and have practiced it regularly. This can also be a delegated responsibility in smaller organizations. In this regard, a good procedural description of the recovery plan is very important. Recovery should be as simple as backing up.
Monday night, five past twelve. The phone rings:
'"Good night, with your colleague. In one hour we are gathering for a disaster-recovery. The disk array has crashed and we have to fall back on backups. The database servers are no longer accessible".
Colleague: "No problem, I'm coming. I know from experience that the recovery process will take about three hours'"
What would be your response?
Restful sleep thanks to a HealthCheck
To quickly and efficiently get a picture of the state of your backups, the efficiency of your recovery process and to investigate where improvements can be made, OptimaData's database HealthCheck is a good starting point. All the checkpoints I mention here are checked anyway and the HealthCheck includes much more than just backup and recovery. If you run it regularly, you can sleep easy. And if you do get a wake-up call, at least you'll have your answer ready.