
The Linux Philosophy for SysAdmins, Tenet 19—Backup everything frequently
Author’s note: This article is excerpted in part from chapter 21 of my book, The Linux Philosophy for SysAdmins, with some changes to update the information in it and to better fit this format.
Nothing can ever go wrong with my computer and I will never lose my data. If you believe that, I have a bridge to sell you. No — this isn’t an April Fool joke.
Data loss
Without going into detail about my own stupidity, here are a few reasons why we may lose data at inopportune times. Of course there is no opportune time to lose data.
Self-inflicted data loss comes in many forms. The most common form being erasure of one or more important files or directories.
Sometimes erasing needed files is accidental. I just erase a bunch of old files in a directory and it turns out later that one or two are still needed. More often, for me at least, I actually look at the files and decide they are no longer needed. A day, or two, or a week after I delete them it turns out that I still need at least some of the files I just deleted. I have also made significant changes to a file and saved it. Once again I find at some time later I made changes and especially deletions that I should not have.
Clearly it is necessary to pay attention when deleting files or making changes to them. That still won’t keep us from deleting data we may need later.
Power failures can occur for many reasons. This includes momentary power failures that can shut down the computer just as irrevocably as longer ones. Regardless of the reason for the power failure, there is the danger of losing data especially from documents that have not been saved. Modern hard drives and filesystems employ strategies that help to minimize the probabilities of data loss, but it still happens.
I have had my share of power failures. Back before modern, journaling filesystems like EXT3 and EXT4, I did experience some serious data loss. One way to help prevent data loss due to power failures is to invest in Uninterruptible power supplies (UPS) that maintain power on the hosts long enough to perform a shutdown, either manual or triggered by the power failure itself.
Electromagnetic Interference, EMI, is various types of electromagnetic radiation from many different sources. This radiation can interfere with the correct operation of any electronic device including computers.
When I worked for IBM in their PC customer support center in Atlanta Georgia, our first office was about a mile from and directly on the centerline of the Dobbins Air Force base runway. Military aircraft of all types flew in and out 24 hours a day. There were times when the high-powered military radars would cause multiple systems to crash at the same time. It was just a fact of life in that environment.
Lightning, static electricity, microwaves, old CRT displays, radio frequency bursts on a ground line, all of these and more can cause problems. Good grounding can reduce the effects of all of these types of EMI. But that does not make our computers completely immune to the effects of EMI.
Hard drive failures also cause data loss. The most common failures in today’s computers are devices that have moving mechanical components. Leading the frequency list are cooling fans and hard drives are a close second. Modern hard drives have SMART capabilities that enable predictive failure analysis. Linux can monitor these drives and send an email to root indicating that failure is imminent. Do not ignore those emails because replacing a hard drive before it fails is less trouble than replacing one after it fails and then hoping the backups are up to date.
Disgruntled employees can maliciously destroy data. Proper security procedures can mitigate this type of threat, but backups are still handy.
Theft is also a way to lose data. Soon after we moved to Raleigh, NC, in 1993, there was a series of articles in the local paper and TV that covered the tribulations a scientist at one of our better known universities. This scientist kept all of his data on a single computer. He did have a backup – to another hard drive on that same computer. When the computer was stolen from his office, all of his experimental data went missing as well and it was never recovered.
This is one very good reason to keep backups separate from the host being backed up.
Natural disasters occur. Fire, flood, hurricanes, tornadoes, mud slides, tsunamis, and so many more kinds of disasters can destroy computers and locally stored backups as well. I can guarantee that, even if I have a good backup, I will never take the time during a fire, tornado or natural disaster that places me in immanent danger to save the backups.
Malware is software that can be used for various malicious purposes including destroying or deleting your data.
Ransomware is a specific form of malware that encrypts your data and holds it for ransom. If you pay the ransom you may get the key that will allow you to decrypt your data – if you are lucky.
So as you can see, there are many ways to lose your data. My intent with this list of possible ways in which data can be damaged or lost is to scare you into doing backups. Did it work?
My problem
A few years ago, I encountered a problem in the form of a hard drive crash that destroyed the data in my home directory. I had been expecting this for some time so it came as no surprise.
The first indication I had that something was wrong was a series of emails from the S.M.A.R.T (Self-Monitoring, Analysis and Reporting Technology) enabled hard drive on which my home directory resided. Each of these emails indicated that one or more sectors had become defective and that the defective sectors had been taken off-line and reserved sectors allocated in their place. This is normal operation; hard drives are designed intentionally with reserved sectors for just this reason.
I put my curiosity to use when these error messages started arriving in my email inbox several months ago. I first used the smartctl command to view the internal statistics for the hard drive in question. The original, defective hard drive has been replaced but – yes, I keep some old, defective devices for teachable moments like this. I installed this damaged hard drive in my docking station to demonstrate what the results of a defective hard drive look like. I use the following command.
# smartctl -x /dev/sdi | less
SMART reports can be a bit confusing. The web page, “Understanding SMART Reports,” can help somewhat with that. Wikipedia also has an interesting page on this technology. I recommend reading those documents before attempting to interpret the SMART results.
The SMART data, provides basic information about the hard drive capabilities and attributes such as brand, model, and serial number. This is interesting and good information to have.
The smartctl command then displays raw data accumulated in the hardware registers on the drive. The raw values are not particularly helpful for some of the error rates. The “Value” column is usually more helpful. Read the referenced web pages to understand a bit about why. In general, numbers like 100 in the Value column mean 100% good and low numbers like 001 mean close to failure – sort of 99% of the useful life is used up. It is really very strange.
The command also lists errors and information about them when they occur. This is the most helpful part of the output. I do not try to analyze every error; I simply look to see if there are multiple errors.
The point of this is that I could see that the drive was going to fail sometime sooner than later.
I decided I would wait to see what else occurred before I replaced the hard drive — you know — just to satisfy my curiosity about what would eventually happen. The failure numbers were not as bad in the beginning. The error count rose to 1350 at the time of the catastrophic failure.
Some testing of over 67,800 SMART drives by a cloud company named Backblaze provides some statistically based insight into failure rates of hard drives that experienced various numbers of reported errors. This web page is the first I have found that demonstrates a statistically relevant correlation between reported SMART errors and actual failure rates. Their web page also helped improve my understanding of the five SMART attributes that they found should be closely monitored.
In my opinion, the bottom line of the Backblaze analysis is that hard drives should be replaced as soon as possible after they begin to experience error reports in any of the five statistics they recommend monitoring. My experience seems to confirm that. My drive failed within a couple months of the first indications that there was a problem. The number of errors my drive experienced before failing beyond recovery is very high and I had been very lucky to have been able to recover from several errors that caused the /home filesystem to switch to read-only (ro) mode. This occurs when Linux determines that the filesystem is unstable and cannot be trusted.
Recovery
So that was all just a long way to say that the drive containing my home directory failed catastrophically. Recovery was straightforward if a bit time-consuming.
I turned off the computer, removed the defective 320GB SATA drive, replaced it with a new 1TB SATA drive because I want to use the extra space for other storage later, and turned the computer back on. I created a physical volume (PV) that takes up all of the space on the drive, and then a volume group (VG) that fills the PV. I used 250GB of that space for a logical volume (LV) that was to be the /home filesystem. I then created an EXT4 filesystem on the logical volume and used the e2label command to give it the label “home” because I mount filesystems using labels. At this point the replacement drive was ready so I mounted it on /home.
As a result of the method I use to create my backups, it is only necessary for me to use a simple copy command to restore the entire home directory to the newly installed drive.
After restoring the data to my /home directory I logged in using my non-privileged user ID and checked things out. Everything worked as expected and all of my data had been restored correctly.
Recovery Testing
No backup regimen would be complete without testing. You should regularly test recovery of random files or entire directory structures to ensure not only that the backups are working, but that the data in the backups can be recovered for use after a disaster. I have seen too many instances where a backup could not be restored for one reason or another and valuable data was lost because the lack of testing prevented discovery of the problem.
Just select a file or directory to test and restore it to a test location such as /tmp so that you won’t overwrite a file that may have been updated since the backup was performed. Verify that the files’ contents are as you expect them to be. Restoring files from a backup made using the rsync commands above simply a matter of finding the file you want to restore from the backup and then copying it to the location you want to restore it to.
I have had a few circumstances where I have had to restore individual files and, occasionally, a complete directory structure. I have had to restore the entire contents of a hard drive on a couple occasions, as I discussed earlier in this chapter. Most of the time this has been self-inflicted when I accidentally deleted a file or directory. At least a few times it has been due to a crashed hard drive. So those backups do come in handy.
Summary
Backups are an incredibly important part of our jobs as SysAdmins and individual users. I have experienced many instances where backups have enabled rapid operational recovery for places I have worked as well as for my own business and personal data.
There are many options for performing and maintaining data backups. I do what works for me and have never had a situation where I lost more than a few hours worth of data.
Like everything else, backups are all about what you need. Whatever you do – do something! Figure out how much pain you would have if you lost everything – data, computers, hard copy records – everything. The pain includes the cost of replacing the hardware and the cost of the time required to restore data that was backed up and to recover data that was not backed up. Then plan and implement your backup systems and procedures accordingly.