The Linux Philosophy for SysAdmins, Tenet 09 — Test Early, Test Often
Author’s note: This article is excerpted in part from chapter 11 of my book, The Linux Philosophy for SysAdmins, with some changes to update the information in it and to better fit this format.
You know it’s as easy to forget to write about testing the programs I write as it is to overlook testing the programs themselves. Why is that?
I wish I had a definitive answer. In some ways it is like documentation. Once the program seems to work we just want to get on with doing whatever task caused us to write the program in the first place.
“There is always one more bug.”
— Lubarsky’s Law of Cybernetic Entomology
Lubarsky — whoever they might be — is correct. We can never find all of the bugs in our code. For every one I find there always seems to be another that crops up, usually at a very inopportune time.
In my article, Always use shell scripts, I started to talk about testing and the process I use for testing. This article covers testing in more detail. You will learn about how testing affects the ultimate outcome of the many tasks SysAdmins do. You’ll also learn that testing is an integral part of the Linux Philosophy for SysAdmins.
However, testing is not just about programs. It is also about verification that problems – whether caused by hardware, software, or the seemingly endless ways that users can find to break things – that we are supposed to have resolved actually have been. These problems can be with application or utility software we wrote, system software, applications, and hardware. Just as importantly, testing is also about ensuring that the code is easy to use and the interface makes sense to the user.
Procedures
One of the jobs I had in a previous life was as a tester for Linux-based appliances at Cisco. I developed test plans, wrote Tcl/Expect code to implement the test plan and helped trace the root cause of the failures. I enjoyed that job and learned a lot from it.
Following a well-defined procedure when writing and testing shell scripts can contribute to consistent and high quality results. My procedures are simple.
- Create the test plan, at least a simple one.
- Start testing right at the beginning of development
- Perform a final test when the code is complete.
- Move to production and test more.
Create a test plan
Testing is hard work and it requires a well designed test plan based on the requirements statements. Regardless of the circumstances, start with a test plan. Even a very basic test plan provides some assurance that testing will be consistent and will cover the required functionality of the code.
Any good plan includes tests to verify that the code does everything it is supposed to. That is, if you enter X and click on button Y, you should get Z as the result. So you write a test that does creates those conditions and then verify that Z is the result.
The best plans include tests to determine how well the code fails. I found this out the hard way back when I got my first IBM PC in 1982.
The PC had just been announced in August of 1981 and employee purchases were not begun until early 1982. There were not a lot of programs out there, especially for kids. I wanted to introduce my son to the PC but could find nothing appropriate so I wrote a little program in BASIC that I thought he would enjoy. Frankly, I don’t even remember what it was supposed to do.
I tested that program every way I could think of. It did everything it was supposed to do. Then I turned the computer over to my son and walked out of the room. I had not gone very far when he yelled, “Dad! Is it supposed to do this?” It wasn’t. I asked him what he did and he described some very strange set of keystrokes and I said, “You are not supposed to do that,” and immediately realized how silly that would have sounded to him.
My problem was that I had not tested how the program would react to unexpected input. That does seem to be a common problem with programs of all kinds. But I never forgot that particular lesson. As a result I always try to include code that tests for unexpected input and then I test to ensure that the program detects it and fails gracefully.
There are lots of different formats for test plans. I have worked with the full range from having it all in my head, to a few notes jotted down on a sheet of paper, to a complex set of forms that required a full description of each test, which functional code it would test, what the test would accomplish, and what the inputs and results should be.
Speaking as a SysAdmin who is also a tester, I try to take the middle ground. Having at least a short, written test plan will ensure consistency from one test run to the next. How much detail you need depends upon how formal your development and test procedures are.
Test Plan Content
All of the sample test plan documents I found using Google were complex and intended for large organizations with a very formal development and test process. Although those test plans would be good for those with “Test” in their job title, they really do not apply well to System Administrators and our more chaotic and fast time-dependent working conditions. As in most other aspects of our jobs we need to be creative. So here is a short list of things that you would want to consider including in your test plan. Modify it to suit your needs.
- The name and a short description of the software being tested.
- A description software features to be tested.
- The starting conditions for each test.
- The procedures to follow for each test.
- A description of the desired outcome for each test.
- Include specific tests designed to test for negative outcomes.
- Tests for how the program handles unexpected input.
- A clear description of what constitutes pass or fail for each test.
- Fuzzy testing, which will be described below.
This short list should give you some ideas for creating your own test plans. For most SysAdmins this should be kept simple and fairly informal.
Start testing at the beginning
I always start testing my shell scripts as soon as I complete the first portion that is executable. This is true whether I am writing a short command line program or a script that is an executable file.
I usually start creating new programs with the shell script template that you had an opportunity to explore in my article, Always use shell scripts. I write the code for the Help procedure and test it. This is usually a trivial part of the process but it helps me get started and ensures that things in the template are working properly at the outset. At this point it is easy to fix problems with the template portions of the script, or to modify it to meet specific needs that the standard template cannot.
When the template and Help procedure are working, I move on to creating the body of the program by adding comments to document the programming steps required to meet the program specifications. After that I start adding code to meet the requirements stated in each comment. This code will probably require adding variables which are initialized in that section of the template — which is now becoming our shell script.
This is where testing is more than just entering data and verifying the results. It takes a bit of extra work. Sometimes I add a command that simply prints the intermediate result of the code I just wrote and verify that. Other times, for more complex scripts, I add a -t option for “test mode.” In this case the internal test code is only executed when the -t option is entered at the command line.
Final testing
After the code is complete I go back through a complete test of all the features and functions using known inputs to produce specific outputs. I also test for some random inputs to see if the program can handle unexpected input now that it is complete.
Final testing is intended to verify that the program is functioning essentially as intended now that it is complete. A large part of the final test is to ensure that functions that worked earlier in the development cycle have not been broken by code added or changed later in the cycle.
If you have been testing the script as you added new code to it, there should be no surprises during this final test. Wrong! There are always surprises during final testing. Always. Expect those surprises and be ready to spend some time fixing them. If there were never any bugs discovered during final testing there would be no point in doing a final test, would there.
Testing in Production
Huh — what?
“Not until a program has been in production for at least six months will the most harmful error be discovered.”
— Troutman’s Programming Postulates
Yes, testing in production is now considered normal and desirable. Having been a tester myself, this actually does seem reasonable. “But wait! That’s dangerous,” you say. My experience is that it is no more dangerous than extensive and rigorous testing in a dedicated test environment. In some cases there is no choice because there is no test environment — only production.
This was the case in one of my jobs, the one where I was responsible for maintenance of a large number of Perl CGI scripts that generated dynamic pages for a web site. The entire web site for this huge organization’s email management interface was run on a single, very old even then, Dell desktop system. That was our critical server. I had an even older Dell desktop from which I would login to the server to do my programming. Both of these computers ran an early version of Red Hat Linux.
The only option we had to work with was to make many critical changes on the fly in the middle of the day and then test in production. What fun that was!
Eventually we obtained a couple additional old desktops to use as development and test environments but it was a nail biting challenge until we did. Part of the reason for lack of equipment on which to run this large email system was that it started out as a small pilot test for one department. It grew rapidly out of control with more departments asking to join as soon as they heard about it. Pilot tests are never funded and are usually lucky to be gifted with another department’s old and unwanted equipment.
So SysAdmins are no strangers to the need to test new or revised scripts in production. Any time a script is moved into production that always becomes the ultimate test. The production environment itself constitutes the most critical part of that test. Nothing that can be dreamed up by testers in a test environment can fully replicate the true production environment.
The allegedly new practice of testing in production is just the recognition of what we SysAdmins have known all along. The best test is production — so long as it is not the only test.
After the final test the program can move into production. Production is always a test of its own. Writing code in an isolated development and test environment is in no way representative of the conditions encountered in a true production environment.
Always expect new bugs to surface in production no matter how well the script was written and tested. As Troutman’s postulate says, the most harmful error won’t be discovered for quite some time after a program has been put in production and everyone has come to assume that the results are always correct. The most harmful bugs are not the ones that cause the programs to crash; they are the ones that quietly result in incorrect results.
Continue checking the results the script produces even after it has gone into production. Look for the next bug and you will eventually find it.
Fuzzy testing
This is another of those buzzwords that caused me to roll my eyes when I first heard it. I learned that its essential meaning is simple – have someone bang on the keys until something happens and see how well the program handles it. But there really is more to it than that.
Fuzzy testing is a bit like the time my son broke my code in less than a minute with his random input. Most test plans utilize very specific input that generates a specific result or output. Regardless of whether the test is for a positive or negative outcome as success, it is still controlled and the inputs and results are specified and expected, such as a specific error message for a specific failure mode.
Fuzzy testing is about dealing with randomness in all aspects of the test such as starting conditions, very random and unexpected input, random combinations of options selected, low memory, high levels of CPU contention with other programs, multiple instances of the program under test, and any other random conditions that you can think of to be applied to the tests.
I try to do some fuzzy testing right from the beginning. If the bash script cannot deal with significant randomness in its very early stages then it is unlikely to get better as we add more code. This is also a good time to catch these problems and fix them while the code is relatively simple. A bit of fuzzy testing at each stage of completion is also useful in locating problems before they get masked by even more code.
After the code is completed I like to do some more extensive fuzzy testing. Always do some fuzzy testing. I have certainly been surprised by some of the results I have encountered. It is easy to test for the expected things, but users do not usually do the expected things with a script.
Automated testing
Testing can be automated but most of the work we do as SysAdmins comes with intrinsic time pressures that preclude taking the time to write code to test our code. Those pressures are the reason most code we write is quick and dirty. So we write code and test it in a hurry.
It is possible to use tools like Tcl/Expect to write a complex test suite for our shell scripts. I never had time to do anything that formal as a SysAdmin and I expect you don’t either. The most test automation I have ever done in my role as a SysAdmin is to write a very short script to sequence through a set of commands to verify a few critical aspects of the script under test. Most of the time I test manually at each step of the way and when the program is complete. Using bash history can be a reasonable substitute and provides at least some semi-automated testing.
My Process
A few years ago, I developed and tested a program that lists some interesting information and statistics about Linux hosts. It’s a fairly long script that ha been well tested. Typical output from this script is shown in Figure 1.
#######################################################################
# MOTD for Mon Sep 30 03:41:04 AM EDT 2024
# HOST NAME: david.both.org
# Machine Type: physical machine.
# Host architecture: X86_64
#----------------------------------------------------------------------
# System Serial No.: System Serial Number
# System UUID: 27191c80-d7da-11dd-9360-b06ebf3a431f
# Motherboard Mfr: ASUSTeK COMPUTER INC.
# Motherboard Model: TUF X299 MARK 2
# Motherboard Serial: 170807951700403
# BIOS Release Date: 07/11/2017
#----------------------------------------------------------------------
# CPU Model: Intel(R) Core(TM) i9-7960X CPU @ 2.80GHz
# CPU Data: 1 Sixteen Core package with 32 CPUs
# CPU Architecture: x86_64
# HyperThreading: Yes
# Max CPU MHz: 7300.0000
# Current CPU MHz: 3599.987
# Min CPU MHz: 1200.0000
#----------------------------------------------------------------------
# RAM: 62.461 GB
# SWAP: 7.999 GB
#----------------------------------------------------------------------
# Install Date: Wed 24 Apr 2024 08:46:05 AM EDT
# Linux Distribution: Fedora 40 (Forty) X86_64
# Kernel Version: 6.10.10-200.fc40.x86_64
#----------------------------------------------------------------------
# Disk Partition Info
# Filesystem Size Used Avail Use% Mounted on
# /dev/mapper/vg01-root 9.8G 1.5G 7.9G 16% /
# /dev/mapper/vg01-usr 59G 44G 13G 78% /usr
# efivarfs 128K 110K 14K 90% /sys/firmware/efi/efivars
# /dev/nvme0n1p2 4.9G 500M 4.1G 11% /boot
# /dev/nvme0n1p1 5.0G 20M 5.0G 1% /boot/efi
# /dev/mapper/vg01-tmp 15G 4.9M 14G 1% /tmp
# /dev/mapper/vg02-home 246G 74G 159G 32% /home
# /dev/mapper/vg01-var 49G 7.1G 40G 16% /var
# /dev/mapper/vg03-Virtual 916G 518G 359G 60% /Virtual
# /dev/mapper/vg01-ansible 15G 216M 14G 2% /home/dboth/development/ansible
# /dev/mapper/vg04-VMArchives 787G 304G 444G 41% /VMArchives
# /dev/mapper/vg04-stuff 246G 98G 135G 43% /stuff
# /dev/sde1 458G 260G 176G 60% /run/media/dboth/USB-X47GF
#----------------------------------------------------------------------
# LVM Physical Volume Info
# PV VG Fmt Attr PSize PFree
# /dev/nvme0n1p3 vg01 lvm2 a-- <466.94g <316.94g
# /dev/nvme1n1 vg02 lvm2 a-- <476.94g <226.94g
# /dev/sda vg03 lvm2 a-- 931.51g 0
# /dev/sdb vg04 lvm2 a-- <2.73t 1.70t
# /dev/sdc1 vg_Backups lvm2 a-- <3.64t <8.90g
#######################################################################
# Note: This MOTD file gets updated automatically every day.
# Changes to this file will be automatically overwritten!
#######################################################################
Figure 1: The text output of my program is used to generate the /etc/motd file wach day.
My systems all run this script as a cron job to generate a report every day that I store as /etc/motd, which is the message of the day file. Whenever anyone logs in using a remote terminal or one of the virtual consoles, the MOTD is displayed. This gives me an instant view into the daily status of each host I login to.
You can download this script, mymotd, from our Downloads page. You can download this file so you can follow along with the steps below. Be sure to set the ownership to root:root, and the permissions to 774 so that only root can run it. This script does have options so you can run it as root from the command line with the -h option for a bit of help. It requires no options to produce the desired output which goes to STDOUT. The output is redirected to /etc/motd to create the file.
I also created a short script that runs this one and redirects the output to /etc/motd, and placed it in /etc/cron.daily. Scripts placed in /etc/cron.daily are run automatically one time daily by the cron service.
Before I started coding, I created a set of requirements and then a simple test plan.
Requirements for motd script
A simple set of requirements will help us design the program and keep on point for the specific features we want to include. These requirements should work just fine yet leave leeway for creativity.
- All output goes to STDIO
- Provide an option to print the script release version
- Provide an option to print the GPL license
- Provide an option to print the software version
- Print the following data in a pleasing — or at least organized — format
- A header with the current date and time
- The host name
- Machine type – VM or physical
- Host hardware architecture X86_64 or i386
- Motherboard vendor and model
- CPU model and hyperthreading status
- The amount of RAM (GB)
- The amount of Swap space (GB)
- The date Linux was installed
- The Linux distribution
- The kernel version
- Disk partition information
- LVM physical volume information
- Include comments to describe the code
- Options should not be required to produce the desired output
This seems like a long list but it is quite short compared to some sets of requirements I have seen. I created the original bash script from a similar set of requirements.
The only issue someone might have with this list is the term “pleasing.” Who knows what might be pleasing versus displeasing for whoever is going to be using this script. So for this experiment, pleasing will be what I say it is. In other environments, many pages of requirements might be needed to define the explicit format of the output. In the SysAdmin environment, pleasing is usually what works for us or for whoever asked for the program to be written.
Test plan for mymotd script
Our test plan is simple and straightforward.
- Verify that the program only runs as the root user.
- Verify that the help (-h) option displays the correct help information.
- Verify that the GPL (-g) option displays the GPL license statement.
- Verify that all output data specified in the requirements are produced.
- Verify that all of the printed output is correct for the system on which testing is performed.
- Verify that the values for numeric output are correct by comparison with other sources. It is probable that some of these numbers will change between runs and when compared with other sources due to the dynamic nature of any running computer, but they should be reasonably close.
- Ensure that incorrect option selections produce an appropriate error code.
- If possible, test on multiple systems including physical hardware and VMs to verify correct results for different conditions. Include Intel, AMD, and ARM hardware.
- If possible, test with multiple Linux distributions.
This simple test plan is everything we need to know when testing our script. We know the outputs that we need to check because they program requirements.
The last two items may not be possible within the context of a learning environment, but it is always something to consider. Differing environments should produce different results with this script. It is helpful to test outside our actual dev/test environment to ensure that the logic and results are accurate for those other environments. Testing in production can help with this.
Developing the script
I copied a template use as a starting point for all my scripts. That template, BashTemplate.sh, can also be obtained from our Downloads page. I changed the script name internally, added a short description, and change the help procedure to match the features of our program. I tested help option, and the option that displays the GNU license.
I have a sanity check to ensure that the script is being run by root. I also have a check to ensure that the script is running on a Linux host. As a bash script it would be compatible with various Unix systems, but several of the Linux specific functions would fail.
At this point, I ran two quick tests. One to ensure that root can still use the program and one to ensure that non-root users cannot.
After adding a sanity check to ensure that this program is being run on a Linux system, I tested again to ensure that it still runs on a Linux system. I don’t have any non-Linux systems, so I could test this only for a positive outcome – that is that we are running on Linux – but not for a negative outcome — a non-Linux host.
All programs should have a version number. So I added that and tested to ensure that the -V (Version) option works. Even adding just a small bit of code should trigger a new round of testing. But you are not really done with testing. You should do some additional testing of the previously coded functions to ensure that they have not been broken by this new code.
At this point the basics are complete. We have a partial script that displays help, the GNU license statement, performs a couple sanity checks, and displays the version number. We do not need to add any options to the case statement because everything we need is already there.
At this point I started working on the main body of the program.
Summary
You get the idea. Every time I add a bit of functional code I test that new code to see that it works as defined in the requirements, and that the new code doesn’t break existing code. If an error occurs, fix it and retest.
Testing code for a SysAdmin is much like writing it — fast and less than rigorous. I hope that last bit sounds better than “fast and loose.” Writing code quickly usually means testing it quickly as well. That does not mean that testing shell scripts does not need to be haphazard. “Test early, test often” is a good mantra for making the testing part of the coding. With the application of this tenet, the task of testing shell scripts becomes second nature and an integral part of the act of writing the script in the first place.
I found the mymotd script in this article to be a good excuse for rewriting my original version. I wrote the script in 2007 and my better understanding of hardware architecture and reporting in Linux has helped me with this version. My knowledge of Linux tools whether new or not has also improved and given me more flexibility to simplify this version of the script.
Notice in the code, if you downloaded it, that the variable names are all ones that make sense in terms of the task being performed. It is possible to look at the variable name and get an idea of the kind of data it is supposed to contain. This also helps make testing easier.