
How I recovered from another self-inflicted problem
And how I got there in the first place.
Those of you who read my articles and books know that I do updates frequently to ensure that I have the most recent security patches on my systems. I also do it to get the latest functional fixes.
The problem
This past weekend I installed the latest updates because there were several important security patches including some in the kernel. I always install updates on my primary workstation first, then on the server that hosts this web site, the Linux computer that is my router and firewall, and then the rest of my desktops and laptops.
After installing the updates on my primary workstation using my doUpdates.sh script, my testing showed no problems. To perform the rest of my updates, I use an Ansible playbook to install updates on the server and firewall/router serially, i.e., server first then the router. After that the playbook installs updates on the rest of my systems in parallel.
Everything went well with updates on the server and router, but when those were finished and Ansible tried to update the rest of my hosts, it just hung.
Problem determination
I started by verifying that both the server and router were up and running, had completed their updates, and rebooted properly. Which they had.
Ansible had not indicated problems with reaching the hosts to be updated, but hung without any indication that updates were available. I wrote the playbook to report on how many updates were available for each host, since they are all different. At this point it looked like a connectivity problem, but since I could access both server and router, it wasn’t an internal issue.
I tried pinging example.com1 from my workstation but that failed. I then tried pinging example.com from the router, but that failed. I tried using the dig command from my workstation and the router but that failed on my workstation with an error indicating that the name server, which is also my web server and every other server on my network, was not responding.
On the router, however, the dig example.com command results were different. Initially, and after a timeout, it responded with an indication that the DNS server was not responding. It then returned a result from a secondary, external name server, that provided the IP address of example.com.
I could then ping example.com from the router but not from my workstation. So the router was not routing although I could think of no reason why that would be. Making a router of a Linux system is quite easy, as I explain in my article, How to tune the Linux kernel with the /proc filesystem. My router has been running for years without a problem, but I checked it anyway.
One of the first things I need to tune on the Linux host I use for a router is to enable it to be a router. This is accomplished by setting the contents of the file /proc/sys/net/ipv4/ip_forward
to 1
so that it will forward packets from one network to another; from my internal network to the external Internet. I checked that file and it contained a 0, not the required 1.
Since the problem began immediately after the router was rebooted, this value was not properly set during the Linux startup sequence.
Problem resolution
The first thing I did to fix this problem was to change the value of /proc/sys/net/ipv4/ip_forward
to 1
, and test that routing was now working. Which it was.
Next I had to explore why it didn’t get set properly. If you read the article in the link above, I set this by adding a line to the Local-sysctl.conf file. The whole file is shown as it should be in Figure 1.
################################################################################
# Local-sysctl.conf #
# #
# Local kernel option settings. #
# Install this file in the /etc/sysctl.d directory. #
# #
# Use the command: sysctl -p /etc/sysctl.d/local-sysctl.conf to activate. #
# #
################################################################################
################################################################################
# Local Network settings - Specifically to disable IPV6 #
################################################################################
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
################################################################################
# And to make this a router #
################################################################################
net.ipv4.ip_forward = 1
################################################################################
# Virtual Memory Swappiness #
################################################################################
# Set swappiness
vm.swappiness = 13
Figure 1. The line net.ipv4.ip_forward = 1 turns on kernel routing.
I discovered that the line, net.ipv4.ip_forward = 1, was not present. I added it back in so that routing would be properly configured during startup. The problem was resolved at that point
Determining the root case
But there was still the question of why the problem occurred in the first place. What did I know?
First, all my other hosts have the same Local-sysctl.conf file — except for the router which must contain that line.
Second, that file on the router didn’t contain that line or it’s related comments.
So the file must have been overwritten since the previous reboot. Only I could have done that, knowingly or not.
So what had I done that caused that file to be overwritten? This is actually the easy question to answer.
I use Ansible to ensure that many configuration items on all my hosts are the same, both when they’re initially installed, and when I need to update certain files. The default Local-sysctl.conf file is one of those files. But my playbook doesn’t contain that ip_forward line. And I had run that playbook a few days prior to distribute some new configurations to all my hosts. Everything was fine until I rebooted and the file without that required line was used during startup, which resulted in a router that didn’t route.
So — once again I’m the root cause. I’ve borked my own system. But then fixed it.
What I learned
This time I’ve learned to be sure that the needs of the individual systems that have special configurations must be considered in my Ansible playbooks. It’s really not one size fits all.
My next task is to rewrite a couple sections of my playbook.
- Yes, example.com (.org, and .net) are real domains that are intended for testing. ↩︎