How to untangle the systemd-resolved DNS mess

0

Finally — I think I got it all.

The change from the venerable nsswitch and NetworkManager to systemd-resolved has damaged and slowed name services. The result of this resolver change was apparent in a number of symptoms. Inability to find the addresses of most remote servers resulting in timeouts was the most noticeable. When the connections were made, They were very slow to respond. I didn’t really understand how much slower until after I fixed the problem.

I run my own name server using BIND. I started this soon after I began learning Linux as a way to overcome the horrible name services provided by my series of ISPs. They were very slow and would fail intermittently, always at the most inopportune times for me. It was far less trouble for me to start my own name service and that has been the case — until systemd-resolved forced its way onto my Fedora systems. All of them.

Determining the problem

A bit of problem determination showed that even connecting to name servers at Google DNS would time out.

# dig www.both.org
;; communications error to 8.8.8.8#53: timed out
;; communications error to 8.8.8.8#53: timed out
;; communications error to 8.8.8.8#53: timed out
;; communications error to 8.8.4.4#53: timed out
; <<>> DiG 9.18.28 <<>> www.both.org
;; global options: +cmd
;; no servers could be reached

We not only had this issue at the beach, but also on my home lab network. This resulted in my partner’s ire when she couldn’t find information about the WNBA games she loves to watch. I set out to explore this problem and find a fix.

I started with resolv.conf. The initial default configuration for resolv.conf is as a pointer to a stub as seen here.

lrwxrwxrwx. 1 root root 39 Apr 14 18:58 resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

That resolv.conf file contains the following. The resolver is the localhost at 127.0.0.53. The comments in this file are enlightening. I had no idea that this had supplanted the previous resolver and resolv.conf managed by NetworkManager. This extracts the resolver function from Network Manager but leaves it with the rest of its network management responsibilities.

# This is /run/systemd/resolve/stub-resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad
search both.org

The /etc/nsswitch.conf file is used to determine the order in which various resources are accessed for various services including host name resolution. This file has also changed and contains some weird logic in the hosts line. Based on my experiments, it’s this logic that appears to slow things down, along with the use of mdns4_minimal and resolve sources. There’s also this new thing called authselect which now generates the nsswitch.conf file.

The original file is /etc/authselect/nsswitch.conf and /etc/nsswitch.conf is a symlink to that file.

# ll nsswitch.conf 
lrwxrwxrwx. 1 root root 29 Jun 10 11:26 nsswitch.conf -> /etc/authselect/nsswitch.conf

This is the default nsswitch.conf.

# Generated by authselect
# Do not modify this file manually, use authselect instead. Any user changes will be overwritten.
# You can stop authselect from managing your configuration by calling 'authselect opt-out'.
# See authselect(8) for more details.

# In order of likelihood of use to accelerate lookup.
passwd:     files systemd
shadow:     files
group:      files [SUCCESS=merge] systemd
hosts:      files mdns4_minimal [NOTFOUND=return] resolve [!UNAVAIL=return] myhostname dns
services:   files
netgroup:   files
automount:  files

aliases:    files
ethers:     files
gshadow:    files
networks:   files dns
protocols:  files
publickey:  files
rpc:        files

I also found many named errors in the systemd journal.

Sep 15 01:41:27 yorktown.both.org named[1464]: SERVFAIL unexpected RCODE resolving 'dns-02.as49870.net/A/IN': 116.203.70.186#53
Sep 15 02:00:14 yorktown.both.org named[1464]: loop detected resolving 'ns6.pinterest.com/A'
Sep 15 02:00:14 yorktown.both.org named[1464]: loop detected resolving 'ns5.pinterest.com/A'
Sep 15 02:24:03 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '218.67.58.103.in-addr.arpa/PTR/IN': 103.58.117.2#53
Sep 15 02:34:38 yorktown.both.org named[1464]:   validating in-addr.arpa/SOA: got insecure response; parent indicates it should be secure
Sep 15 02:34:53 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '29.140.3.106.in-addr.arpa/PTR/IN': 101.251.253.10#53
Sep 15 02:34:54 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '29.140.3.106.in-addr.arpa/PTR/IN': 221.228.99.114#53
Sep 15 02:34:54 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '29.140.3.106.in-addr.arpa/PTR/IN': 38.83.110.66#53
Sep 15 02:34:54 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '29.140.3.106.in-addr.arpa/PTR/IN': 38.123.104.98#53
Sep 15 02:34:54 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '29.140.3.106.in-addr.arpa/PTR/IN': 118.193.16.194#53
Sep 15 02:34:57 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '29.140.3.106.in-addr.arpa/PTR/IN': 38.123.104.98#53
Sep 15 02:34:57 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '29.140.3.106.in-addr.arpa/PTR/IN': 38.83.110.66#53
Sep 15 02:34:57 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '29.140.3.106.in-addr.arpa/PTR/IN': 101.251.253.10#53
Sep 15 02:34:58 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '29.140.3.106.in-addr.arpa/PTR/IN': 221.228.99.114#53
Sep 15 02:34:58 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '29.140.3.106.in-addr.arpa/PTR/IN': 118.193.16.194#53
Sep 15 02:39:21 yorktown.both.org named[1464]:   validating meric.net.tr/SOA: no valid signature found
Sep 15 02:39:21 yorktown.both.org named[1464]:   validating host-193.111.76.147.meric.net.tr/NSEC: no valid signature found
Sep 15 02:51:35 yorktown.both.org named[1464]: connection refused resolving 'ns4.151.net/A/IN': 206.130.183.2#53
Sep 15 02:51:35 yorktown.both.org named[1464]: connection refused resolving 'ns3.151.net/A/IN': 206.130.183.2#53
Sep 15 02:51:36 yorktown.both.org named[1464]: SERVFAIL unexpected RCODE resolving '73.65.126.216.in-addr.arpa/PTR/IN': 206.130.183.8#53
Sep 15 02:51:36 yorktown.both.org named[1464]: SERVFAIL unexpected RCODE resolving '73.65.126.216.in-addr.arpa/PTR/IN': 206.130.183.4#53
Sep 15 02:51:58 yorktown.both.org named[1464]:   validating in-addr.arpa/SOA: got insecure response; parent indicates it should be secure
Sep 15 03:09:31 yorktown.both.org named[1464]:   validating in-addr.arpa/SOA: got insecure response; parent indicates it should be secure
Sep 15 03:09:32 yorktown.both.org named[1464]:   validating in-addr.arpa/SOA: got insecure response; parent indicates it should be secure
Sep 15 03:09:32 yorktown.both.org named[1464]:   validating in-addr.arpa/SOA: got insecure response; parent indicates it should be secure
Sep 15 03:09:40 yorktown.both.org named[1464]: connection refused resolving '134.96.90.45.in-addr.arpa/PTR/IN': 194.9.6.179#53
Sep 15 03:09:45 yorktown.both.org named[1464]: DNS format error from 103.27.236.70#53 resolving 125.145.118.45.in-addr.arpa/PTR for 192.168.0.254#39447: s>
Sep 15 03:10:06 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving 'revers.hostmaster.uz/A/IN': 91.213.99.17#53
Sep 15 03:10:06 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving 'ns3.hostmaster.uz/A/IN': 91.213.99.17#53
Sep 15 03:10:10 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '209.21.126.101.in-addr.arpa/PTR/IN': 180.184.1.164#53
Sep 15 03:10:11 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '209.21.126.101.in-addr.arpa/PTR/IN': 180.184.1.53#53
Sep 15 03:10:11 yorktown.both.org named[1464]: REFUSED unexpected RCODE resolving '209.21.126.101.in-addr.arpa/PTR/IN': 180.184.68.13#53

Several on-line resources indicate that these errors are caused by configuration issues for the domain’s name services. The comments on these articles suggest that the domain admins should fix their problems but recognize that’s unlikely to happen. So, once again, we must implement our own changes to fix someone else’s problem.

To summarize the root cause of several related resolver problems, much of the progress towards the goal of “Linux on the desktop,” is to make things easier for the end user. This objective has resulted in many changes introduced along with various systemd services that perform a more automatic configuration of the host’s network connection. This has repeatedly introduced new problems into the name resolution process. I’m pretty sure that these problems would cause all but the most technical desktop users to abandon any attempt to use Linux.

Resolving the problem

It takes several steps to resolve this problem. This section describes each step and why it’s needed as part of the complete solution.

1. Stop and disable the Avahi service

The Avahi web site describes Avahi better than I can.

Avahi is a system which facilitates service discovery on a local network via the mDNS/DNS-SD protocol suite. This enables you to plug your laptop or computer into a network and instantly be able to view other people who you can chat with, find printers to print to or find files being shared. Compatible technology is found in Apple MacOS X (branded “Bonjour” and sometimes “Zeroconf”).

Avahi is the basis for much of the good things that end user simplification can support, however it’s not going to be needed when we disable some of the other services that it supports.

# systemctl disable --now avahi-daemon.service

2. Stop and disable the Avahi daemon

The Avahi daemon socket is a part of the Avahi service. When a program requests Avahi services, it does so through the daemon rather than directly to the service itself. The socket then send the request to the service. Other systemd services also work this way. This won’t be required since we’ve disabled the Avahi service. A socket like this could also cause the service it belongs to to start even though the service is disabled. We don’t want to allow that to happen.

# systemctl disable --now avahi-daemon.socket

3. Stop and disable the systemd-resolved service

The systemd-resolved service is the root cause of the problems we’re having so we disable it. The systemd-resolved man page states its purpose succinctly.

systemd-resolved is a system service that provides network name resolution to local applications. It implements a caching and validating DNS/DNSSEC stub resolver, as well as an LLMNR and MulticastDNS resolver and responder. Local applications may submit network name resolution requests via three interfaces

The man page then proceeds to describe the interfaces it exposes to programs and a high-level statement about how to access it as a resolver. This service is the root cause of the problem and we disable it.

# systemctl disable --now systemd-resolved.service

4. Delete the /etc/resolv.conf link

The NetworkManager service examines /etc/resolv.conf file to determine which servers to use for name resolution. Up to three servers are supported in a simple list format. This file also defines the name of the domain in which to search for host names if a simple hostname is provided, i.e., host, rather than the FQDN (Fully Qualified Domain Name) , i.e., host.example.com. Here’s a typical example as configured by systemd-resolved.

# This is /run/systemd/resolve/stub-resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad
search example.org

Only one nameserver is specified in this file; the local host. The systemd-resolved service and Avahi search the local network for other local hosts using systemd-resolved and can configure name resolution so that the hosts can talk amongst themselves. If there is a name server found, such as that provided on wired or wireless routers, it can use that to perform name resolution for external hosts such as www.both.org.

If there’s no locally accessible name server, external name resolution is not possible. This is what happened to me at the beach. The local name server at the hotel was intermittent so no name resolution was possible. I could. however, still ping remote hosts such as www.both.org using the IP address. Yes — this is an edge case. But it clearly does happen.

So we delete the existing resolve.conf file. We won’t create a new resolv.conf file because once we get the rest of this mess sorted, NetworkManager will create a usable one. The NetworkManager service is responsible for creating the /etc/resolv.conf file at boot time if it doesn’t exist. If systemd-resolved is running, the file above is created, which is not the one we want.

# rm -f  /etc/resolv.conf

5. Delete the /etc/nsswitch.conf link

The man page for nsswitch.conf provides a brief description of the uses for this file.

The Name Service Switch (NSS) configuration file, /etc/nsswitch.conf, is used by the GNU C Library and certain other applications to determine
the sources from which to obtain name-service information in a range of categories, and in what order. Each category of information is identi‐
fied by a database name.

My testing determined that the /etc/nsswitch.conf file shown at the beginning of this article is directly responsible for the slow resolution speeds I encountered, whether at the beach, or here in my home lab. If you look back at that file, the logic in the hosts line seems to be the cause.

We don’t need — or really want to — delete the actual nsswitch.conf file. We’ll just delete the symbolic link (symlink) in /etc.

# rm -f  /etc/nsswitch.conf

6. Create a revised nsswitch.conf

Since we deleted the symlink to this file in the previous step, we need to create a new version of this file. After the next step, it won’t be changed or overwritten. My file looks like this. I copied the original from /etc/authselect/nsswitch.conf to /etc so that it’s not a symlink. I made my changes changes to this file.

# Do not modify this file manually, use authselect instead. Any user changes will be overwritten.
# You can stop authselect from managing your configuration by calling 'authselect opt-out'.
# See authselect(8) for more details.

# In order of likelihood of use to accelerate lookup.
passwd:     files sss systemd
shadow:     files
group:      files [SUCCESS=merge] sss [SUCCESS=merge] systemd
# hosts:      files mdns4_minimal [NOTFOUND=return] resolve [!UNAVAIL=return] myhostname dns
hosts:      files myhostname dns
services:   files sss
netgroup:   files sss
automount:  files sss

aliases:    files
ethers:     files
gshadow:    files
networks:   files dns
protocols:  files
publickey:  files
rpc:        files

7. Opt out of authselect

In order to prevent from changing /etc/nsswitch, we opt out.

# authselect opt-out

You can safely ignore the first line of the file and make changes to it manually.

8. Restart NetworkManager

The last step is to restart the NetworkManager service. This will create a new /etc/resolv.conf, and utilize the new nsswitch.conf file we created.

# systemctl restart NetworkManager.service

NetworkManager creates the new /etc/resolv.conf using the data provided by the name server for the network. For many stand-alone systems in home and office, this DHCP server is usually the wired/wireless router. My resolv.conf file contains the information I configured for this interface using NetworkManager Connection Files.

root@david:/etc# cat resolv.conf 
# Generated by NetworkManager
search both.org
nameserver 192.168.0.52
nameserver 8.8.8.8
nameserver 8.8.4.4

The server at 192.168.0.52 is my internal server. It handles name services for the local network with zone files and uses the top level DNS servers for external network name resolution.

If you want to change the network configuration provided by a DHCP server, you can explicitly configure the network interface using NetworkManager Connection Files.

At this point name services are using NSSwitch with a decent and reliable resolv.conf.

Concluding thoughts

Based on my experimentation, the nsswitch.conf file generated by authselect, and dependence on the Avahi daemon to locate services such as network configuration and other hosts on the local network, slow the entire process to the point of uselessness. I think it’s fine to aim at Linux on the desktop as a goal. It’s an admirable goal and I’ll be happy when Linux is on the majority of desktops. While this may work — once the problems are resolved — for minimally-technical users, its can cause issues for those of us SysAdmins who’ve had things well-configured and working for years.

Don’t misunderstand me. I’m not suddenly saying I hate systemd. That’s not it at all. What I am saying is that the unintended consequences of these developer decisions can cause SysAdmins pain as they try to determine what’s changed and how to fix it. It’s simply that what’s good for one set of users is not necessarily good for other sets of users. My use case is significantly different from that of non-technical users.

In previous articles about fixing the resolver problems, I was able to resolve the issues at hand, but after this last round of extreme symptoms, I finally realized the extent of the multiple root causes. Part of the issue is that various systemd name service tools have been added over a period of time rather than all at once. This article considers all of the currently known root causes for name service resolution issues related to systemd-resolved and explains how to resolve them.

I hadn’t realized how lengthy the delays in name resolution were until after resolving this problem. Web pages that took minutes to load — and some never did with all the external links they use to load pictures and advertisements — now take only a second or so. Tests using the dig command show name resolution times of around 100 milliseconds (msec) for sites that were not currently in the cache of my name server.

I suggest reading the man pages for each of the files mentioned in this article as there is additional important information about each that can be very helpful.

Leave a Reply