Why I won’t use the BtrFS filesystem

Over the more than twenty-five years I have been using Linux, the default filesystem for Red Hat Linux (not RHEL) and Fedora has evolved considerably. EXT2, the second extended filesystem, was the default when I started using Linux and it had many drawbacks including but not the least of which was that it took hours and sometimes days to recover from an improper shutdown such as a power failure. Now at EXT4, the extended filesystem can recover from many types of occurrences such as that in only seconds. It is fast and works very well in concert with logical volume management (LVM) to provide a flexible and powerful filesystem structure that works well in many storage environments.

BtrFS¹ or the B-Tree filesystem is a relatively new filesystem that employs a Copy-on-Write² (CoW) strategy. Copy-on-Write differs significantly from the EXT4 journaling strategy for committing data to the storage device medium. The next two paragraphs are extremely simplified, conceptual summaries of how they work.

With a journaling filesystem, new or revised data is stored in the fixed-size journal and when all of the data has been committed to the journal, it is then written to the main data space of the storage device, either into newly allocated blocks or to replace modified blocks. The journal is marked as having been committed when the write operation is completed.
In a BtrFS Copy-on-Write filesystem, the original data is not touched. New or revised data is written to a completely new location on the storage device. When the data has been completely written to the storage device the pointer to the now old data is simply changed to point to the new data in an atomic operation that minimizes the possibility of data corruption. The storage space containing the old data is then released for re-use.

BtrFS is usually pronounced as “betterfs.”

BtrFS is also designed to be fault-tolerant and it is self-healing in case errors occur. It is intended to be easy to maintain. It has built-in volume management which means the separate logical volume management (LVM) tool used to provide that functionality behind the EXT4 filesystem is not needed.

BtrFS is the default Fedora filesystem starting with Fedora 33³ but this can be easily overridden during the installation procedure.

Based on a 2007 paper by IBM researcher Ohad Rodeh, the BtrFS filesystem was designed at Oracle Corporation for use in their version of Linux. In addition to being a general purpose filesystem, it was intended to address a different and more specific set of problems from the EXT filesystem. The BtrFS filesystem is designed to accommodate huge storage devices with capacities that don’t exist even yet and large amounts of data, especially massively large databases in highly transactional environments.

A very complete set of BtrFS documentation is available from the BtrFS project web site.⁴

Warning!!! Red Hat no longer supports BtrFS.

Red hat has removed all support for BtrFS in RHEL 9 and that is an implicit statement that it does not trust this filesystem. The fact that BtrFS supersedes EXT4 as the default filesystem for Fedora — despite its removal from RHEL — means that we need to know more than just a little about it. But there are some important problems to consider and I will cover those as we proceed through this article.

When upgrading Fedora from one release version to another, such as from Fedora 39 to Fedora 40, the procedure does not convert existing EXT filesystems to BtrFS. This is a good thing as you will see.

BtrFS vs EXT4

Although BtrFS has many interesting features, I think the best way to describe the functional differences from the standpoint of the SysAdmin is that BtrFS combines the functions of a journaling filesystem (EXT4) with the volume management capabilities of LVM. Many of its other features are designed for large commercial use cases and provide little benefit over EXT4 to smaller businesses, individual users, or even larger users unless they have some specific requirements that call for the use of BtrFS.

BtrFS uses a different strategy for space allocation than EXT4 but it maintains some of the same meta-structures such as inodes and directories.

Filesystem structure with BtrFS

The BtrFS metadata structure and data allocation strategies on the storage device are much different from that of EXT4 on LVM. For this reason partitioning a storage device on a new system during initial installation works differently. Some things we used to do with LVM/EXT are no longer available, and other things are a bit … strange compared to my previous experience. Different tools show this in different ways.

The first time I installed a new VM using BtrFS I took the default storage partitioning option, not really knowing what to expect. Figure 1 illustrates the result after a default installation on a 120GB (virtual) storage device. The -T option displays the filesystem type. The root ( / ) and /home partitions are both located on /dev/sda3 which is a BtrFS partition and both appear to have 119G of space available.

$ df -Th
Filesystem     Type      Size  Used Avail Use% Mounted on
/dev/sda3      btrfs     119G  2.8G  115G   3% /
devtmpfs       devtmpfs  4.0M     0  4.0M   0% /dev
tmpfs          tmpfs     3.9G   12K  3.9G   1% /dev/shm
tmpfs          tmpfs     1.6G  1.2M  1.6G   1% /run
tmpfs          tmpfs     3.9G   28K  3.9G   1% /tmp
/dev/sda3      btrfs     119G  2.8G  115G   3% /home
/dev/sda2      ext4      974M  259M  648M  29% /boot
tmpfs          tmpfs     794M  124K  794M   1% /run/user/1000

Figure 1: Storage configuration after a default installation of Fedora.

You can see that the lsblk command in Figure 2 shows both the / and /home partitions as part of /dev/sda3 thus making it clear that both filesystems are located on the same partition. The -f option can be used to display additional information about the filesystems.

$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda      8:0    0  120G  0 disk 
├─sda1   8:1    0    1M  0 part 
├─sda2   8:2    0    1G  0 part /boot
└─sda3   8:3    0  119G  0 part /home
                                /
sr0     11:0    1 1024M  0 rom  
zram0  252:0    0  7.8G  0 disk [SWAP]

Figure 2: The lsblk command shows both / and /home filesystems located on the /dev/sda3 partition.

How it works

So how is it that the /dev/sda3 partition which is 119GB in size can support two filesystems, each of which is 119GB in size?

Well, it doesn’t. Not really.

The / and /home filesystems are called subvolumes on the BtrFS filesystem.⁵ The /dev/sda3 partition is used as a storage location for those subvolumes and the storage space in that partition is used as a pool of available storage for any and all subvolumes created on that partition. When additional space is needed to store a new file or an expansion of an existing file on one of the subvolumes, the space is assigned to that subvolume and removed from the available storage pool.

Notes on an edge case BtrFS failure

While I was creating Experiment 34-4 for Volume2 of my Using and Administering Linux – Zero to SysAdmin: 2nd Edition course, I inadvertently managed to find an edge case⁶ in which a failure of the BtrFS filesystem occurred. Although I test things a lot, my job at Cisco was as a tester so I tend to try to stress test anything I’m working on. And that’s what happened here.

In part the failure was apparently due to the unusual makeup of the BtrFS filesystem with two small 2GB storage devices and one large, 20GB device. I wanted to create 1.5 million small files for testing and that failed about 90 percent or so of the way through the process. I knew there was a problem when it took more than 2.5 hours on the VM on my primary workstation and never completed because the metadata structures filled up leaving no room to create new files even though there was still data space available.

The metadata structures, had been relegated to the RAID1 array consisting of the two 2GB devices, and did not contain enough space to allow creation of metadata for 1.5 million files. In a more normal use case this might not have occurred.

It was difficult to tell how many files were actually created because the filesystem failed in such a way that the kernel – as it is supposed to do – placed it into read-only mode as a means of protecting the filesystem from further damage. However, the BtrFS filesystem usage /TestFS showed that approximately 1.35 million files had been created.

Although it should have been possible to use the -o remount option of the mount program to atomically remount the filesystem that also failed. Unmounting and then mounting as separate operations also failed. Only a reboot worked to remount the BtrFS filesystem properly in Read/Write mode.

I tried to delete the files that had already been created using rm -f * as root but that just hung with no apparent effect. After a long wait it finally displayed an error message to the effect that there were too many arguments. This means that there were too many files for the rm command to deal with using the file glob, *.

I then tried a different approach with the command for X in `ls` ; do echo “Working on $X” ; rm -f $X ; done and that worked for a while but eventually ended in another hang and with the filesystem in read-only mode again.

To shorten a long story, after some additional testing I decided to recreate the BtrFS volume and to reduce the number of files to be created to 50,000 while increasing their size to take up more space. This worked well and the resulting experiment worked as it should.

However, one lesson I learned is that strange and unexpected failures can still occur. Therefore, it’s important for us to understand the technologies that are used in our devices so that we can be effective in finding and fixing or circumventing the root causes of the problems we encounter.

And yes, I did test this edge case in an EXT4/LVM setup by converting those three devices on my VM to EXT4 on LVM. It worked much faster and created 1.5 million files without any problems. However after exploring the number of inodes remaining in the EXT4 filesystem after creating 1.5 million files, I discovered that the EXT4 filesystem could only have held about 38,000 additional files before it, too failed. All filesystems have limits.

Although this situation is an extremely unusual edge case, I still plan to continue using LVM+EXT4 on my personal systems. This decision is not solely due to the edge case itself but rather to the fact that the entire filesystem became unusable after the problem. That is not a situation I want to encounter in a production environment whether my home office or a huge organization.

Another factor in my decision to remain with LVM+EXT4 filesystems is an article published on Ars-Technica by Jim Salter, “Examining btrfs, Linux’s perpetually half-finished filesystem.”⁷ In this article Salter describes some of the interesting features of BtrFS but covers in some detail the problems extant in the filesystem concluding with the statement:

“Btrfs’ refusal to mount degraded, automatic mounting of stale disks, and lack of automatic stale disk repair/recovery do not add up to a sane way to manage a ‘redundant’ storage system.”

The final factor I considered is the fact that Red Hat has removed all support for BtrFS from its RHEL flagship operating system. That is not comforting at all.

Summary

The BtrFS filesystem can be used to hide the complexities of traditional Linux filesystem structures for relatively non-technical users such as individuals or businesses who just want to get their work done. BtrFS is the default filesystem on Fedora and other distributions in part because of its apparent simplicity from the viewpoint of the non-technical user.

The edge case failure I experienced while developing Experiment 34-4 for my book serves to illustrate that BtrFS is still not ready for use except in the simplest of environments where a single BtrFS volume is the filesystem for a single storage device. The lack of accurate documentation caused me to spend hours of time researching commands and their syntax. Additionally, my experiments with performance show that BtrFS is much slower at certain tasks such as creating large numbers of files very rapidly.

The BtrFS development team’s response to a query I submitted indicated that concentrating on edge cases was not helpful. I can’t ignore the fact that it was so easy to randomly create a test case that failed.

I strongly recommend using LVM+EXT4 and will continue to do so myself for the foreseeable future. Stay away from multidisk BtrFS volumes at all cost.

Wikipedia, BtrFS, https://en.wikipedia.org/wiki/BtrFS ↩︎
Wkipedia, Copy-on-Write, https://en.wikipedia.org/wiki/Copy-on-write ↩︎
Murphy, Chris, Fedora Magazine, Btrfs Coming to Fedora 33, https://fedoramagazine.org/btrfs-coming-to-fedora-33/, 08/24/2020 ↩︎
BtrFS documentation, https://BtrFS.readthedocs.io/en/latest/index.html ↩︎
Notice the two meanings of “filesystem” in this sentence. ↩︎
Edge case” is a term used in testing that means when one condition exceeds an extreme maximum or minimum value. In this case the number 1.5 million files exceeded the capacity of the available metadata space. Wikipedia, Edge Case, https://en.wikipedia.org/wiki/Edge_case ↩︎
Salter, Jim, Examining btrfs, Linux’s perpetually half-finished filesystem, https://arstechnica.com/gadgets/2021/09/examining-btrfs-linuxs-perpetually-half-finished-filesystem/, Ars Technica, 09/24/2021 ↩︎

The Linux Philosophy for SysAdmins, Tenet 12 — Use separate filesystems for data

Fastfetch: system information tool

Celebrating technical writing with open source software