The Linux Philosophy for SysAdmins, Tenet 01 — Data Streams, the universal interface

0

Last Updated on August 10, 2024 by David Both

Author’s note: This article is excerpted in part from chapter 3 of my book, Linux Philosophy for SysAdmins, with some changes.

Everything in Linux revolves around streams of data — particularly streams of text. This is also true of the Linux Philosophy for SysAdmins. Data streams are at the core of many of the tenets.

I recently searched for “data stream” and most of the top hits are concerned with processing huge amounts of streaming data in single entities such as streaming video and audio, or financial institutions processing streams consisting of huge numbers of individual transactions. That is not what we are talking about here although the concept is the same and a case could be made that current applications use the stream processing functions of Linux as the model for processing many types of data.

In the Unix and Linux worlds a stream is a flow of text data that originates at some source. The stream may flow to one or more programs that transform it in some way, and then it may be stored in a file or displayed in a terminal session. As a SysAdmin your job is intimately associated with manipulating the creation and flow of these data streams. In this chapter we will explore data streams – what they are, how to create them, and a little bit about how to use them.

None of the dictionaries I checked have an entry for “data stream,” but Wikipedia does have a very technical one. So I give you my own, made-up, but somewhat less technical, and more general definition.

“A data stream is a series of encoded characters that represent specific information. Modern encoding is typically in UTF-8 format. A data stream may be static when recorded on a storage device such as a hard disk drive (HDD) or Solid State Device (SSD). It may be dynamic when read from the storage device and transmitted to another device within the same computer or to a remote computer.”

— David Both

More specifically, a text stream is a data stream that consists solely of ASCII text characters

Text streams – a universal interface

The use of Standard Input/Output (STDIO) for program input and output is a key foundation of the Linux way of doing things. STDIO was first developed for Unix and has found its way into most other operating systems since then, including DOS, Windows, and Linux.

This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.”

Doug McIlroy, Basics of the Unix Philosophy

STDIO was developed by Ken Thompson as a part of the infrastructure required to implement pipes on early versions of Unix. Programs that implement STDIO use standardized file handles for input and output rather than files that are stored on a disk or other recording media. STDIO is best described as a buffered data stream and its primary function is to stream data from the output of one program, file, or device to the input of another program, file, or device.

STDIO file handles

There are three STDIO data streams, each of which is automatically opened as a file at the startup of a program – well those programs that use STDIO. Each STDIO data stream is associated with a file handle which is just a set of metadata that describes the attributes of the file. File handles 0, 1, and 2 are explicitly defined by convention and long practice as STDIN, STDOUT, and STDERR, respectively.

  • STDIN, File handle 0, is standard input which is usually input from the keyboard. STDIN can be redirected from any file including device files instead of the keyboard. It is not common to need to redirect STDIN but it can be done.
  • STDOUT, File handle 1, is standard output which sends the data stream to the display by default. It is common to redirect STDOUT to a file or to pipe it to another program for further processing.
  • STDERR is associated with File handle 2. The data stream for STDERR is also usually sent to the display.

If STDOUT is redirected to a file, STDERR continues to be displayed on the screen. This ensures that while the data stream itself is not displayed on the terminal, STDERR is, thus ensuring that the user will see any errors resulting from execution of the program. STDERR can also be redirected to the same device or passed on to the next transformer program in a pipeline.

STDIO is implemented as a C library, stdio.h, which can be included in the source code of programs so that it can be compiled into the resulting executable. This makes it easy to include STDIO in new programs.

Preparation

This article has some short experiments to illustrate the effects — and sometimes the side-effects — of the tenet. So I suggest using a non-production virtual machine for performing the experiments in this article. At least one of the experiments for this tenet will prevent users from logging in to the desktop on the VM. Future articles about other tenets of the philosophy will also require the use of this VM to enable similar illustrations.

I used the following fairly minimal specifications for my VM.

ItemMinimum Value
RAM8GB
Disk120GB
CPUs4
Display RAM128MB
Figure 1: VM Specifications

I use the following disk partitioning and LVM setup for my the virtual drive. I changed the volume group name to vg01 because it’s shorter and easier to type than the default name. Note that the specific partitions and their sizes are designed to work with certain experiments. Using a different drive configuration may change those results.

Mount PointDeviceTypeVolume
Group
Filesystem
Type
Size
BIOS Boot/dev/sda1Raw PartitionBIOS2MiB
/boot/dev/sda2Raw PartitionEXT42GiB
//dev/sda3Logical Volumevg01EXT45Gib
/usr/dev/sda3Logical Volumevg01EXT430GiB
/var/dev/sda3Logical Volumevg01EXT430Gib
/home/dev/sda3Logical Volumevg01EXT410GiB
/tmp/dev/sda3Logical Volumevg01EXT410GiB
Figure 2: Use this disk scheme for the VM.

I installed Fedora 40 on my VM but any distribution with a desktop GUI should work.

Generating data streams

Most of the Core Utilities use STDIO as their output stream and those that generate data streams, rather than acting to transform the data stream in some way, can be used to create the data streams that we will use for our experiments. Data streams can be as short as one line or even a single character, and as long as needed1.

Some GNU core utilities are designed specifically to produce streams of data. The yes command produces a continuous data stream that consists of repetitions of the data string provided as the argument. The generated data stream will continue until it is interrupted with a Ctrl-C which is displayed on the screen as ^C.

Enter the command as shown and let it run for a few seconds. Press Ctrl-C when you get tired of watching the same string of data scroll by.

[tuser1@testvm1 ~]$ yes 123465789-abcdefg
123465789-abcdefg
123465789-abcdefg
123465789-abcdefg
123465789-abcdefg
<SNIP>
123465789-abcdefg
123465789-abcdefg
123465789-abcdefg
1234^C

There are many ways that this might be useful. For example you might wish to automate the process of responding to the seemingly interminable requests for “y” input to from the fsck program to fix problems on the hard drive. This solution can result in saving a lot of presses on the “y” key. To see how the yes generates a string of “y” characters try the yes command again without a string argument to get a string of “y” characters as output.

[student@f26vm ~]$ yes 
y
y
<SNIP>
y
y
^C

And now, here is something that you should most definitely not try. When run as root, the rm * command will erase every file in the present working directory (pwd) – but it asks you to enter “y” for each file to verify that you actually want to delete that file. This is a safety feature but it means more typing. The CLI program below will supply the response of “y” to each request by the rm command and will delete all of the files. Such a result would probably be undesirable.

Warning! Do not run this command because it will delete all of the files in the present working directory.

yes | rm *

The root user could also use rm -f * which would also forcibly delete all of the files in the PWD. The -f means “force” the deletions. That is also something you should not do.

Test a theory with yes

Another option for using the yes command is to fill a filesystem with a single file containing some arbitrary and pretty much irrelevant data in order to – well – fill up the filesystem. I have used this technique to test what happens to a Linux host when a particular directory becomes full. In the specific instance where I used this technique, I was testing a theory because a customer was having problems and could not login to the desktop on their Linux host.

The problem symptom was that the user couldn’t login to the desktop. I used the yes tool to create a data stream which I redirected to the /tmp directory in order to fill it up. I then tried to login to the command line on virtual terminal 2 — Ctrl-Alt+F3. That worked so I was able to explore the host and determine that the root cause of the problem was that the /tmp directory was full of files that filled the / (root) filesystem. The first command fills /tmp with a single file. The second shows the full filesystem.

root@testvm2:~# yes 123456789-abcdefgh >> /tmp/testfile.txt
yes: standard output: No space left on device

root@testvm2:~# df -h
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/vg01-root  4.9G   40M  4.6G   1% /
/dev/mapper/vg01-usr    40G   12G   26G  31% /usr
devtmpfs               4.0M     0  4.0M   0% /dev
tmpfs                  3.9G     0  3.9G   0% /dev/shm
tmpfs                  1.6G  1.3M  1.6G   1% /run
/dev/sda2              2.0G  360M  1.5G  20% /boot
/dev/mapper/vg01-var    40G  630M   37G   2% /var
/dev/mapper/vg01-tmp   9.8G  9.8G     0 100% /tmp
/dev/mapper/vg01-home  9.8G   59M  9.2G   1% /home
tmpfs                  794M  168K  794M   1% /run/user/0
tmpfs                  794M  212K  794M   1% /run/user/1002
tmpfs                  794M  176K  794M   1% /run/user/983
root@testvm2:~# ll /tmp
total 10200264
drwx------. 2 root root       16384 Jun 20 12:35 lost+found
<SNIP>
-rw-r--r--  1 root root 10444967936 Jul 16 16:07 testfile.txt
-rw-r--r--  1 root root       33232 Jul 15 08:01 updates.list
root@testvm2:~#

At this point try to login to the desktop as user tuser1. This will look like it’s going to work, but simply returns you to the login screen.

I used the simple test in this experiment on the /tmp directory of one of my own computers as part of my testing to assist me in determining my customer’s problem. After /tmp filled up users were no longer able to login to a GUI desktop, but they could still login using the consoles. That is because logging into a GUI desktop creates files in the /tmp directory and there was no room left so the login failed. The console login does not create new files in /tmp so they succeeded. My customer had not tried logging into the console because they were not familiar with the CLI.

After testing this on my own system as verification, I used the console to login to the customer host and found a number of large files taking up all of the space in the /tmp directory. I deleted those and helped the customer determine how the files were being created and we were able to put a stop to that.

As root, delete testfile.txt and verify that the user can now login to the desktop.

This next experiment shows what happened on a host that didn’t have a separate /tmp filesystem. To perform this experiment, comment out the line for /tmp in /etc/fstab and reboot. Then login and run the following commands.

root@testvm3:~# yes 123456789-abcdefgh >> /tmp/testfile.txt
yes: standard output: No space left on device
root@testvm3:~# df -h
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/vg01-root  4.9G  102M  4.5G   3% /
/dev/mapper/vg01-usr    30G   12G   17G  43% /usr
devtmpfs               4.0M     0  4.0M   0% /dev
tmpfs                  3.9G     0  3.9G   0% /dev/shm
tmpfs                  1.6G  1.3M  1.6G   1% /run
tmpfs                  3.9G  3.9G     0 100% /tmp
/dev/sda2              2.0G  310M  1.5G  17% /boot
/dev/mapper/vg01-home  9.8G   51M  9.2G   1% /home
/dev/mapper/vg01-var    30G  578M   28G   3% /var
tmpfs                  794M  240K  794M   1% /run/user/1001
tmpfs                  794M  204K  794M   1% /run/user/0
tmpfs                  794M  212K  794M   1% /run/user/983
root@testvm3:~# lsblk
NAME          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sda             8:0    0  120G  0 disk
├─sda1          8:1    0    2M  0 part
├─sda2          8:2    0    2G  0 part /boot
└─sda3          8:3    0  118G  0 part
  ├─vg01-root 253:0    0    5G  0 lvm  /
  ├─vg01-usr  253:1    0   30G  0 lvm  /usr
  ├─vg01-var  253:2    0   30G  0 lvm  /var
  └─vg01-home 253:3    0   10G  0 lvm  /home
sr0            11:0    1 1024M  0 rom
zram0         252:0    0  7.7G  0 disk [SWAP]
root@testvm3:~# ll /tmp
total 4062244
drwx------ 3 root root         60 Jul 16 12:22 systemd-private-f97b70<snip>1ba-chronyd.service-F9Pg92
<SNIP>
drwx------ 3 root root         60 Jul 16 12:22 systemd-private-f97b70<snip>1ba-upower.service-fvqIl2
-rw-r--r-- 1 root root 4159737856 Jul 16 15:36 testfile.txt
root@testvm3:~#

What happens when you try to login to the desktop? After answering that question, uncomment the line in the fstab file, and reboot.

One thing I discovered in performing this experiment is that the /tmp directory is a virtual drive in RAM. It’s no longer a simple directory, nor is it a partition or logical volume mounted on the /tmp mountpoint. So the file filled RAM and not a filesystem located on a disk device. The total amount of RAM on your system will make difference in how long it takes to fill the virtual RAM filesystem.

The symptoms in this experiment are that it takes a long time to login and logout. Because /tmp is a virtual filesystem, a reboot can temporarily circumvent this problem because virtual filesystems are recreated at each boot so are empty. But that won’t necessarily prevent /tmp from filling up again.

Exploring a drive

It is now time to do a little exploring and to be as safe as possible you absolutely must use the VM. I suggest making a snapshot of your VM so you can revert if it gets damaged.

In this experiment we will look at some of the filesystem structures. Let’s start with something simple but a bit dangerous. You might be at least somewhat familiar with the dd command. Officially known as “disk dump,” many SysAdmins call it “disk destroyer” for good reason. Many of us have inadvertently destroyed the contents of an entire hard drive or partition using the dd command. That is why we will be very careful.

As root in a terminal session, use the dd command to view the boot record of the storage device. The bs= argument is not what you might think, it simply specifies the block size of 512 bytes, and the count= argument specifies the number of blocks to dump to STDIO. The of= argument, when present, specifies the target device or file of the data stream. In this case, since there is no output file specified, the display.

Let’s start by viewing the first sector on the drive, the boot sector of /dev/sda.

root@testvm3:~# dd if=/dev/sda bs=512 count=1
Z������}�f�ƈd�@f�D�������@�����f�f�`|f���uNf�\|f1�f�4��1�f�t;}7���0����Z�ƻp��1۸�r��`���1������a�&Z|��}���}�4��}�.���GRUB GeomHard DiskRead Error
����<u����������U�
1+0 records in
1+0 records out
512 bytes copied, 0.000108519 s, 4.7 MB/s
root@testvm3:~#

This command prints the text of the boot record, which is the first block on the disk – any disk. This disk is formatted using the GUID partition table so this is only a “protective MBR” that prevents the disk from appearing empty to older tools.

I have added a couple line feeds after the boot record itself in order to clarify the end of the data in the sector and the information printed by the dd command itself. The last three lines contain data about the number of records and bytes processed by dd. But this looks random except for a few characters and the � characters obscure the underlying data.

Now use the dd command to view the MBR on the other partitions on the disk, sda1 through sda3. What does that tell you?

We can make the output data stream easier to interpret with the very flexible od command. We can pipe the data stream from dd through od with the following result.

root@testvm3:~# dd if=/dev/sda bs=512 count=1 | od

I won’t reproduce that stream completely but it converts the data to octal and displays it in a more structure format. However, that’s still not very helpful. Fortunately, the od command has some options that can help. We can also skip the dd command since od allows us to specifiy the number of bytes to be read and it can directly read the storage device because “Everything is a file.”

An asterisk, aka, splat or star, indicates one or more repeats of the preceeding line. the -ta option tells od to display all characters of ASCII type. And the -z option tells it to add the column to the right that shows just the characters and not the names as in the left section. The right column is easier to read when readable data is found in the stream.

root@testvm3:~# od -taz -N 1024 /dev/sda
0000000   k   c dle nul nul nul nul nul nul nul nul nul nul nul nul nul  >.c..............<
0000020 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul  >................<
*
0000120 nul nul nul nul nul nul nul nul nul nul nul nul nul  bs nul nul  >................<
0000140 nul nul nul nul del   z dle dle   v   B nul   t enq   v   B   p  >...........t...p<
0000160   t stx   2 nul   j   y   | nul nul   1   @  so   X  so   P   <  >t....y|..1......<
0000200 nul  sp   {  sp   d   |   < del   t stx  bs   B   R   > nul   }  >. ..d|<.t...R..}<
0000220   h etb soh   > enq   |   4   A   ;   *   U   M dc3   Z   R   r  >.....|.A..U..ZRr<
0000240   = soh   {   U   *   u   7 etx   a soh   t   2   1   @  ht   D  >=..U.u7...t21..D<
0000260 eot   @  bs   D del  ht   D stx   G eot dle nul   f  vt  rs   \  >.@.D..D.....f..\<
0000300   |   f  ht   \  bs   f  vt  rs   `   |   f  ht   \  ff   G   D  >|f.\.f..`|f.\..D<
0000320 ack nul   p   4   B   M dc3   r enq   ; nul   p   k   v   4  bs  >..p.B..r...p.v..<
<SNIP>
0000540  so   F   |   s   %  us   a del   &   Z   |   > ack   }   k etx  >......a.&Z|..}..<
0000560   > nak   }   h   4 nul   > sub   }   h   . nul   M can   k   ~  >..}.4...}.......<
0000600   G   R   U   B  sp nul   G   e   o   m nul   H   a   r   d  sp  >GRUB .Geom.Hard <
0000620   D   i   s   k nul   R   e   a   d nul  sp   E   r   r   o   r  >Disk.Read. Error<
0000640  cr  nl nul   ; soh nul   4  so   M dle   ,   < nul   u   t   C  >...........<.u..<
0000660 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul  >................<
0000700 stx nul   n del del del soh nul nul nul del del del  so nul nul  >................<
0000720 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul  >................<
*
0000760 nul nul nul nul nul nul nul nul nul nul nul nul nul nul   U   *  >..............U.<
0001000   E   F   I  sp   P   A   R   T nul nul soh nul   \ nul nul nul  >EFI PART....\...<
0001020   4   z   M   7 nul nul nul nul soh nul nul nul nul nul nul nul  >..M.............<
0001040 del del del  so nul nul nul nul   " nul nul nul nul nul nul nul  >........".......<
0001060   ^ del del  so nul nul nul nul dc3   $   V ack  rs   C stx   H  >.........$.....H<
0001100   , dc3  si   0   |   S   {   K stx nul nul nul nul nul nul nul  >....|S{.........<
0001120 nul nul nul nul nul nul nul nul   o   p   ]  si nul nul nul nul  >................<
0001140 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul  >................<
*
0002000
root@testvm3:~#

Suppose we want to save an exact image of a partition or the entire hard drive. This would create a massive data stream and we don’t have room on our VM to store it all. But let’s look at how to do that using smaller amounts of data. We’ll just do 100 sectors for this exercise.

root@testvm3:~# dd if=/dev/sda bs=512 count=100 > /tmp/disk-image.txt

Now use the od command to view that data stream.

ISO images

We’ve already looked at some data streams and how to create them. But let’s look at something a little more mundane that I use quite regularly. Every time a new release of Fedora hits the mirrors, I download it. From there I need to install that ISO image on a USB thumb drive to create a bootable Live image. For this exercise download and place the ISO image for Fedora 40 in /tmp. Then insert a USB thumb drive in your Linux host.

We’ll determine the device file for the USB drive using dmesg, then install the ISO image on that device. The device is /dev/sdh as shown by the last entries in the dmesg data stream, so we’ll install the ISO image to that device and not to one of the partitions.

root@david:/tmp# dmesg
[1262807.278432] usb-storage 1-14.3:1.0: USB Mass Storage device detected
[1262807.279463] scsi host10: usb-storage 1-14.3:1.0
[1262808.324404] scsi 10:0:0:0: Direct-Access     General  UDisk            5.00 PQ: 0 ANSI: 2
[1262808.324841] sd 10:0:0:0: Attached scsi generic sg10 type 0
[1262808.326028] sd 10:0:0:0: [sdh] 15394600 512-byte logical blocks: (7.88 GB/7.34 GiB)
[1262808.326162] sd 10:0:0:0: [sdh] Write Protect is off
[1262808.326169] sd 10:0:0:0: [sdh] Mode Sense: 0b 00 00 08
[1262808.326293] sd 10:0:0:0: [sdh] No Caching mode page found
[1262808.326298] sd 10:0:0:0: [sdh] Assuming drive cache: write through
[1262808.342996] GPT:Primary header thinks Alt. header is not at the end of the disk.
[1262808.343004] GPT:3326887 != 15394599
[1262808.343009] GPT:Alternate GPT header not at the end of the disk.
[1262808.343012] GPT:3326887 != 15394599
[1262808.343015] GPT: Use GNU Parted to correct GPT errors.
[1262808.343037]  sdh: sdh1 sdh2 sdh3
[1262808.343441] sd 10:0:0:0: [sdh] Attached SCSI removable disk
root@david:/tmp# dd if=Fedora-Xfce-Live-x86_64-40-1.14.iso of=/dev/sdh bs=2048 
922846+0 records in
922846+0 records out
1889988608 bytes (1.9 GB, 1.8 GiB) copied, 1817.28 s, 1.0 MB/s
root@david:/tmp#

This use of the dd command to install the ISO image on the USB device creates a bootable, Live Fedora USB that’s suitable for demonstrating Linux, testing hardware, and installing Linux. This is done without the use of any special tools to make the device bootable. And it’s done in one simple step.

Randomness

It turns out that randomness is a desirable thing in computers. Who knew?

There are a number of reasons that SysAdmins might want to generate a stream of random data. A stream of random data is sometimes useful to overwrite the contents of a complete partition, such as /dev/sda1, or even the entire hard drive as in /dev/sda.

Although deleting files may seem permanent, it is not. Many forensic tools are available and can be used by trained forensic specialists to easily recover files that have supposedly been deleted. It is much more difficult to recover files that have been overwritten by random data. I have frequently needed not just to to delete all of the data on a hard drive but to overwrite it so it cannot be recovered. I do this for customers and friends who have “gifted” me with their old computers for reuse or recycling.

Regardless of what ultimately happens to the computers, I promise the persons who donate the computers that I will scrub all of the data from the hard drive. I remove the drives from the computer, put them in my plug-in hard drive docking station, and overwrite all of the data, but instead of just spewing the random data to STDOUT I redirect it to the device file for the hard drive that needs to be overwritten — but don’t do that.

Enter this command to print an unending stream of random data to STDIO.

[student@testvm1 ~]$ cat /dev/urandom

Use Ctrl-C to break out and stop the stream of data. Try this with the dd and od commands.

If you are extremely paranoid, the shred command can be used to overwrite individual files as well as partitions and complete drives. It can write over the device as many times as needed for you to feel secure, with multiple passes using both random data as well as specifically sequenced patterns of data streams designed to prevent even the most sensitive equipment from recovering any data from the hard drive. As with other utilities that use random data, the random stream is supplied by the /dev/urandom device.

Random data is also used as the input seed to programs that generate random passwords and random data and numbers for use in scientific and statistical calculations.

Summary

So far you’ve learned that STDIO is nothing more than streams of data. This data can be almost anything from the output of a command to list the files in a directory, or an unending stream of data from a special device like /dev/urandom, or even a stream that contains all of the raw data from a hard drive or a partition. You learned some different and interesting methods to generate different types of data streams and how to use the dd and od command to explore the contents of a data stream.

Any device on a Linux computer can be treated like a data stream. You can use ordinary tools like dd, od, and cat to dump data from a device into a STDIO data stream that can be processed using other ordinary Linux tools.

So far we have not done anything with these data streams except to look at them. But wait – there’s more! The next episode is coming soon…


  1. A data stream taken from special device files random, urandom, and zero, for example, can continue forever without some form of external termination such as the user entering Ctrl-C, a limiting argument to the command or a system failure. ↩︎

Leave a Reply