The Linux Philosophy for SysAdmins, Tenet 02 — Transforming Data Streams

0

Last Updated on August 10, 2024 by David Both

Author’s note: This article is excerpted in part from chapter 4 of my book, The Linux Philosophy for SysAdmins, with some changes.

My article about Tenet 1 was about the universality of data streams. This article introduces the use of pipes to connect streams of data from one utility program to another using STDIO. You will learn that the function of these programs is to transform the data in some manner. You will also learn about the use of redirection to redirect the data to a file.

The primary task of each filter program is to transform the incoming data from STDIO in a specific way as intended by the SysAdmin and to send the transformed data to STDOUT for possible use by another program or redirection to a file. These programs can add data to a stream, modify the data in some amazing ways, sort it, rearrange the data in each line, perform operations based on the contents of the data stream, and so much more.

Data streams as raw materials

Data streams are the raw materials upon which the Core Utilities and many other CLI tools perform their work. As its name implies, a data stream is a stream of data being passed from one file, device, or program to another using STDIO.

Data streams can be manipulated by inserting filter programs into the stream using pipes. Each filter program is used by the SysAdmin to perform some transformative operation on the data in the stream, thus changing its contents in some manner. Redirection can then be used at the end of the pipeline to direct the data stream to a file. That file could be an actual data file on the hard drive, or a device file such as a drive partition, a printer, a terminal, a pseudo-terminal, or any other device connected to a computer.

The ability to manipulate these data streams using these small yet powerful filter programs is central to the power of the Linux command line interface. Many of the Core Utilities are transformer programs and use STDIO.

Pipe dreams

Pipes are critical to our ability to do the amazing things on the command line, so much so that I think it is important to recognize that they were invented by Douglas McIlroy during the early days of Unix. Thanks, Doug! The Princeton University web site has a fragment of an interview with McIlroy in which he discusses the creation of the pipe and the beginnings of the Unix Philosophy.

Notice the use of pipes in the simple command line program shown here that lists each logged-in user a single time no matter how many logins they have active. First login as multiple users and also login multipl times as at least one user. If necessary, create some test users to use for this. Enter the command shown below on one line.

[tuser1@testvm1 ~]$ w | tail -n +3 | awk '{print $1}' | sort | uniq
root
tuser1

The results from this command produce two lines of data that show that the users root and student are both logged in. It does not show how many times each user is logged in.

A string of programs connected with pipes is called a pipeline and the programs that use STDIO are referred to officially as filters. Pipes – represented by the vertical bar ( | ) – are the syntactical glue, the operator, that connects these command line utilities together. Pipes allow the Standard Output from one command to be “piped”, i.e., streamed from Standard Output of one command to the Standard Input of the next command.

[tuser1@testvm1 ~]$ echo "Pipes were invented by Doug McIlroy. Thanks, Doug." | cowsay
 ______________________________________ 
/ Pipes were invented by Doug McIlroy. \
\ Thanks, Doug.                        /
 -------------------------------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

The vertical bar + ampersand ( |& ) operator can be used to pipe the STDERR along with STDOUT to STDIN of the next command. This offers flexibility in the ability to record the STDERR data stream for the purposes of problem determination.

Think about how this program would have to work if we could not pipe the data stream from one command to the next. The first command would perform its task on the data and then the output from that command would have to be saved in a file. The next command would have to read the stream of data from the intermediate file and perform its modification of the data stream, sending its own output to a new, temporary data file. The third command would have to take its data from the second temporary data file and perform its own manipulation of the data stream and then store the resulting data stream in yet another temporary file. At each step the data file names would have to be transferred from one command to the next in some way.

I can’t even stand to think about that because it is so complex. Remember that simplicity rocks!

Building pipelines

When I am doing something new, solving a new problem, I usually do not just type in a complete bash command pipeline from scratch, as in Experiment 4-1 off the top of my head. I usually start with just one or two commands in the pipeline and build from there by adding more commands to further process the data stream. This allows me to view the state of the data stream after each of the commands in the pipeline and make corrections as they are needed.

Enter the commands as shown on each line. Observe the changes in the data stream as each new utility is appended to the data stream using the pipe.

[tuser1@testvm1 ~]$ w

[tuser1@testvm1 ~]$ w | tail -n +3

[tuser1@testvm1 ~]$ w | tail -n +3 | awk '{print $1}'

[tuser1@testvm1 ~]$ w | tail -n +3 | awk '{print $1}' | sort

[tuser1@testvm1 ~]$ w | tail -n +3 | awk '{print $1}' | sort | uniq

The results of this experiment illustrate the changes to the data stream performed by each of the transformer utility programs in the pipeline. It is possible to build up very complex pipelines that can transform the data stream using many different utilities that work with STDIO.

Redirection

Redirection is the capability to redirect the STDOUT data stream of a program to a file instead of to the default target of the display. The “greater than” ( > ) character, aka, “gt”, is the syntactical symbol for redirection. Next you’ll redirect the output data stream of the df -h command to the file diskusage.txt. Redirecting the STDOUT of a command can be used to create a file containing the results from that command.

[tuser1@testvm1 ~]$ df -h > diskusage.txt

There is no output to the terminal from this command unless there is an error. This is because the STDOUT data stream is redirected to the file and STDERR is still directed to the STDOUT device which is the display. You can view the contents of the file you just created using this next command.

tuser1@testvm1:~$ cat diskusage.txt
Filesystem             Size  Used Avail Use% Mounted on
/dev/mapper/vg01-root  4.9G   38M  4.6G   1% /
/dev/mapper/vg01-usr    30G  9.9G   19G  36% /usr
devtmpfs               4.0M     0  4.0M   0% /dev
tmpfs                  3.9G     0  3.9G   0% /dev/shm
tmpfs                  1.6G  1.2M  1.6G   1% /run
/dev/sda2              4.9G  366M  4.3G   8% /boot
/dev/mapper/vg01-home  4.9G   39M  4.6G   1% /home
/dev/mapper/vg01-var    30G  596M   28G   3% /var
/dev/mapper/vg01-tmp   9.8G  2.2M  9.3G   1% /tmp
tmpfs                  794M  104K  794M   1% /run/user/983
tmpfs                  794M   96K  794M   1% /run/user/0
tmpfs                  794M   96K  794M   1% /run/user/1002
tuser1@testvm1:~$

When using the > symbol for redirection, the specified file is created if it does not already exist. If it already does exist the contents are overwritten by the data stream from the command. You can use double greater than symbols, >>, to append the new data stream to any existing content in the file.

This command appends the new data stream to the end of the existing file.

tuser1@testvm1:~$ df -h >> diskusage.txt

You can use cat and/or less to view the diskusage.txt file in order to verify that the new data was appended to the end of the file.

The < (less than) symbol redirects data to the STDIN of the program. You might want to use this method to input data from a file to STDIN of a command that does not take a filename as an argument but that does use STDIN. Although input sources can be redirected to STDIN, such as a file that is used as input to grep, it is generally not necessary as grep also takes a filename as an argument to specify the input source. Most other commands also take a filename as an argument for their input source.

Here’s an example of using redirection to STDIN is with the od command. The -N 50 option limits the output to the specified number of lines. You could use Ctrl-C to terminate the output data stream if you don’t use the -N option to limit it.

This command illustrates the use of redirection as input to STDIN.

tuser1@testvm1:~$ od -c -N 200 < /dev/urandom
0000000  \n 335 366 315   f   V 267 343 236 375   H   @ 250 252 364  \r
0000020 375 314   $       F 320 021   j   R   =   9 215 266   - 231 237
0000040   ,   i   G   ' 325 330 251 250 345 341 212 322 360     232   Q
0000060 374 347   4   f   F   u   R   o 237 240 004 310   F 354 265   2
0000100 365 030 267 305   Z 231 307 264 322  \t 211 347 241   \   X   -
0000120   ] 257 310   ? 315 330   V   ,   l 020 356 265 034 204 177 317
0000140 034   4   :  \n 373   D 350 215 001   7   Y   { 325   D   6   9
0000160 220  \t 202   e 205   {   8   m   L   M   #   > 362 334 241   7
0000200 326 033   <   X 347 036   +   + 363   V   - 025   }  \0   @   9
0000220   F 006 265   a 002 213   9 005 220 304 373 200 314 230 363   F
0000240 274 364 354   a   p 247 203 337 237 271 366 037 325 231   ,   m
0000260  \n 357 301   4 305 302   * 306 261 347 207 016   x   T 005   &
0000300   f 020   V 257 202  \f   0 266
0000310
tuser1@testvm1:~$

Redirection can be the source or the termination of a pipeline. Because it is so seldom needed as input, redirection is usually used as termination of a pipeline.

The pipeline challenge

I wrote prolifically for Opensource.com before Red Hat terminated it, and a few years ago I posed a challenge for readers, one that involves pipes as a required component of the solution. It is a simple problem with a solution that I use frequently.

I now offer this as a challenge to readers of Both.org. Unfortunately, as a completely volunteer organization with no financial support except that it all runs on my own server, we have no prizes to award. However, send your solution to us at challenge@both.org and we’ll print the best ones in a future article, along with the winners of the previous use of this challenge.

The Problem

I have a number of computers configured to send administrative emails to my own email account. I have configured procmail on my email server to move most of these administrative emails into a single folder to make it easy to find them all. Over the previous couple years I had collected over 50,000 emails in that folder. Those emails consisted of output from rkhunter (Rootkit hunter), logwatch, cron jobs, and Fail2Ban, among others.

I was interested in extracting more information from the data. Specifically that from Fail2Ban which is Free Open Source Software (FOSS) that dynamically bans IP addresses of hosts that attempt to maliciously access my own hosts, primarily the firewalls on the Internet. Fail2Ban does this by adding rules to IPTables. Each time an IP address is banned for multiple failure attempts at SSH login, Fail2Ban sends an email. I wanted a list of those IP addresses that were being banned.

The Objective

The objective of this challenge is to create a single command line program to count the number of emails from each IP Address that have attempted to access my hosts using SSH. Entrants will download the admin.index file containing CSV data exported from my email client with more than 50,000 subject lines extracted from the emails. All of the subject lines were included in the data available to the entrants, so part of the task would be to extract only the subject lines pertaining to banned SSH connections.

The Rules

  1. This solution must be a command line program only one line long. Line wrapping is permitted.
  2. The correct solution must list only the lines that contain the IP addresses that were banned by fail2ban due to multiple attempts to login to my firewall host via SSH.
  3. The solution must use pipes to channel the data stream from one command to the next.
  4. The resulting list must be sorted.
  5. For extra credit the results could include the name of the country of each IP address.
  6. I, David Both, am the sole judge of this challenge and my decisions are final.
  7. All entries become the property of Both,org and may be included in one or more future articles under a CC-by-SA 4.0 license.
  8. All code submitted for consideration must be licensed under the GPL V3 or later. You must state this explicitly in the email containing your submission.
  9. Previous winners of this challenge are not eligible to win this time.
  10. Solutions previously published at Opensource.com will not be eligible to win.
  11. No prizes will be awarded.

The categories

We will have at least one winner but no more than three winners in each of the following categories. All entries must produce the correct results.

  1. First entry with correct results.
  2. Shortest solutions.
  3. Most creative solutions.
  4. Extra credit solutions.

Data streams with wget

The wget command is an excellent example of a tool that creates a data stream. In this case the source data is a downloadable file on my personal web site. The wget command initiates the download and stores the received data stream in a file on the receiving host. Remember, a file is just a data stream that is recorded on a storage device such as a hard drive or an SSD. If no target file name is specified, the default filename is that of the file being downloaded.

The wget command is not interactive so it can be used directly at the command line, or in scripts for automating downloads. Use the wget command to download the file from my web site.

tuser1@testvm1:~$ wget https://www.both.org/downloads/admin.index
admin.index    100% [===================================================================>]    6.32M   83.58MB/s
                    [Files: 1  Bytes: 6.32M [14.11MB/s] Redirects: 0  Todo: 0  Errors: 0 ]
tuser1@testvm1:~$ 

Submit your solutions to challenge@both.org. We’ll print the best ones, up to three, in each category.

Summary

It is only with the use of pipes and redirection that many of the tenets of the Linux Philosophy for SysAdmins make sense. It is the pipes that transport STDIO data streams from one program or file to another. In this chapter you have learned that the use of piping streams of data through one or more transformer programs supports powerful and flexible manipulation of data in those streams.

Each of the programs in the pipeline demonstrated in the experiments, and in all of the contest entries showcased here, is small and each does one thing well. They are also transformers, that is they take Standard Input, process it in some way and then send the output to Standard Output. Implementation of these programs as transformers to send processed data streams from their own Standard Output to the Standard Input of the other programs is complementary to and necessary for the implementation of pipes as a Linux tool.

We’ll print the best solutions to the challenge in a future article. That will also include my own solution as well as the solutions from the challenge when it was posted at Opensource.com.