Writing your own ‘fmt’ program
The original Bell Labs Unix system was based on a then-novel idea: Everything is a text file. Your documents? Text files. Your email? Text files. You might run a separate program to process those plain text files into a more suitable format, such as using nroff to prepare documents for printing on a TeleType printer, or using troff to format documents for printing on a phototypesetter – but in the end, everything was a text file.
Yet there’s an inherent problem when working with text files: If you continue to edit a text file in a plain editor like vi
(or before that, ed
) you will quickly reach a point where the lines are not the same length. While this isn’t a problem for processing documents using nroff or troff, this can make other files look less “nice” for humans to read.
Reformatting text files
Unix and Unix-like systems provide tools to help reformat a text document to make it look great for the rest of us. One such program is fmt
, which first appeared in the Berkeley Software Distribution of Unix (commonly known as “BSD” Unix). GNU has a similar fmt
program that does pretty much the same thing, although differs in a few implementation details such as handling nroff files.
fmt
had a specific use case: split long lines to be shorter, and glue short lines together to be longer. fmt
has some interesting peculiarities due to its history, but the simple description is the program makes text files easier to read by making each line about the same length.
But with a little programming, you can write your own version of fmt
. Let’s write a simple version that collects words and fills paragraphs. We’ll call this program fill
so it doesn’t get confused with fmt
which does things a little differently.
Outlining the program
Before we write this program, let’s start with an outline. We’ll need a main
function that will read a list of files. For each file, the program should call a function that collects words and fills paragraphs in the file. We’ll call that other function fill_file
.
To process a file, it’s easiest to read one line at a time and process them. We can do that in another function, which we’ll call fill_line
.
In the C programming language, the strtok
function interprets a string and returns the next “token” based on a set of delimiters. If we use various whitespace characters such as space, tab, and newline as hte delimiters, the tokens are words. This allows us to read lines one a time, and collect words from them.
By keeping track of how much text we’ve printed as output, we can fill paragraphs up to a certain line length. We’ll keep this program as simple as possible and hard-code the target line length.
We can describe the program operation at a high level with this pseudo-code:
main() {
for every file:
fill_file(file)
if no files:
file_file(stdin)
exit
}
fill_file(file) {
for every line in the file:
fill_line(line)
return
}
fill_line(line) {
if line is empty:
print blank line
return
for every word in the line:
if we can fit the word on the output:
print word
else:
start a new line
print word
return
}
The implementation will require more details, but that is the overall structure we’ll use in our program.
The ‘main’ program
With the pseudo-code as a guide, we can construct a program to collect words and fill paragraphs. First, let’s start with the main
program function:
int main(int argc, char **argv)
{
int i;
FILE *in;
for (i = 1; i < argc; i++) {
in = fopen(argv[i], "r");
if (in) {
fill_file(in, stdout);
fclose(in);
}
else {
fputs("cannot open file: ", stderr);
fputs(argv[i], stderr);
fputc('\n', stderr);
}
}
if (argc == 1) {
fill_file(stdin, stdout);
}
return 0;
}
This reads a list of files from the command line. The program opens the file, and passes it to the fill_file
function for processing. If there are no files on the command line (argc == 1
) then the program uses standard input as the input to fill_file
.
I decided to write the fill_file
function so it takes both an input and output file pointer. This doesn’t really add much complexity in printing the output, but it makes the program more flexible if we decide to add a command line parameter like -o file
to save all output directly to a file instead of printing it to standard output.
Reading the file
If we leave the main
function to manage opening and closing the input files, we can focus the fill_file
function on reading lines one at a time from the input. A simple way to read a line of input is with the fgets
function from the standard C library. This reads a line up to a certain length into memory. However, that leaves us stuck with a predetermined size of input lines; lines that are longer will get split, possibly in the middle of a word.
A more flexible approach uses getline
, which reads an arbitrary amount of data into memory. If the memory is too small, getline
automatically reallocates more room to fit the data. This makes the fill_file
function a very brief one:
void fill_file(FILE *in, FILE *out)
{
char *line = NULL;
size_t linesize = 0;
size_t length = 0;
while (getline(&line, &linesize, in) != -1) {
length = fill_line(line, length, out);
}
fputc('\n', out); /* trailing newline */
free(line);
}
Processing lines
The most complicated function in the program processes the input lines, splits them into words, and prints the output. Fortunately, the strtok
function makes this somewhat easier: call strtok
with the string to read the first word, then call strtok
with a “zero” value (called NULL
) to read the words that follow on the line. strtok
returns NULL
when there are no more words to find.
size_t fill_line(char *line, size_t length, FILE *out)
{
char *word;
size_t wordlen, linelen;
if (is_empty(line, DELIM)) {
if (length > 0) {
fputc('\n', out); /* break prev line */
}
fputc('\n', out); /* add blank line */
return 0;
}
linelen = length;
word = strtok(line, DELIM);
while (word) {
wordlen = strlen(word);
if ((linelen + 1 + wordlen) > MAX_LENGTH) {
fputc('\n', out);
linelen = 0;
}
if (linelen > 0) {
fputc(' ', out);
linelen++;
}
fputs(word, out);
linelen += wordlen;
word = strtok(NULL, DELIM); /* get next token */
}
return linelen;
}
The key to this function is tracking how much has been written to the output. In the while
loop, the function uses strlen
to determine how many letters are in the word. If the next input word won’t fit on the current output line, it starts a new line before printing the word. The function also adds a single space between words, except at the start of a new line.
Finding empty lines
The fill_line
function relies on a new function to determine if a line is empty. A too-simple approach would use strlen
to determine if the string had a zero length, but this does not account for lines that are just one or more spaces or tabs.
To correctly determine if a line is empty, we need to write our own function. The is_empty
function reads a string (a line of input) and a list of delimiters (the same delimiters used to split words) and returns a false value if any character in the string is not a delimiter. Only if the function reaches the end of the string will it return a true value; this is possible if the string contains only delimiters, or is a zero-length string.
int is_empty(char *str, const char *whitesp)
{
char *s;
s = str;
while (s[0]) {
if (strchr(whitesp, s[0]) == NULL) {
return 0;
}
s++;
}
return 1;
}
Putting it all together
Those four functions are all we need to create a simple program that collects words and fills paragraphs. As we put it all together, we’ll also need to provide the programming overhead to specify the “include” files for the C programming language library functions, such as string.h
for functions that work on strings and stdlib.h
to use the functions that allocate and free memory.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_LENGTH 66
#define DELIM " \t\n"
int is_empty(char *str, const char *whitesp)
{
...
}
size_t fill_line(char *line, size_t length, FILE *out)
{
...
}
void fill_file(FILE *in, FILE *out)
{
...
}
int main(int argc, char **argv)
{
...
}
This also uses #define
to create a constant value (called a macro) for the maximum line length, at 66 characters. Because we set this as a predetermined value, you will need to update the program if you want to use a different line length. A more robust version of this program might scan the command line arguments (such as with getopt
) to interpret the user’s preferred line length. But for this version of the program, we’ll keep it simple and use a hard-coded value.
Collect words and fill paragraphs
This new program (which we’ll call fill
) will collect words from the input and fill paragraphs on the output. To generate a version of the program you can run, save the source code in a file called fill.c
and use your system’s C compiler like this:
$ gcc -o fill fill.c
Let’s exercise the program with a test file. Create a plain text file on your system, one with different line lengths, and possibly extra spaces between words and lines. I’ve saved my test file as t.txt
:
$ cat t.txt
One blank line before this one.
This is a test file
with lines that are different lengths.
This is the start of a new paragraph,
which should start on a new line.
Some more text to cause a line break.
Two blank lines before this one.
A-really-long-line-without-spaces-or-tabs-that-goes-beyond-80-columns,-at-84-columns
The fill
program will read the input and collect words on each line, then will fill output lines, up to the specified length of 66 characters. The program starts a new paragraph when it finds empty lines. Note that multiple empty lines in the input file also get printed as empty lines in the output:
$ ./fill t.txt
One blank line before this one. This is a test file with lines
that are different lengths.
This is the start of a new paragraph, which should start on a new
line. Some more text to cause a line break.
Two blank lines before this one.
A-really-long-line-without-spaces-or-tabs-that-goes-beyond-80-columns,-at-84-columns
The last line is quite long, and demonstrates why it’s important for programs to avoid arbitrary limits. If the fill
program used fgets
to read lines one a time, we risk splitting long lines because of the limits in fgets
. In the worst case, we might have a line that consists of multiple words, but gets split in the middle of a word because the line is too long. Using getline
reads the entire line, at the expense of a little extra memory. On modern systems with lots of memory, this is usually a safe trade-off.