Using ‘awk’ to filter text

0

I use Markdown to write drafts of technical articles. I find writing in Markdown makes it easy for me to stay focused on what I’m writing rather than what it will look like.

When I’m writing an article, I also like to keep track of my word count. There’s no magic “word count” for technical articles – they can be as long or as short as needed to cover the material – but I still like to keep most of my technical articles between 800 and 1,000 words. Articles that provide a “deep dive” on a highly technical topic (such as programming) might be much longer, up to 2,000 words.

I don’t want to include the code in my word count; every bracket, parenthesis, … and generally everything that’s surrounded by at least one space will be included in the “word count.” Yet the code is part of the Markdown file, so using the wc tool to count words will include all of my sample code. For example, this simple “hello world” program has about 30 “words” in it:

#include <stdio.h>

int main()
{
    int i;

    for (i = 1; i <= 10; i = i + 1) {
        puts("Hello world");
    }

    return 0;
}

But how do you count words in an article when that article has lots of code samples? All it takes is knowing a little about using awk to filter text.

The basics of awk scripts

Awk is a simple yet powerful scripting language developed by Al Aho, Peter Weinberger, and Brian Kernighan of Bell Labs. In fact, the command name awk was formed from the first letter of each of their last names.

Awk is perhaps best explained as a scripting language that takes actions based on matching conditions, and have the general form of:

condition { actions }

In awk, a condition can be a regular expression inside slashes, such as /^a/ to match any line that starts with the letter ‘a’, or a relational expression like i==4 for when the variable i has the value 4, or a constant “value” like BEGIN for the beginning of a file or END for the end of a file. You can form more complex conditions with those basics.

To make processing text files easier for you, awk also splits lines into tokens or fields that you can access as $1, $2, and so on. The field value $0 indicates the entire line. Awk also provides variables that you can access from within scripts, such as NR as the number of “records” or lines processed so far, or NF as the number of fields on the current line.

Actions or expressions can be any series of awk instructions. Awk instructions are very similar to C programming instructions: if you know a little C, you can quickly learn awk. For example, let’s say I wanted to set a variable called aline to 1 whenever we encounter a line that starts with the letter a:

/^a/ { aline = 1; }

The extra spaces within the curly braces aren’t needed; I included them only to make this easier to read. You could also write that awk statement like this:

/^a/ {aline=1;}

Or maybe I want to just increment the aline variable, such as to count the number of lines that start with the letter a. This is easy to do, as well. In awk, all variables start with a zero value, so I can write this:

/^a/ {aline=aline+1}

You can start to see how awk operates by recognizing a pattern (such as /^a/ to match a regular expression) and then taking an action (like adding 1 to the aline variable). This simple pattern-action format makes awk both simple and flexible.

Using awk to recognize code blocks

Markdown is a lightweight document markup system that uses plain text files as input. You usually convert Markdown into some other format, such as HTML. And that’s exactly how I use Markdown to write my article drafts; I’ll write a draft in Markdown, then convert it into an HTML document using the pandoc command.

To insert a block of code, such as some sample code in a programming article, you surround the sample code with a “code fence” of three “backticks.” These “backticks” make it easy to match the start and end of sample code using awk. In other words, I want awk to take action whenever it finds three “backticks” in a Markdown file. I’ll start by incrementing a variable called text every time we encounter the three “backticks” delimiter:

/```/ { text=text+1; }

Since we only need to add 1 to the text variable, we can instead use the ++ notation, like this:

/```/ { text++; }

The first time we find three “backticks” in a Markdown file, that marks the end of regular article text and the beginning of sample code. The sample code continues until the next series of three “backticks.” This means that the variable text will always have an even value (0, 2, 4, 6, …) for regular body text within a Markdown file, and an odd value (1, 3, 5, 7, …) for sample code.

An easy way to determine if a value is even or odd is to use % to calculate the modulo, or the remainder after dividing by another number. For example, 5%2 is “5 divided by 2,” or “2 with a remainder of 1,” so a modulo of 1.

We can use this to only print lines from a Markdown file that are regular body text, when text has an even value:

(text%2)==0 {print;}

In this case, the pattern is (text%2)==0 which calculates the modulo of text with 2, to determine if the result is an even number (modulo is zero). If it is, then awk prints the line.

Counting words in an article

Let’s say I have this sample Markdown file called hello.md, which contains headings, paragraph text, and sample code:

# Hello world

Here is how you can write your first "Hello world" program in C:

```
#include <stdio.h>

int main()
{
  puts("Hello world");
  return 0;
}
```

And now you're ready to learn programming!

This file contains 35 words, according to the wc command:

$ wc -w hello.md
35 hello.md

But this includes the sample code, which I don’t want to include in the final word count. We can use this 2-line awk script called text.awk to match lines with three “backticks” and only print the parts of the article that are regular text:

/```/ {text++;}
(text%2)==0 {print;}

Now we can use the awk command with the -f option to specify the script file, to filter the Markdown file before passing the results to wc to count the words:

$ awk -f text.awk hello.md | wc -w
24

For very short awk scripts like this, you can also provide the entire awk script as a single command line argument, usually enclosed in single quotes. When you use this method to run an awk script, you list the conditions and actions in pairs, such as condition-action condition-action condition-action condition-action and so on. This means we can rewrite the command line like this:

$ awk '/```/ {text++;} (text%2)==0 {print;}' hello.md | wc -w
24

In my real-world example, I had written a draft article in Markdown about programming, called copyfile.md. According to the wc command, this file had over 2,200 words, including source code:

$ wc -w copyfile.md
2274 copyfile.md

Using the short awk command to filter out the sample code, and running the result through the wc command to count words, tells me the file has about 1,800 words of actual text:

$ awk '/```/ {text++;} (text%2)==0 {print;}' copyfile.md | wc -w
1884