Searching text files from the command line

I like to write articles about a variety of topics, and I am a regular contributor to several websites. I also manage a few websites; one is about IT leadership, another is about open source software. In these websites, the web content management system is file-based, which means it stores articles and some metadata as files under a special data area. The files are combined into web pages when a reader views an article. Storing the content as files has several advantages, including the ability to view the files directly on the server, and to use standard Linux tools to examine them.

And recently, I did exactly that. I was asked to contribute an article from one of the websites as a “chapter” in a book. We have a lot of articles on the website, but I wanted to examine only the articles written by me, and to identify the articles that had the longest word count. If I could generate the list of my longest articles as a list of URLs, I could click on each to read it; then I could decide which article to contribute to the book. Here’s how I did that.

How the files are stored

The data area of the content management system is organized more or less in parallel to how users access the website—although it’s a different area that’s not directly accessible on the web. Articles are sorted by date and topic, each as directories.

In the “topic” directory, the content management system keeps a few metadata files, including a file called author that lists who wrote the article. The article content itself is saved in a file called content.html. That means an article called “leadership” that was published on August 1, 2022 will be stored in a path like 2022/08/01/leadership and will have a file called contents.html and another called author, plus other metadata files.

Finding files

For my example, I wanted to generate a list of URLs to the longest articles I had written on the website. That meant I needed to search the author for my email address, and then count the words in the accompanying content.html file.

Searching for a list of all author files is pretty straightforward using the find command. This is an extremely flexible tool to identify files in your filesystem that match certain rules. The basic usage is:

$ find {path} {rules} {actions}

The find command has a bunch of rules you can apply, but I only needed a few of them. The -type f rule will match regular files, while the -name author rule will match an entry called author. Use both together to find regular files named author. The default action is to print the entries. For example, to simply list the files in the data area named author, I might run this command:

$ find data/ -type f -name author

Counting words

However, I wanted to take a second action: I needed to count the words in the contents.html file. I can’t do that directly using the find command, but I figured the easiest way around that was to write a short Bash script. This script accepted the name of an author file, and searched it for my email address. If that matched, then it used the path to the author file to find the contents.html file, and used wc to count the words. I saved this short script as wc.bash in my home directory:

#!/bin/bash
d=$(dirname $1)
grep -q jhall $1
if [ $? -eq 0 ] ; then
    words=$(cat $d/content.html | wc -w)
    echo $words https://www.example.com/$d
fi

The dirname command prints the path to a file, which allows the script to locate the matching content.html file that goes with the author file.

I used wc indirectly, using cat, because I wanted to print a URL to the article on the website. That required a separate echo command to print the word count plus the URL.

With this script, I was able to update my find command with a new action: for each real file named author, execute the wc.bash script with the author file. Note that the -exec action uses {} as a placeholder for the matching entry, and requires ; to terminate the action. Because ; is also a special shell character (to separate commands on a single command line) I needed to protect it by writing it as \; instead.

$ find 202[34] -type f -name author -exec bash $HOME/wc.bash {} \; | sort -nr | head -5

This find command line looks in both the 2023 and 2024 directories for any real files called author, then runs the wc.bash script. The script checks the author file for my email address; if I’m the author, the script then uses wc to count the words in content.html and prints the result as a URL.

The output of the find actions then get sent to the sort command to sort by numbers (-n) in reverse order (-r), and print only the first 5 results. These are the 5 longest articles written by me, with the word count and a URL to the article:

$ find 202[34] -type f -name author -exec bash $HOME/wc.bash {} \; | sort -nr | head -5
1985 https://www.example.com/2024/02/12/workplans
1712 https://www.example.com/2024/05/20/resume
1706 https://www.example.com/2024/06/17/interview
1668 https://www.example.com/2024/06/10/digitaltwins
1461 https://www.example.com/2024/02/05/smallbusiness