Searching text files from the command line
I like to write articles about a variety of topics, and I am a regular contributor to several websites. I also manage a few websites; one is about IT leadership, another is about open source software. In these websites, the web content management system is file-based, which means it stores articles and some metadata as files under a special data area. The files are combined into web pages when a reader views an article. Storing the content as files has several advantages, including the ability to view the files directly on the server, and to use standard Linux tools to examine them.
And recently, I did exactly that. I was asked to contribute an article from one of the websites as a “chapter” in a book. We have a lot of articles on the website, but I wanted to examine only the articles written by me, and to identify the articles that had the longest word count. If I could generate the list of my longest articles as a list of URLs, I could click on each to read it; then I could decide which article to contribute to the book. Here’s how I did that.
How the files are stored
The data area of the content management system is organized more or less in parallel to how users access the website—although it’s a different area that’s not directly accessible on the web. Articles are sorted by date and topic, each as directories.
In the “topic” directory, the content management system keeps a few metadata files, including a file called author
that lists who wrote the article. The article content itself is saved in a file called content.html
. That means an article called “leadership” that was published on August 1, 2022 will be stored in a path like 2022/08/01/leadership
and will have a file called contents.html
and another called author
, plus other metadata files.
Finding files
For my example, I wanted to generate a list of URLs to the longest articles I had written on the website. That meant I needed to search the author
for my email address, and then count the words in the accompanying content.html
file.
Searching for a list of all author
files is pretty straightforward using the find command. This is an extremely flexible tool to identify files in your filesystem that match certain rules. The basic usage is:
$ find {path} {rules} {actions}
The find command has a bunch of rules you can apply, but I only needed a few of them. The -type f
rule will match regular files, while the -name author
rule will match an entry called author
. Use both together to find regular files named author
. The default action is to print the entries. For example, to simply list the files in the data area named author
, I might run this command:
$ find data/ -type f -name author
Counting words
However, I wanted to take a second action: I needed to count the words in the contents.html
file. I can’t do that directly using the find command, but I figured the easiest way around that was to write a short Bash script. This script accepted the name of an author
file, and searched it for my email address. If that matched, then it used the path to the author
file to find the contents.html
file, and used wc to count the words. I saved this short script as wc.bash
in my home directory:
#!/bin/bash
d=$(dirname $1)
grep -q jhall $1
if [ $? -eq 0 ] ; then
words=$(cat $d/content.html | wc -w)
echo $words https://www.example.com/$d
fi
The dirname command prints the path to a file, which allows the script to locate the matching content.html
file that goes with the author
file.
I used wc indirectly, using cat, because I wanted to print a URL to the article on the website. That required a separate echo command to print the word count plus the URL.
With this script, I was able to update my find command with a new action: for each real file named author
, execute the wc.bash
script with the author
file. Note that the -exec
action uses {}
as a placeholder for the matching entry, and requires ;
to terminate the action. Because ;
is also a special shell character (to separate commands on a single command line) I needed to protect it by writing it as \;
instead.
$ find 202[34] -type f -name author -exec bash $HOME/wc.bash {} \; | sort -nr | head -5
This find command line looks in both the 2023 and 2024 directories for any real files called author
, then runs the wc.bash
script. The script checks the author
file for my email address; if I’m the author, the script then uses wc to count the words in content.html
and prints the result as a URL.
The output of the find actions then get sent to the sort command to sort by numbers (-n
) in reverse order (-r
), and print only the first 5 results. These are the 5 longest articles written by me, with the word count and a URL to the article:
$ find 202[34] -type f -name author -exec bash $HOME/wc.bash {} \; | sort -nr | head -5
1985 https://www.example.com/2024/02/12/workplans
1712 https://www.example.com/2024/05/20/resume
1706 https://www.example.com/2024/06/17/interview
1668 https://www.example.com/2024/06/10/digitaltwins
1461 https://www.example.com/2024/02/05/smallbusiness
Searching text files
The find command is a powerful and flexible tool to locate files under a path. You can use it just to print the matching filenames, or use -exec
like I did to perform some secondary actions. Every Unix-like system should include find, as it’s a core part of Unix since 1st Ed (November ’71). Learn more about find in the online manual page, in section 1:
$ man 1 find