Processing files with ‘find’ and ‘xargs’
I manage several websites, including Technically We Write and Coaching Buttons, and recently I wanted to see how much I had written over the last year. I wanted more than just a count; I also was curious to know how many words I had written for each article, and in total.
I manage these websites using a static website generator, which means the website content is saved in plain text files. That makes the website very fast, but it also lets me use standard Linux commands to examine the content, including the word count. Here’s how I counted words for the articles I wrote in 2024:
Finding my articles
The website content is stored in a directory for the year, such as 2024 for all articles published during 2024. Every article is saved in its own directory, which also contains some plain text metadata; one file is called author and contains the author’s username. My username is jhall.
With this information, I could look for all files called author under the 2024 directory that contained the text jhall, using this find command:
$ find 2024 -type f -name author -exec grep -q 'jhall' {} \; -print
The find command operates on a set of files, but usually on a directory tree, to match files and directories that match a pattern. In this command, find looks for all files (-type f
) with a specific name (-name author
). For each matching file, find runs the grep command to look for text in the file (-exec grep -q 'jhall' {} \;
).
The grep command runs silently (-q
) and returns a “success” value if it finds jhall in the file. Note that find uses {}
as a placeholder for whatever filename matches the earlier pattern, and the command executed by the -exec
action must end with a semicolon, which I’ve protected from Bash interpretation by using a backslash (\;
).
Putting everything together, I can use find to locate every author file that contains jhall, and prints the filename (the -print
action). With this find command, and the wc command to count lines in the output, I can see that I wrote about half of the articles for Technically We Write in 2024:
$ cd technicallywewrite
$ find 2024 -type f -name author -exec grep -q 'jhall' {} \; -print | wc -l
47
$ find 2024 -type f -name author -print | wc -l
104
Running the same set of commands in the other website, I counted that I wrote a third of the articles for Coaching Buttons:
$ cd coachingbuttons
$ find 2024 -type f -name author -exec grep -q 'jhall' {} \; -print | wc -l
24
$ find 2024 -type f -name author -print | wc -l
76
Counting my words
For each matching author file, I wanted to count the words in the article. Each article’s main content is stored in an HTML file called content.html, saved in the same directory as the author metadata file. To get a list of all content files I first save a list of the matching author files, then replace the author text with content.html on each line of the list.
For example, running this command in the Technically We Write website prints all author files that contain my username, then uses sed to change author (but only at the end of a line) to content.html, before saving the list in a plain text file:
$ find 2024 -type f -name author -exec grep -q 'jhall' {} \; -print | sed -e 's/author$/content.html/' > ~/tww.list
I can also run the same command from the Coaching Buttons website:
$ find 2024 -type f -name author -exec grep -q 'jhall' {} \; -print | sed -e 's/author$/content.html/' > ~/cb.list
To count the total words, I only need to run the wc command against each file in the list. One way to do that is with the $()
Bash expansion to print the contents of the list of filenames as options to the wc command, showing that I wrote almost 59,600 words in 47 articles for Technically We Write in 2024:
$ wc -l < ~/tww.list
47
$ wc -w $(cat ~/tww.list) | tail -1
59592 total
But running a command with a long list of files (especially where each file might have a long path) can overload the command line. To avoid this, the more typical way to run a command with a list from a file is the xargs command. This runs a command as though you specified each filename on the command line. If the command line gets too long, xargs can automatically break up the list and run the other command multiple times.
To accommodate possibly running wc more than once (which will result in multiple total output lines) I’ll add the --total=never
command line option to suppress the total, and pass the output through gawk to print the sum of the word counts:
$ wc -l < ~/tww.list
47
$ xargs wc -w --total=never < ~/tww.list | gawk '{tot += $1} END {print tot}'
59592
Running the same commands from the Coaching Buttons website shows that I wrote over 22,000 words in 24 articles during 2024:
$ wc -l < ~/cb.list
24
$ xargs wc -w --total=never < ~/cb.list | gawk '{tot += $1} END {print tot}'
22047
Processing files with ‘find’ and ‘xargs’
A core tenet of the Linux Philosophy is to store everything in plain text files. This makes it easy to work with them using the Linux command line, which provides a ton of useful utilities to process text. Two powerful commands that I used here are find to locate matching files and directories and print the results as a list, and the xargs command to run a command against a list of files.
If you look at how I’ve written my commands, you can see this in action. I used find to match files that contained my name, and saved the results to a file. Then I used that list with xargs to count the words in all the files, and print the result. This is made possible because each command does one thing and operates on plain text, making the overall process a series of small steps.