Check spelling at the command line

0

I like to write a lot of articles about Linux, FreeDOS, programming, and open source software. I am pretty confident of my spelling and grammar, but sometimes it’s nice to run a quick spell-check to see if I have any spelling errors in my document. When I write documents in LibreOffice, I can use the spell-checker that’s built into LibreOffice, or I can rely on the “red squiggle underline” to catch spelling errors as I type.

But I write most of my articles in Markdown. If I’m using Vim, I can turn on the automatic inline spell-checker using this Vim command:

:set spell spelllang=en_us

And that does a great job of catching spelling errors as I type. But I prefer to type without distraction, and capture my thoughts quickly. Going back to fix typos while I type is very distracting for me. Instead, I wanted to have a command line tool where I could run a quick command to check for any spelling errors in my document after I’d finished writing it.

Original spell-checkers

The original Unix introduced the typo command in Unix 3rd Edition. The spell command was added in its place starting with Unix 6th Edition. Each command does basically the same thing: checks every word in a document, and prints a sorted list of unique misspelled words.

As an undergraduate, our campus lab of Sun computers had ispell, which was an interactive version of spell that showed any misspelled words in context in the text document, and suggested correct spellings. These days, GNU Aspell replaces ispell for checking spelling at the command line on Linux and other Unix-like systems.

Building your own spell-checker

You can build your own version of the original typo command using a few Linux commands. At a high level, you can do this by breaking apart a document into words, sorting that list of words and removing duplicates, then comparing that list to a list of correctly spelled words.

It’s easiest to convert all words to lowercase, and that’s how the original typo command worked. To convert all text to lowercase, use the tr (transliterate) command, replacing all uppercase letters with lowercase letters:

tr 'A-Z' 'a-z'

Next, remove all punctuation from the input. You can also use tr to do this, with the -d (delete) option:

tr -d '.,:;"?!@()'

Then, break up the text so that each word appears on its own line. A simple way to do this is with the tr command, and convert spaces to newlines:

tr ' ' '\n'

From there, you can use sort to sort the list, and uniq to remove any duplicates:

sort | uniq

The last step uses the comm (common lines) command to compare two files: the list of words from the document with another list of correctly spelled words. The comm program assumes both lists are sorted in the same way, and it produces output where the lines unique to the first file are in one column, lines unique to the second fiel are in a second column, and the lines common to both appear in a third column.

When comparing lists of words, that means the correctly spelled words (words that appear in both the document and the list of correctly spelled words) will be in column 3, while misspelled words (words that appear only in the document) will be in column 1. To display only column 1 (misspelled words) we need to disable columns 2 and 3 with the -2 and -3 options:

comm -2 -3 - $words

To put that all together, the full command line looks like this:

cat "$@" | tr 'A-Z' 'a-z' | tr -d '.,:;"?!@()' | tr ' ' '\n' | sort | uniq | comm -2 -3 - $words

This requires a sorted list of correctly spelled words. Every Unix-like system should have this list saved as /usr/share/dict/words, but the list may not be sorted in the same way that the sort command would generate, so I like to work with a local copy. My full typo script looks like this:

#!/bin/bash
words=$HOME/lib/words.tmp

[ -f $words ] || sort /usr/share/dict/words > $words

cat "$@" | tr 'A-Z' 'a-z' | tr -d '.,:;"?!@()' | tr ' ' '\n' | sort | uniq | comm -2 -3 - $words

Let’s test it! Let’s say I had this one-line document called test.md that had a single misspelled word:

This is a sample document with a mspelled word.

If I run the typo script against this file, I get the one misspelled word as the only output:

$ typo test.md
mspelled

A streamlined version

That typo script works well for me, but I’ve experimented with other ways to implement it. The basic steps remain the same, but I wanted to use the character class model from GNU tr to do the same thing. One way is to start is with the -c (complement) option to convert any character that is not a letter into a newline character:

tr -c '[:alpha:]' '\n'

Then I wanted to immediately reduce the work for following steps by removing the blank lines. The grep command can do this easily:

grep -v '^$'

The remaining steps are much as they were in the original. The script converts uppercase letters to lowercase letters, this time using the [:upper:] and [:lower:] character classes instead of A-Z and a-z:

tr '[:upper:]' '[:lower:]'

Finally, sort the words with sort and remove duplicates with uniq before comparing the output with comm:

sort | uniq | comm -2 -3 - $words

The full command line looks like this:

cat "$@" | tr -c '[:alpha:]' '\n' | grep -v '^$' | \
 tr '[:upper:]' '[:lower:]' | sort | uniq | comm -2 -3 - $words

I followed the same basic steps to prepare the sorted list of correctly spelled words, based on the /usr/share/dict/words file. To accommodate any words that I use but aren’t in the system list, such as when I write about FreeDOS, I combine a list of my own words saved in mywords:

cat /usr/share/dict/words $HOME/lib/mywords | \
 tr '[:upper:]' '[:lower:]' | sort | uniq > $words

My new typo script looks like this:

#!/bin/bash
words=$HOME/lib/words.tmp

if [ ! -f $words ] ; then
 cat /usr/share/dict/words $HOME/lib/mywords | \
  tr '[:upper:]' '[:lower:]' | sort | uniq > $words
fi

cat "$@" | tr -c '[:alpha:]' '\n' | grep -v '^$' | \
 tr '[:upper:]' '[:lower:]' | sort | uniq | comm -2 -3 - $words

The script does mostly the same job as the previous typo script. For example, it finds the same misspelled word from the previous example:

$ typo test.md 
mspelled

The limitation in this “improved” version is that the tr -c command removes hyphenation and apostrophes, so words like hadn't will get split up into hadn and t, resulting in hadn being identified as a “misspelled” word, despite the original word being listed in the correctly spelled words. However, as a quick spell-check tool, this works well enough for me.

Leave a Reply