Regular Expressions #3: grep — Data flow and building blocks
Blockitecture Project #2e by The Marmot, CC BY 2.0
In Regular Expressions #1: Introduction, I used the relatively simple example of grep to illustrate what they are and why they’re useful. In Regular Expressions #2: An example, we looked at a more complex example of the uses of regular expressions. In this third of four articles you’ll learn how to make tighter matches with your regexes.
Now let’s take a deeper look at how they’re created. Because GNU grep
is one of the tools I use the most (that provides a more or less standardized implementation of regular expressions), I will use that set of expressions as the basis for this article. We will then look at sed
(another tool that uses regular expressions) in a later article.
All implementations of regular expressions are line-based. A pattern created by a combination of one or more expressions is compared against each line of a data stream. When a match is made, an action is taken on that line as prescribed by the tool being used.
For example, when a pattern match occurs with grep
, the usual action is to pass that line to STDOUT and discard lines that do not match the pattern. As we saw in Getting started with regular expressions: An example, the -v
option reverses those actions, so that the lines with matches are discarded.
Each line of the data stream is evaluated on its own. Think of each data stream line as a record, where the tools that use regexes process one record at a time. When a match is made, an action defined by the tool in use is taken on the line that contains the matching string.
Regex building blocks
Figure 1 contains a list of the basic building block expressions and metacharacters implemented by the GNU grep
command (and most other regex implementations), and their descriptions. When used in a pattern, each of these expressions or metacharacters matches a single character in the data stream being parsed.
Expression | Description |
---|---|
Alphanumeric characters Literals A-Z,a-z,0-9 | All alphanumeric and some punctuation characters are considered as literals. Thus the letter a in a regex will always match the letter “a” in the data stream being parsed. There is no ambiguity for these characters. Each literal character matches one and only one character. |
. (dot) | The dot (.) metacharacter is the most basic form of expression. It matches any single character in the position it is encountered in a pattern. So the pattern b.g would match “big,” “bigger,” “bag,” “baguette,” and “bog,” but not “dog,” “blog,” “hug,” “lag,” “gag,” “leg,” etc. |
Bracket expression [list of characters] | GNU grep calls this a bracket expression, and it is the same as a set for the Bash shell. The brackets enclose a list of characters to match for a single character location in the pattern. [abcdABCD] matches the letters “a,” b,” “c,” or “d” in either upper- or lowercase. [a-dA-D] specifies a range of characters that creates the same match. [a-zA-Z] matches the alphabet in upper- and lowercase. |
[:class name:] Character classes | This is a POSIX attempt at regex standardization. The class names are supposed to be obvious. For example, the [:alnum:] class matches all alphanumeric characters. Other classes are [:digit :] which matches any one digit 0-9, [:alpha:] ,[:space:] , and so on. Note that there may be issues due to differences in the sorting sequences in different locales. Read the grep man page for details. |
^ and $ Anchors | These two metacharacters match the beginning and ending of a line, respectively. They are said to anchor the rest of the pattern to either the beginning or end of a line. The expression ^b.g would only match “big,” “bigger,” “bag,” etc., as shown above if they occur at the beginning of the line being parsed. The pattern b.g$ would match “big” or “bag” only if they occur at the end of the line, but not “bigger.” |
Example: TOC entries
Let’s explore these building blocks before continuing on with some of the modifiers. The text file we will use for this is from a lab project I created for an old Linux class I used to teach. It was originally in a LibreOffice Writer odt file but I saved it to an ASCII text file. Most of the formatting of things like tables was removed, but the result is a long ASCII text file that we can use for this series of experiments.
Let’s take a look at an example to explore what we’ve just learned. First, make the ~/testing
directory your PWD (create it if you didn’t already in the previous article in this series), and then download the sample file from GitHub.
$ wget https://raw.githubusercontent.com/opensourceway/reg-ex-examples/master/Experiment_6-3.txt
To begin, use the less
command to look at and explore the Experiment_6-3.txt
file for a few minutes to get an idea of its content.
Now, let’s use some simple grep
expressions to extract lines from the input data stream. The Table of Contents (TOC) contains a list of projects and their respective page numbers in the PDF document. Let’s extract the TOC starting with lines ending in two digits:
$ grep [0-9][0-9]$ Experiment_6-3.txt
This command is not really what we want. It displays all lines that end in two digits and misses TOC entries with only one digit. We’ll look at how to deal with an expression for one or more digits in a later experiment. Looking at the whole file in less
, we could do something like this.
$ grep "^Lab Project" Experiment_6-3.txt | grep "[0-9]$"
This command is much closer to what we want, but it is not quite there. We get some lines from later in the document that also match these expressions. If you study the extra lines and look at those in the complete document, you can see why they match while not being part of the TOC.
This command also misses TOC entries that do not start with “Lab Project.” Sometimes this result is the best you can do, and it does give a better look at the TOC than we had before. We will look at how to combine these two grep
instances into a single one in a later experiment.
Now, let’s modify this command a bit and use the POSIX expression. Note the double square braces ([[]]
) around it. If we use single braces that would generate an error message.
$ grep "^Lab Project" Experiment_6-3.txt | grep "[[:digit:]]$"
This command gives the same results as the previous attempt.
Example: systemd
Let’s look for something different in the same file:
$ grep systemd Experiment_6-3.txt
This command lists all occurrences of “systemd” in the file. Try using the -i
option to ensure that you get all instances, including those that start with uppercase letters (the official form of “systemd” is all lowercase). Or, you could change the literal expression to Systemd
.
Count the number of lines containing the string systemd
. I always use -i
to ensure that all instances of the search expression are found regardless of case:
$ grep -i systemd Experiment_6-3.txt | wc
20 478 3098
As you can see, I have 20 lines, and you should have the same number.
Example: Metacharacters
Here is an example of matching a metacharacter: the left bracket ([
). First, let’s try without doing anything special:
$ grep -i "[" Experiment_6-3.txt
grep: Invalid regular expression
This error occurs because [
is interpreted as a metacharacter. We need to escape this character with a backslash (\
) so that it is interpreted as a literal character and not as a metacharacter:
$ grep -i "\[" Experiment_6-3.txt
Most metacharacters lose their special meaning when used inside bracket expressions:
- To include a literal
]
, place it first in the list. - To include a literal
^
, place it anywhere but first. - To include a literal
[
, place it last.
Repetition
Regular expressions can be modified using operators that let you specify zero, one, or more repetitions of a character or expression. These repetition operators are placed immediately following the literal character or metacharacter used in the pattern. Figure 2 lists these operators and their uses.
Operator | Description |
---|---|
? | In regexes the ? means zero or one occurrence at most of the preceding character. So for example, drives? matches “drive,” and “drives” but not “driver.” This result is a bit different from the behavior of ? in a glob. |
* | The character preceding the * will be matched zero or more times without limit. In this example, drives* matches “drive,” “drives”, and “drivesss” but not “driver.” Again, this is a bit different from the behavior of * in a glob. |
+ | The character preceding the + will be matched one or more times. The character must exist in the line at least once for a match to occur. As one example, drives+ matches “drives,” and “drivesss” but not “drive” or “driver.” |
{n} | This operator matches the preceding character exactly n times. The expression drives{2} matches “drivess” but not “drive,” “drives,” “drivesss,” or any number of trailing “s” characters. However, because “drivesssss” contains the string drivess , a match occurs on that string, so the line would be a match by grep . |
{n,} | This operator matches the preceding character n or more times. The expression drives{2,} matches “drivess” but not “drive,” “drives,” “drivess ,” “drives,” or any number of trailing “s” characters. Because “drivesssss” contains the string drivess , a match occurs. |
{,m} | This operator matches the preceding character no more than m times. The expression drives{,2} matches “drive,” “drives,” and “drivess,” but not “drivesss,” or any number of trailing “s” characters. Once again, because “drivesssss” contains the string drivess , a match occurs. |
{n,m} | This operator matches the preceding character at least n times, but no more than m times. The expression drives{1,3} matches “drives,” “drivess,” and “drivesss,” but not “drivessss” or any number of trailing “s” characters. Once again, because “drivesssss” contains a matching string, a match occurs. |
As an example, run each of the following commands and examine the results carefully, so that you understand what is happening:
$ grep -E files? Experiment_6-3.txt
$ grep -Ei "drives*" Experiment_6-3.txt
$ grep -Ei "drives+" Experiment_6-3.txt
$ grep -Ei "drives{2}" Experiment_6-3.txt
$ grep -Ei "drives{2,}" Experiment_6-3.txt
$ grep -Ei "drives{,2}" Experiment_6-3.txt
$ grep -Ei "drives{2,3}" Experiment_6-3.txt
Be sure to experiment with these modifiers on other text in the sample file.
Metacharacter modifiers
There are still some interesting and important modifiers that we need to explore. Figure 3 describes these additional metacharacters.
Modifier | Description |
< | This special expression matches the empty string at the beginning of a word. The expression <fun would match “fun” and “Function,” but not “refund.” |
> | This special expression matches the normal space, or empty (” “) string at the end of a word, as well as punctuation that typically appears in the single-character string at the end of a word. So environment> matches “environment,” “environment,” and “environment,” but not “environments” or “environmental.” |
^ | In a character class expression, this operator negates the list of characters. Thus, while the class [a-c] matches “a,” “b,” or “c,” in that position of the pattern, the class [^a-c] matches anything but “a,” “b,” or “c.” |
| | When used in a regex, the | metacharacter is a logical “or” operator. It is officially called the infix or alternation operator. We have already encountered this one in Getting started with regular expressions: An example, where we saw that the regex "Team|^\s*$" means, “a line with ‘Team’ or (| ) an empty line that has zero, one, or more whitespace characters such as spaces, tabs, and other unprintable characters.” |
( and ) | The parentheses ( and ) allow us to ensure a specific sequence of pattern comparison, like might be used for logical comparisons in a programming language. |
We now have a way to specify word boundaries with the \<
and \>
metacharacters. This means that we can now be even more explicit with our patterns. We can also use logic in more complex patterns. As an example, start with a couple of simple patterns. This first one selects all instances of drives
but not drive
, drivess
, or additional trailing “s” characters:
$ grep -Ei "\<drives\>" Experiment_6-3.txt
Now let’s build up a search pattern to locate references to tar
(the tape archive command) and related references. The first two iterations display more than just tar
-related lines:
$ grep -Ei "tar" Experiment_6-3.txt
$ grep -Ei "\<tar" Experiment_6-3.txt
$ grep -Ein "\<tar\>" Experiment_6-3.txt
The -n
option in the last command above displays the line numbers for each line in which a match occurred. This option can assist in locating specific instances of the search pattern.
Tip: Matching lines of data can extend beyond a single screen, especially when searching a large file. You can pipe the resulting data stream through the less utility and then use the less search facility which implements regexes, too, to highlight the occurrences of matches to the search pattern. The search argument in less is:
\<tar\>
.This next pattern searches for “shell script,” “shell program,” “shell variable,” “shell environment,” or “shell prompt” in our test document. The parentheses alter the logical order in which the pattern comparisons are resolved:
$ grep -Eni "\<shell (script|program|variable|environment|prompt)" Experiment_6-3.txt
Remove the parentheses from the preceding command and run it again to see the difference.
Wrapping up
Although we have now explored the basic building blocks of regular expressions in grep
, there are an infinite variety of ways in which they can be combined to create complex yet elegant search patterns. However, grep
is a search tool, and does not provide any direct capability to edit or modify a line of text in the data stream when a match is made. For that purpose, we need a tool like sed
, which I cover in the next article of this series.
Note: This series is a slightly modified version from Chapter 25 of Volume 2 of my Linux self-study trilogy, Using and Administering Linux: Zero to SysAdmin, 2nd Edition.