Regular Expressions #4: Pulling it all together
Refine your understanding of building regular expressions with grep and sed.
In Regular Expressions #1: Introduction, I used the relatively simple example of grep to illustrate what they are and why they’re useful. In Regular Expressions #2: An example, we looked at a more complex example of the uses of regular expressions and we walked through an example that cleans up lists of names and email addresses so they are consistent and parseable. After our dive into Regular Expressions #3: grep — Data flow and building blocks, where we got more detail about regular expressions, it’s now time to explore ways in which we can shorten and simplify the command-line program from the first example. Although many Linux commands and tools implement regular expressions,we’ll focus here on grep
and sed
.
Preparation
Download the file used in this article, Experiment_6-1.txt.
$ wget https://raw.githubusercontent.com/opensourceway/reg-ex-examples/master/Experiment_6-1.txt
Example: Simplifying the mailing list program
First, let’s look back at our first example, where we built the command-line interface (CLI) program in Figure 1.
$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'
Figure 1: A rather complex Regular Expression like this one can seem obscure until we learn how REGEXs work.
You might find the regular expressions easier to read at this point, but this program can be simplified.
cat and grep
Let’s start by focusing on the beginning of the command, which involves cat
and grep
. We can combine the first two grep
statements, which originally look like this:
$ cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" <SNIP>
Tip: When the STDOUT from
grep
is not piped through another utility, and when using a terminal emulator that supports color, the regex matches are highlighted in the output data stream.
The revised command up to this point.
$ cat Experiment_6-1.txt | grep -vE "Team|^\s*$"
Here, we’ve added the E
option, which specifies extended regex. According to the grep
man page, “In GNU grep there is no difference in available functionality between basic and extended syntaxes.” This statement is not strictly true, because our new combined expression fails without the E
option. Run the following to see the results:
$ cat Experiment_6-1.txt | grep -vE "Team|^\s*$"
Try it without the E
option.
Now, let’s look at cat
. The grep
tool can also read data from a file, so we can eliminate the cat
command entirely:
$ grep -vE "Team|^\s*$" Experiment_6-1.txt
This change and the previous one together leave us with the following, somewhat simplified CLI program:
$ grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'
This command is shorter, more succinct, and will execute faster because grep
only needs to parse the data stream once.
Note: Again, it is important to realize that this solution is not the only one. There are different methods in Bash for producing the same output, and there are other languages like Python and Perl that can also be used. And, of course, there are always LibreOffice Writer macros. But, I can always count on Bash as part of any Linux distribution. I can perform these tasks using Bash programs on any Linux computer, even one without a GUI desktop, or one that has a GUI desktop but does not have LibreOffice installed.
sed
We can also simplify the sed
command. The sed
utility not only allows searching for text that matches a regex pattern, it can also modify, delete, or replace the matched text. I use sed
at the command line and in Bash shell scripts as a fast and easy way to locate text and alter it. The name sed
stands for stream editor because it operates on data streams in the same manner as other tools that can transform a data stream. Most of those changes involve selecting specific lines from the data stream and passing them on to another transformer program.
Note: Many people call tools like
grep
filter programs, because they filter unwanted lines out of the data stream. I prefer the term transformers, because tools likesed
andawk
do more than just filter. They can test content for various string combinations and alter the matching content in many different ways. Tools likesort
,head
,tail
,uniq
,fmt
, and more all transform the data stream in some way.
We have already seen sed
in action, but now, with an understanding of regular expressions, we can better analyze and understand our earlier usage. It is possible to combine four of the five expressions used in the sed
command into a single expression. The sed
command now has two expressions instead of five:
sed -e "s/[Ll]eader//" -e "s/[]()\[]//g"
This format makes it a bit difficult to understand the more complex expression. Note that no matter how many expressions a single sed
command contains, the data stream is only parsed once to match all of the expressions. Let’s examine the revised expression more closely:
-e "s/[]()\[]//g"
By default, sed
interprets all [
characters as the beginning of a set, and the last ]
character as the end of that set. So, in the code above, the first [
and the last ]
contain the set. The intervening ]
characters are not interpreted as metacharacters.
Since we need to match [
as a literal character in order to remove it from the data stream, and sed
normally interprets [
as a metacharacter, we need to escape it so that it is interpreted as a literal ]
. That is where the backslash (\
) comes in, giving us \[
in the middle.
Let’s plug this new version into the CLI script and test it:
$ grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[]()\[]//g"
I know what you are asking: “Why not place the \[
after the [
that opens the set, and before the ]
character?” Try it as I did:
$ grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[\[]()]//g"`
I think that should work, but it does not. Little unexpected results like this make it clear that we must be careful and test each regex carefully to ensure that it actually does what we intend.
After some experimentation of my own, I discovered that the escaped left square brace \[
works fine in all positions of the expression except for the first one. This behavior is noted in the grep
man page, which I probably should have read first. However, I find that experimentation reinforces the things I read, and I usually discover more interesting things than what I was looking for.
Adding the last component, the awk
statement, our optimized program looks like this and the results are exactly what we want:
$ grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[]()\[]//g" | awk '{print $1" "$2" <"$3">"}'
Other tools that implement regular expressions
Many Linux tools implement regular expressions. Most of those implementations are very similar to that of awk
, grep
, and sed
, so it should be easy to learn the differences. Although we have not looked in detail at awk, it is a powerful text-processing language that also implements regexes.
Most of the more advanced text editors use regexes. Vim, gVim, Kate, and GNU Emacs are no exceptions. The less
utility implements regexes, as does LibreOffice Writer’s search and replace facility. Programming languages like Perl, awk, and Python also contain implementations of regexes, which makes them well suited to writing tools for text manipulation.
Summary
This series has provided a brief introduction to the complex world of regular expressions. We have explored the regex implementation in the grep
utility in just enough depth to give you an idea of some of the amazing things that can be accomplished with regexes. We have also looked at several Linux tools and programming languages that also implement regexes.
But make no mistake! We have only scratched the surface of these tools, and of regular expressions. There is much more to learn, and as you can see, there are some excellent resources for doing so.
Resources
I have found some excellent resources for learning about regular expressions. There are more than I have listed here, but these are the ones I have found to be particularly useful:
- The
grep
man page has a good reference but is not appropriate for learning about regular expressions. - The O’Reilly book, Mastering Regular Expressions, by Jeffrey E. F. Friedl, is a good tutorial and reference for regular expressions. I recommend it for anyone who is or wants to be a Linux sysadmin because you will use regular expressions.
- The O’Reilly book sed & awk: UNIX Power Tools, by Arnold Robbins and Dale Dougherty, is another good one. It covers both of these powerful tools and it also has an excellent discussion of regular expressions.
There are also some good web sites that can help you learn about regular expressions, and which provide interesting and useful cookbook-style regex examples. There are some that ask for money in return for using them. Jason Baker, my Technical Reviewer for Volumes 1 and 2 of the 1st Edition of my Using and Administering Linux course suggests regexcrossword.com as a good learning tool.
Note: This series is a slightly modified version from Chapter 25 of Volume 2 of my Linux self-study trilogy, Using and Administering Linux: Zero to SysAdmin, 2nd Edition.