| Using Regular Expressions
 
Larry Reznick 
Regular expressions, which provide a means of representing
string 
patterns for searches, are supported by most of the
common UNIX utilities, 
yet many system administrators do not know how to use
them. Since 
regular expressions can be used in combination with
the existing UNIX 
editors and utilities to simplify a number of important
tasks, it's 
worthwhile to learn to work with them. 
Possibly the hardest part of mastering regular expressions
is understanding 
their meaning. Learning what each individual character
means is simple 
enough, deciphering a particular regular expression
filled with cryptic, 
write-only symbols seems to be more than most people
want to do. (Sometimes, 
it seems as if guru-hood should be bestowed on you if
you can only 
figure out what those funny-looking characters are doing.) 
The easiest way I know of to interpret a complex regular
expression 
is to consider each character as a separate command
in combination 
with the commands that come before it. In other words,
don't worry 
so much about the whole thing, but only about one character
at a time. 
And, since regular expressions are used in string searching
commands, 
always start out with the words "Find a string
composed of . . 
.." 
For instance, say that you encounter the following regular
expression 
in a sed command: 
 
s/		*/	/g 
 
(where there are two tabs before the * and also 
one between the second and third /'s). The s is the
sed substitute command, and the first slash specifies
the beginning 
of the search string. (It also serves as the delimiter
between the 
search and replace strings;  if a slash is to be searched
for, any 
other character may be used for the delimiter.) To the
star, then, 
the expression reads "Find a string composed of
a tab followed 
by another tab." The star signifies "zero
or more of the previous 
character"; so we have "Find a string composed
of a tab followed 
by another tab, zero or more, and replace it with a
single tab." 
The trailing g tells sed to apply this globally throughout
the line. Without it, the replacement would apply only
to the first 
such match found in the line -- any others in the same
line would 
be left alone. 
The same syntax -- but with the :s command -- could
be 
used within a single file by the ex editor underneath
vi, 
but this would apply only to that one file, while sed
could 
be made to apply to many files. Another approach to
the same problem 
would be to use awk, which adds a metacharacter that
sed and 
ex do not understand, the + symbol, which means "one
or more." To get sed to do "one or more,"
the first 
appearance of the tab character had to be explicitly
typed, and then 
the second one had to be specified with the star. If
the star had 
been applied to the one-and-only tab, the "zero
or more" definition 
would have caused a substitution whenever a tab was
not found as well 
as when it was! (This would cause every character to
have a tab placed 
after it -- try it sometime, but do not save the result.
Pipe the 
output through cat -tve and the tab characters will
appear 
as ^I symbols.) With awk, though,  you could specify
the match 
by entering the + after a single tab, which would say,
"Find 
a string composed of a tab, one or more." This
kind of replacement 
is better done with sed than awk, though, because sed
automatically outputs everything that does not match
as well as the 
results of the replacement when a match is found, while
awk would 
have to be explicitly told what to do with the non-matching
lines 
as well as the matching lines. 
An example of a more complex set of regular expressions
can be found 
in the sending of man pages to the printer. In SVR4
and SCO's 
current version 4 of their SVR3, the man command now
outputs 
the actual characters for creating boldface and underlined
characters. 
The boldface is done by backspacing and overstriking
the same character 
several times before moving on to the next character,
while the underlining 
is done by writing the underscore character, then backspacing
and 
writing the actual character. On a fast terminal this
can be nice 
to read (although terminal handlers usually will not
show these unless 
the output is piped through the /usr/ucb/ul program,
which 
adds ANSI escape sequences to use the various modes
the terminal can 
produce); but on a slow terminal, such as over a modem
dialup line 
or, even worse, on the printer, this can make things
agonizingly slow. 
Regular expressions make it easy to prevent the command
from outputting 
the boldface and underline characters. Notice that the
character preceding 
the backspace will either be the underscore or the first
instance 
of a repeated character for bolding. The character following
the backspace 
is the only one wanted for the output (unless it happens
to be another 
of the bolding characters, but if it is, it will be
followed by yet 
another backspace). So, the trick is to eliminate any
character followed 
immediately by the backspace, as well as the backspace
itself: 
 
man whatever |
sed 's/.\^H//g' 
 
where "whatever" represents the man page 
to be filtered by the sed command, and the ^H represents
a backspace keystroke. Again, the sed substitute command
is 
used. The regular expression says, "Find a string
composed of 
any character followed by a backspace." The backslash
(\) before 
the backspace character escapes the backspace so that
it will be interpreted 
as a backspace character, not the usual backspacing
action that your 
keyboard filter might perform. Any string that matches
this gets replaced 
by nothing, which deletes it from the output. This operation
is done 
globally throughout the input line, and, since sed acts
on 
all the lines input, it will be performed throughout
the file. Since 
the output automatically goes to the standard output,
if you want 
to see the man page on the screen, simply pipe it through
your 
favorite pager. If it should be printed, pipe this to
the print spooler. 
But, if ANSI escape sequences are built into the output,
say because 
you have set your PAGER variable to automatically route
the output 
of man through the /usr/ucb/ul program, how do you get
rid of those when you want to pipe the output to the
printer? Most 
of the ANSI escape sequences are of the form 
 
ESC [ params char 
 
where ESC stands for the escape character, which 
appears as ^[ to cat -tve; params indicates an 
optional number with multiple numbers separated by a
semicolon (;); 
and char refers to some alphabetic or punctuation symbol
representing 
the particular ANSI code action to be performed. 
You must use regular expressions to deal with this because
the params 
and the char could be almost anything, and the params
might even be nonexistent due to reasonable default
values. Begin 
with "Find a string composed of an ESC followed
by a bracket," 
which would be^[\[ (the ^[ is a representation of 
the escape character, which the backslash causes to
be uninterpreted 
by the keyboard handler; the bracket itself must be
escaped since 
it is a regular expression metacharacter that will function
here as 
a normal character). 
To represent the optional digits, use [0-9]*, which
says, "any 
of the characters in the range 0 to 9, zero or more."
The bracket 
characters delimit a set of characters to be treated
as a single regular 
expression character (any of the set may be matched),
so the star 
applies to all of those in the set. This will match
any number, no 
matter how many digits there are, yet because of the
"zero or 
more" interpretation of the star, the case where
no digits are 
found will also match. Remember, too, that multiple
numbers could 
occur, such as 123;456;789, so you must include the
semicolon 
with the digits, thus [0-9;]* becomes the correct subexpression. 
Finally, any upper-case alphabetic character many of
the lower-case 
characters, and two of the punctuation marks (specifically,
@ 
and `) might follow the optional number, and in a few
cases, a 
single space might precede the character. These characters
identify 
exactly which control function is to be used. The ANSI
and ISO committees 
specified that any of the characters between 40 hex
and 6F 
hex inclusive (except for those between 5B hex and 5F
hex inclusive) may be used without the space, and any
between 40 
hex and 52 hex, inclusive (except for those between
4A 
hex and 4D hex inclusive) may be used with the space.
We 
probably do not have to get quite that picky and could
simply represent 
this as [ @-o], which says, "any of either the
space character 
or the characters ranging from @ to o." 
The problem with this formulation is that, if the space
matches, it 
will be followed by another character, while if it does
not match, 
the other characters are sufficient to complete the
entire match. 
As a result, the expression completes even if nothing
but a space 
comes up. To avoid this, we might write instead, [space]*[@-o],
which says, "a space, zero or more, followed by
any of the characters 
@ to o.< 
Now,^[\[[0-9;]* *[@-o] becomes the full expression. Combining  it with the sed command line that eliminates the underlining 
and boldfacing, we would have: 
 
sed -e 's/.\^H//g'
-e 's/\^[\[[0-9;]* * [@-o]//g' 
 
which would receive data piped into it from the man
command. (Multiple expressions are needed since two
separate searching 
operations are to be applied to every single line of
input.) 
There is another possible problem: due to an error in
ANSI/ISO code 
generation, if more than one space appeared before the
appropriate 
action character, this expression would accept that
as legitimate 
and act on all those spaces. However, since the intention
here was 
not to handle escape code syntax checking issues, this
regular expression 
will probably suffice. The ? ("zero or one")
metacharacter, 
available in awk and egrep, could handle this problem
by limiting 
acceptable values to either zero or one matching space,
but no more. 
Although the sed program does not recognize that particular
metacharacter, it does acknowledge the range metacharacters,
which 
can be used to duplicate this functionality. By adding
\{0,1\}, 
you can specify "a space occurring between 0 and
1 times." 
So, the final sed command is: 
 
sed -e 's/.\^H//g' -e 's/\^[\[[0-9;]*
\{0,1\}[@-o]//g'
 
which translates as, "first expression: substitute,
find a string composed of any character followed by
the backspace, 
replace it with nothing, globally," and "second
expression: 
substitute, find a string composed of an escape character
followed 
by a bracket followed by any of the digits or a semicolon,
zero or 
more times, followed by a space, occurring between 0
and 1 times, 
followed by any of the characters between @ and o inclusive,
replace 
it with nothing, globally." 
The use of regular expression metacharacters is similar
to programming 
a pattern-matching-oriented little language. By examining
each of 
the regular expression metacharacters individually,
rather than trying 
to interpret the entire collection of cryptic symbols,
you can find 
and manipulate just about any pattern of characters.
Combining regular 
expressions with the common UNIX utilities enhances
the functionality 
of those utilities. In addition, making the expressions
available 
in various scripts that you or your users can work with
will make 
many jobs simpler -- while relieving you of the need
to write new 
tools. 
 
 About the Author
 
Larry Reznick has been programming professionally since
1978. He is 
currently working on systems programming in UNIX and
DOS. He teaches
C language courses at American River College in Sacramento
and is the
owner of Rezolution Technical Books. He can be reached
via email at:
rezbook!reznick@csusac.ecs.csus.edu. 
 
 
 |