In a world filled with strings (filenames and the like) it is often very useful
to be able to talk about many items at once. This is typically done with some
form of pattern to express the class or group of items you are interested in.
This might be all the filenames which end .txt
or all lines in a log file
with the word ERROR
in them.
There are two common families of pattern languages in use in Unix like machines. These are regular expressions and shell globs. The latter is a smaller language than the former, and there are many variants of each of them. We will tackle shell globs first.
Shell Globs
When using the shell, you will sometimes want to express a pattern for multiple
files at once. For example, you might want to delete all the backup files in a
directory, which would be a command along the lines of rm *~
. This command
comprises two parts: the program to run (rm
) and the argument(s) given to it
(*~
). In this example the argument to the command is a shell glob and will
be expanded by the shell into a list of all the files whose name ends with a
tilde. The expanded list will be passed to the rm
program.
Most shells have similar glob rules, and they usually consist of:
- A marker for zero-or-more characters:
*
- A marker for exactly one character:
?
- A way to express one of a certain set of characters:
[...]
- A way to express a choice of one or more strings:
{...,...}
- A way to escape any of the above special characters:
\
Some shells offer more globbing patterns, but the above are the most common. In all cases, globs are matched against files in the filesystem. As such there is actually a C library function [fnmatch(3)] which does this kind of matching; the name is simply a contraction of file name match.
In our example, the glob *~
simply means zero or more characters followed by
a tilde and will match filenames which end in a tilde which is the common way
to indicate a 'backup' file in Unix.
If you want to know more about shell globs, see:
- fnmatch(3)
- The 'Pattern Matching' section of bash(1)
- The equivalent section of your favoured shell manual page.
Regular Expressions
Regular expressions (often shortened to regexps) feature in many programs
although they are most commonly encountered as part of using the shell-related
tools grep
, awk
and sed
or as part of other scripting languages such as
perl
or python
. Some scripting languages such as lua
have other pattern
languages which are similar, although not identical, to regexps.
As with shell globs, regexps have a common core 'language' and then there are a multitude of variants. The common core properties of regexps are:
- A marker for exactly one character:
.
- A way to group atoms (e.g. characters, classes or groups):
(...)
- A way to indicate a single character from a class:
[...]
- A way to indicate zero-or-one repetitions of the previous atom:
?
- A way to indicate zero-or-more repetitions of the previous atom:
*
- A way to indicate one-or-more repetitions of the previous atom:
+
- A way to escape any of the above special characters:
\
- A way to anchor the start of the matched string:
^
- A way to anchor the end of the matched string:
$
In regular expressions, the shell glob example we used above would be ^.*~$
and would be read as "starting at the start of the input string, any character
zero or more times, then a tilde, and then the end of the input string". As
you can see, regexps are not intrinsically anchored to the start and end of the
input, unlike shell globs. This is both very powerful and potentially
confusing as if you omitted the ^
and $
then the regexp .*~
would match
the file name wibble~foobar
which is clearly not the intention of the glob.
If you wish to know more about regular expressions, then see:
For one different variant of regular expressions see: PCRE's documentation.
Wikipedia's article on regexps is pretty good and goes into more of the formal theory of regular languages.
Hopefully now, regular expressions won't scare you when you see them.