An expression is a string of characters. Those characters having an interpretation above and beyond their literal meaning are called metacharacters. A quote symbol, for example, may denote speech by a person, ditto, or a meta-meaning for the symbols that follow. Regular Expressions are sets of characters and/or metacharacters that match (or specify) patterns.
A Regular Expression contains one or more of the following:
The main uses for Regular Expressions (REs) are text searches and string manipulation. An RE matches a single character or a set of characters -- a string or a part of a string.
"1133*" matches 11 + one or more 3's + possibly other characters: 113, 1133, 111312, and so forth.
"13." matches 13 + at least one of any character (including a space): 1133, 11333, but not 13 (additional character missing).
The dollar sign -- $ -- at the end of an RE matches the end of a line.
"XXX$" matches XXX at the end of a line.
"^$" matches blank lines.
Brackets -- [...] -- enclose a set of characters to match in a single RE.
"[xyz]" matches the characters x, y, or z.
"[c-n]" matches any of the characters in the range c to n.
"[B-Pk-y]" matches any of the characters in the ranges B to P and k to y.
"[a-z0-9]" matches any lowercase letter or any digit.
"[^b-d]" matches all characters except those in the range b to d. This is an instance of ^ negating or inverting the meaning of the following RE (taking on a role similar to ! in a different context).
Combined sequences of bracketed characters match common word patterns. "[Yy][Ee][Ss]" matches yes, Yes, YES, yEs, and so forth. "[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]" matches any Social Security number.
A "\$" reverts back to its literal meaning of "$", rather than its RE meaning of end-of-line. Likewise a "\\" has the literal meaning of "\".
Escaped "angle brackets" -- \<...\> -- mark word boundaries.
The angle brackets must be escaped, since otherwise they have only their literal character meaning.
"\<the\>" matches the word "the", but not the words "them", "there", "other", etc.
bash$ cat textfile This is line 1, of which there is only one instance. This is the only instance of line 2. This is line 3, another line. This is line 4. bash$ grep 'the' textfile This is line 1, of which there is only one instance. This is the only instance of line 2. This is line 3, another line. bash$ grep '\<the\>' textfile This is the only instance of line 2. |
The only way to be certain that a particular RE works is to test it.
|
The question mark -- ? -- matches zero or one of the previous RE. It is generally used for matching single characters.
The plus -- + -- matches one or more of the previous RE. It serves a role similar to the *, but does not match zero occurrences.
# GNU versions of sed and awk can use "+", # but it needs to be escaped. echo a111b | sed -ne '/a1\+b/p' echo a111b | grep 'a1\+b' echo a111b | gawk '/a1+b/' # All of above are equivalent. # Thanks, S.C. |
It is necessary to escape the curly brackets since they have only their literal character meaning otherwise. This usage is technically not part of the basic RE set.
"[0-9]\{5\}" matches exactly five digits (characters in the range of 0 to 9).
Curly brackets are not
available as an RE in the "classic"
(non-POSIX compliant) version of
awk. However, gawk has the
--re-interval option that permits
them (without being escaped).
Perl and some egrep versions do not require escaping the curly brackets. |
bash$ egrep 're(a|e)d' misc.txt People who read seem to be better informed than those who do not. The clarinet produces sound by the vibration of its reed. |
Some versions of sed, ed, and ex support escaped versions of the extended Regular Expressions described above, as do the GNU utilities. |
This is an alternate method of specifying a range of characters to match.
POSIX character classes
generally require quoting or
double brackets ([[ ]]).
These character classes may even be used with globbing, to a limited extent.
To see POSIX character classes used in scripts, refer to Example 15-18 and Example 15-19. |
Sed, awk, and Perl, used as filters in scripts, take REs as arguments when "sifting" or transforming files or I/O streams. See Example A-12 and Example A-17 for illustrations of this.
The standard reference on this complex topic is Friedl's Mastering Regular Expressions. Sed & Awk, by Dougherty and Robbins also gives a very lucid treatment of REs. See the Bibliography for more information on these books.
[1] | Since
sed,
awk, and
grep
process single lines, there will usually not be a newline to match. In
those cases where there is a newline in a multiple line expression, the
dot will match the newline.
|