The Unix Shell: Regular Expressions

Regular Expressions

A regular expression (regex) is a text pattern that can be used for searching and replacing. Regular expressions are similar to Unix wild cards used in globbing, but much more powerful, and can be used to search, replace and validate text.

Regular expressions are used in many Unix commands such as find and grep, and also within most programming languages such as R and Python.

We only show basic usage here to get you started. To get practice, first spend some time at https://regex101.com to get a better understanding of how to use regular expressions, then find out how to use them in your text editor to do a search and replace.

Matching characters

We will practice using grep. If a successful match is found, the line of text will be returned; otherwise nothing.

In [56]:
man grep | head -n 20

GREP(1)                   BSD General Commands Manual                  GREP(1)

NAME
     grep, egrep, fgrep, zgrep, zegrep, zfgrep -- file pattern searcher

SYNOPSIS
     grep [-abcdDEFGHhIiJLlmnOopqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
          [-e pattern] [-f file] [--binary-files=value] [--color[=when]]
          [--colour[=when]] [--context[=num]] [--label] [--line-buffered]
          [--null] [pattern] [file ...]

DESCRIPTION
     The grep utility searches any given input files, selecting lines that
     match one or more patterns.  By default, a pattern matches an input line
     if the regular expression (RE) in the pattern matches the input line
     without its trailing newline.  An empty expression matches every line.
     Each input line that matches at least one of the patterns is written to
     the standard output.

Literal character match

In [70]:
echo abcd | grep abcd
abcd
In [71]:
echo abcd | grep bc
abcd

No match for ac

In [4]:
echo abcd | grep ac

Case insensitive match

In [9]:
echo abcd | grep -i A
abcd
In [8]:
echo abcd | grep A

Matching any single character

The . matches exactly one character.

In [5]:
echo abcd | grep a.c
abcd
In [6]:
echo abcd | grep a..c

In [7]:
echo abcd | grep a..d
abcd

Matching a character set

In [7]:
echo a2b | grep [0123456789]
a2b
In [8]:
echo a2b | grep [0-9]
a2b
In [11]:
echo a2b | grep [abc]
a2b
In [12]:
echo a2b | grep [def]

In [9]:
echo a2b | grep [a-z]
a2b
In [10]:
echo a2b | grep [A-Z]

Exceptions

The ^ within a character set says match anything NOT in the set.

In [46]:
echo a2b | grep [A-Z]

In [47]:
echo a2b | grep [^A-Z]
a2b

Pre-defined character sets

Many useful sets of characters (e.g. all digits) have been pre-defined as character classes that you can use in your regular expressions. Character classes are a bit clumsy in the Unix shell, but simpler forms are often used in programming languages (e.g. ‘:raw-latex:`\d`’ instead of ‘[:digit:]’).

In [40]:
echo a2b | grep ['[:alpha:]']
a2b
In [43]:
echo a2b | grep ['[:digit:]']
a2b
In [44]:
echo a2b | grep ['[:punct:]']

In [45]:
echo a2,b | grep ['[:punct:]']
a2,b

Alternative expressions

In [105]:
echo cat | grep -E '(cat|dog)'
cat
In [106]:
echo dog | grep -E '(cat|dog)'
dog
In [107]:
echo fox | grep -E '(cat|dog)'

Character set modifiers

Anchors

^ indicates start of line and $ indicates end of line.

In [22]:
echo abcd | grep ^ab
abcd
In [23]:
echo abcd | grep ab$

In [24]:
echo abcd | grep ^cd

In [25]:
echo abcd | grep cd$
abcd

Repeating characters

  • + matches one or more of the preceding character set
  • ‘*’ matches zero or more of the preceding character set
  • ‘{m, n}’ matches between m and n repeats of the preceding character set.
In [92]:
echo abbbcd | grep abcd

In [96]:
echo abbbcd | grep -E ab+cd
abbbcd
In [97]:
echo abbbcd | grep -E ab*cd
abbbcd
In [98]:
echo abbbcd | grep -E 'ab{1,5}cd'
abbbcd
In [99]:
echo abbbcd | grep -E a[bc]+d
abbbcd

Matching words with word boundaries

In [86]:
echo 'other ones go together' | grep 'the'
other ones go together
In [87]:
echo 'other ones go together' | grep '<the>'

In [91]:
echo 'other ones go together' | grep '\<other\>'
other ones go together

Capture groups and back references

In [114]:
echo "123_456_123_456" | grep -E '([0-9]+).*\1'
123_456_123_456
In [116]:
echo "123_456_123_456" | grep -E '([0-9]+)_([0-9]+)_\1_\2'
123_456_123_456
In [117]:
echo "123_456_123_123" | grep -E '([0-9]+)_([0-9]+)_\1_\2'

Exercise

1. Does this match? Why or why not?

echo fox | grep -E '[cat|dog]'