The Unix Shell: Regular Expressions¶

Regular Expressions¶

A regular expression (regex) is a text pattern that can be used for searching and replacing. Regular expressions are similar to Unix wild cards used in globbing, but much more powerful, and can be used to search, replace and validate text.

Regular expressions are used in many Unix commands such as find and grep, and also within most programming languages such as R and Python.

We only show basic usage here to get you started. To get practice, first spend some time at https://regex101.com to get a better understanding of how to use regular expressions, then find out how to use them in your text editor to do a search and replace.

Matching characters¶

We will practice using grep. If a successful match is found, the line of text will be returned; otherwise nothing.

In [1]:

grep --help | head -n 20

usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
        [-e pattern] [-f file] [--binary-files=value] [--color=when]
        [--context[=num]] [--directories=action] [--label] [--line-buffered]
        [--null] [pattern] [file ...]

Literal character match¶

In [2]:

echo abcd | grep abcd

abcd

In [3]:

echo abcd | grep bc

abcd

No match for `ac`¶

In [4]:

echo abcd | grep ac

Case insensitive match¶

In [12]:

echo abcd | grep -i A

abcd

In [13]:

echo abcd | grep A

Matching any single character¶

The . matches exactly one character.

In [14]:

echo abcd | grep a.c

abcd

In [15]:

echo abcd | grep a..c

In [16]:

echo abcd | grep a..d

abcd

Matching a character set¶

In [17]:

echo a2b | grep [0123456789]

a2b

In [18]:

echo a2b | grep [0-9]

a2b

In [19]:

echo a2b | grep [abc]

a2b

In [20]:

echo a2b | grep [def]

In [21]:

echo a2b | grep [a-z]

a2b

In [22]:

echo a2b | grep [A-Z]

Exceptions¶

The ^ within a character set says match anything NOT in the set.

In [23]:

echo a2b | grep [A-Z]

In [24]:

echo a2b | grep [^A-Z]

a2b

Pre-defined character sets¶

Many useful sets of characters (e.g. all digits) have been pre-defined as character classes that you can use in your regular expressions. Character classes are a bit clumsy in the Unix shell, but simpler forms are often used in programming languages (e.g. ‘:raw-latex:`\d`‘ instead of ‘[:digit:]’).

In [25]:

echo a2b | grep ['[:alpha:]']

a2b

In [26]:

echo a2b | grep ['[:digit:]']

a2b

In [27]:

echo a2b | grep ['[:punct:]']

In [28]:

echo a2,b | grep ['[:punct:]']

a2,b

Alternative expressions¶

We use the -E argument here to avoid having to escape special characters

-E, --extended-regexp
        Interpret pattern as an extended regular expression (i.e. force
        grep to behave as egrep).'

In [29]:

echo cat | grep -E '(cat|dog)'

cat

Without `-E`¶

We need to escape the special characters (, | and ).

In [30]:

echo cat | grep '\(cat\|dog\)'

cat

We love dogs as well¶

In [31]:

echo dog | grep -E '(cat|dog)'

dog

But not foxes¶

In [32]:

echo fox | grep -E '(cat|dog)'

Be careful - use of square brackets means something different¶

In [33]:

echo fox | grep -E '[cat|dog]'

fox

Character set modifiers¶

Anchors¶

^ indicates start of line and $ indicates end of line.

In [34]:

echo abcd | grep ^ab

abcd

In [35]:

echo abcd | grep ab$

In [36]:

echo abcd | grep ^cd

In [37]:

echo abcd | grep cd$

abcd

Repeating characters¶

+ matches one or more of the preceding character set
‘*’ matches zero or more of the preceding character set
‘{m, n}’ matches between m and n repeats of the preceding character set.

In [38]:

echo abbbcd | grep abcd

In [39]:

echo abbbcd | grep -E ab+cd

abbbcd

In [40]:

echo abbbcd | grep -E ab*cd

abbbcd

In [41]:

echo abbbcd | grep -E 'ab{1,5}cd'

abbbcd

In [42]:

echo abbbcd | grep -E a[bc]+d

abbbcd

Matching words with word boundaries¶

\< and \> indicate word boundaries. That is, \<foo\> will only match foo bar or bar foo but not foobar or barfoo.

In [43]:

echo 'other ones go together' | grep 'the'

other ones go together

In [44]:

echo 'other ones go together' | grep '\<the\>'

In [45]:

echo 'other ones go together' | grep '\<other\>'

other ones go together

Capture groups and back references¶

In [46]:

echo "123_456_123_456" | grep -E '([0-9]+).*\1'

123_456_123_456

In [47]:

echo "123_456_123_456" | grep -E '([0-9]+)_([0-9]+)_\1_\2'

123_456_123_456

In [48]:

echo "123_456_123_123" | grep -E '([0-9]+)_([0-9]+)_\1_\2'

The Unix Shell: Regular Expressions¶

Regular Expressions¶

Matching characters¶

Literal character match¶

No match for `ac`¶

Case insensitive match¶

Matching any single character¶

Matching a character set¶

Exceptions¶

Pre-defined character sets¶

Alternative expressions¶

Without `-E`¶

We love dogs as well¶

But not foxes¶

Be careful - use of square brackets means something different¶

Character set modifiers¶

Anchors¶

Repeating characters¶

Matching words with word boundaries¶

Capture groups and back references¶

Page contents

Previous page

Next page

This Page

The Unix Shell: Regular Expressions¶

Regular Expressions¶

Matching characters¶

Literal character match¶

No match for ac¶

Case insensitive match¶

Matching any single character¶

Matching a character set¶

Exceptions¶

Pre-defined character sets¶

Alternative expressions¶

Without -E¶

We love dogs as well¶

But not foxes¶

Be careful - use of square brackets means something different¶

Character set modifiers¶

Anchors¶

Repeating characters¶

Matching words with word boundaries¶

Capture groups and back references¶

No match for `ac`¶

Without `-E`¶