{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Unix Shell: Regular Expressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regular Expressions\n",
    "\n",
    "A regular expression (regex) is a text **pattern** that can be used for searching and replacing. Regular expressions are similar to Unix wild cards used in globbing, but much more powerful, and can be used to search, replace and validate text.\n",
    "\n",
    "Regular expressions are used in many Unix commands such as `find` and  `grep`, and also within most programming languages such as R and Python.\n",
    "\n",
    "We only show basic usage here to get you started. To get practice, first spend some time at <https://regex101.com> to get a better understanding of how to use regular expressions, then find out how to use them in your text editor to do a search and replace."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Matching characters\n",
    "\n",
    "We will practice using `grep`. If a successful match is found, the line of text will be returned; otherwise nothing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "GREP(1)                   BSD General Commands Manual                  GREP(1)\n",
      "\n",
      "NAME\n",
      "     grep, egrep, fgrep, zgrep, zegrep, zfgrep -- file pattern searcher\n",
      "\n",
      "SYNOPSIS\n",
      "     grep [-abcdDEFGHhIiJLlmnOopqRSsUVvwxZ] [-A num] [-B num] [-C[num]]\n",
      "          [-e pattern] [-f file] [--binary-files=value] [--color[=when]]\n",
      "          [--colour[=when]] [--context[=num]] [--label] [--line-buffered]\n",
      "          [--null] [pattern] [file ...]\n",
      "\n",
      "DESCRIPTION\n",
      "     The grep utility searches any given input files, selecting lines that\n",
      "     match one or more patterns.  By default, a pattern matches an input line\n",
      "     if the regular expression (RE) in the pattern matches the input line\n",
      "     without its trailing newline.  An empty expression matches every line.\n",
      "     Each input line that matches at least one of the patterns is written to\n",
      "     the standard output.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "man grep | head -n 20"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Literal character match"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "abcd\n"
     ]
    }
   ],
   "source": [
    "echo abcd | grep abcd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "abcd\n"
     ]
    }
   ],
   "source": [
    "echo abcd | grep bc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### No match for `ac`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo abcd | grep ac"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Case insensitive match"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "abcd\n"
     ]
    }
   ],
   "source": [
    "echo abcd | grep -i A"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo abcd | grep A"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Matching any single character\n",
    "\n",
    "The `.` matches exactly one character."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "abcd\n"
     ]
    }
   ],
   "source": [
    "echo abcd | grep a.c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo abcd | grep a..c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "abcd\n"
     ]
    }
   ],
   "source": [
    "echo abcd | grep a..d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Matching a character set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a2b\n"
     ]
    }
   ],
   "source": [
    "echo a2b | grep [0123456789]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a2b\n"
     ]
    }
   ],
   "source": [
    "echo a2b | grep [0-9]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a2b\n"
     ]
    }
   ],
   "source": [
    "echo a2b | grep [abc]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo a2b | grep [def]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a2b\n"
     ]
    }
   ],
   "source": [
    "echo a2b | grep [a-z]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo a2b | grep [A-Z]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exceptions\n",
    "\n",
    "The `^` within a character set says match anything NOT in the set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo a2b | grep [A-Z]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a2b\n"
     ]
    }
   ],
   "source": [
    "echo a2b | grep [^A-Z]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pre-defined character sets\n",
    "\n",
    "Many useful sets of characters (e.g. all digits) have been pre-defined as [character classes](https://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html) that you can use in your regular expressions. Character classes are a bit clumsy in the Unix shell, but simpler forms are often used in programming languages (e.g. '\\d' instead of '[:digit:]')."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a2b\n"
     ]
    }
   ],
   "source": [
    "echo a2b | grep ['[:alpha:]']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a2b\n"
     ]
    }
   ],
   "source": [
    "echo a2b | grep ['[:digit:]']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo a2b | grep ['[:punct:]']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a2,b\n"
     ]
    }
   ],
   "source": [
    "echo a2,b | grep ['[:punct:]']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Alternative expressions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cat\n"
     ]
    }
   ],
   "source": [
    "echo cat | grep -E '(cat|dog)'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dog\n"
     ]
    }
   ],
   "source": [
    "echo dog | grep -E '(cat|dog)'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo fox | grep -E '(cat|dog)'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Character set modifiers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Anchors\n",
    "\n",
    "`^` indicates start of line and `$` indicates end of line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "abcd\n"
     ]
    }
   ],
   "source": [
    "echo abcd | grep ^ab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo abcd | grep ab$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo abcd | grep ^cd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "abcd\n"
     ]
    }
   ],
   "source": [
    "echo abcd | grep cd$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Repeating characters\n",
    "\n",
    "- `+` matches one or more of the preceding character set\n",
    "- '*' matches zero or more of the preceding character set\n",
    "- '{m, n}' matches between m and n repeats of the preceding character set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo abbbcd | grep abcd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "abbbcd\n"
     ]
    }
   ],
   "source": [
    "echo abbbcd | grep -E ab+cd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "abbbcd\n"
     ]
    }
   ],
   "source": [
    "echo abbbcd | grep -E ab*cd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "abbbcd\n"
     ]
    }
   ],
   "source": [
    "echo abbbcd | grep -E 'ab{1,5}cd'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "abbbcd\n"
     ]
    }
   ],
   "source": [
    "echo abbbcd | grep -E a[bc]+d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Matching words with word boundaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "other ones go together\n"
     ]
    }
   ],
   "source": [
    "echo 'other ones go together' | grep 'the'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo 'other ones go together' | grep '<the>'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "other ones go together\n"
     ]
    }
   ],
   "source": [
    "echo 'other ones go together' | grep '\\<other\\>'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Capture groups and back references"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "123_456_123_456\n"
     ]
    }
   ],
   "source": [
    "echo \"123_456_123_456\" | grep -E '([0-9]+).*\\1'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "123_456_123_456\n"
     ]
    }
   ],
   "source": [
    "echo \"123_456_123_456\" | grep -E '([0-9]+)_([0-9]+)_\\1_\\2'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "ename": "",
     "evalue": "1",
     "output_type": "error",
     "traceback": []
    }
   ],
   "source": [
    "echo \"123_456_123_123\" | grep -E '([0-9]+)_([0-9]+)_\\1_\\2'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1**. Does this match? Why or why not?\n",
    "\n",
    "```bash\n",
    "echo fox | grep -E '[cat|dog]'\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}