{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The Unix Shell: Writing Shell Scripts\n", "\n", "The shell commands constitute a programming language, and command line programs known as shell scripts can be written to perform complex tasks. \n", "\n", "This will only provide a brief overview - shell scripts have many traps and pitfalls for the unwary, and we generally prefer to use languages such as Python or R with more consistent syntax for complex tasks. However, shell scripts are extensively used in domains such as the preprocessing of genomics data, and it is a useful tool to know about." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Assigning variables\n", "\n", "We assign variables using `=` and recall them by using `$`. It is customary to spell shell variable names in ALL_CAPS." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello Joe\n", "Hello Joe\n" ] } ], "source": [ "NAME='Joe'\n", "echo \"Hello $NAME\"\n", "echo \"Hello ${NAME}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Single and double parentheses\n", "\n", "The main difference between the use of '' and \"\" is that variable expansion only occurs with double parentheses. For plain text, they are equivalent." ] }, { "cell_type": "code", "execution_count": 222, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "${NAME}\n" ] } ], "source": [ "echo '${NAME}'" ] }, { "cell_type": "code", "execution_count": 223, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Joe\n" ] } ], "source": [ "echo \"${NAME}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use of parenthesis\n", "\n", "Use of parenthesis unambiguously specifies the variable of interest. I suggest you always use them as a defensive programming technique." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello Joel\n" ] } ], "source": [ "echo \"Hello ${NAME}l\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$Namel is not defined, and so returns an empty string!" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello \n" ] } ], "source": [ "echo \"Hello $NAMEl\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the quirks of shell scripts is already present - there cannot be spaces before or after the `=` in an assignment." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bash: Joe: command not found\n", "Hello \n" ] } ], "source": [ "NAME2= 'Joe'\n", "echo \"Hello ${NAME2}\"" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bash: NAME3: command not found\n", "Hello \n" ] } ], "source": [ "NAME3 ='Joe'\n", "echo \"Hello ${NAME3}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assigning commands to variables" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/cliburn/_teach/bios-821/lessons\n" ] } ], "source": [ "pwd" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Users/cliburn/_teach/bios-821\n", "lessons\n" ] } ], "source": [ "CUR_DIR=$(pwd)\n", "dirname ${CUR_DIR}\n", "basename ${CUR_DIR}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Working with numbers\n", "\n", "**Careful**: Note the use of DOUBLE parentheses to trigger evaluation of a mathematical expression." ] }, { "cell_type": "code", "execution_count": 226, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10\n" ] } ], "source": [ "NUM=$((1+2+3+4))\n", "echo ${NUM}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `seq` generates a range of numbers" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "2\n", "3\n" ] } ], "source": [ "seq 3" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n", "3\n", "4\n", "5\n" ] } ], "source": [ "seq 2 5" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5\n", "7\n", "9\n" ] } ], "source": [ "seq 5 2 9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Attempt to find sum of first 5 numbers" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "26288 1\n" ] } ], "source": [ "seq 5 | sum" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cksum(1), sum(1) - display file checksums and block counts\n", "sum(n) - Calculate a sum(1) compatible checksum\n" ] } ], "source": [ "whatis sum " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Doing this is surprisingly tricky\n", "\n", "Another reason to use Python or R where sum is just sum." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first use command substitution to treat the output of `seq 5` as a file, then pass it to `paste` which inserts teh `+` delimiter between each line in the file." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "paste(1) - merge corresponding or subsequent lines of files\n" ] } ], "source": [ "whatis paste" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1+2+3+4+5\n" ] } ], "source": [ "paste -s -d+ <(seq 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This string is then passed to `bc`, which evaluates strings as mathematical expressions." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bc(1) - An arbitrary precision calculator language\n" ] } ], "source": [ "whatis bc" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15\n" ] } ], "source": [ "paste -s -d+ <(seq 5) | bc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing functions\n", "\n", "Functions in bash have this structure\n", "\n", "```bash\n", "function_name () {\n", " commands\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Function arguments\n", "\n", "Function arguments are retrieved via special symbols known as positional parameters." ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": true }, "outputs": [], "source": [ "f () {\n", " echo $0\n", " echo $1\n", " echo $2\n", " echo $@\n", "}" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/bin/bash\n", "one\n", "two\n", "one two three\n" ] } ], "source": [ "f one two three" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing a sum function" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": true }, "outputs": [], "source": [ "f_sum () {\n", " paste -s -d+ $1 | bc\n", "}" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "15\n" ] } ], "source": [ "f_sum <(seq 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Function to extract first line of a set of files" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hello.txt\tstderr.txt\tstdout.txt\ttest1.txt\n" ] } ], "source": [ "ls *.txt" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": true }, "outputs": [], "source": [ "headers () {\n", " for FILE in $@; do\n", " echo -n \"${FILE}: \"\n", " cat ${FILE} | head -n 1\n", " done\n", "}" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hello.txt: 1 Hello, bash\n", "stderr.txt: mkdir: foo/bar: No such file or directory\n", "stdout.txt: test1.txt: One, two buckle my shoe\n" ] } ], "source": [ "headers $(ls *.txt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Branching" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using if to check for file existence" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 Hello, bash\n", "2 Hello, again\n", "3 Hello\n", "4 again\n" ] } ], "source": [ "if [ -f hello.txt ]; then\n", " cat hello.txt\n", "else\n", " echo \"No such file\"\n", "fi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Downloading remote files" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WGET(1) GNU Wget WGET(1)\n", "\n", "\n", "\n", "NAME\n", " Wget - The non-interactive network downloader.\n", "\n", "SYNOPSIS\n", " wget [option]... [URL]...\n", "\n", "DESCRIPTION\n", " GNU Wget is a free utility for non-interactive download of files from\n", " the Web. It supports HTTP, HTTPS, and FTP protocols, as well as\n", " retrieval through HTTP proxies.\n", "\n", " Wget is non-interactive, meaning that it can work in the background,\n", " while the user is not logged on. This allows you to start a retrieval\n", " and disconnect from the system, letting Wget finish the work. By\n", " contrast, most of the Web browsers require constant user's presence,\n", " which can be a great hindrance when transferring a lot of data.\n" ] } ], "source": [ "man wget | head -n 20" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A data frame with 2000 observations on the following 8 variables.\n", " rank\n", " the ranking of the company.\n", " name\n", " the name of the company.\n", " country\n", " a factor giving the country the company is situated in.\n", " category\n", " a factor describing the products the company produces.\n", " sales\n", " the amount of sales of the company in billion USD.\n", " profits\n", " the profit of the company in billion USD.\n", " assets\n", " the assets of the company in billion USD.\n", " marketvalue\n", " the market value of the company in billion USD.\n" ] } ], "source": [ "wget -qO- https://vincentarelbundock.github.io/Rdatasets/doc/HSAUR/Forbes2000.html \\\n", " | html2text | head -n 27 | tail -n 17" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "if [ ! -f \"data/forbes.csv\" ]; then\n", " wget https://vincentarelbundock.github.io/Rdatasets/csv/HSAUR/Forbes2000.csv \\\n", " -O data/forbes.csv\n", "fi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conditional evaluation with `test`\n", "\n", "The `[ -f hello.txt ]` syntax is equivalent to `test -f hello.txt`, where `test` is a shell command with a large range of operators and flags that you can view in the man page." ] }, { "cell_type": "code", "execution_count": 209, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "TEST(1) BSD General Commands Manual TEST(1)\n", "\n", "NAME\n", " test, [ -- condition evaluation utility\n", "\n", "SYNOPSIS\n", " test expression\n", " [ expression ]\n", "\n", "DESCRIPTION\n", " The test utility evaluates the expression and, if it evaluates to true,\n", " returns a zero (true) exit status; otherwise it returns 1 (false). If\n", " there is no expression, test also returns 1 (false).\n", "\n", " All operators and flags are separate arguments to the test utility.\n", "\n", " The following primaries are used to construct expression:\n", "\n", " -b file True if file exists and is a block special file.\n", "\n", " -c file True if file exists and is a character special file.\n", "\n", " -d file True if file exists and is a directory.\n", "\n", " -e file True if file exists (regardless of type).\n", "\n", " -f file True if file exists and is a regular file.\n", "\n", " -g file True if file exists and its set group ID flag is set.\n" ] } ], "source": [ "man test | head -n 30" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### For loop" ] }, { "cell_type": "code", "execution_count": 210, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hello.txt\n", "stderr.txt\n", "stdout.txt\n", "test1.txt\n" ] } ], "source": [ "for FILE in $(ls *txt); do\n", " echo $FILE\n", "done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### While loop" ] }, { "cell_type": "code", "execution_count": 221, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10\n", "9\n", "8\n", "7\n", "6\n", "5\n", "4\n", "3\n", "2\n", "1\n" ] } ], "source": [ "COUNTER=10\n", "while [ $COUNTER -gt 0 ]; do\n", " echo $COUNTER\n", " COUNTER=$(($COUNTER - 1))\n", "done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Careful**: Note that `<` is the redirection operator, and hence will lead to an infinite loop. Use `-lt` for less than and `-gt` for greater than, `==` for equality and `!=` for inequality." ] }, { "cell_type": "code", "execution_count": 225, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10\n", "9\n", "8\n", "7\n", "6\n", "5\n", "4\n", "3\n", "2\n", "1\n" ] } ], "source": [ "COUNTER=10\n", "while [ $COUNTER != 0 ]; do\n", " echo $COUNTER\n", " COUNTER=$(($COUNTER - 1))\n", "done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Shell script" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From now on, we will write the shell script using an editor for convenience. For a syntax-highlighted display, I use a non-standard Python program `pygmentize` that you can install with \n", "\n", "```\n", "pip install pygments\n", "```\n", "\n", "but you can also just use `cat` to display the file contents.\n", "\n", "A shell script is traditionally given the extension `.sh`. There are a few things to note:\n", "\n", "1. To make the script standalone, you need to add `#!/path/to/shell` in the first line. Otherwise you need to call the script with `bash /path/to/script` instead of just `/path/to/script`.\n", "2. To make the script executable, change the file permissions to executable with `chmod +x /path/to/script`\n", "3. Shell arguments are similar to function arguments - i.e. `$1`, `$2`, `$@` etc. Another useful variable is `$#` which gives the number of command line arguments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Find default shell to use" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/bin/bash\n" ] } ], "source": [ "which bash" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Display script" ] }, { "cell_type": "code", "execution_count": 199, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[37m#!/bin/bash\u001b[39;49;00m\n", "\n", "\u001b[34mif\u001b[39;49;00m [ -f \u001b[31m$1\u001b[39;49;00m ]; \u001b[34mthen\u001b[39;49;00m\n", " cat \u001b[31m$1\u001b[39;49;00m\n", "\u001b[34melse\u001b[39;49;00m\n", " \u001b[36mecho\u001b[39;49;00m \u001b[33m\"\u001b[39;49;00m\u001b[33mNo such file: \u001b[39;49;00m\u001b[31m$1\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\n", "\u001b[34mfi\u001b[39;49;00m\n" ] } ], "source": [ "pygmentize -g scripts/cat_if_exists.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Give executable permission" ] }, { "cell_type": "code", "execution_count": 200, "metadata": { "collapsed": true }, "outputs": [], "source": [ "chmod +x scripts/cat_if_exists.sh" ] }, { "cell_type": "code", "execution_count": 202, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 Hello, bash\n", "2 Hello, again\n", "3 Hello\n", "4 again\n" ] } ], "source": [ "scripts/cat_if_exists.sh hello.txt" ] }, { "cell_type": "code", "execution_count": 204, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "No such file: goodbye.txt\n" ] } ], "source": [ "scripts/cat_if_exists.sh goodbye.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading a file line by line\n", "\n", "We will write a script to extract headers from a FASTA Nucleic Acid (FNA) file. Headers in FASTA format are lines that begin with the `>` character." ] }, { "cell_type": "code", "execution_count": 235, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">random sequence 1 consisting of 1000 bases.\n", "acggacaaacggttgatgtggttcttcgcaggatgcgccaaagtgtttacaaggctggta\n", "aactgagaatgtgcttgttccccgtctcacgcaaagatatgaggcgtaagagaccgacat\n", "attccctcctccataggtctttttgattattgatcactgcttcgccacccttagcgtggt\n", "gtctttcatagtctcaccgttaaacggcgacgttcgtgaacctgctcagtccctaaactc\n", "gataacaatcgggctgtgttggaagctagtattatcggcattcaggtagtagtcccccgg\n", "actagcacggtccgggtctggttgcacatacatggtagcgaaattccgctcctccagccc\n", "agaataaaggtagaagaccaatgcccgggtaaaaaactcaacgagtaggtcccacgatta\n", "tctgagtggtgaactatgctgaggacgacaatatcatcggagtgttcactagggtgcggg\n", "gttgactataagtgtagtctgatcatagagactccgcatattcggctacgctctataact\n", "aatttgacgaatgctgcgaacgcacctgcgtatcgcttccttctaacctcaggcggtcat\n", "tatcatgtcaaacaacaagagtaggtttatggcatcgacacgcatgactgcgtaacgagt\n", "cacacgccagacgtctaagcagtgcaatgccagcgtctatgaagctcttaattagcgggt\n", "ttacacttgcattgagtgaaatgtgccaagagcctactacaacccgcagccggcatatgg\n", "gatcaagcgaggcaatttgatgcgcccccaaagcacgcgaaaaaagagcttggacccgga\n", "agaaaacgatgttctgggtccgtcaagcctgcgtacagcttatccaacttttaagtggac\n", "gtgtccgcagacaagcacacagggagggctcgccaaaaaaattgctgtatctagtacaag\n", "gtagctaatagctccggaccgaccacctttccggactgcc\n", "\n", ">random sequence 2 consisting of 1000 bases.\n", "tgcgcattctcctatacatatgacgatctggtaccatgcgatagcggtcgccgagataat\n", "ataccaaaagacatatgtcttctccgcaccctgttcctcctaccagccacaggctctgca\n", "gcctctctcactccccgatcgagaaagattgggggttaacaataacactttttacgtcgg\n" ] } ], "source": [ "cat data/example.fna | head -n 23" ] }, { "cell_type": "code", "execution_count": 232, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[37m#!/bin/bash\u001b[39;49;00m\n", "\n", "\u001b[34mwhile\u001b[39;49;00m \u001b[36mread\u001b[39;49;00m LINE\n", " \u001b[34mdo\u001b[39;49;00m\n", " \u001b[34mif\u001b[39;49;00m [ \u001b[33m\"\u001b[39;49;00m\u001b[33m${\u001b[39;49;00m\u001b[31mLINE\u001b[39;49;00m:\u001b[31m0\u001b[39;49;00m:\u001b[31m1\u001b[39;49;00m\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m == \u001b[33m'>'\u001b[39;49;00m ]; \u001b[34mthen\u001b[39;49;00m\n", " \u001b[36mecho\u001b[39;49;00m \u001b[31m$LINE\u001b[39;49;00m\n", " \u001b[34mfi\u001b[39;49;00m\n", " \u001b[34mdone\u001b[39;49;00m \n" ] } ], "source": [ "pygmentize scripts/extract_headers.sh" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Careful**: You need to put all variables in the test condition within double quotes. If not, when the variable is empty or undefined (e.g. empty line) it vanishes and leaves `[ == '>' ]` which raises a syntax error." ] }, { "cell_type": "code", "execution_count": 227, "metadata": { "collapsed": true }, "outputs": [], "source": [ "chmod +x scripts/extract_headers.sh" ] }, { "cell_type": "code", "execution_count": 231, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">random sequence 1 consisting of 1000 bases.\n", ">random sequence 2 consisting of 1000 bases.\n", ">random sequence 3 consisting of 1000 bases.\n", ">random sequence 4 consisting of 1000 bases.\n", ">random sequence 5 consisting of 1000 bases.\n", ">random sequence 6 consisting of 1000 bases.\n", ">random sequence 7 consisting of 1000 bases.\n", ">random sequence 8 consisting of 1000 bases.\n", ">random sequence 9 consisting of 1000 bases.\n", ">random sequence 10 consisting of 1000 bases.\n" ] } ], "source": [ "cat data/example.fna | scripts/extract_headers.sh" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 2 }