The Unix Shell: Writing Shell Scripts

The shell commands constitute a programming language, and command line programs known as shell scripts can be written to perform complex tasks.

This will only provide a brief overview - shell scripts have many traps and pitfalls for the unwary, and we generally prefer to use languages such as Python or R with more consistent syntax for complex tasks. However, shell scripts are extensively used in domains such as the preprocessing of genomics data, and it is a useful tool to know about.

Assigning variables

We assign variables using = and recall them by using $. It is customary to spell shell variable names in ALL_CAPS.

In [1]:
NAME='Joe'
echo "Hello $NAME"
echo "Hello ${NAME}"
Hello Joe
Hello Joe

Single and double parentheses

The main difference between the use of ‘’ and “” is that variable expansion only occurs with double parentheses. For plain text, they are equivalent.

In [2]:
echo '${NAME}'
${NAME}
In [3]:
echo "${NAME}"
Joe

Use of curly braces

Use of curly braces unambiguously specifies the variable of interest. I suggest you always use them as a defensive programming technique.

In [4]:
echo "Hello ${NAME}l"
Hello Joel

$Namel is not defined, and so returns an empty string!

In [5]:
echo "Hello $NAMEl"
Hello

One of the quirks of shell scripts is already present - there cannot be spaces before or after the = in an assignment.

In [6]:
NAME2= 'Joe'
echo "Hello ${NAME2}"
bash: Joe: command not found
Hello

The previous instruction assigns the empty space to NAME2, then tries to execute ‘Joe’ as a command.

In [7]:
NAME3 ='Joe'
echo "Hello ${NAME3}"
bash: NAME3: command not found
Hello

The previous instruction runs the command NAME3 with =’Joe’ as its argument.

Assigning commands to variables

In [8]:
pwd
/Users/cliburn/_teach/HTS_SummerCourse_2017/Materials/Computation/Wk4_Day3_PM
In [9]:
CUR_DIR=$(pwd)
dirname ${CUR_DIR}
basename ${CUR_DIR}
/Users/cliburn/_teach/HTS_SummerCourse_2017/Materials/Computation
Wk4_Day3_PM

Working with numbers

Careful: Note the use of double parentheses to trigger evaluation of a mathematical expression.

In [10]:
NUM=$((1+2+3+4))
echo ${NUM}
10

seq generates a range of numbers

In [11]:
seq 3
1
2
3
In [12]:
seq 2 5
2
3
4
5
In [13]:
seq 5 2 9
5
7
9

Branching

Using if to check for file existence

Note the test condition must use square brackets.

In [14]:
if [ -f hello.txt ]; then
    cat hello.txt
else
    echo "No such file"
fi
1 Hello, bash
2 Hello, again
3 Hello
4 again

Downloading remote files

In [15]:
man wget | head -n 20
WGET(1)                            GNU Wget                            WGET(1)



NAME
       Wget - The non-interactive network downloader.

SYNOPSIS
       wget [option]... [URL]...

DESCRIPTION
       GNU Wget is a free utility for non-interactive download of files from
       the Web.  It supports HTTP, HTTPS, and FTP protocols, as well as
       retrieval through HTTP proxies.

       Wget is non-interactive, meaning that it can work in the background,
       while the user is not logged on.  This allows you to start a retrieval
       and disconnect from the system, letting Wget finish the work.  By
       contrast, most of the Web browsers require constant user's presence,
       which can be a great hindrance when transferring a lot of data.
In [16]:
wget -qO- https://vincentarelbundock.github.io/Rdatasets/doc/HSAUR/Forbes2000.html \
    | html2text | head -n 27  | tail -n 17
A data frame with 2000 observations on the following 8 variables.
  rank
      the ranking of the company.
  name
      the name of the company.
  country
      a factor giving the country the company is situated in.
  category
      a factor describing the products the company produces.
  sales
      the amount of sales of the company in billion USD.
  profits
      the profit of the company in billion USD.
  assets
      the assets of the company in billion USD.
  marketvalue
      the market value of the company in billion USD.
In [17]:
if [ ! -f "data/forbes.csv" ]; then
    wget https://vincentarelbundock.github.io/Rdatasets/csv/HSAUR/Forbes2000.csv \
    -O data/forbes.csv
fi

Conditional evaluation with test

The [ -f hello.txt ] syntax is equivalent to test -f hello.txt, where test is a shell command with a large range of operators and flags that you can view in the man page.

TEST(1)                   BSD General Commands Manual                  TEST(1)

NAME
     test, [ -- condition evaluation utility

SYNOPSIS
     test expression
     [ expression ]

DESCRIPTION
     The test utility evaluates the expression and, if it evaluates to true,
     returns a zero (true) exit status; otherwise it returns 1 (false).  If
     there is no expression, test also returns 1 (false).

     All operators and flags are separate arguments to the test utility.

     The following primaries are used to construct expression:

     -b file       True if file exists and is a block special file.

     -c file       True if file exists and is a character special file.

     -d file       True if file exists and is a directory.

     -e file       True if file exists (regardless of type).

     -f file       True if file exists and is a regular file.

     -g file       True if file exists and its set group ID flag is set.

Looping

For loop

In [18]:
for FILE in $(ls *ipynb); do
    echo $FILE
done
The_Unix_Shell_01___File_and_Directory_Management.ipynb
The_Unix_Shell_03___Working_with_Text.ipynb
The_Unix_Shell_04___Regular_Expresssions.ipynb
The_Unix_Shell_05___Finding_Stuff.ipynb
The_Unix_Shell_06___Shell_Scripts.ipynb

While loop

In [19]:
COUNTER=10
while [ $COUNTER -gt 0 ]; do
    echo $COUNTER
    COUNTER=$(($COUNTER - 1))
done
10
9
8
7
6
5
4
3
2
1

Careful: Note that < is the redirection operator, and hence will lead to an infinite loop. Use -lt for less than and -gt for greater than, == for equality and != for inequality.

In [20]:
COUNTER=10
while [ $COUNTER != 0 ]; do
    echo $COUNTER
    COUNTER=$(($COUNTER - 1))
done
10
9
8
7
6
5
4
3
2
1

Shell script

From now on, we will write the shell script using an editor for convenience. For a syntax-highlighted display, I use a non-standard Python program pygmentize that you can install with

pip install pygments

but you can also just use cat to display the file contents.

A shell script is traditionally given the extension .sh. There are a few things to note:

  1. To make the script standalone, you need to add #!/path/to/shell in the first line. Otherwise you need to call the script with bash /path/to/script instead of just /path/to/script.
  2. To make the script executable, change the file permissions to executable with chmod +x /path/to/script
  3. Shell arguments are similar to function arguments - i.e. $1, $2, $@ etc. Another useful variable is $# which gives the number of command line arguments.
In [21]:
which bash
/bin/bash
In [22]:
pygmentize -g scripts/cat_if_exists.sh
#!/bin/bash

if [ -f $1 ]; then
    cat $1
else
    echo "No such file: $1"
fi
In [23]:
chmod +x scripts/cat_if_exists.sh
In [24]:
scripts/cat_if_exists.sh hello.txt
1 Hello, bash
2 Hello, again
3 Hello
4 again
In [25]:
scripts/cat_if_exists.sh goodbye.txt
No such file: goodbye.txt

Reading a file line by line

We will write a script to extract headers from a FASTA Nucleic Acid (FNA) file. Headers in FASTA format are lines that begin with the > character.

In [26]:
cat data/example.fna | head -n 23
>random sequence 1 consisting of 1000 bases.
acggacaaacggttgatgtggttcttcgcaggatgcgccaaagtgtttacaaggctggta
aactgagaatgtgcttgttccccgtctcacgcaaagatatgaggcgtaagagaccgacat
attccctcctccataggtctttttgattattgatcactgcttcgccacccttagcgtggt
gtctttcatagtctcaccgttaaacggcgacgttcgtgaacctgctcagtccctaaactc
gataacaatcgggctgtgttggaagctagtattatcggcattcaggtagtagtcccccgg
actagcacggtccgggtctggttgcacatacatggtagcgaaattccgctcctccagccc
agaataaaggtagaagaccaatgcccgggtaaaaaactcaacgagtaggtcccacgatta
tctgagtggtgaactatgctgaggacgacaatatcatcggagtgttcactagggtgcggg
gttgactataagtgtagtctgatcatagagactccgcatattcggctacgctctataact
aatttgacgaatgctgcgaacgcacctgcgtatcgcttccttctaacctcaggcggtcat
tatcatgtcaaacaacaagagtaggtttatggcatcgacacgcatgactgcgtaacgagt
cacacgccagacgtctaagcagtgcaatgccagcgtctatgaagctcttaattagcgggt
ttacacttgcattgagtgaaatgtgccaagagcctactacaacccgcagccggcatatgg
gatcaagcgaggcaatttgatgcgcccccaaagcacgcgaaaaaagagcttggacccgga
agaaaacgatgttctgggtccgtcaagcctgcgtacagcttatccaacttttaagtggac
gtgtccgcagacaagcacacagggagggctcgccaaaaaaattgctgtatctagtacaag
gtagctaatagctccggaccgaccacctttccggactgcc

>random sequence 2 consisting of 1000 bases.
tgcgcattctcctatacatatgacgatctggtaccatgcgatagcggtcgccgagataat
ataccaaaagacatatgtcttctccgcaccctgttcctcctaccagccacaggctctgca
gcctctctcactccccgatcgagaaagattgggggttaacaataacactttttacgtcgg
In [27]:
pygmentize scripts/extract_headers.sh
#!/bin/bash

while read LINE
  do
      if [ "${LINE:0:1}" == '>' ]; then
          echo $LINE
      fi
  done

The ${X:m:n} expression extracts the characters of X from m to n

In [28]:
LINE=">random sequence 1 consisting of 1000 bases."
echo "${LINE:0:1}"
>
In [29]:
echo "${LINE:5:10}"
om sequenc

Careful: You need to put all variables in the test condition within double quotes. If not, when the variable is empty or undefined (e.g. empty line) it vanishes and leaves [ == '>' ] which raises a syntax error.

In [30]:
chmod +x scripts/extract_headers.sh
In [31]:
cat data/example.fna | scripts/extract_headers.sh
>random sequence 1 consisting of 1000 bases.
>random sequence 2 consisting of 1000 bases.
>random sequence 3 consisting of 1000 bases.
>random sequence 4 consisting of 1000 bases.
>random sequence 5 consisting of 1000 bases.
>random sequence 6 consisting of 1000 bases.
>random sequence 7 consisting of 1000 bases.
>random sequence 8 consisting of 1000 bases.
>random sequence 9 consisting of 1000 bases.
>random sequence 10 consisting of 1000 bases.
In [ ]: