The Unix Shell: Writing Shell Scripts¶
The shell commands constitute a programming language, and command line programs known as shell scripts can be written to perform complex tasks.
This will only provide a brief overview - shell scripts have many traps and pitfalls for the unwary, and we generally prefer to use languages such as Python or R with more consistent syntax for complex tasks. However, shell scripts are extensively used in domains such as the preprocessing of genomics data, and it is a useful tool to know about.
Assigning variables¶
We assign variables using =
and recall them by using $
. It is
customary to spell shell variable names in ALL_CAPS.
In [1]:
NAME='Joe'
echo "Hello $NAME"
echo "Hello ${NAME}"
Hello Joe
Hello Joe
Single and double parentheses¶
The main difference between the use of ‘’ and “” is that variable expansion only occurs with double parentheses. For plain text, they are equivalent.
In [2]:
echo '${NAME}'
${NAME}
In [3]:
echo "${NAME}"
Joe
Use of curly braces¶
Use of curly braces unambiguously specifies the variable of interest. I suggest you always use them as a defensive programming technique.
In [4]:
echo "Hello ${NAME}l"
Hello Joel
$Namel is not defined, and so returns an empty string!
In [5]:
echo "Hello $NAMEl"
Hello
One of the quirks of shell scripts is already present - there cannot be
spaces before or after the =
in an assignment.
In [6]:
NAME2= 'Joe'
echo "Hello ${NAME2}"
bash: Joe: command not found
Hello
The previous instruction assigns the empty space to NAME2, then tries to execute ‘Joe’ as a command.
In [7]:
NAME3 ='Joe'
echo "Hello ${NAME3}"
bash: NAME3: command not found
Hello
The previous instruction runs the command NAME3 with =’Joe’ as its argument.
Assigning commands to variables¶
In [8]:
pwd
/Users/cliburn/_teach/HTS_SummerCourse_2017/Materials/Computation/Wk4_Day3_PM
In [9]:
CUR_DIR=$(pwd)
dirname ${CUR_DIR}
basename ${CUR_DIR}
/Users/cliburn/_teach/HTS_SummerCourse_2017/Materials/Computation
Wk4_Day3_PM
Working with numbers¶
Careful: Note the use of double parentheses to trigger evaluation of a mathematical expression.
In [10]:
NUM=$((1+2+3+4))
echo ${NUM}
10
seq
generates a range of numbers¶
In [11]:
seq 3
1
2
3
In [12]:
seq 2 5
2
3
4
5
In [13]:
seq 5 2 9
5
7
9
Branching¶
Using if to check for file existence¶
Note the test condition must use square brackets.
In [14]:
if [ -f hello.txt ]; then
cat hello.txt
else
echo "No such file"
fi
1 Hello, bash
2 Hello, again
3 Hello
4 again
Downloading remote files¶
In [15]:
man wget | head -n 20
WGET(1) GNU Wget WGET(1)
NAME
Wget - The non-interactive network downloader.
SYNOPSIS
wget [option]... [URL]...
DESCRIPTION
GNU Wget is a free utility for non-interactive download of files from
the Web. It supports HTTP, HTTPS, and FTP protocols, as well as
retrieval through HTTP proxies.
Wget is non-interactive, meaning that it can work in the background,
while the user is not logged on. This allows you to start a retrieval
and disconnect from the system, letting Wget finish the work. By
contrast, most of the Web browsers require constant user's presence,
which can be a great hindrance when transferring a lot of data.
In [16]:
wget -qO- https://vincentarelbundock.github.io/Rdatasets/doc/HSAUR/Forbes2000.html \
| html2text | head -n 27 | tail -n 17
A data frame with 2000 observations on the following 8 variables.
rank
the ranking of the company.
name
the name of the company.
country
a factor giving the country the company is situated in.
category
a factor describing the products the company produces.
sales
the amount of sales of the company in billion USD.
profits
the profit of the company in billion USD.
assets
the assets of the company in billion USD.
marketvalue
the market value of the company in billion USD.
In [17]:
if [ ! -f "data/forbes.csv" ]; then
wget https://vincentarelbundock.github.io/Rdatasets/csv/HSAUR/Forbes2000.csv \
-O data/forbes.csv
fi
Conditional evaluation with test
¶
The [ -f hello.txt ]
syntax is equivalent to test -f hello.txt
,
where test
is a shell command with a large range of operators and
flags that you can view in the man page.
TEST(1) BSD General Commands Manual TEST(1)
NAME
test, [ -- condition evaluation utility
SYNOPSIS
test expression
[ expression ]
DESCRIPTION
The test utility evaluates the expression and, if it evaluates to true,
returns a zero (true) exit status; otherwise it returns 1 (false). If
there is no expression, test also returns 1 (false).
All operators and flags are separate arguments to the test utility.
The following primaries are used to construct expression:
-b file True if file exists and is a block special file.
-c file True if file exists and is a character special file.
-d file True if file exists and is a directory.
-e file True if file exists (regardless of type).
-f file True if file exists and is a regular file.
-g file True if file exists and its set group ID flag is set.
Looping¶
For loop¶
In [18]:
for FILE in $(ls *ipynb); do
echo $FILE
done
The_Unix_Shell_01___File_and_Directory_Management.ipynb
The_Unix_Shell_03___Working_with_Text.ipynb
The_Unix_Shell_04___Regular_Expresssions.ipynb
The_Unix_Shell_05___Finding_Stuff.ipynb
The_Unix_Shell_06___Shell_Scripts.ipynb
While loop¶
In [19]:
COUNTER=10
while [ $COUNTER -gt 0 ]; do
echo $COUNTER
COUNTER=$(($COUNTER - 1))
done
10
9
8
7
6
5
4
3
2
1
Careful: Note that <
is the redirection operator, and hence will
lead to an infinite loop. Use -lt
for less than and -gt
for
greater than, ==
for equality and !=
for inequality.
In [20]:
COUNTER=10
while [ $COUNTER != 0 ]; do
echo $COUNTER
COUNTER=$(($COUNTER - 1))
done
10
9
8
7
6
5
4
3
2
1
Shell script¶
From now on, we will write the shell script using an editor for
convenience. For a syntax-highlighted display, I use a non-standard
Python program pygmentize
that you can install with
pip install pygments
but you can also just use cat
to display the file contents.
A shell script is traditionally given the extension .sh
. There are a
few things to note:
- To make the script standalone, you need to add
#!/path/to/shell
in the first line. Otherwise you need to call the script withbash /path/to/script
instead of just/path/to/script
. - To make the script executable, change the file permissions to
executable with
chmod +x /path/to/script
- Shell arguments are similar to function arguments - i.e.
$1
,$2
,$@
etc. Another useful variable is$#
which gives the number of command line arguments.
In [21]:
which bash
/bin/bash
In [22]:
pygmentize -g scripts/cat_if_exists.sh
#!/bin/bash
if [ -f $1 ]; then
cat $1
else
echo "No such file: $1"
fi
In [23]:
chmod +x scripts/cat_if_exists.sh
In [24]:
scripts/cat_if_exists.sh hello.txt
1 Hello, bash
2 Hello, again
3 Hello
4 again
In [25]:
scripts/cat_if_exists.sh goodbye.txt
No such file: goodbye.txt
Reading a file line by line¶
We will write a script to extract headers from a FASTA Nucleic Acid
(FNA) file. Headers in FASTA format are lines that begin with the >
character.
In [26]:
cat data/example.fna | head -n 23
>random sequence 1 consisting of 1000 bases.
acggacaaacggttgatgtggttcttcgcaggatgcgccaaagtgtttacaaggctggta
aactgagaatgtgcttgttccccgtctcacgcaaagatatgaggcgtaagagaccgacat
attccctcctccataggtctttttgattattgatcactgcttcgccacccttagcgtggt
gtctttcatagtctcaccgttaaacggcgacgttcgtgaacctgctcagtccctaaactc
gataacaatcgggctgtgttggaagctagtattatcggcattcaggtagtagtcccccgg
actagcacggtccgggtctggttgcacatacatggtagcgaaattccgctcctccagccc
agaataaaggtagaagaccaatgcccgggtaaaaaactcaacgagtaggtcccacgatta
tctgagtggtgaactatgctgaggacgacaatatcatcggagtgttcactagggtgcggg
gttgactataagtgtagtctgatcatagagactccgcatattcggctacgctctataact
aatttgacgaatgctgcgaacgcacctgcgtatcgcttccttctaacctcaggcggtcat
tatcatgtcaaacaacaagagtaggtttatggcatcgacacgcatgactgcgtaacgagt
cacacgccagacgtctaagcagtgcaatgccagcgtctatgaagctcttaattagcgggt
ttacacttgcattgagtgaaatgtgccaagagcctactacaacccgcagccggcatatgg
gatcaagcgaggcaatttgatgcgcccccaaagcacgcgaaaaaagagcttggacccgga
agaaaacgatgttctgggtccgtcaagcctgcgtacagcttatccaacttttaagtggac
gtgtccgcagacaagcacacagggagggctcgccaaaaaaattgctgtatctagtacaag
gtagctaatagctccggaccgaccacctttccggactgcc
>random sequence 2 consisting of 1000 bases.
tgcgcattctcctatacatatgacgatctggtaccatgcgatagcggtcgccgagataat
ataccaaaagacatatgtcttctccgcaccctgttcctcctaccagccacaggctctgca
gcctctctcactccccgatcgagaaagattgggggttaacaataacactttttacgtcgg
In [27]:
pygmentize scripts/extract_headers.sh
#!/bin/bash
while read LINE
do
if [ "${LINE:0:1}" == '>' ]; then
echo $LINE
fi
done
The ${X:m:n} expression extracts the characters of X from m to n¶
In [28]:
LINE=">random sequence 1 consisting of 1000 bases."
echo "${LINE:0:1}"
>
In [29]:
echo "${LINE:5:10}"
om sequenc
Careful: You need to put all variables in the test condition within
double quotes. If not, when the variable is empty or undefined (e.g.
empty line) it vanishes and leaves [ == '>' ]
which raises a syntax
error.
In [30]:
chmod +x scripts/extract_headers.sh
In [31]:
cat data/example.fna | scripts/extract_headers.sh
>random sequence 1 consisting of 1000 bases.
>random sequence 2 consisting of 1000 bases.
>random sequence 3 consisting of 1000 bases.
>random sequence 4 consisting of 1000 bases.
>random sequence 5 consisting of 1000 bases.
>random sequence 6 consisting of 1000 bases.
>random sequence 7 consisting of 1000 bases.
>random sequence 8 consisting of 1000 bases.
>random sequence 9 consisting of 1000 bases.
>random sequence 10 consisting of 1000 bases.
In [ ]: