The Unix Shell: Writing Shell Scripts¶
The shell commands constitute a programming language, and command line programs known as shell scripts can be written to perform complex tasks.
This will only provide a brief overview - shell scripts have many traps and pitfalls for the unwary, and we generally prefer to use languages such as Python or R with more consistent syntax for complex tasks. However, shell scripts are extensively used in domains such as the preprocessing of genomics data, and it is a useful tool to know about.
Assigning variables¶
We assign variables using = and recall them by using $. It is
customary to spell shell variable names in ALL_CAPS.
In [12]:
NAME='Joe'
echo "Hello $NAME"
echo "Hello ${NAME}"
Hello Joe
Hello Joe
Single and double parentheses¶
The main difference between the use of ‘’ and “” is that variable expansion only occurs with double parentheses. For plain text, they are equivalent.
In [222]:
echo '${NAME}'
${NAME}
In [223]:
echo "${NAME}"
Joe
Use of parenthesis¶
Use of parenthesis unambiguously specifies the variable of interest. I suggest you always use them as a defensive programming technique.
In [13]:
echo "Hello ${NAME}l"
Hello Joel
$Namel is not defined, and so returns an empty string!
In [14]:
echo "Hello $NAMEl"
Hello
One of the quirks of shell scripts is already present - there cannot be
spaces before or after the = in an assignment.
In [10]:
NAME2= 'Joe'
echo "Hello ${NAME2}"
bash: Joe: command not found
Hello
In [11]:
NAME3 ='Joe'
echo "Hello ${NAME3}"
bash: NAME3: command not found
Hello
Assigning commands to variables¶
In [16]:
pwd
/Users/cliburn/_teach/bios-821/lessons
In [15]:
CUR_DIR=$(pwd)
dirname ${CUR_DIR}
basename ${CUR_DIR}
/Users/cliburn/_teach/bios-821
lessons
Working with numbers¶
Careful: Note the use of DOUBLE parentheses to trigger evaluation of a mathematical expression.
In [226]:
NUM=$((1+2+3+4))
echo ${NUM}
10
seq generates a range of numbers¶
In [24]:
seq 3
1
2
3
In [25]:
seq 2 5
2
3
4
5
In [29]:
seq 5 2 9
5
7
9
Attempt to find sum of first 5 numbers¶
In [32]:
seq 5 | sum
26288 1
In [35]:
whatis sum
cksum(1), sum(1) - display file checksums and block counts
sum(n) - Calculate a sum(1) compatible checksum
Doing this is surprisingly tricky¶
Another reason to use Python or R where sum is just sum.
We first use command substitution to treat the output of seq 5 as a
file, then pass it to paste which inserts teh + delimiter
between each line in the file.
In [51]:
whatis paste
paste(1) - merge corresponding or subsequent lines of files
In [47]:
paste -s -d+ <(seq 5)
1+2+3+4+5
This string is then passed to bc, which evaluates strings as
mathematical expressions.
In [49]:
whatis bc
bc(1) - An arbitrary precision calculator language
In [48]:
paste -s -d+ <(seq 5) | bc
15
Writing functions¶
Functions in bash have this structure
function_name () {
commands
}
Function arguments¶
Function arguments are retrieved via special symbols known as positional parameters.
In [55]:
f () {
echo $0
echo $1
echo $2
echo $@
}
In [56]:
f one two three
/bin/bash
one
two
one two three
Function to extract first line of a set of files¶
In [58]:
ls *.txt
hello.txt stderr.txt stdout.txt test1.txt
In [96]:
headers () {
for FILE in $@; do
echo -n "${FILE}: "
cat ${FILE} | head -n 1
done
}
In [99]:
headers $(ls *.txt)
hello.txt: 1 Hello, bash
stderr.txt: mkdir: foo/bar: No such file or directory
stdout.txt: test1.txt: One, two buckle my shoe
Branching¶
Using if to check for file existence¶
In [126]:
if [ -f hello.txt ]; then
cat hello.txt
else
echo "No such file"
fi
1 Hello, bash
2 Hello, again
3 Hello
4 again
Downloading remote files¶
In [1]:
man wget | head -n 20
WGET(1) GNU Wget WGET(1)
NAME
Wget - The non-interactive network downloader.
SYNOPSIS
wget [option]... [URL]...
DESCRIPTION
GNU Wget is a free utility for non-interactive download of files from
the Web. It supports HTTP, HTTPS, and FTP protocols, as well as
retrieval through HTTP proxies.
Wget is non-interactive, meaning that it can work in the background,
while the user is not logged on. This allows you to start a retrieval
and disconnect from the system, letting Wget finish the work. By
contrast, most of the Web browsers require constant user's presence,
which can be a great hindrance when transferring a lot of data.
In [2]:
wget -qO- https://vincentarelbundock.github.io/Rdatasets/doc/HSAUR/Forbes2000.html \
| html2text | head -n 27 | tail -n 17
A data frame with 2000 observations on the following 8 variables.
rank
the ranking of the company.
name
the name of the company.
country
a factor giving the country the company is situated in.
category
a factor describing the products the company produces.
sales
the amount of sales of the company in billion USD.
profits
the profit of the company in billion USD.
assets
the assets of the company in billion USD.
marketvalue
the market value of the company in billion USD.
In [3]:
if [ ! -f "data/forbes.csv" ]; then
wget https://vincentarelbundock.github.io/Rdatasets/csv/HSAUR/Forbes2000.csv \
-O data/forbes.csv
fi
Conditional evaluation with test¶
The [ -f hello.txt ] syntax is equivalent to test -f hello.txt,
where test is a shell command with a large range of operators and
flags that you can view in the man page.
In [209]:
man test | head -n 30
TEST(1) BSD General Commands Manual TEST(1)
NAME
test, [ -- condition evaluation utility
SYNOPSIS
test expression
[ expression ]
DESCRIPTION
The test utility evaluates the expression and, if it evaluates to true,
returns a zero (true) exit status; otherwise it returns 1 (false). If
there is no expression, test also returns 1 (false).
All operators and flags are separate arguments to the test utility.
The following primaries are used to construct expression:
-b file True if file exists and is a block special file.
-c file True if file exists and is a character special file.
-d file True if file exists and is a directory.
-e file True if file exists (regardless of type).
-f file True if file exists and is a regular file.
-g file True if file exists and its set group ID flag is set.
Looping¶
For loop¶
In [210]:
for FILE in $(ls *txt); do
echo $FILE
done
hello.txt
stderr.txt
stdout.txt
test1.txt
While loop¶
In [221]:
COUNTER=10
while [ $COUNTER -gt 0 ]; do
echo $COUNTER
COUNTER=$(($COUNTER - 1))
done
10
9
8
7
6
5
4
3
2
1
Careful: Note that < is the redirection operator, and hence will
lead to an infinite loop. Use -lt for less than and -gt for
greater than, == for equality and != for inequality.
In [225]:
COUNTER=10
while [ $COUNTER != 0 ]; do
echo $COUNTER
COUNTER=$(($COUNTER - 1))
done
10
9
8
7
6
5
4
3
2
1
Shell script¶
From now on, we will write the shell script using an editor for
convenience. For a syntax-highlighted display, I use a non-standard
Python program pygmentize that you can install with
pip install pygments
but you can also just use cat to display the file contents.
A shell script is traditionally given the extension .sh. There are a
few things to note:
- To make the script standalone, you need to add
#!/path/to/shellin the first line. Otherwise you need to call the script withbash /path/to/scriptinstead of just/path/to/script. - To make the script executable, change the file permissions to
executable with
chmod +x /path/to/script - Shell arguments are similar to function arguments - i.e.
$1,$2,$@etc. Another useful variable is$#which gives the number of command line arguments.
In [132]:
which bash
/bin/bash
In [199]:
pygmentize -g scripts/cat_if_exists.sh
#!/bin/bash
if [ -f $1 ]; then
cat $1
else
echo "No such file: $1"
fi
In [200]:
chmod +x scripts/cat_if_exists.sh
In [202]:
scripts/cat_if_exists.sh hello.txt
1 Hello, bash
2 Hello, again
3 Hello
4 again
In [204]:
scripts/cat_if_exists.sh goodbye.txt
No such file: goodbye.txt
Reading a file line by line¶
We will write a script to extract headers from a FASTA Nucleic Acid
(FNA) file. Headers in FASTA format are lines that begin with the >
character.
In [235]:
cat data/example.fna | head -n 23
>random sequence 1 consisting of 1000 bases.
acggacaaacggttgatgtggttcttcgcaggatgcgccaaagtgtttacaaggctggta
aactgagaatgtgcttgttccccgtctcacgcaaagatatgaggcgtaagagaccgacat
attccctcctccataggtctttttgattattgatcactgcttcgccacccttagcgtggt
gtctttcatagtctcaccgttaaacggcgacgttcgtgaacctgctcagtccctaaactc
gataacaatcgggctgtgttggaagctagtattatcggcattcaggtagtagtcccccgg
actagcacggtccgggtctggttgcacatacatggtagcgaaattccgctcctccagccc
agaataaaggtagaagaccaatgcccgggtaaaaaactcaacgagtaggtcccacgatta
tctgagtggtgaactatgctgaggacgacaatatcatcggagtgttcactagggtgcggg
gttgactataagtgtagtctgatcatagagactccgcatattcggctacgctctataact
aatttgacgaatgctgcgaacgcacctgcgtatcgcttccttctaacctcaggcggtcat
tatcatgtcaaacaacaagagtaggtttatggcatcgacacgcatgactgcgtaacgagt
cacacgccagacgtctaagcagtgcaatgccagcgtctatgaagctcttaattagcgggt
ttacacttgcattgagtgaaatgtgccaagagcctactacaacccgcagccggcatatgg
gatcaagcgaggcaatttgatgcgcccccaaagcacgcgaaaaaagagcttggacccgga
agaaaacgatgttctgggtccgtcaagcctgcgtacagcttatccaacttttaagtggac
gtgtccgcagacaagcacacagggagggctcgccaaaaaaattgctgtatctagtacaag
gtagctaatagctccggaccgaccacctttccggactgcc
>random sequence 2 consisting of 1000 bases.
tgcgcattctcctatacatatgacgatctggtaccatgcgatagcggtcgccgagataat
ataccaaaagacatatgtcttctccgcaccctgttcctcctaccagccacaggctctgca
gcctctctcactccccgatcgagaaagattgggggttaacaataacactttttacgtcgg
In [232]:
pygmentize scripts/extract_headers.sh
#!/bin/bash
while read LINE
do
if [ "${LINE:0:1}" == '>' ]; then
echo $LINE
fi
done
Careful: You need to put all variables in the test condition within
double quotes. If not, when the variable is empty or undefined (e.g.
empty line) it vanishes and leaves [ == '>' ] which raises a syntax
error.
In [227]:
chmod +x scripts/extract_headers.sh
In [231]:
cat data/example.fna | scripts/extract_headers.sh
>random sequence 1 consisting of 1000 bases.
>random sequence 2 consisting of 1000 bases.
>random sequence 3 consisting of 1000 bases.
>random sequence 4 consisting of 1000 bases.
>random sequence 5 consisting of 1000 bases.
>random sequence 6 consisting of 1000 bases.
>random sequence 7 consisting of 1000 bases.
>random sequence 8 consisting of 1000 bases.
>random sequence 9 consisting of 1000 bases.
>random sequence 10 consisting of 1000 bases.