Using `bash` in bioinformatics¶

The Unix commands shown in this notebook are quite intimidating at first. There’s a lot of material becaue I try to cover all the stuff you willl need to construct a typicla bioinformatics pipeline.

Try to understand the expressions and play aound to see what varying them will do. You do not need to memorize the commands now - you can always come back to this notebook for reference. You will also get more practice when you do the Bioiinformatics sessions to construct Unix pipelines for processing your RNA-seq data sets.

Hack to handle broekn pipes - IGNORE.

In [1]:

cleanup () {
    :
}

trap "cleanup" SIGPIPE

Treat unset variables as an error, and immediately exit.

In [2]:

set -u

0.1. Files, directories and working with text¶

Quickly reveiw what the following navigation and file/directory manipulation commands do:

>, >>, <, |, ls, cd, pwd, cp, mv, rm, mkdir, rmdir

This session dives quite deep into Unix shell scripting and invovles quite a bit of text processing. We review some basic commands for working with text in Unix.

This is the file we will warm up with¶

We create it using a HEREDOC that allows us to construct multi-line files and redirrect it to the file junk.txt

In [3]:

cat <<EOF > junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
EOF

We can see the contents of a file with `cat`¶

In [4]:

cat junk.txt

1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG

Using `head` and `tail`¶

In [5]:

head -3 junk.txt

1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA

In [6]:

tail -3 junk.txt

2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG

In [7]:

tail +3 junk.txt

3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG

Pipling to chain commnads¶

In [8]:

head -3 junk.txt | tail -2

2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA

Searching within a file with `grep`¶

grep takes a REGEX (regular expression) as the query and a file (or files) to search
Normally grep shows lines that match the REGEX argument
The -v flag shows the lines that DO NOT match the REGEX argument

In [9]:

grep "[2|4]" junk.txt

2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG

In [10]:

grep -v "[2|4]" junk.txt

1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA

Use a regular expression with `grep`¶

In [11]:

grep "CCC.*CCC" junk.txt

1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG

Get rid of numbers with `cut`¶

cut extracts columns separated by delimiters - here the delimiter is a space -d ' ' - and we cut the second column -f 2

In [12]:

cut -d ' ' -f 2 junk.txt

CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG

Get rid of numbers with `grep`¶

The -o flag to grep shows only the strings mathcing the regular expression

In [13]:

grep -o '[A-Z]*' junk.txt

CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG

Extract only odd numbered lines with `sed`¶

The -n says not to print anything
Then we print p the lines starting at 1 with a step of 2

In [14]:

sed -n "1~2p" junk.txt

1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA

Use `sed` for substitution¶

The syntax is s/PATTERN/REPLACMENT/g/

The s/// measn substitute
The g means replace globally; otherwise just replaces the first patttern found
[0-9][0-9]* is a regular expression matching a digit followed by zero or more digits
\. matches the literal .
[[:space:]] matches a single space character

In [15]:

sed "s/[0-9][0-9]*\.[[:space:]]//g" junk.txt

CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG

Use `sed` to remove dupplicate characters using a REGEX capture group¶

In [16]:

sed -E "s/(.)\1*/\1/g" junk.txt

1. CTACTACTACTCAGTCGAGCGACATAGCTG
2. GTCATATGTCTGTACGTATGCACTGATCATACTGATCGAGC
3. GACGATATCAGACGACATGTCGTCTGACACGA
4. TCTCGTCGCTGCGTGTCTCTCAGTGCAGTACG

Convert DNA to RNA with `tr`¶

In [17]:

cat junk.txt | tr T U

1. CCCUAACCCCCUAACCCCCUAACCCUCAGUCGGGGAGGCGACAAUAGCUG
2. GUCAUAUGUUCUGUACGUUAUUGGCCAACUGAUCAUACCUGAAUCGAGCC
3. GAACCGGGAUUAUCAAAGACGAACAUGGUCGGGUCCUUGAACCAAACGAA
4. UCUCCGUCCGCUGGCGUGUUUUUCUUUUCUCAAGUGGGCAAGUUACCCGG

Reverse complement with `tr`¶

In [18]:

grep -o '[ACTG]*' junk.txt | tr ACGT TGCA | rev

CAGCTATTGTCGCCTCCCCGACTGAGGGTTAGGGGGTTAGGGGGTTAGGG
GGCTCGATTCAGGTATGATCAGTTGGCCAATAACGTACAGAACATATGAC
TTCGTTTGGTTCAAGGACCCGACCATGTTCGTCTTTGATAATCCCGGTTC
CCGGGTAACTTGCCCACTTGAGAAAAGAAAAACACGCCAGCGGACGGAGA

Split command over multiple lines¶

Use \ to continue command to next line

In [19]:

grep -o '[ACTG]*' junk.txt | tr \
ACGT TGCA | rev

CAGCTATTGTCGCCTCCCCGACTGAGGGTTAGGGGGTTAGGGGGTTAGGG
GGCTCGATTCAGGTATGATCAGTTGGCCAATAACGTACAGAACATATGAC
TTCGTTTGGTTCAAGGACCCGACCATGTTCGTCTTTGATAATCCCGGTTC
CCGGGTAACTTGCCCACTTGAGAAAAGAAAAACACGCCAGCGGACGGAGA

Actually the line continuation character is not necesssary if you break at a pipe |

In [20]:

grep -o '[ACTG]*' junk.txt |
tr ACGT TGCA |
rev

CAGCTATTGTCGCCTCCCCGACTGAGGGTTAGGGGGTTAGGGGGTTAGGG
GGCTCGATTCAGGTATGATCAGTTGGCCAATAACGTACAGAACATATGAC
TTCGTTTGGTTCAAGGACCCGACCATGTTCGTCTTTGATAATCCCGGTTC
CCGGGTAACTTGCCCACTTGAGAAAAGAAAAACACGCCAGCGGACGGAGA

0.2 Getting files¶

We will download Chromosomes 7 and 8 of Crytococcus neoformans in FASTA format from NCBI Nucleotide. We can find their IDs through a web query.

We find that the IDs are

CM000046.1 for chormosome 7
CM000047.1 for chormosome 8

We will use the NCBI Entrez tool efetch to download the FASTA files to a folder seqs.

In [21]:

cd
mkdir -p seqs
efetch -db=nuccore -format=fasta -id=CM000046.1 > seqs/CM000046.1.fa
efetch -db=nuccore -format=fasta -id=CM000047.1 > seqs/CM000047.1.fa

In [22]:

ls seqs

CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

In [23]:

head seqs/CM000046.1.fa

>CM000046.1 Cryptococcus neoformans var. neoformans B-3501A chromosome 7, whole genome shotgun sequence
TGGTCTTATGAGGAAGAGGAGTTTGGATTATTTTTTCTTTTCTTTAAAAAGTTGTTTATTTAAGTAGTTT
CTTTAATTCGGGTAACACACACGACAACCCAATAAATTAAACAACGAAAAAATGCAACCTCTATAACCCC
CTAACCCTTTTATTCGGGTAACACACACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATA
ACCCCCGAAAAGGATTGGTGTCGAGTTAGTAAAGGCGAGGGCTAGGACGCTGGTCTTGTGAGGAAGAGGA
GGTTGGATTATTTTTTCTTTTCTTTAATAAGTTGTTTATTTAAGTAGTTTCTTTTATTCGGGTAACACAC
ACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATAACCCCCATCCAAGGTGCCACGGTTGA
GATTGAGCCCCACACACTGTCGGGACAAAGGAGAAACGACCTTCGGGTCAGAGGTTCCAGCGCTCTGGCC
TTCACTGACTACGACCTGAAGGTTTACTCCCTCGGGGACCGAGACGCGAGGAGCACAGCCACCCCCAGCA
CCCCCAACAGCAAACTGGCCGACTTCTGCTTGGACCGGTGCGTGAACTGGCTCGACAAGGTGGGTCAGGT

In [24]:

head seqs/CM000047.1.fa

>CM000047.1 Cryptococcus neoformans var. neoformans B-3501A chromosome 8, whole genome shotgun sequence
ATGGCCTTCCTGACTACGACCTGAGGTTTACTCCCTCGGCGACCGAGACGCGAGGAGCACAGCCACCCCA
AGCACCCCCAACAGCAAGCTGGCCGAATTCTGCTTGGACCGGTGCGTGAACTGGCTCGACAAGGTGGGTC
AGGTCGTCTCGAAGAACGCTCCGAAAGTCACTGGTGGGGTCTTTAAACCGATCATCCTTTCCACTGGTGG
CCTGATGAGCAGGAGCACAGCAGACGAATGGAAGGAGTGGAGGGAGGCGATGCCGGTGGGGGGGTTCGAG
AAAATGGAGAAACGGATTGGTGTCGAGTTAGTAAAGGCGAGGGCTAGGACGCTGGTCTTGTGAGGAAGAG
GAGGTTGGATTATTTTTTCTTTTCTTTAAAAAGTTGTTTATTTAAGTAGTTTCTTTAATTCGGGCAACCC
ACACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATAACCCTCAATAGCTCTGGCTGGTAC
AGATGGGCAGAAACTCACCTTGCTCCCACGGCATACATCCCTCCCAATAAGTATCTGCAGATGGCGAATA
AACCCCATGCTCTGCGCAAATATATTGAAGGCTGACGACTTTGATGTTATGATCCACCCCATAAACGCTC

0.3 Compressing ard archiving files¶

In Bioinformatics the most common formats are gzip and bgzip.

In [25]:

cd ~/seqs
ls

CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

Compressing with `gzip`¶

In [26]:

gzip *fa
ls

CM000046.1.fa.gz  CM000047.1.fa.bak  CM000047.1.fa.gz

Uncompressing with `gzip`¶

In [27]:

gunzip *gz
ls

CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

Compressing with `bgzip`¶

bgzip (Blocked GNU Zip Format) is commonly used in Bioinformatics, and allows random access of the content. BAM, BCF and VCF file formats are typically bgzip compressed.

Unfortunately bgzip does not take multiple files, and we have to use the xargs command to pipe in one file at a time.

In [28]:

xargs --help | grep '\-n,'

  -n, --max-args=MAX-ARGS      use at most MAX-ARGS arguments per command line

In [29]:

ls *fa | xargs -n1 bgzip
ls

CM000046.1.fa.gz  CM000047.1.fa.bak  CM000047.1.fa.gz

Note that `bgzip` files can be uncompressed with `gzip`¶

In [30]:

gunzip *gz
ls

CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

Viewing compressed files¶

Some utilities can work directly with gzipped files. One that we will be using extensively is zcat.

In [31]:

ls *fa | xargs -n1 bgzip
ls

CM000046.1.fa.gz  CM000047.1.fa.bak  CM000047.1.fa.gz

In [32]:

zcat CM000046.1.fa.gz | head

>CM000046.1 Cryptococcus neoformans var. neoformans B-3501A chromosome 7, whole genome shotgun sequence
TGGTCTTATGAGGAAGAGGAGTTTGGATTATTTTTTCTTTTCTTTAAAAAGTTGTTTATTTAAGTAGTTT
CTTTAATTCGGGTAACACACACGACAACCCAATAAATTAAACAACGAAAAAATGCAACCTCTATAACCCC
CTAACCCTTTTATTCGGGTAACACACACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATA
ACCCCCGAAAAGGATTGGTGTCGAGTTAGTAAAGGCGAGGGCTAGGACGCTGGTCTTGTGAGGAAGAGGA
GGTTGGATTATTTTTTCTTTTCTTTAATAAGTTGTTTATTTAAGTAGTTTCTTTTATTCGGGTAACACAC
ACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATAACCCCCATCCAAGGTGCCACGGTTGA
GATTGAGCCCCACACACTGTCGGGACAAAGGAGAAACGACCTTCGGGTCAGAGGTTCCAGCGCTCTGGCC
TTCACTGACTACGACCTGAAGGTTTACTCCCTCGGGGACCGAGACGCGAGGAGCACAGCCACCCCCAGCA
CCCCCAACAGCAAACTGGCCGACTTCTGCTTGGACCGGTGCGTGAACTGGCTCGACAAGGTGGGTCAGGT

In [33]:

gunzip *gz
ls

CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

Archival with `tar`¶

Often we want to compress an entire direcotry (possibly with subdirectories). In Unix, this is most commonly done with tar. Flags to note

c compress
x extract
z gzip or gunzip
f files
v verbose

In [34]:

tar --help | head -8

Usage: tar [OPTION...] [FILE]...
GNU 'tar' saves many files together into a single tape or disk archive, and can
restore individual files from the archive.

Examples:
  tar -cf archive.tar foo bar  # Create archive.tar from files foo and bar.
  tar -tvf archive.tar         # List all files in archive.tar verbosely.
  tar -xf archive.tar          # Extract all files from archive.tar.

In [35]:

cd
ls seqs

CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

In [36]:

tar czvf seqs.tar.gz seqs
ls seqs*

seqs/
seqs/CM000046.1.fa
seqs/CM000047.1.fa
seqs/CM000047.1.fa.bak
seqs.tar.gz

seqs:
CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

View contents of archive¶

In [37]:

tar tzvf seqs.tar.gz

drwxr-xr-x cliburn/cliburn   0 2018-06-21 10:25 seqs/
-rw-r--r-- cliburn/cliburn 1434513 2018-06-21 10:25 seqs/CM000046.1.fa
-rw-r--r-- cliburn/cliburn 1194378 2018-06-21 10:25 seqs/CM000047.1.fa
-rw-r--r-- cliburn/cliburn 1194398 2018-06-21 10:19 seqs/CM000047.1.fa.bak

Remove direcotry and regenerate from archive¶

In [38]:

rm -rf seqs
ls seqs*

seqs.tar.gz

In [39]:

tar xzvf seqs.tar.gz
ls seqs*

seqs/
seqs/CM000046.1.fa
seqs/CM000047.1.fa
seqs/CM000047.1.fa.bak
seqs.tar.gz

seqs:
CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

In [40]:

rm seqs.tar.gz

0.4 Checksums¶

When working with genomic data, we deal with very large files. There is a small risk that these files will be corrupted over time or during data transfer. To ensure that files are not changed, we use a “checksum” function. This is a function that generates an long, essentially random number called a checksum that represents the contents of the file. When the file contents change, so will the checksum. In theory, there is a very small probability that two different files generate the same checksum, but in practice the probability is too small to worry about.

There are several different algorithms for generating the checksums, and at least 3 Unix commands to do so, but they all work very similarly for our purposes.

The strategy is:

Generate and store a checksum together with a data file whose integrity you care about
When you use or download the data, re-generate the checksum (using the same algorithm e.g. MD5) and compare with the checksum

In [41]:

ls ~/seqs

CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

In [42]:

md5sum ~/seqs/*fa > MD5_CHECKSUM

We check that the files are unchanged wth the `-c` flag¶

In [43]:

md5sum -c MD5_CHECKSUM

/home/cliburn/seqs/CM000046.1.fa: OK
/home/cliburn/seqs/CM000047.1.fa: OK

Suppsoe one of the files is corrupted¶

In [44]:

echo 'Oops I did it again' >> seqs/CM000047.1.fa

In [45]:

md5sum -c MD5_CHECKSUM  \
| cat # this is jusst so we can Run All

/home/cliburn/seqs/CM000046.1.fa: OK
md5sum: /home/cliburn/seqs/CM000047.1.fa: FAILED
WARNING: 1 computed checksum did NOT match

Using `diff` to find changes¶

Let’s rename the altered file, re-download the orginal and see what was changed.

In [46]:

mv seqs/CM000047.1.fa seqs/CM000047.1.fa.bak

In [47]:

efetch -db=nuccore -format=fasta -id=CM000047.1 > seqs/CM000047.1.fa

In [48]:

diff -u seqs/CM000047.1.fa seqs/CM000047.1.fa.bak \
| cat # this is jusst so we can Run All

--- seqs/CM000047.1.fa  2018-06-21 10:25:49.253700667 -0400
+++ seqs/CM000047.1.fa.bak      2018-06-21 10:25:48.425709045 -0400
@@ -16820,3 +16820,4 @@
 CCGTGTCGGCACGCATCTGATTTTTTTCTTATTATTCATAACAACGGCTACAGATCACGCAAGCGCATCG
 ACTCCGCTGCTGCGAAACACCACGAGTACTGCAACCATTGCTGCGGATGAGCTTAGGGGGTTAGGGGGTT
 AGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGG
+Oops I did it again

Explanaiton from [official documentation]

Header

--- seqs/CM000047.1.fa  2018-06-21 09:58:18.246185758 -0400
+++ seqs/CM000047.1.fa.bak  2018-06-21 09:55:55.967707764 -0400

Explanation

The unified output format starts with a two-line header, which looks like this:

--- from-file from-file-modification-time
+++ to-file to-file-modification-time
The timestamp looks like ‘2002-02-21 23:30:39.942229878 -0800’ to indicate the date, time with fractional seconds, and time zone. The fractional seconds are omitted on hosts that do not support fractional timestamps.

You can change the header’s content with the --label=label option. See Alternate Names.

Hunk

@@ -16820,3 +16820,4 @@
 CCGTGTCGGCACGCATCTGATTTTTTTCTTATTATTCATAACAACGGCTACAGATCACGCAAGCGCATCG
 ACTCCGCTGCTGCGAAACACCACGAGTACTGCAACCATTGCTGCGGATGAGCTTAGGGGGTTAGGGGGTT
 AGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGG
+Oops I did it again

Explanation

``` @@ from-file-line-numbers to-file-line-numbers @@ line-from-either-file line-from-either-file…

If a hunk contains just one line, only its start line number appears. Otherwise its line numbers look like ‘start,count’. An empty hunk is considered to start at the line that follows the hunk.

If a hunk and its context contain two or more lines, its line numbers look like ‘start,count’. Otherwise only its end line number appears. An empty hunk is considered to end at the line that precedes the hunk.

The lines common to both files begin with a space character. The lines that actually differ between the two files have one of the following indicator characters in the left print column:

‘+’ A line was added here to the first file.

‘-’ A line was removed here from the first file. ```

Verify¶

We can confirm that a line was added at the position given by diff

In [49]:

cat seqs/CM000047.1.fa | tail +16820 | head -3

CCGTGTCGGCACGCATCTGATTTTTTTCTTATTATTCATAACAACGGCTACAGATCACGCAAGCGCATCG
ACTCCGCTGCTGCGAAACACCACGAGTACTGCAACCATTGCTGCGGATGAGCTTAGGGGGTTAGGGGGTT
AGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGG

In [50]:

cat seqs/CM000047.1.fa.bak | tail +16820 | head -4

CCGTGTCGGCACGCATCTGATTTTTTTCTTATTATTCATAACAACGGCTACAGATCACGCAAGCGCATCG
ACTCCGCTGCTGCGAAACACCACGAGTACTGCAACCATTGCTGCGGATGAGCTTAGGGGGTTAGGGGGTT
AGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGG
Oops I did it again

1. Working with FASTA files¶

From NCBI FASTA specification

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

        >P01013 GENE X PROTEIN (OVALBUMIN-RELATED)
        QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
        KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
        VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
        FLFLIKHNPTNTIVYFGRYWSP

Note that a FASTA file may contain mulitple sequences, each with its own description line.

Set up variables for files we will be using¶

In [51]:

DATA_DIR='/data/hts/2018/Granek_4837_180427A5'
FASTA='cneo040623.scaffs.fa.gz'
FASTQ='10_MA_C_S3_L003_R1_001.fastq.gz'
GTF='Cryptococcus_neoformans.ASM9104v1.39.gtf.gz'

This is how you get the value of a shell variable¶

In [52]:

echo $DATA_DIR

/data/hts/2018/Granek_4837_180427A5

List files in data direcotry¶

In [53]:

ls $DATA_DIR | head -3

10_MA_C_S3_L001_R1_001.fastq
10_MA_C_S3_L002_R1_001.fastq.gz
10_MA_C_S3_L003_R1_001.fastq.gz

Inspect the contents of a zipped file.¶

zcat is like cat but works on gzipped files.

In [54]:

zcat $DATA_DIR/$FASTA | head

>chr01.b3501.040506 5 contigs (2263073 bp)
CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
GGATTTATACAACGATAAGAGGCTATCCGCTGACTATAAATTGTGTTAGC
TGATCCAGATCCGCAGTTGGCACAAACCGATGTTTCCTTTTCCTCAGGTC
CTGAATATTTCTCTTCCATCAAGGCTCCCAACCCCGAGGGGAGTATTTCC
ACCCGATCTGATTCCAAGAGGTCCAGCGTCAACCAGGTACGTGAATGTTT
TCTTCTTTAGTGGTGGGGACCCTATGCTTGTCCCACAGAACTGCACTGAA

Show headers only¶

In [55]:

zcat $DATA_DIR/$FASTA   | grep '^>' | head

>chr01.b3501.040506 5 contigs (2263073 bp)
>chr02.b3501.040506 6 contigs (1607892 bp)
>chr03.b3501.040506 6 contigs (2025976 bp)
>chr04.b3501.040623 5 contigs (1762351 bp)
>chr05.b3501.040506 7 contigs (1432964 bp)
>chr06.b3501.040616 4 contigs (1411173 bp)
>chr07.b3501.040506 6 contigs (1356306 bp)
>chr08.b3501.040506 3 contigs (1153453 bp)
>chr09.b3501.040506 5 contigs (1072068 bp)
>chr10.b3501.040506 3 contigs (1017008 bp)

Show headers with line numbers¶

In [56]:

zcat $DATA_DIR/$FASTA| grep -n '^>' | head

1:>chr01.b3501.040506 5 contigs (2263073 bp)
45264:>chr02.b3501.040506 6 contigs (1607892 bp)
77423:>chr03.b3501.040506 6 contigs (2025976 bp)
117944:>chr04.b3501.040623 5 contigs (1762351 bp)
153193:>chr05.b3501.040506 7 contigs (1432964 bp)
181854:>chr06.b3501.040616 4 contigs (1411173 bp)
210079:>chr07.b3501.040506 6 contigs (1356306 bp)
237207:>chr08.b3501.040506 3 contigs (1153453 bp)
260278:>chr09.b3501.040506 5 contigs (1072068 bp)
281721:>chr10.b3501.040506 3 contigs (1017008 bp)

Count number of headers (one for each crhomosome)¶

In [57]:

zcat $DATA_DIR/$FASTA  | grep -c '^>'

Extract first part of the sequence from chromosome 7¶

In [58]:

zcat $DATA_DIR/$FASTA | sed -n '210079,210082p'

>chr07.b3501.040506 6 contigs (1356306 bp)
TGGTCTTATGAGGAAGAGGAGTTTGGATTATTTTTTCTTTTCTTTAAAAA
GTTGTTTATTTAAGTAGTTTCTTTAATTCGGGTAACACACACGACAACCC
AATAAATTAAACAACGAAAAAATGCAACCTCTATAACCCCCTAACCCTTT

Extract first line of chromosome 7 and save to file¶

In [59]:

zcat $DATA_DIR/$FASTA | sed -n '210080p' > dna.txt

Inspect the file we just saved¶

In [60]:

cat dna.txt

TGGTCTTATGAGGAAGAGGAGTTTGGATTATTTTTTCTTTTCTTTAAAAA

Find complement¶

In [61]:

cat dna.txt | tr ACGT TGCA

ACCAGAATACTCCTTCTCCTCAAACCTAATAAAAAAGAAAAGAAATTTTT

Find reverse complement¶

In [62]:

cat dna.txt | tr ACGT TGCA | rev

TTTTTAAAGAAAAGAAAAAATAATCCAAACTCCTCTTCCTCATAAGACCA

Delete A and T¶

In [63]:

cat dna.txt | tr -d 'AT'

GGCGGGGGGGGGCC

Only count non A and T¶

In [64]:

cat dna.txt | tr -d 'AT' | tr -d '\n' | wc -c

In [65]:

man tr | grep '\-d,' -A2

       -d, --delete
              delete characters in SET1, do not translate

Count C and G¶

In [66]:

grep '[CG]' -o dna.txt | wc -l

In [67]:

man grep | grep '\-o,' -A2

       -o, --only-matching
              Print  only  the  matched  (non-empty) parts of a matching line,
              with each such part on a separate output line.

Specify data directory as a shell variable¶

2. Working with FASTQ files¶

From Wikipedia

A FASTQ file normally uses four lines per sequence.

Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

Specify target file as a shell variable¶

Show the contents of a zipped file¶

In [68]:

zcat $DATA_DIR/$FASTQ | head -n 10

@NB501800:128:HTYKFBGX5:3:11401:19195:1019 1:N:0:ATTACTCG+AGGATAGG
NGTAAGGGTCTGGGGGACGTTGTTGTGGATGGAAACAACCCTGGCCTCTCGGGGCTGAGCAGGGAGTCGCTTGAGC
+
#AAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEE/<<
@NB501800:128:HTYKFBGX5:3:11401:24339:1020 1:N:0:ATTACTCG+AGGATAGG
NCTCGCCTGGGGTGCTTGCGTGTCCCTGTCCTTGGCCCTGGTCACCGGGGAGATGCACATCATCTTCTTCTGGCTC
+
#AAAAEAAEEEEEEEEEEEEEEEE/EEE/EEEEEEEEEEEEEEEE<A/EEEE6EEE//EEEAEEEEEEE/EAEEEA
@NB501800:128:HTYKFBGX5:3:11401:24140:1020 1:N:0:ATTACTCG+AGGATAGG
NTGATCACGGGAGACAAGCTAGAGATGCAAAATGCCATAGTACGATCATCCCATACAAGATGGGTGAGAGAAAAAG

Extracting just the sequence¶

We need to extract every 4th line starting from the 2nd. We cna use a tool called sed (stream editor) to do so.

In [69]:

zcat $DATA_DIR/$FASTQ | sed -n '2~4p;' | head

NGTAAGGGTCTGGGGGACGTTGTTGTGGATGGAAACAACCCTGGCCTCTCGGGGCTGAGCAGGGAGTCGCTTGAGC
NCTCGCCTGGGGTGCTTGCGTGTCCCTGTCCTTGGCCCTGGTCACCGGGGAGATGCACATCATCTTCTTCTGGCTC
NTGATCACGGGAGACAAGCTAGAGATGCAAAATGCCATAGTACGATCATCCCATACAAGATGGGTGAGAGAAAAAG
NGGGCCTTGGTCCTGCCCTCAGCCTTGATAGGACCCTTGCCGTCGATGGGGTTACCCAAAGCGTCAACAACTCGGC
NACCTTTCGGAGACCTCGGTGGGTCTTCCTGGGGAGACGGGTAGTACCCCATCGGGCGGTAACACCCTCGTAACCG
NCCGAACTGTTGAAAAGAGCTTCCACGCTCCAAAGGCAGGGACAAGAGTAAAGAGCCACCACAGCCAATCTCCAAA
NGCAAACTGAACGTCCTCAGTCTCCATTCGCTGGAAGTCCGATTCAGGACGGGCAGCGTCATCGATAGAGAATGGA
NGCATTACCAGATGCACTGATAGCAATGATCGTCTCATCTTCCAAGTACTTGCTCACAGCGATACTATCACTCAAA
GGCGATCTTCCTCTTGGGGAAGTGAACGTTCTTCTTGTGCTGCTTCTCGGTGATAGCAGAAGCCTCCTTGGTAGTC
CGGCGATGATGAGAGTGAAGGGATCTTGCCACATGAGAGGGGATGATCGACGGGGACGAGGACGGATGGGGATGAT

Explanation

sed -n '2~4p'

-n means to suppress all output
2~4p means print (p) every 4th line starting at line 2

Joining lines¶

Getting sequence as a single string. We just need to delete newlline (:raw-latex:`\n`) characters.

Note: We can break long commands uisng the line continuation character

In [70]:

zcat $DATA_DIR/$FASTQ | \
sed -n '2~4p;' | \
head | \
tr -d '\n'

NGTAAGGGTCTGGGGGACGTTGTTGTGGATGGAAACAACCCTGGCCTCTCGGGGCTGAGCAGGGAGTCGCTTGAGCNCTCGCCTGGGGTGCTTGCGTGTCCCTGTCCTTGGCCCTGGTCACCGGGGAGATGCACATCATCTTCTTCTGGCTCNTGATCACGGGAGACAAGCTAGAGATGCAAAATGCCATAGTACGATCATCCCATACAAGATGGGTGAGAGAAAAAGNGGGCCTTGGTCCTGCCCTCAGCCTTGATAGGACCCTTGCCGTCGATGGGGTTACCCAAAGCGTCAACAACTCGGCNACCTTTCGGAGACCTCGGTGGGTCTTCCTGGGGAGACGGGTAGTACCCCATCGGGCGGTAACACCCTCGTAACCGNCCGAACTGTTGAAAAGAGCTTCCACGCTCCAAAGGCAGGGACAAGAGTAAAGAGCCACCACAGCCAATCTCCAAANGCAAACTGAACGTCCTCAGTCTCCATTCGCTGGAAGTCCGATTCAGGACGGGCAGCGTCATCGATAGAGAATGGANGCATTACCAGATGCACTGATAGCAATGATCGTCTCATCTTCCAAGTACTTGCTCACAGCGATACTATCACTCAAAGGCGATCTTCCTCTTGGGGAAGTGAACGTTCTTCTTGTGCTGCTTCTCGGTGATAGCAGAAGCCTCCTTGGTAGTCCGGCGATGATGAGAGTGAAGGGATCTTGCCACATGAGAGGGGATGATCGACGGGGACGAGGACGGATGGGGATGAT

3.Working wtih GTF files¶

From UCSD Genome Browser

The GTF (General Transfer Format) is an extension of GFF version 2
and used to represent transcription models. GFF (General Feature Format)
consists of one line per feature, each containing 9 columns of data.

Fields

Fields are tab-separated. Also, all but the final field in each
feature line must contain a value; "empty" columns are denoted
with a '.'

    seqname   - name of the chromosome or scaffold; chromosome names
                without a 'chr'
    source    - name of the program that generated this feature, or
                the data source (database or project name)
    feature   - feature type name. Current allowed features are
                {gene, transcript, exon, CDS, Selenocysteine, start_codon,
                stop_codon and UTR}
    start     - start position of the feature, with sequence numbering
                starting at 1.
    end       - end position of the feature, with sequence numbering
                starting at 1.
    score     - a floating point value indiciating the score of a feature
    strand    - defined as + (forward) or - (reverse).
    frame     - one of '0', '1' or '2'. Frame indicates the number of base pairs
                before you encounter a full codon. '0' indicates the feature
                begins with a whole codon. '1' indicates there is an extra
                base (the 3rd base of the prior codon) at the start of this feature.
                '2' indicates there are two extra bases (2nd and 3rd base of the
                prior exon) before the first codon. All values are given with
                relation to the 5' end.
    attribute - a semicolon-separated list of tag-value pairs (separated by a space),
                providing additional information about each feature. A key can be
                repeated multiple times.

Attributes

The following attributes are available. All attributes are semi-colon
separated pairs of keys and values.

- gene_id: The stable identifier for the gene
- gene_version: The stable identifier version for the gene
- gene_name: The official symbol of this gene
- gene_source: The annotation source for this gene
- gene_biotype: The biotype of this gene
- transcript_id: The stable identifier for this transcript
- transcript_version: The stable identifier version for this transcript
- transcript_name: The symbold for this transcript derived from the gene name
- transcript_source: The annotation source for this transcript
- transcript_biotype: The biotype for this transcript
- exon_id: The stable identifier for this exon
- exon_version: The stable identifier version for this exon
- exon_number: Position of this exon in the transcript
- ccds_id: CCDS identifier linked to this transcript
- protein_id: Stable identifier for this transcript's protein
- protein_version: Stable identifier version for this transcript's protein
- tag: A collection of additional key value tags
- transcript_support_level: Ranking to assess how well a transcript is supported (from 1 to 5)

Inspect the GTF file¶

In [71]:

zcat $DATA_DIR/$GTF | head

#!genome-build ASM9104v1
#!genome-version ASM9104v1
#!genome-date 2016-06
#!genome-build-accession GCA_000091045.1
#!genebuild-last-updated 2016-06
1       ena     gene    4839    6236    .       +       .       gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      4839    6236    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    4839    5178    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-1"; exon_version "1";
1       ena     CDS     4914    5178    .       +       0       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AAW41410"; protein_version "1";
1       ena     start_codon     4914    4916    .       +       0       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";

Ignore comment lines version 1¶

This uses tail to take all lines starting from line 6

In [72]:

zcat $DATA_DIR/$GTF | tail +6 | head -3

1       ena     gene    4839    6236    .       +       .       gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      4839    6236    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    4839    5178    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-1"; exon_version "1";

Ingore comment lines 2¶

This uses grep to ignore (-v) lines starting with ^#

In [73]:

zcat $DATA_DIR/$GTF | grep -v '^#' | head -3

1       ena     gene    4839    6236    .       +       .       gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      4839    6236    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    4839    5178    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-1"; exon_version "1";

Find all unique feature types in GTF¶

In [74]:

zcat $DATA_DIR/$GTF | grep -v '^#' | cut -f 3 | sort | uniq

CDS
exon
five_prime_utr
gene
start_codon
stop_codon
three_prime_utr
transcript

Get counts of all unique feature types in GTF¶

In [75]:

zcat $DATA_DIR/$GTF | grep -v '^#' | cut -f 3 | sort | uniq -c

  44079 CDS
  44344 exon
   6738 five_prime_utr
   6886 gene
   6894 start_codon
   2369 stop_codon
   6718 three_prime_utr
   7117 transcript

Find the the attributes of the first 3 exons¶

In [76]:

zcat $DATA_DIR/$GTF | grep -v '^#' | awk '$3 == "exon"' | head -3

1       ena     exon    4839    5178    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-1"; exon_version "1";
1       ena     exon    5241    5377    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "2"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-2"; exon_version "1";
1       ena     exon    5453    5560    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "3"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-3"; exon_version "1";

Store attributes of all genes in a file¶

It is common to use Unix commands to do simple processing and save the results to a file, then use a scripting language such as R or Python to furhter process the file.

In [77]:

zcat $DATA_DIR/$GTF | grep -v '^#' | awk '$3 == "gene"' | cut -f 9 > genes.txt

In [78]:

head -6 genes.txt

gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00020"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00030"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00040"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00050"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00060"; gene_source "ena"; gene_biotype "protein_coding";

4. Looping¶

Loop throgh first 10 lines¶

Creae a counter variable LINE_NUM and initialize it to 0
Redirect genes.txt to a while loop
Whlie loop reads each line into the variable LINE
The [[ CONDITION ]] tests if CONDITION is true
If LINE_NUM is less than -lt 10 print LINE and increment LINE_NUM
Note that (( EXPR )) is used to evaluate numeric expressions

In [79]:

LINE_NUM=0
while read LINE
do
    if [[ $LINE_NUM -lt 10 ]]; then
        echo $LINE
        (( LINE_NUM++ ))
    fi
done < genes.txt

gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00020"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00030"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00040"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00050"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00060"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00070"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00075"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00080"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00090"; gene_source "ena"; gene_biotype "protein_coding";

Loop through first 10 tRNAs¶

Almost the same as previous program but we add a new condition to check tha the gene_biotype is tRNA using a REGULAr EXPRESSION via the match operator =~

.*gene_biotype[[:space:]]\"tRNA\".* means

.*: Match any number of any character
gene_biotype: Match the string gene_biotype
[[:space:]]: Match a single space cahracter
\"tRNA\": Math the string “tRNA” - we escape the double quotes to match linteral double quote characters
.*: Match any number of any character

In [80]:

LINE_NUM=0
while read LINE
do
    if [[ $LINE_NUM -lt 10 ]] && [[ $LINE =~ .*gene_biotype[[:space:]]\"tRNA\".* ]] ; then
        echo $LINE
        ((LINE_NUM++))
    fi
done < genes.txt

gene_id "CNA02110"; gene_name "CNA02110"; gene_source "ena"; gene_biotype "tRNA";
gene_id "EBG00005237496"; gene_name "tRNA"; gene_source "Rfam"; gene_biotype "tRNA";
gene_id "CNA02190"; gene_name "CNA02190"; gene_source "ena"; gene_biotype "tRNA";
gene_id "EBG00005237490"; gene_name "tRNA"; gene_source "Rfam"; gene_biotype "tRNA";
gene_id "CNA02200"; gene_name "CNA02200"; gene_source "ena"; gene_biotype "tRNA";
gene_id "CNA02280"; gene_name "CNA02280"; gene_source "ena"; gene_biotype "tRNA";
gene_id "EBG00005237502"; gene_name "tRNA"; gene_source "Rfam"; gene_biotype "tRNA";
gene_id "CNA02320"; gene_name "CNA02320"; gene_source "ena"; gene_biotype "tRNA";
gene_id "EBG00005237484"; gene_name "tRNA"; gene_source "Rfam"; gene_biotype "tRNA";
gene_id "CNA03230"; gene_name "CNA03230"; gene_source "ena"; gene_biotype "tRNA";

Loop through files in a directory and print the filename followed by the first line¶

Use a COMMAND EXPRESSION $(EXPR) to list files matching the glob pattern
We loop through each file using a for statement
For each file, extract the first line with head -1 and save in variable FIRST
The 2>/dev/null ignores output from the stderr stream
We use another COMMAND EXPRESSION to print the filename withou the directory (using basename)
We then print the first line storred in FIRST

In [81]:

for FILE in $(ls $DATA_DIR/*_MA_J_S20_*gz)
do
    FIRST=$(zcat $FILE 2>/dev/null | head -1)
    echo $(basename $FILE)
    echo $FIRST
    echo
done

11_MA_J_S20_L001_R1_001.fastq.gz
@NB501800:128:HTYKFBGX5:1:11101:26760:1056 1:N:0:CGCTCATT+NGGATAGG

11_MA_J_S20_L002_R1_001.fastq.gz
@NB501800:128:HTYKFBGX5:2:11101:20327:1041 1:N:0:CGCTCATT+AGGATAGN

11_MA_J_S20_L003_R1_001.fastq.gz
@NB501800:128:HTYKFBGX5:3:11401:10896:1024 1:N:0:CGCTCATT+AGGATAGG

11_MA_J_S20_L004_R1_001.fastq.gz
@NB501800:128:HTYKFBGX5:4:11401:8259:1032 1:N:0:CGCTCATT+AGGATAGG

Using bash in bioinformatics¶