Using bash in bioinformatics

The Unix commands shown in this notebook are quite intimidating at first. There’s a lot of material becaue I try to cover all the stuff you willl need to construct a typicla bioinformatics pipeline.

Try to understand the expressions and play aound to see what varying them will do. You do not need to memorize the commands now - you can always come back to this notebook for reference. You will also get more practice when you do the Bioiinformatics sessions to construct Unix pipelines for processing your RNA-seq data sets.

Hack to handle broekn pipes - IGNORE.

In [1]:
cleanup () {
    :
}

trap "cleanup" SIGPIPE

Treat unset variables as an error, and immediately exit.

In [2]:
set -u

0.1. Files, directories and working with text

Quickly reveiw what the following navigation and file/directory manipulation commands do:

>, >>, <, |, ls, cd, pwd, cp, mv, rm, mkdir, rmdir

This session dives quite deep into Unix shell scripting and invovles quite a bit of text processing. We review some basic commands for working with text in Unix.

This is the file we will warm up with

We create it using a HEREDOC that allows us to construct multi-line files and redirrect it to the file junk.txt

In [3]:
cat <<EOF > junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
EOF

We can see the contents of a file with cat

In [4]:
cat junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG

Using head and tail

In [5]:
head -3 junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
In [6]:
tail -3 junk.txt
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
In [7]:
tail +3 junk.txt
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG

Pipling to chain commnads

In [8]:
head -3 junk.txt | tail -2
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA

Searching within a file with grep

  • grep takes a REGEX (regular expression) as the query and a file (or files) to search
  • Normally grep shows lines that match the REGEX argument
  • The -v flag shows the lines that DO NOT match the REGEX argument
In [9]:
grep "[2|4]" junk.txt
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
In [10]:
grep -v "[2|4]" junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA

Use a regular expression with grep

In [11]:
grep "CCC.*CCC" junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG

Get rid of numbers with cut

cut extracts columns separated by delimiters - here the delimiter is a space -d ' ' - and we cut the second column -f 2

In [12]:
cut -d ' ' -f 2 junk.txt
CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG

Get rid of numbers with grep

The -o flag to grep shows only the strings mathcing the regular expression

In [13]:
grep -o '[A-Z]*' junk.txt
CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG

Extract only odd numbered lines with sed

  • The -n says not to print anything
  • Then we print p the lines starting at 1 with a step of 2
In [14]:
sed -n "1~2p" junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA

Use sed for substitution

The syntax is s/PATTERN/REPLACMENT/g/

  • The s/// measn substitute
  • The g means replace globally; otherwise just replaces the first patttern found
  • [0-9][0-9]* is a regular expression matching a digit followed by zero or more digits
  • \. matches the literal .
  • [[:space:]] matches a single space character
In [15]:
sed "s/[0-9][0-9]*\.[[:space:]]//g" junk.txt
CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG

Use sed to remove dupplicate characters using a REGEX capture group

In [16]:
sed -E "s/(.)\1*/\1/g" junk.txt
1. CTACTACTACTCAGTCGAGCGACATAGCTG
2. GTCATATGTCTGTACGTATGCACTGATCATACTGATCGAGC
3. GACGATATCAGACGACATGTCGTCTGACACGA
4. TCTCGTCGCTGCGTGTCTCTCAGTGCAGTACG

Convert DNA to RNA with tr

In [17]:
cat junk.txt | tr T U
1. CCCUAACCCCCUAACCCCCUAACCCUCAGUCGGGGAGGCGACAAUAGCUG
2. GUCAUAUGUUCUGUACGUUAUUGGCCAACUGAUCAUACCUGAAUCGAGCC
3. GAACCGGGAUUAUCAAAGACGAACAUGGUCGGGUCCUUGAACCAAACGAA
4. UCUCCGUCCGCUGGCGUGUUUUUCUUUUCUCAAGUGGGCAAGUUACCCGG

Reverse complement with tr

In [18]:
grep -o '[ACTG]*' junk.txt | tr ACGT TGCA | rev
CAGCTATTGTCGCCTCCCCGACTGAGGGTTAGGGGGTTAGGGGGTTAGGG
GGCTCGATTCAGGTATGATCAGTTGGCCAATAACGTACAGAACATATGAC
TTCGTTTGGTTCAAGGACCCGACCATGTTCGTCTTTGATAATCCCGGTTC
CCGGGTAACTTGCCCACTTGAGAAAAGAAAAACACGCCAGCGGACGGAGA

Split command over multiple lines

Use \ to continue command to next line

In [19]:
grep -o '[ACTG]*' junk.txt | tr \
ACGT TGCA | rev
CAGCTATTGTCGCCTCCCCGACTGAGGGTTAGGGGGTTAGGGGGTTAGGG
GGCTCGATTCAGGTATGATCAGTTGGCCAATAACGTACAGAACATATGAC
TTCGTTTGGTTCAAGGACCCGACCATGTTCGTCTTTGATAATCCCGGTTC
CCGGGTAACTTGCCCACTTGAGAAAAGAAAAACACGCCAGCGGACGGAGA

Actually the line continuation character is not necesssary if you break at a pipe |

In [20]:
grep -o '[ACTG]*' junk.txt |
tr ACGT TGCA |
rev
CAGCTATTGTCGCCTCCCCGACTGAGGGTTAGGGGGTTAGGGGGTTAGGG
GGCTCGATTCAGGTATGATCAGTTGGCCAATAACGTACAGAACATATGAC
TTCGTTTGGTTCAAGGACCCGACCATGTTCGTCTTTGATAATCCCGGTTC
CCGGGTAACTTGCCCACTTGAGAAAAGAAAAACACGCCAGCGGACGGAGA

0.2 Getting files

We will download Chromosomes 7 and 8 of Crytococcus neoformans in FASTA format from NCBI Nucleotide. We can find their IDs through a web query.

We find that the IDs are

CM000046.1 for chormosome 7
CM000047.1 for chormosome 8

We will use the NCBI Entrez tool efetch to download the FASTA files to a folder seqs.

In [21]:
cd
mkdir -p seqs
efetch -db=nuccore -format=fasta -id=CM000046.1 > seqs/CM000046.1.fa
efetch -db=nuccore -format=fasta -id=CM000047.1 > seqs/CM000047.1.fa
In [22]:
ls seqs
CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak
In [23]:
head seqs/CM000046.1.fa
>CM000046.1 Cryptococcus neoformans var. neoformans B-3501A chromosome 7, whole genome shotgun sequence
TGGTCTTATGAGGAAGAGGAGTTTGGATTATTTTTTCTTTTCTTTAAAAAGTTGTTTATTTAAGTAGTTT
CTTTAATTCGGGTAACACACACGACAACCCAATAAATTAAACAACGAAAAAATGCAACCTCTATAACCCC
CTAACCCTTTTATTCGGGTAACACACACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATA
ACCCCCGAAAAGGATTGGTGTCGAGTTAGTAAAGGCGAGGGCTAGGACGCTGGTCTTGTGAGGAAGAGGA
GGTTGGATTATTTTTTCTTTTCTTTAATAAGTTGTTTATTTAAGTAGTTTCTTTTATTCGGGTAACACAC
ACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATAACCCCCATCCAAGGTGCCACGGTTGA
GATTGAGCCCCACACACTGTCGGGACAAAGGAGAAACGACCTTCGGGTCAGAGGTTCCAGCGCTCTGGCC
TTCACTGACTACGACCTGAAGGTTTACTCCCTCGGGGACCGAGACGCGAGGAGCACAGCCACCCCCAGCA
CCCCCAACAGCAAACTGGCCGACTTCTGCTTGGACCGGTGCGTGAACTGGCTCGACAAGGTGGGTCAGGT
In [24]:
head seqs/CM000047.1.fa
>CM000047.1 Cryptococcus neoformans var. neoformans B-3501A chromosome 8, whole genome shotgun sequence
ATGGCCTTCCTGACTACGACCTGAGGTTTACTCCCTCGGCGACCGAGACGCGAGGAGCACAGCCACCCCA
AGCACCCCCAACAGCAAGCTGGCCGAATTCTGCTTGGACCGGTGCGTGAACTGGCTCGACAAGGTGGGTC
AGGTCGTCTCGAAGAACGCTCCGAAAGTCACTGGTGGGGTCTTTAAACCGATCATCCTTTCCACTGGTGG
CCTGATGAGCAGGAGCACAGCAGACGAATGGAAGGAGTGGAGGGAGGCGATGCCGGTGGGGGGGTTCGAG
AAAATGGAGAAACGGATTGGTGTCGAGTTAGTAAAGGCGAGGGCTAGGACGCTGGTCTTGTGAGGAAGAG
GAGGTTGGATTATTTTTTCTTTTCTTTAAAAAGTTGTTTATTTAAGTAGTTTCTTTAATTCGGGCAACCC
ACACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATAACCCTCAATAGCTCTGGCTGGTAC
AGATGGGCAGAAACTCACCTTGCTCCCACGGCATACATCCCTCCCAATAAGTATCTGCAGATGGCGAATA
AACCCCATGCTCTGCGCAAATATATTGAAGGCTGACGACTTTGATGTTATGATCCACCCCATAAACGCTC

0.3 Compressing ard archiving files

In Bioinformatics the most common formats are gzip and bgzip.

In [25]:
cd ~/seqs
ls
CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

Compressing with gzip

In [26]:
gzip *fa
ls
CM000046.1.fa.gz  CM000047.1.fa.bak  CM000047.1.fa.gz

Uncompressing with gzip

In [27]:
gunzip *gz
ls
CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

Compressing with bgzip

bgzip (Blocked GNU Zip Format) is commonly used in Bioinformatics, and allows random access of the content. BAM, BCF and VCF file formats are typically bgzip compressed.

Unfortunately bgzip does not take multiple files, and we have to use the xargs command to pipe in one file at a time.

In [28]:
xargs --help | grep '\-n,'
  -n, --max-args=MAX-ARGS      use at most MAX-ARGS arguments per command line
In [29]:
ls *fa | xargs -n1 bgzip
ls
CM000046.1.fa.gz  CM000047.1.fa.bak  CM000047.1.fa.gz

Note that bgzip files can be uncompressed with gzip

In [30]:
gunzip *gz
ls
CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

Viewing compressed files

Some utilities can work directly with gzipped files. One that we will be using extensively is zcat.

In [31]:
ls *fa | xargs -n1 bgzip
ls
CM000046.1.fa.gz  CM000047.1.fa.bak  CM000047.1.fa.gz
In [32]:
zcat CM000046.1.fa.gz | head
>CM000046.1 Cryptococcus neoformans var. neoformans B-3501A chromosome 7, whole genome shotgun sequence
TGGTCTTATGAGGAAGAGGAGTTTGGATTATTTTTTCTTTTCTTTAAAAAGTTGTTTATTTAAGTAGTTT
CTTTAATTCGGGTAACACACACGACAACCCAATAAATTAAACAACGAAAAAATGCAACCTCTATAACCCC
CTAACCCTTTTATTCGGGTAACACACACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATA
ACCCCCGAAAAGGATTGGTGTCGAGTTAGTAAAGGCGAGGGCTAGGACGCTGGTCTTGTGAGGAAGAGGA
GGTTGGATTATTTTTTCTTTTCTTTAATAAGTTGTTTATTTAAGTAGTTTCTTTTATTCGGGTAACACAC
ACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATAACCCCCATCCAAGGTGCCACGGTTGA
GATTGAGCCCCACACACTGTCGGGACAAAGGAGAAACGACCTTCGGGTCAGAGGTTCCAGCGCTCTGGCC
TTCACTGACTACGACCTGAAGGTTTACTCCCTCGGGGACCGAGACGCGAGGAGCACAGCCACCCCCAGCA
CCCCCAACAGCAAACTGGCCGACTTCTGCTTGGACCGGTGCGTGAACTGGCTCGACAAGGTGGGTCAGGT
In [33]:
gunzip *gz
ls
CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

Archival with tar

Often we want to compress an entire direcotry (possibly with subdirectories). In Unix, this is most commonly done with tar. Flags to note

  • c compress
  • x extract
  • z gzip or gunzip
  • f files
  • v verbose
In [34]:
tar --help | head -8
Usage: tar [OPTION...] [FILE]...
GNU 'tar' saves many files together into a single tape or disk archive, and can
restore individual files from the archive.

Examples:
  tar -cf archive.tar foo bar  # Create archive.tar from files foo and bar.
  tar -tvf archive.tar         # List all files in archive.tar verbosely.
  tar -xf archive.tar          # Extract all files from archive.tar.
In [35]:
cd
ls seqs
CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak
In [36]:
tar czvf seqs.tar.gz seqs
ls seqs*
seqs/
seqs/CM000046.1.fa
seqs/CM000047.1.fa
seqs/CM000047.1.fa.bak
seqs.tar.gz

seqs:
CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak

View contents of archive

In [37]:
tar tzvf seqs.tar.gz
drwxr-xr-x cliburn/cliburn   0 2018-06-21 10:25 seqs/
-rw-r--r-- cliburn/cliburn 1434513 2018-06-21 10:25 seqs/CM000046.1.fa
-rw-r--r-- cliburn/cliburn 1194378 2018-06-21 10:25 seqs/CM000047.1.fa
-rw-r--r-- cliburn/cliburn 1194398 2018-06-21 10:19 seqs/CM000047.1.fa.bak

Remove direcotry and regenerate from archive

In [38]:
rm -rf seqs
ls seqs*
seqs.tar.gz
In [39]:
tar xzvf seqs.tar.gz
ls seqs*
seqs/
seqs/CM000046.1.fa
seqs/CM000047.1.fa
seqs/CM000047.1.fa.bak
seqs.tar.gz

seqs:
CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak
In [40]:
rm seqs.tar.gz

0.4 Checksums

When working with genomic data, we deal with very large files. There is a small risk that these files will be corrupted over time or during data transfer. To ensure that files are not changed, we use a “checksum” function. This is a function that generates an long, essentially random number called a checksum that represents the contents of the file. When the file contents change, so will the checksum. In theory, there is a very small probability that two different files generate the same checksum, but in practice the probability is too small to worry about.

There are several different algorithms for generating the checksums, and at least 3 Unix commands to do so, but they all work very similarly for our purposes.

The strategy is:

  • Generate and store a checksum together with a data file whose integrity you care about
  • When you use or download the data, re-generate the checksum (using the same algorithm e.g. MD5) and compare with the checksum
In [41]:
ls ~/seqs
CM000046.1.fa  CM000047.1.fa  CM000047.1.fa.bak
In [42]:
md5sum ~/seqs/*fa > MD5_CHECKSUM

We check that the files are unchanged wth the -c flag

In [43]:
md5sum -c MD5_CHECKSUM
/home/cliburn/seqs/CM000046.1.fa: OK
/home/cliburn/seqs/CM000047.1.fa: OK

Suppsoe one of the files is corrupted

In [44]:
echo 'Oops I did it again' >> seqs/CM000047.1.fa
In [45]:
md5sum -c MD5_CHECKSUM  \
| cat # this is jusst so we can Run All
/home/cliburn/seqs/CM000046.1.fa: OK
md5sum: /home/cliburn/seqs/CM000047.1.fa: FAILED
WARNING: 1 computed checksum did NOT match

Using diff to find changes

Let’s rename the altered file, re-download the orginal and see what was changed.

In [46]:
mv seqs/CM000047.1.fa seqs/CM000047.1.fa.bak
In [47]:
efetch -db=nuccore -format=fasta -id=CM000047.1 > seqs/CM000047.1.fa
In [48]:
diff -u seqs/CM000047.1.fa seqs/CM000047.1.fa.bak \
| cat # this is jusst so we can Run All
--- seqs/CM000047.1.fa  2018-06-21 10:25:49.253700667 -0400
+++ seqs/CM000047.1.fa.bak      2018-06-21 10:25:48.425709045 -0400
@@ -16820,3 +16820,4 @@
 CCGTGTCGGCACGCATCTGATTTTTTTCTTATTATTCATAACAACGGCTACAGATCACGCAAGCGCATCG
 ACTCCGCTGCTGCGAAACACCACGAGTACTGCAACCATTGCTGCGGATGAGCTTAGGGGGTTAGGGGGTT
 AGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGG
+Oops I did it again

Explanaiton from [official documentation]

Header

--- seqs/CM000047.1.fa  2018-06-21 09:58:18.246185758 -0400
+++ seqs/CM000047.1.fa.bak  2018-06-21 09:55:55.967707764 -0400

Explanation

The unified output format starts with a two-line header, which looks like this:

--- from-file from-file-modification-time
+++ to-file to-file-modification-time
The timestamp looks like ‘2002-02-21 23:30:39.942229878 -0800’ to indicate the date, time with fractional seconds, and time zone. The fractional seconds are omitted on hosts that do not support fractional timestamps.

You can change the header’s content with the --label=label option. See Alternate Names.

Hunk

@@ -16820,3 +16820,4 @@
 CCGTGTCGGCACGCATCTGATTTTTTTCTTATTATTCATAACAACGGCTACAGATCACGCAAGCGCATCG
 ACTCCGCTGCTGCGAAACACCACGAGTACTGCAACCATTGCTGCGGATGAGCTTAGGGGGTTAGGGGGTT
 AGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGG
+Oops I did it again

Explanation

``` @@ from-file-line-numbers to-file-line-numbers @@ line-from-either-file line-from-either-file…

If a hunk contains just one line, only its start line number appears. Otherwise its line numbers look like ‘start,count’. An empty hunk is considered to start at the line that follows the hunk.

If a hunk and its context contain two or more lines, its line numbers look like ‘start,count’. Otherwise only its end line number appears. An empty hunk is considered to end at the line that precedes the hunk.

The lines common to both files begin with a space character. The lines that actually differ between the two files have one of the following indicator characters in the left print column:

‘+’ A line was added here to the first file.

‘-’ A line was removed here from the first file. ```

Verify

We can confirm that a line was added at the position given by diff

In [49]:
cat seqs/CM000047.1.fa | tail +16820 | head -3
CCGTGTCGGCACGCATCTGATTTTTTTCTTATTATTCATAACAACGGCTACAGATCACGCAAGCGCATCG
ACTCCGCTGCTGCGAAACACCACGAGTACTGCAACCATTGCTGCGGATGAGCTTAGGGGGTTAGGGGGTT
AGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGG
In [50]:
cat seqs/CM000047.1.fa.bak | tail +16820 | head -4
CCGTGTCGGCACGCATCTGATTTTTTTCTTATTATTCATAACAACGGCTACAGATCACGCAAGCGCATCG
ACTCCGCTGCTGCGAAACACCACGAGTACTGCAACCATTGCTGCGGATGAGCTTAGGGGGTTAGGGGGTT
AGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGG
Oops I did it again

1. Working with FASTA files

From NCBI FASTA specification

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

        >P01013 GENE X PROTEIN (OVALBUMIN-RELATED)
        QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
        KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
        VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
        FLFLIKHNPTNTIVYFGRYWSP

Note that a FASTA file may contain mulitple sequences, each with its own description line.

Set up variables for files we will be using

In [51]:
DATA_DIR='/data/hts/2018/Granek_4837_180427A5'
FASTA='cneo040623.scaffs.fa.gz'
FASTQ='10_MA_C_S3_L003_R1_001.fastq.gz'
GTF='Cryptococcus_neoformans.ASM9104v1.39.gtf.gz'

This is how you get the value of a shell variable

In [52]:
echo $DATA_DIR
/data/hts/2018/Granek_4837_180427A5

List files in data direcotry

In [53]:
ls $DATA_DIR | head -3
10_MA_C_S3_L001_R1_001.fastq
10_MA_C_S3_L002_R1_001.fastq.gz
10_MA_C_S3_L003_R1_001.fastq.gz

Inspect the contents of a zipped file.

zcat is like cat but works on gzipped files.

In [54]:
zcat $DATA_DIR/$FASTA | head
>chr01.b3501.040506 5 contigs (2263073 bp)
CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
GGATTTATACAACGATAAGAGGCTATCCGCTGACTATAAATTGTGTTAGC
TGATCCAGATCCGCAGTTGGCACAAACCGATGTTTCCTTTTCCTCAGGTC
CTGAATATTTCTCTTCCATCAAGGCTCCCAACCCCGAGGGGAGTATTTCC
ACCCGATCTGATTCCAAGAGGTCCAGCGTCAACCAGGTACGTGAATGTTT
TCTTCTTTAGTGGTGGGGACCCTATGCTTGTCCCACAGAACTGCACTGAA

Show headers only

In [55]:
zcat $DATA_DIR/$FASTA   | grep '^>' | head
>chr01.b3501.040506 5 contigs (2263073 bp)
>chr02.b3501.040506 6 contigs (1607892 bp)
>chr03.b3501.040506 6 contigs (2025976 bp)
>chr04.b3501.040623 5 contigs (1762351 bp)
>chr05.b3501.040506 7 contigs (1432964 bp)
>chr06.b3501.040616 4 contigs (1411173 bp)
>chr07.b3501.040506 6 contigs (1356306 bp)
>chr08.b3501.040506 3 contigs (1153453 bp)
>chr09.b3501.040506 5 contigs (1072068 bp)
>chr10.b3501.040506 3 contigs (1017008 bp)

Show headers with line numbers

In [56]:
zcat $DATA_DIR/$FASTA| grep -n '^>' | head
1:>chr01.b3501.040506 5 contigs (2263073 bp)
45264:>chr02.b3501.040506 6 contigs (1607892 bp)
77423:>chr03.b3501.040506 6 contigs (2025976 bp)
117944:>chr04.b3501.040623 5 contigs (1762351 bp)
153193:>chr05.b3501.040506 7 contigs (1432964 bp)
181854:>chr06.b3501.040616 4 contigs (1411173 bp)
210079:>chr07.b3501.040506 6 contigs (1356306 bp)
237207:>chr08.b3501.040506 3 contigs (1153453 bp)
260278:>chr09.b3501.040506 5 contigs (1072068 bp)
281721:>chr10.b3501.040506 3 contigs (1017008 bp)

Count number of headers (one for each crhomosome)

In [57]:
zcat $DATA_DIR/$FASTA  | grep -c '^>'
14

Extract first part of the sequence from chromosome 7

In [58]:
zcat $DATA_DIR/$FASTA | sed -n '210079,210082p'
>chr07.b3501.040506 6 contigs (1356306 bp)
TGGTCTTATGAGGAAGAGGAGTTTGGATTATTTTTTCTTTTCTTTAAAAA
GTTGTTTATTTAAGTAGTTTCTTTAATTCGGGTAACACACACGACAACCC
AATAAATTAAACAACGAAAAAATGCAACCTCTATAACCCCCTAACCCTTT

Extract first line of chromosome 7 and save to file

In [59]:
zcat $DATA_DIR/$FASTA | sed -n '210080p' > dna.txt

Inspect the file we just saved

In [60]:
cat dna.txt
TGGTCTTATGAGGAAGAGGAGTTTGGATTATTTTTTCTTTTCTTTAAAAA

Find complement

In [61]:
cat dna.txt | tr ACGT TGCA
ACCAGAATACTCCTTCTCCTCAAACCTAATAAAAAAGAAAAGAAATTTTT

Find reverse complement

In [62]:
cat dna.txt | tr ACGT TGCA | rev
TTTTTAAAGAAAAGAAAAAATAATCCAAACTCCTCTTCCTCATAAGACCA

Delete A and T

In [63]:
cat dna.txt | tr -d 'AT'
GGCGGGGGGGGGCC

Only count non A and T

In [64]:
cat dna.txt | tr -d 'AT' | tr -d '\n' | wc -c
14
In [65]:
man tr | grep '\-d,' -A2
       -d, --delete
              delete characters in SET1, do not translate

Count C and G

In [66]:
grep '[CG]' -o dna.txt | wc -l
14
In [67]:
man grep | grep '\-o,' -A2
       -o, --only-matching
              Print  only  the  matched  (non-empty) parts of a matching line,
              with each such part on a separate output line.

Specify data directory as a shell variable

2. Working with FASTQ files

From Wikipedia

A FASTQ file normally uses four lines per sequence.

Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

Specify target file as a shell variable

Show the contents of a zipped file

In [68]:
zcat $DATA_DIR/$FASTQ | head -n 10
@NB501800:128:HTYKFBGX5:3:11401:19195:1019 1:N:0:ATTACTCG+AGGATAGG
NGTAAGGGTCTGGGGGACGTTGTTGTGGATGGAAACAACCCTGGCCTCTCGGGGCTGAGCAGGGAGTCGCTTGAGC
+
#AAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEE/<<
@NB501800:128:HTYKFBGX5:3:11401:24339:1020 1:N:0:ATTACTCG+AGGATAGG
NCTCGCCTGGGGTGCTTGCGTGTCCCTGTCCTTGGCCCTGGTCACCGGGGAGATGCACATCATCTTCTTCTGGCTC
+
#AAAAEAAEEEEEEEEEEEEEEEE/EEE/EEEEEEEEEEEEEEEE<A/EEEE6EEE//EEEAEEEEEEE/EAEEEA
@NB501800:128:HTYKFBGX5:3:11401:24140:1020 1:N:0:ATTACTCG+AGGATAGG
NTGATCACGGGAGACAAGCTAGAGATGCAAAATGCCATAGTACGATCATCCCATACAAGATGGGTGAGAGAAAAAG

Extracting just the sequence

We need to extract every 4th line starting from the 2nd. We cna use a tool called sed (stream editor) to do so.

In [69]:
zcat $DATA_DIR/$FASTQ | sed -n '2~4p;' | head
NGTAAGGGTCTGGGGGACGTTGTTGTGGATGGAAACAACCCTGGCCTCTCGGGGCTGAGCAGGGAGTCGCTTGAGC
NCTCGCCTGGGGTGCTTGCGTGTCCCTGTCCTTGGCCCTGGTCACCGGGGAGATGCACATCATCTTCTTCTGGCTC
NTGATCACGGGAGACAAGCTAGAGATGCAAAATGCCATAGTACGATCATCCCATACAAGATGGGTGAGAGAAAAAG
NGGGCCTTGGTCCTGCCCTCAGCCTTGATAGGACCCTTGCCGTCGATGGGGTTACCCAAAGCGTCAACAACTCGGC
NACCTTTCGGAGACCTCGGTGGGTCTTCCTGGGGAGACGGGTAGTACCCCATCGGGCGGTAACACCCTCGTAACCG
NCCGAACTGTTGAAAAGAGCTTCCACGCTCCAAAGGCAGGGACAAGAGTAAAGAGCCACCACAGCCAATCTCCAAA
NGCAAACTGAACGTCCTCAGTCTCCATTCGCTGGAAGTCCGATTCAGGACGGGCAGCGTCATCGATAGAGAATGGA
NGCATTACCAGATGCACTGATAGCAATGATCGTCTCATCTTCCAAGTACTTGCTCACAGCGATACTATCACTCAAA
GGCGATCTTCCTCTTGGGGAAGTGAACGTTCTTCTTGTGCTGCTTCTCGGTGATAGCAGAAGCCTCCTTGGTAGTC
CGGCGATGATGAGAGTGAAGGGATCTTGCCACATGAGAGGGGATGATCGACGGGGACGAGGACGGATGGGGATGAT

Explanation

sed -n '2~4p'

  • -n means to suppress all output
  • 2~4p means print (p) every 4th line starting at line 2

Joining lines

Getting sequence as a single string. We just need to delete newlline (:raw-latex:`\n`) characters.

Note: We can break long commands uisng the line continuation character

In [70]:
zcat $DATA_DIR/$FASTQ | \
sed -n '2~4p;' | \
head | \
tr -d '\n'
NGTAAGGGTCTGGGGGACGTTGTTGTGGATGGAAACAACCCTGGCCTCTCGGGGCTGAGCAGGGAGTCGCTTGAGCNCTCGCCTGGGGTGCTTGCGTGTCCCTGTCCTTGGCCCTGGTCACCGGGGAGATGCACATCATCTTCTTCTGGCTCNTGATCACGGGAGACAAGCTAGAGATGCAAAATGCCATAGTACGATCATCCCATACAAGATGGGTGAGAGAAAAAGNGGGCCTTGGTCCTGCCCTCAGCCTTGATAGGACCCTTGCCGTCGATGGGGTTACCCAAAGCGTCAACAACTCGGCNACCTTTCGGAGACCTCGGTGGGTCTTCCTGGGGAGACGGGTAGTACCCCATCGGGCGGTAACACCCTCGTAACCGNCCGAACTGTTGAAAAGAGCTTCCACGCTCCAAAGGCAGGGACAAGAGTAAAGAGCCACCACAGCCAATCTCCAAANGCAAACTGAACGTCCTCAGTCTCCATTCGCTGGAAGTCCGATTCAGGACGGGCAGCGTCATCGATAGAGAATGGANGCATTACCAGATGCACTGATAGCAATGATCGTCTCATCTTCCAAGTACTTGCTCACAGCGATACTATCACTCAAAGGCGATCTTCCTCTTGGGGAAGTGAACGTTCTTCTTGTGCTGCTTCTCGGTGATAGCAGAAGCCTCCTTGGTAGTCCGGCGATGATGAGAGTGAAGGGATCTTGCCACATGAGAGGGGATGATCGACGGGGACGAGGACGGATGGGGATGAT

3.Working wtih GTF files

From UCSD Genome Browser

The GTF (General Transfer Format) is an extension of GFF version 2
and used to represent transcription models. GFF (General Feature Format)
consists of one line per feature, each containing 9 columns of data.

Fields

Fields are tab-separated. Also, all but the final field in each
feature line must contain a value; "empty" columns are denoted
with a '.'

    seqname   - name of the chromosome or scaffold; chromosome names
                without a 'chr'
    source    - name of the program that generated this feature, or
                the data source (database or project name)
    feature   - feature type name. Current allowed features are
                {gene, transcript, exon, CDS, Selenocysteine, start_codon,
                stop_codon and UTR}
    start     - start position of the feature, with sequence numbering
                starting at 1.
    end       - end position of the feature, with sequence numbering
                starting at 1.
    score     - a floating point value indiciating the score of a feature
    strand    - defined as + (forward) or - (reverse).
    frame     - one of '0', '1' or '2'. Frame indicates the number of base pairs
                before you encounter a full codon. '0' indicates the feature
                begins with a whole codon. '1' indicates there is an extra
                base (the 3rd base of the prior codon) at the start of this feature.
                '2' indicates there are two extra bases (2nd and 3rd base of the
                prior exon) before the first codon. All values are given with
                relation to the 5' end.
    attribute - a semicolon-separated list of tag-value pairs (separated by a space),
                providing additional information about each feature. A key can be
                repeated multiple times.

Attributes

The following attributes are available. All attributes are semi-colon
separated pairs of keys and values.

- gene_id: The stable identifier for the gene
- gene_version: The stable identifier version for the gene
- gene_name: The official symbol of this gene
- gene_source: The annotation source for this gene
- gene_biotype: The biotype of this gene
- transcript_id: The stable identifier for this transcript
- transcript_version: The stable identifier version for this transcript
- transcript_name: The symbold for this transcript derived from the gene name
- transcript_source: The annotation source for this transcript
- transcript_biotype: The biotype for this transcript
- exon_id: The stable identifier for this exon
- exon_version: The stable identifier version for this exon
- exon_number: Position of this exon in the transcript
- ccds_id: CCDS identifier linked to this transcript
- protein_id: Stable identifier for this transcript's protein
- protein_version: Stable identifier version for this transcript's protein
- tag: A collection of additional key value tags
- transcript_support_level: Ranking to assess how well a transcript is supported (from 1 to 5)

Inspect the GTF file

In [71]:
zcat $DATA_DIR/$GTF | head
#!genome-build ASM9104v1
#!genome-version ASM9104v1
#!genome-date 2016-06
#!genome-build-accession GCA_000091045.1
#!genebuild-last-updated 2016-06
1       ena     gene    4839    6236    .       +       .       gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      4839    6236    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    4839    5178    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-1"; exon_version "1";
1       ena     CDS     4914    5178    .       +       0       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AAW41410"; protein_version "1";
1       ena     start_codon     4914    4916    .       +       0       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";

Ignore comment lines version 1

This uses tail to take all lines starting from line 6

In [72]:
zcat $DATA_DIR/$GTF | tail +6 | head -3
1       ena     gene    4839    6236    .       +       .       gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      4839    6236    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    4839    5178    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-1"; exon_version "1";

Ingore comment lines 2

This uses grep to ignore (-v) lines starting with ^#

In [73]:
zcat $DATA_DIR/$GTF | grep -v '^#' | head -3
1       ena     gene    4839    6236    .       +       .       gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      4839    6236    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    4839    5178    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-1"; exon_version "1";

Find all unique feature types in GTF

In [74]:
zcat $DATA_DIR/$GTF | grep -v '^#' | cut -f 3 | sort | uniq
CDS
exon
five_prime_utr
gene
start_codon
stop_codon
three_prime_utr
transcript

Get counts of all unique feature types in GTF

In [75]:
zcat $DATA_DIR/$GTF | grep -v '^#' | cut -f 3 | sort | uniq -c
  44079 CDS
  44344 exon
   6738 five_prime_utr
   6886 gene
   6894 start_codon
   2369 stop_codon
   6718 three_prime_utr
   7117 transcript

Find the the attributes of the first 3 exons

In [76]:
zcat $DATA_DIR/$GTF | grep -v '^#' | awk '$3 == "exon"' | head -3
1       ena     exon    4839    5178    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-1"; exon_version "1";
1       ena     exon    5241    5377    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "2"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-2"; exon_version "1";
1       ena     exon    5453    5560    .       +       .       gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "3"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-3"; exon_version "1";

Store attributes of all genes in a file

It is common to use Unix commands to do simple processing and save the results to a file, then use a scripting language such as R or Python to furhter process the file.

In [77]:
zcat $DATA_DIR/$GTF | grep -v '^#' | awk '$3 == "gene"' | cut -f 9 > genes.txt
In [78]:
head -6 genes.txt
gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00020"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00030"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00040"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00050"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00060"; gene_source "ena"; gene_biotype "protein_coding";

4. Looping

Loop throgh first 10 lines

  • Creae a counter variable LINE_NUM and initialize it to 0
  • Redirect genes.txt to a while loop
  • Whlie loop reads each line into the variable LINE
  • The [[ CONDITION ]] tests if CONDITION is true
  • If LINE_NUM is less than -lt 10 print LINE and increment LINE_NUM
  • Note that (( EXPR )) is used to evaluate numeric expressions
In [79]:
LINE_NUM=0
while read LINE
do
    if [[ $LINE_NUM -lt 10 ]]; then
        echo $LINE
        (( LINE_NUM++ ))
    fi
done < genes.txt
gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00020"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00030"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00040"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00050"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00060"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00070"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00075"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00080"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00090"; gene_source "ena"; gene_biotype "protein_coding";

Loop through first 10 tRNAs

Almost the same as previous program but we add a new condition to check tha the gene_biotype is tRNA using a REGULAr EXPRESSION via the match operator =~

.*gene_biotype[[:space:]]\"tRNA\".* means

  • .*: Match any number of any character
  • gene_biotype: Match the string gene_biotype
  • [[:space:]]: Match a single space cahracter
  • \"tRNA\": Math the string “tRNA” - we escape the double quotes to match linteral double quote characters
  • .*: Match any number of any character
In [80]:
LINE_NUM=0
while read LINE
do
    if [[ $LINE_NUM -lt 10 ]] && [[ $LINE =~ .*gene_biotype[[:space:]]\"tRNA\".* ]] ; then
        echo $LINE
        ((LINE_NUM++))
    fi
done < genes.txt
gene_id "CNA02110"; gene_name "CNA02110"; gene_source "ena"; gene_biotype "tRNA";
gene_id "EBG00005237496"; gene_name "tRNA"; gene_source "Rfam"; gene_biotype "tRNA";
gene_id "CNA02190"; gene_name "CNA02190"; gene_source "ena"; gene_biotype "tRNA";
gene_id "EBG00005237490"; gene_name "tRNA"; gene_source "Rfam"; gene_biotype "tRNA";
gene_id "CNA02200"; gene_name "CNA02200"; gene_source "ena"; gene_biotype "tRNA";
gene_id "CNA02280"; gene_name "CNA02280"; gene_source "ena"; gene_biotype "tRNA";
gene_id "EBG00005237502"; gene_name "tRNA"; gene_source "Rfam"; gene_biotype "tRNA";
gene_id "CNA02320"; gene_name "CNA02320"; gene_source "ena"; gene_biotype "tRNA";
gene_id "EBG00005237484"; gene_name "tRNA"; gene_source "Rfam"; gene_biotype "tRNA";
gene_id "CNA03230"; gene_name "CNA03230"; gene_source "ena"; gene_biotype "tRNA";

Loop through files in a directory and print the filename followed by the first line

  • Use a COMMAND EXPRESSION $(EXPR) to list files matching the glob pattern
  • We loop through each file using a for statement
  • For each file, extract the first line with head -1 and save in variable FIRST
  • The 2>/dev/null ignores output from the stderr stream
  • We use another COMMAND EXPRESSION to print the filename withou the directory (using basename)
  • We then print the first line storred in FIRST
In [81]:
for FILE in $(ls $DATA_DIR/*_MA_J_S20_*gz)
do
    FIRST=$(zcat $FILE 2>/dev/null | head -1)
    echo $(basename $FILE)
    echo $FIRST
    echo
done
11_MA_J_S20_L001_R1_001.fastq.gz
@NB501800:128:HTYKFBGX5:1:11101:26760:1056 1:N:0:CGCTCATT+NGGATAGG

11_MA_J_S20_L002_R1_001.fastq.gz
@NB501800:128:HTYKFBGX5:2:11101:20327:1041 1:N:0:CGCTCATT+AGGATAGN

11_MA_J_S20_L003_R1_001.fastq.gz
@NB501800:128:HTYKFBGX5:3:11401:10896:1024 1:N:0:CGCTCATT+AGGATAGG

11_MA_J_S20_L004_R1_001.fastq.gz
@NB501800:128:HTYKFBGX5:4:11401:8259:1032 1:N:0:CGCTCATT+AGGATAGG