Note: Make a copy of this notebook to work in.

Bash Scratchpad

Hack for SIGPIPE error in Jupyter notebook

In [1]:
cleanup () {
    :
}

trap "cleanup" SIGPIPE

Safety first

In [2]:
set -u

Challenge

You will download and extract some information about Cryptococcus neoformans by investigating its GTF file located at

ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz

File and directory management

In [3]:
pwd
/local_data/notebooks/cliburn/HTS2018-notebooks/cliburn
In [4]:
ls
 Bash_Exercise_1.ipynb            'R_Graphic_ ggplot2.ipynb'
 Bash_Exercise_1_Solutions.ipynb   R_Graphics_Base.ipynb
 Bash_Exercise_2.ipynb             R_Graphics_Exercise.ipynb
 Bash_Exercise_2_Solutoins.ipynb   R_Graphics_Exercise_Solutions.ipynb
 Bash_in_Jupyter.ipynb             R_Graphics_Overview.ipynb
'Bash Tutorial.ipynb'              R_tidyverse_1.ipynb
 data                              R_tidyverse_2.ipynb
 figs                              R_tidyverse_3.ipynb
 Process-RNA-seq-counts.ipynb      R_tidyyverse_Exercise.ipynb
 ref                               R_tidyyverse_Exercise_Solutions.ipynb
In [5]:
rm -rf ref
In [6]:
mkdir ref
In [7]:
cd ref

Using variables

In [10]:
FOO=foo
In [11]:
echo $FOObar
bash: FOObar: unbound variable

In [12]:
echo ${FOO}bar
foobar

Downlaoding files

In [13]:
URL='ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz'
In [14]:
echo $URL
ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
In [15]:
wget $URL
--2018-07-09 08:18:10--  ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
           => ‘Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.94
Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.94|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99 ... done.
==> SIZE Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz ... 1796344
==> PASV ... done.    ==> RETR Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz ... done.
Length: 1796344 (1.7M) (unauthoritative)

Cryptococcus_neofor 100%[===================>]   1.71M  1.06MB/s    in 1.6s

2018-07-09 08:18:13 (1.06 MB/s) - ‘Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz’ saved [1796344]

Working with compressed files

In [16]:
pwd
/local_data/notebooks/cliburn/HTS2018-notebooks/cliburn/ref
In [17]:
ls
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
In [18]:
gunzip Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
In [19]:
ls
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf

Inspecting the GTF file

A GTF file has some header lines, followed by tabular data in 9 columns:

  1. chromosome name > chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M}
  2. annotation source > {ENSEMBL,HAVANA}
  3. feature-type > {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine}
  4. genomic start location > integer-value (1-based)
  5. genomic end location > integer-value
  6. score (not used) > .
  7. genomic strand > {+,-}
  8. genomic phase (for CDS features) > {0,1,2,.}
  9. additional information as key-value pairs > (format: key “value”;)

See GTF3 in ths Sequence Ontology.

In [20]:
GTF=$(ls)
In [21]:
echo $GTF
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf
In [22]:
head $GTF
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1       ena     CDS     5494    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1       ena     start_codon     5643    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
In [23]:
tail $GTF
Mt      ena     stop_codon      23840   23842   .       +       0       gene_id "CNAG_09011"; transcript_id "AFR99113"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt      Ensembl_Fungi   gene    23909   23980   .       +       .       gene_id "ENSRNA049545749"; gene_name "tRNA-Val"; gene_source "Ensembl_Fungi"; gene_biotype "tRNA";
Mt      Ensembl_Fungi   transcript      23909   23980   .       +       .       gene_id "ENSRNA049545749"; transcript_id "ENSRNA049545749-T1"; gene_name "tRNA-Val"; gene_source "Ensembl_Fungi"; gene_biotype "tRNA"; transcript_source "Ensembl_Fungi"; transcript_biotype "tRNA";
Mt      Ensembl_Fungi   exon    23909   23980   .       +       .       gene_id "ENSRNA049545749"; transcript_id "ENSRNA049545749-T1"; exon_number "1"; gene_name "tRNA-Val"; gene_source "Ensembl_Fungi"; gene_biotype "tRNA"; transcript_source "Ensembl_Fungi"; transcript_biotype "tRNA"; exon_id "ENSRNA049545749-E1";
Mt      ena     gene    24096   24851   .       +       .       gene_id "CNAG_09012"; gene_source "ena"; gene_biotype "protein_coding";
Mt      ena     transcript      24096   24851   .       +       .       gene_id "CNAG_09012"; transcript_id "AFR99114"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt      ena     exon    24096   24851   .       +       .       gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR99114-1";
Mt      ena     CDS     24096   24848   .       +       0       gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR99114"; protein_version "1";
Mt      ena     start_codon     24096   24098   .       +       0       gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt      ena     stop_codon      24849   24851   .       +       0       gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";

Combining operations with cat and pipe

In [24]:
cat $GTF | head -8
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
In [25]:
cat $GTF | head -8 | tail -3
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";

Filtering comment lines

Using tail

In [26]:
cat $GTF | tail +6 | head -3
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";

Using cat and regualr expressions

In [27]:
cat << EOF > vietnam.txt
In 1965 Vietnam seemed like just another foreign war but it wasn't
It was different in many ways, as so were those that did the fighting
In World War II the average age of the combat soldier was 26
In Vietnam he was 19
EOF
In [28]:
cat vietnam.txt
In 1965 Vietnam seemed like just another foreign war but it wasn't
It was different in many ways, as so were those that did the fighting
In World War II the average age of the combat soldier was 26
In Vietnam he was 19
In [29]:
cat vietnam.txt | egrep 'combat'
In World War II the average age of the combat soldier was 26
In [30]:
cat vietnam.txt | egrep 'wa.s'
It was different in many ways, as so were those that did the fighting
In [31]:
cat vietnam.txt | egrep 'c.*'
In World War II the average age of the combat soldier was 26
In [32]:
cat vietnam.txt | egrep 'it'
In 1965 Vietnam seemed like just another foreign war but it wasn't
In [33]:
cat vietnam.txt | egrep  -i 'it'
In 1965 Vietnam seemed like just another foreign war but it wasn't
It was different in many ways, as so were those that did the fighting
In [34]:
cat vietnam.txt | egrep  -i '^it'
It was different in many ways, as so were those that did the fighting
In [35]:
cat vietnam.txt | egrep  -i '[0-9]+'
In 1965 Vietnam seemed like just another foreign war but it wasn't
In World War II the average age of the combat soldier was 26
In Vietnam he was 19
In [36]:
cat vietnam.txt | egrep  -i -v '[0-9]+'
It was different in many ways, as so were those that did the fighting
In [37]:
cat vietnam.txt | egrep  -i -o '[0-9]+'
1965
26
19

Usng regular eexpressions with grep

In [38]:
cat $GTF |
grep -v '^#' |
head -3
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";

Creating new files with redirection operators

In [39]:
cat $GTF | grep '^#' > header.txt
In [40]:
cat  $GTF | grep -v '^#' > info.txt
In [41]:
head -3 info.txt
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
In [42]:
cat info.txt >> header.txt
In [43]:
head header.txt
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1       ena     CDS     5494    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1       ena     start_codon     5643    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";

Cutting columns from tabular data

In [44]:
cat info.txt | cut -f3 | head -10
gene
transcript
exon
CDS
start_codon
exon
CDS
exon
CDS
exon
In [45]:
cat info.txt | cut -f1,3-5 | head -3
1       gene    100     5645
1       transcript      100     5645
1       exon    5494    5645
In [46]:
cat info.txt | cut -f1,3-5 | tr '\t' ',' > info.csv
In [47]:
head info.csv
1,gene,100,5645
1,transcript,100,5645
1,exon,5494,5645
1,CDS,5494,5645
1,start_codon,5643,5645
1,exon,5322,5422
1,CDS,5322,5422
1,exon,3958,5263
1,CDS,3958,5263
1,exon,3206,3890
In [48]:
cat info.csv | cut -f2 -d',' | head -3
gene
transcript
exon

Sorting and counting

In [49]:
cat info.txt | cut -f3 | head -10 | sort
CDS
CDS
CDS
exon
exon
exon
exon
gene
start_codon
transcript
In [50]:
cat info.txt | cut -f3 | head -10 | sort -r
transcript
start_codon
gene
exon
exon
exon
exon
CDS
CDS
CDS
In [51]:
cat info.txt | cut -f3 | sort | uniq
CDS
exon
five_prime_utr
gene
start_codon
stop_codon
three_prime_utr
transcript
In [52]:
cat info.txt | cut -f3 | sort | uniq -c
  49063 CDS
  52036 exon
   6923 five_prime_utr
   8497 gene
   7860 start_codon
   3167 stop_codon
   7034 three_prime_utr
   9348 transcript

Features on a chromosome - using awk

In [53]:
cat info.txt | awk -F '\t' '$1=="Mt"' | head -3
Mt      ena     gene    20      2196    .       +       .       gene_id "CNAG_09000"; gene_source "ena"; gene_biotype "protein_coding";
Mt      ena     transcript      20      2196    .       +       .       gene_id "CNAG_09000"; transcript_id "AFR99102"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt      ena     exon    20      233     .       +       .       gene_id "CNAG_09000"; transcript_id "AFR99102"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR99102-1";
In [54]:
cat info.txt | awk -F '\t' '$1=="2" {print $3}' | sort | uniq -c
   4241 CDS
   4434 exon
    597 five_prime_utr
    706 gene
    681 start_codon
    278 stop_codon
    606 three_prime_utr
    780 transcript