Note: Make a copy of this notebook to work in.

Bash Scratchpad¶

Hack for SIGPIPE error in Jupyter notebook¶

In [1]:

cleanup () {
    :
}

trap "cleanup" SIGPIPE

Safety first¶

In [2]:

set -u

Challenge¶

You will download and extract some information about Cryptococcus neoformans by investigating its GTF file located at

ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz

File and directory management¶

In [3]:

pwd

/local_data/notebooks/cliburn/HTS2018-notebooks/cliburn

In [4]:

ls

 Bash_Exercise_1.ipynb            'R_Graphic_ ggplot2.ipynb'
 Bash_Exercise_1_Solutions.ipynb   R_Graphics_Base.ipynb
 Bash_Exercise_2.ipynb             R_Graphics_Exercise.ipynb
 Bash_Exercise_2_Solutoins.ipynb   R_Graphics_Exercise_Solutions.ipynb
 Bash_in_Jupyter.ipynb             R_Graphics_Overview.ipynb
'Bash Tutorial.ipynb'              R_tidyverse_1.ipynb
 data                              R_tidyverse_2.ipynb
 figs                              R_tidyverse_3.ipynb
 Process-RNA-seq-counts.ipynb      R_tidyyverse_Exercise.ipynb
 ref                               R_tidyyverse_Exercise_Solutions.ipynb

In [5]:

rm -rf ref

In [6]:

mkdir ref

In [7]:

cd ref

Using variables¶

In [10]:

FOO=foo

In [11]:

echo $FOObar

bash: FOObar: unbound variable

In [12]:

echo ${FOO}bar

foobar

Downlaoding files¶

In [13]:

URL='ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz'

In [14]:

echo $URL

ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz

In [15]:

wget $URL

--2018-07-09 08:18:10--  ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
           => ‘Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.94
Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.94|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99 ... done.
==> SIZE Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz ... 1796344
==> PASV ... done.    ==> RETR Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz ... done.
Length: 1796344 (1.7M) (unauthoritative)

Cryptococcus_neofor 100%[===================>]   1.71M  1.06MB/s    in 1.6s

2018-07-09 08:18:13 (1.06 MB/s) - ‘Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz’ saved [1796344]

Working with compressed files¶

In [16]:

pwd

/local_data/notebooks/cliburn/HTS2018-notebooks/cliburn/ref

In [17]:

ls

Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz

In [18]:

gunzip Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz

In [19]:

ls

Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf

Inspecting the GTF file¶

A GTF file has some header lines, followed by tabular data in 9 columns:

chromosome name > chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M}
annotation source > {ENSEMBL,HAVANA}
feature-type > {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine}
genomic start location > integer-value (1-based)
genomic end location > integer-value
score (not used) > .
genomic strand > {+,-}
genomic phase (for CDS features) > {0,1,2,.}
additional information as key-value pairs > (format: key “value”;)

See GTF3 in ths Sequence Ontology.

In [20]:

GTF=$(ls)

In [21]:

echo $GTF

Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf

In [22]:

head $GTF

#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1       ena     CDS     5494    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1       ena     start_codon     5643    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";

In [23]:

tail $GTF

Mt      ena     stop_codon      23840   23842   .       +       0       gene_id "CNAG_09011"; transcript_id "AFR99113"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt      Ensembl_Fungi   gene    23909   23980   .       +       .       gene_id "ENSRNA049545749"; gene_name "tRNA-Val"; gene_source "Ensembl_Fungi"; gene_biotype "tRNA";
Mt      Ensembl_Fungi   transcript      23909   23980   .       +       .       gene_id "ENSRNA049545749"; transcript_id "ENSRNA049545749-T1"; gene_name "tRNA-Val"; gene_source "Ensembl_Fungi"; gene_biotype "tRNA"; transcript_source "Ensembl_Fungi"; transcript_biotype "tRNA";
Mt      Ensembl_Fungi   exon    23909   23980   .       +       .       gene_id "ENSRNA049545749"; transcript_id "ENSRNA049545749-T1"; exon_number "1"; gene_name "tRNA-Val"; gene_source "Ensembl_Fungi"; gene_biotype "tRNA"; transcript_source "Ensembl_Fungi"; transcript_biotype "tRNA"; exon_id "ENSRNA049545749-E1";
Mt      ena     gene    24096   24851   .       +       .       gene_id "CNAG_09012"; gene_source "ena"; gene_biotype "protein_coding";
Mt      ena     transcript      24096   24851   .       +       .       gene_id "CNAG_09012"; transcript_id "AFR99114"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt      ena     exon    24096   24851   .       +       .       gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR99114-1";
Mt      ena     CDS     24096   24848   .       +       0       gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR99114"; protein_version "1";
Mt      ena     start_codon     24096   24098   .       +       0       gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt      ena     stop_codon      24849   24851   .       +       0       gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";

Combining operations with cat and pipe¶

In [24]:

cat $GTF | head -8

#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";

In [25]:

cat $GTF | head -8 | tail -3

1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";

Filtering comment lines¶

Using `tail`¶

In [26]:

cat $GTF | tail +6 | head -3

1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";

Using `cat` and regualr expressions¶

In [27]:

cat << EOF > vietnam.txt
In 1965 Vietnam seemed like just another foreign war but it wasn't
It was different in many ways, as so were those that did the fighting
In World War II the average age of the combat soldier was 26
In Vietnam he was 19
EOF

In [28]:

cat vietnam.txt

In 1965 Vietnam seemed like just another foreign war but it wasn't
It was different in many ways, as so were those that did the fighting
In World War II the average age of the combat soldier was 26
In Vietnam he was 19

In [29]:

cat vietnam.txt | egrep 'combat'

In World War II the average age of the combat soldier was 26

In [30]:

cat vietnam.txt | egrep 'wa.s'

It was different in many ways, as so were those that did the fighting

In [31]:

cat vietnam.txt | egrep 'c.*'

In World War II the average age of the combat soldier was 26

In [32]:

cat vietnam.txt | egrep 'it'

In 1965 Vietnam seemed like just another foreign war but it wasn't

In [33]:

cat vietnam.txt | egrep  -i 'it'

In 1965 Vietnam seemed like just another foreign war but it wasn't
It was different in many ways, as so were those that did the fighting

In [34]:

cat vietnam.txt | egrep  -i '^it'

It was different in many ways, as so were those that did the fighting

In [35]:

cat vietnam.txt | egrep  -i '[0-9]+'

In 1965 Vietnam seemed like just another foreign war but it wasn't
In World War II the average age of the combat soldier was 26
In Vietnam he was 19

In [36]:

cat vietnam.txt | egrep  -i -v '[0-9]+'

It was different in many ways, as so were those that did the fighting

In [37]:

cat vietnam.txt | egrep  -i -o '[0-9]+'

1965
26
19

Usng regular eexpressions with `grep`¶

In [38]:

cat $GTF |
grep -v '^#' |
head -3

1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";

Creating new files with redirection operators¶

In [39]:

cat $GTF | grep '^#' > header.txt

In [40]:

cat  $GTF | grep -v '^#' > info.txt

In [41]:

head -3 info.txt

1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";

In [42]:

cat info.txt >> header.txt

In [43]:

head header.txt

#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1       ena     gene    100     5645    .       -       .       gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1       ena     transcript      100     5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1       ena     exon    5494    5645    .       -       .       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1       ena     CDS     5494    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1       ena     start_codon     5643    5645    .       -       0       gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";

Cutting columns from tabular data¶

In [44]:

cat info.txt | cut -f3 | head -10

gene
transcript
exon
CDS
start_codon
exon
CDS
exon
CDS
exon

In [45]:

cat info.txt | cut -f1,3-5 | head -3

1       gene    100     5645
1       transcript      100     5645
1       exon    5494    5645

In [46]:

cat info.txt | cut -f1,3-5 | tr '\t' ',' > info.csv

In [47]:

head info.csv

1,gene,100,5645
1,transcript,100,5645
1,exon,5494,5645
1,CDS,5494,5645
1,start_codon,5643,5645
1,exon,5322,5422
1,CDS,5322,5422
1,exon,3958,5263
1,CDS,3958,5263
1,exon,3206,3890

In [48]:

cat info.csv | cut -f2 -d',' | head -3

gene
transcript
exon

Sorting and counting¶

In [49]:

cat info.txt | cut -f3 | head -10 | sort

CDS
CDS
CDS
exon
exon
exon
exon
gene
start_codon
transcript

In [50]:

cat info.txt | cut -f3 | head -10 | sort -r

transcript
start_codon
gene
exon
exon
exon
exon
CDS
CDS
CDS

In [51]:

cat info.txt | cut -f3 | sort | uniq

CDS
exon
five_prime_utr
gene
start_codon
stop_codon
three_prime_utr
transcript

In [52]:

cat info.txt | cut -f3 | sort | uniq -c

  49063 CDS
  52036 exon
   6923 five_prime_utr
   8497 gene
   7860 start_codon
   3167 stop_codon
   7034 three_prime_utr
   9348 transcript

Features on a chromosome - using `awk`¶

In [53]:

cat info.txt | awk -F '\t' '$1=="Mt"' | head -3

Mt      ena     gene    20      2196    .       +       .       gene_id "CNAG_09000"; gene_source "ena"; gene_biotype "protein_coding";
Mt      ena     transcript      20      2196    .       +       .       gene_id "CNAG_09000"; transcript_id "AFR99102"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt      ena     exon    20      233     .       +       .       gene_id "CNAG_09000"; transcript_id "AFR99102"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR99102-1";

In [54]:

cat info.txt | awk -F '\t' '$1=="2" {print $3}' | sort | uniq -c

   4241 CDS
   4434 exon
    597 five_prime_utr
    706 gene
    681 start_codon
    278 stop_codon
    606 three_prime_utr
    780 transcript