Note: Make a copy of this notebook to work in.
Bash Scratchpad¶
Hack for SIGPIPE error in Jupyter notebook¶
In [1]:
cleanup () {
:
}
trap "cleanup" SIGPIPE
Safety first¶
In [2]:
set -u
Challenge¶
You will download and extract some information about Cryptococcus neoformans by investigating its GTF file located at
File and directory management¶
In [3]:
pwd
/local_data/notebooks/cliburn/HTS2018-notebooks/cliburn
In [4]:
ls
Bash_Exercise_1.ipynb 'R_Graphic_ ggplot2.ipynb'
Bash_Exercise_1_Solutions.ipynb R_Graphics_Base.ipynb
Bash_Exercise_2.ipynb R_Graphics_Exercise.ipynb
Bash_Exercise_2_Solutoins.ipynb R_Graphics_Exercise_Solutions.ipynb
Bash_in_Jupyter.ipynb R_Graphics_Overview.ipynb
'Bash Tutorial.ipynb' R_tidyverse_1.ipynb
data R_tidyverse_2.ipynb
figs R_tidyverse_3.ipynb
Process-RNA-seq-counts.ipynb R_tidyyverse_Exercise.ipynb
ref R_tidyyverse_Exercise_Solutions.ipynb
In [5]:
rm -rf ref
In [6]:
mkdir ref
In [7]:
cd ref
Using variables¶
In [10]:
FOO=foo
In [11]:
echo $FOObar
bash: FOObar: unbound variable
In [12]:
echo ${FOO}bar
foobar
Downlaoding files¶
In [13]:
URL='ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz'
In [14]:
echo $URL
ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
In [15]:
wget $URL
--2018-07-09 08:18:10-- ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
=> ‘Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz’
Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.94
Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.94|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99 ... done.
==> SIZE Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz ... 1796344
==> PASV ... done. ==> RETR Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz ... done.
Length: 1796344 (1.7M) (unauthoritative)
Cryptococcus_neofor 100%[===================>] 1.71M 1.06MB/s in 1.6s
2018-07-09 08:18:13 (1.06 MB/s) - ‘Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz’ saved [1796344]
Working with compressed files¶
In [16]:
pwd
/local_data/notebooks/cliburn/HTS2018-notebooks/cliburn/ref
In [17]:
ls
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
In [18]:
gunzip Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
In [19]:
ls
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf
Inspecting the GTF file¶
A GTF file has some header lines, followed by tabular data in 9 columns:
- chromosome name > chr{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y,M}
- annotation source > {ENSEMBL,HAVANA}
- feature-type > {gene,transcript,exon,CDS,UTR,start_codon,stop_codon,Selenocysteine}
- genomic start location > integer-value (1-based)
- genomic end location > integer-value
- score (not used) > .
- genomic strand > {+,-}
- genomic phase (for CDS features) > {0,1,2,.}
- additional information as key-value pairs > (format: key “value”;)
See GTF3 in ths Sequence Ontology.
In [20]:
GTF=$(ls)
In [21]:
echo $GTF
Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf
In [22]:
head $GTF
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1 ena CDS 5494 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1 ena start_codon 5643 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
In [23]:
tail $GTF
Mt ena stop_codon 23840 23842 . + 0 gene_id "CNAG_09011"; transcript_id "AFR99113"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt Ensembl_Fungi gene 23909 23980 . + . gene_id "ENSRNA049545749"; gene_name "tRNA-Val"; gene_source "Ensembl_Fungi"; gene_biotype "tRNA";
Mt Ensembl_Fungi transcript 23909 23980 . + . gene_id "ENSRNA049545749"; transcript_id "ENSRNA049545749-T1"; gene_name "tRNA-Val"; gene_source "Ensembl_Fungi"; gene_biotype "tRNA"; transcript_source "Ensembl_Fungi"; transcript_biotype "tRNA";
Mt Ensembl_Fungi exon 23909 23980 . + . gene_id "ENSRNA049545749"; transcript_id "ENSRNA049545749-T1"; exon_number "1"; gene_name "tRNA-Val"; gene_source "Ensembl_Fungi"; gene_biotype "tRNA"; transcript_source "Ensembl_Fungi"; transcript_biotype "tRNA"; exon_id "ENSRNA049545749-E1";
Mt ena gene 24096 24851 . + . gene_id "CNAG_09012"; gene_source "ena"; gene_biotype "protein_coding";
Mt ena transcript 24096 24851 . + . gene_id "CNAG_09012"; transcript_id "AFR99114"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt ena exon 24096 24851 . + . gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR99114-1";
Mt ena CDS 24096 24848 . + 0 gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR99114"; protein_version "1";
Mt ena start_codon 24096 24098 . + 0 gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt ena stop_codon 24849 24851 . + 0 gene_id "CNAG_09012"; transcript_id "AFR99114"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Combining operations with cat and pipe¶
In [24]:
cat $GTF | head -8
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
In [25]:
cat $GTF | head -8 | tail -3
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
Filtering comment lines¶
Using tail
¶
In [26]:
cat $GTF | tail +6 | head -3
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
Using cat
and regualr expressions¶
In [27]:
cat << EOF > vietnam.txt
In 1965 Vietnam seemed like just another foreign war but it wasn't
It was different in many ways, as so were those that did the fighting
In World War II the average age of the combat soldier was 26
In Vietnam he was 19
EOF
In [28]:
cat vietnam.txt
In 1965 Vietnam seemed like just another foreign war but it wasn't
It was different in many ways, as so were those that did the fighting
In World War II the average age of the combat soldier was 26
In Vietnam he was 19
In [29]:
cat vietnam.txt | egrep 'combat'
In World War II the average age of the combat soldier was 26
In [30]:
cat vietnam.txt | egrep 'wa.s'
It was different in many ways, as so were those that did the fighting
In [31]:
cat vietnam.txt | egrep 'c.*'
In World War II the average age of the combat soldier was 26
In [32]:
cat vietnam.txt | egrep 'it'
In 1965 Vietnam seemed like just another foreign war but it wasn't
In [33]:
cat vietnam.txt | egrep -i 'it'
In 1965 Vietnam seemed like just another foreign war but it wasn't
It was different in many ways, as so were those that did the fighting
In [34]:
cat vietnam.txt | egrep -i '^it'
It was different in many ways, as so were those that did the fighting
In [35]:
cat vietnam.txt | egrep -i '[0-9]+'
In 1965 Vietnam seemed like just another foreign war but it wasn't
In World War II the average age of the combat soldier was 26
In Vietnam he was 19
In [36]:
cat vietnam.txt | egrep -i -v '[0-9]+'
It was different in many ways, as so were those that did the fighting
In [37]:
cat vietnam.txt | egrep -i -o '[0-9]+'
1965
26
19
Usng regular eexpressions with grep
¶
In [38]:
cat $GTF |
grep -v '^#' |
head -3
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
Creating new files with redirection operators¶
In [39]:
cat $GTF | grep '^#' > header.txt
In [40]:
cat $GTF | grep -v '^#' > info.txt
In [41]:
head -3 info.txt
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
In [42]:
cat info.txt >> header.txt
In [43]:
head header.txt
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1 ena CDS 5494 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1 ena start_codon 5643 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Cutting columns from tabular data¶
In [44]:
cat info.txt | cut -f3 | head -10
gene
transcript
exon
CDS
start_codon
exon
CDS
exon
CDS
exon
In [45]:
cat info.txt | cut -f1,3-5 | head -3
1 gene 100 5645
1 transcript 100 5645
1 exon 5494 5645
In [46]:
cat info.txt | cut -f1,3-5 | tr '\t' ',' > info.csv
In [47]:
head info.csv
1,gene,100,5645
1,transcript,100,5645
1,exon,5494,5645
1,CDS,5494,5645
1,start_codon,5643,5645
1,exon,5322,5422
1,CDS,5322,5422
1,exon,3958,5263
1,CDS,3958,5263
1,exon,3206,3890
In [48]:
cat info.csv | cut -f2 -d',' | head -3
gene
transcript
exon
Sorting and counting¶
In [49]:
cat info.txt | cut -f3 | head -10 | sort
CDS
CDS
CDS
exon
exon
exon
exon
gene
start_codon
transcript
In [50]:
cat info.txt | cut -f3 | head -10 | sort -r
transcript
start_codon
gene
exon
exon
exon
exon
CDS
CDS
CDS
In [51]:
cat info.txt | cut -f3 | sort | uniq
CDS
exon
five_prime_utr
gene
start_codon
stop_codon
three_prime_utr
transcript
In [52]:
cat info.txt | cut -f3 | sort | uniq -c
49063 CDS
52036 exon
6923 five_prime_utr
8497 gene
7860 start_codon
3167 stop_codon
7034 three_prime_utr
9348 transcript
Features on a chromosome - using awk
¶
In [53]:
cat info.txt | awk -F '\t' '$1=="Mt"' | head -3
Mt ena gene 20 2196 . + . gene_id "CNAG_09000"; gene_source "ena"; gene_biotype "protein_coding";
Mt ena transcript 20 2196 . + . gene_id "CNAG_09000"; transcript_id "AFR99102"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Mt ena exon 20 233 . + . gene_id "CNAG_09000"; transcript_id "AFR99102"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR99102-1";
In [54]:
cat info.txt | awk -F '\t' '$1=="2" {print $3}' | sort | uniq -c
4241 CDS
4434 exon
597 five_prime_utr
706 gene
681 start_codon
278 stop_codon
606 three_prime_utr
780 transcript