Using bash
in bioinformatics¶
The Unix commands shown in this notebook are quite intimidating at first. There’s a lot of material becaue I try to cover all the stuff you willl need to construct a typicla bioinformatics pipeline.
Try to understand the expressions and play aound to see what varying them will do. You do not need to memorize the commands now - you can always come back to this notebook for reference. You will also get more practice when you do the Bioiinformatics sessions to construct Unix pipelines for processing your RNA-seq data sets.
Hack to handle broekn pipes - IGNORE.
In [1]:
cleanup () {
:
}
trap "cleanup" SIGPIPE
Treat unset variables as an error, and immediately exit.
In [2]:
set -u
0.1. Files, directories and working with text¶
Quickly reveiw what the following navigation and file/directory manipulation commands do:
>, >>, <, |, ls, cd, pwd, cp, mv, rm, mkdir, rmdir
This session dives quite deep into Unix shell scripting and invovles quite a bit of text processing. We review some basic commands for working with text in Unix.
This is the file we will warm up with¶
We create it using a HEREDOC that allows us to construct multi-line
files and redirrect it to the file junk.txt
In [3]:
cat <<EOF > junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
EOF
We can see the contents of a file with cat
¶
In [4]:
cat junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
Using head
and tail
¶
In [5]:
head -3 junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
In [6]:
tail -3 junk.txt
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
In [7]:
tail +3 junk.txt
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
Pipling to chain commnads¶
In [8]:
head -3 junk.txt | tail -2
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
Searching within a file with grep
¶
grep
takes a REGEX (regular expression) as the query and a file (or files) to search- Normally grep shows lines that match the REGEX argument
- The
-v
flag shows the lines that DO NOT match the REGEX argument
In [9]:
grep "[2|4]" junk.txt
2. GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
4. TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
In [10]:
grep -v "[2|4]" junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
Use a regular expression with grep
¶
In [11]:
grep "CCC.*CCC" junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
Get rid of numbers with cut
¶
cut
extracts columns separated by delimiters - here the delimiter is
a space -d ' '
- and we cut the second column -f 2
In [12]:
cut -d ' ' -f 2 junk.txt
CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
Get rid of numbers with grep
¶
The -o
flag to grep
shows only the strings mathcing the regular
expression
In [13]:
grep -o '[A-Z]*' junk.txt
CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
Extract only odd numbered lines with sed
¶
- The
-n
says not to print anything - Then we print
p
the lines starting at 1 with a step of 2
In [14]:
sed -n "1~2p" junk.txt
1. CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
3. GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
Use sed
for substitution¶
The syntax is s/PATTERN/REPLACMENT/g/
- The
s///
measn substitute - The
g
means replace globally; otherwise just replaces the first patttern found [0-9][0-9]*
is a regular expression matching a digit followed by zero or more digits\.
matches the literal.
[[:space:]]
matches a single space character
In [15]:
sed "s/[0-9][0-9]*\.[[:space:]]//g" junk.txt
CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
Use sed
to remove dupplicate characters using a REGEX capture group¶
In [16]:
sed -E "s/(.)\1*/\1/g" junk.txt
1. CTACTACTACTCAGTCGAGCGACATAGCTG
2. GTCATATGTCTGTACGTATGCACTGATCATACTGATCGAGC
3. GACGATATCAGACGACATGTCGTCTGACACGA
4. TCTCGTCGCTGCGTGTCTCTCAGTGCAGTACG
Convert DNA to RNA with tr
¶
In [17]:
cat junk.txt | tr T U
1. CCCUAACCCCCUAACCCCCUAACCCUCAGUCGGGGAGGCGACAAUAGCUG
2. GUCAUAUGUUCUGUACGUUAUUGGCCAACUGAUCAUACCUGAAUCGAGCC
3. GAACCGGGAUUAUCAAAGACGAACAUGGUCGGGUCCUUGAACCAAACGAA
4. UCUCCGUCCGCUGGCGUGUUUUUCUUUUCUCAAGUGGGCAAGUUACCCGG
Reverse complement with tr
¶
In [18]:
grep -o '[ACTG]*' junk.txt | tr ACGT TGCA | rev
CAGCTATTGTCGCCTCCCCGACTGAGGGTTAGGGGGTTAGGGGGTTAGGG
GGCTCGATTCAGGTATGATCAGTTGGCCAATAACGTACAGAACATATGAC
TTCGTTTGGTTCAAGGACCCGACCATGTTCGTCTTTGATAATCCCGGTTC
CCGGGTAACTTGCCCACTTGAGAAAAGAAAAACACGCCAGCGGACGGAGA
Split command over multiple lines¶
Use \
to continue command to next line
In [19]:
grep -o '[ACTG]*' junk.txt | tr \
ACGT TGCA | rev
CAGCTATTGTCGCCTCCCCGACTGAGGGTTAGGGGGTTAGGGGGTTAGGG
GGCTCGATTCAGGTATGATCAGTTGGCCAATAACGTACAGAACATATGAC
TTCGTTTGGTTCAAGGACCCGACCATGTTCGTCTTTGATAATCCCGGTTC
CCGGGTAACTTGCCCACTTGAGAAAAGAAAAACACGCCAGCGGACGGAGA
Actually the line continuation character is not necesssary if you break
at a pipe |
In [20]:
grep -o '[ACTG]*' junk.txt |
tr ACGT TGCA |
rev
CAGCTATTGTCGCCTCCCCGACTGAGGGTTAGGGGGTTAGGGGGTTAGGG
GGCTCGATTCAGGTATGATCAGTTGGCCAATAACGTACAGAACATATGAC
TTCGTTTGGTTCAAGGACCCGACCATGTTCGTCTTTGATAATCCCGGTTC
CCGGGTAACTTGCCCACTTGAGAAAAGAAAAACACGCCAGCGGACGGAGA
0.2 Getting files¶
We will download Chromosomes 7 and 8 of Crytococcus neoformans in FASTA format from NCBI Nucleotide. We can find their IDs through a web query.
We find that the IDs are
CM000046.1 for chormosome 7
CM000047.1 for chormosome 8
We will use the NCBI Entrez tool efetch
to download the FASTA files
to a folder seqs
.
In [21]:
cd
mkdir -p seqs
efetch -db=nuccore -format=fasta -id=CM000046.1 > seqs/CM000046.1.fa
efetch -db=nuccore -format=fasta -id=CM000047.1 > seqs/CM000047.1.fa
In [22]:
ls seqs
CM000046.1.fa CM000047.1.fa CM000047.1.fa.bak
In [23]:
head seqs/CM000046.1.fa
>CM000046.1 Cryptococcus neoformans var. neoformans B-3501A chromosome 7, whole genome shotgun sequence
TGGTCTTATGAGGAAGAGGAGTTTGGATTATTTTTTCTTTTCTTTAAAAAGTTGTTTATTTAAGTAGTTT
CTTTAATTCGGGTAACACACACGACAACCCAATAAATTAAACAACGAAAAAATGCAACCTCTATAACCCC
CTAACCCTTTTATTCGGGTAACACACACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATA
ACCCCCGAAAAGGATTGGTGTCGAGTTAGTAAAGGCGAGGGCTAGGACGCTGGTCTTGTGAGGAAGAGGA
GGTTGGATTATTTTTTCTTTTCTTTAATAAGTTGTTTATTTAAGTAGTTTCTTTTATTCGGGTAACACAC
ACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATAACCCCCATCCAAGGTGCCACGGTTGA
GATTGAGCCCCACACACTGTCGGGACAAAGGAGAAACGACCTTCGGGTCAGAGGTTCCAGCGCTCTGGCC
TTCACTGACTACGACCTGAAGGTTTACTCCCTCGGGGACCGAGACGCGAGGAGCACAGCCACCCCCAGCA
CCCCCAACAGCAAACTGGCCGACTTCTGCTTGGACCGGTGCGTGAACTGGCTCGACAAGGTGGGTCAGGT
In [24]:
head seqs/CM000047.1.fa
>CM000047.1 Cryptococcus neoformans var. neoformans B-3501A chromosome 8, whole genome shotgun sequence
ATGGCCTTCCTGACTACGACCTGAGGTTTACTCCCTCGGCGACCGAGACGCGAGGAGCACAGCCACCCCA
AGCACCCCCAACAGCAAGCTGGCCGAATTCTGCTTGGACCGGTGCGTGAACTGGCTCGACAAGGTGGGTC
AGGTCGTCTCGAAGAACGCTCCGAAAGTCACTGGTGGGGTCTTTAAACCGATCATCCTTTCCACTGGTGG
CCTGATGAGCAGGAGCACAGCAGACGAATGGAAGGAGTGGAGGGAGGCGATGCCGGTGGGGGGGTTCGAG
AAAATGGAGAAACGGATTGGTGTCGAGTTAGTAAAGGCGAGGGCTAGGACGCTGGTCTTGTGAGGAAGAG
GAGGTTGGATTATTTTTTCTTTTCTTTAAAAAGTTGTTTATTTAAGTAGTTTCTTTAATTCGGGCAACCC
ACACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATAACCCTCAATAGCTCTGGCTGGTAC
AGATGGGCAGAAACTCACCTTGCTCCCACGGCATACATCCCTCCCAATAAGTATCTGCAGATGGCGAATA
AACCCCATGCTCTGCGCAAATATATTGAAGGCTGACGACTTTGATGTTATGATCCACCCCATAAACGCTC
0.3 Compressing ard archiving files¶
In Bioinformatics the most common formats are gzip
and bgzip
.
In [25]:
cd ~/seqs
ls
CM000046.1.fa CM000047.1.fa CM000047.1.fa.bak
Compressing with bgzip
¶
bgzip
(Blocked GNU Zip Format) is commonly used in Bioinformatics,
and allows random access of the content. BAM, BCF and VCF file formats
are typically bgzip
compressed.
Unfortunately bgzip
does not take multiple files, and we have to use
the xargs
command to pipe in one file at a time.
In [28]:
xargs --help | grep '\-n,'
-n, --max-args=MAX-ARGS use at most MAX-ARGS arguments per command line
In [29]:
ls *fa | xargs -n1 bgzip
ls
CM000046.1.fa.gz CM000047.1.fa.bak CM000047.1.fa.gz
Note that bgzip
files can be uncompressed with gzip
¶
In [30]:
gunzip *gz
ls
CM000046.1.fa CM000047.1.fa CM000047.1.fa.bak
Viewing compressed files¶
Some utilities can work directly with gzipped files. One that we will be
using extensively is zcat
.
In [31]:
ls *fa | xargs -n1 bgzip
ls
CM000046.1.fa.gz CM000047.1.fa.bak CM000047.1.fa.gz
In [32]:
zcat CM000046.1.fa.gz | head
>CM000046.1 Cryptococcus neoformans var. neoformans B-3501A chromosome 7, whole genome shotgun sequence
TGGTCTTATGAGGAAGAGGAGTTTGGATTATTTTTTCTTTTCTTTAAAAAGTTGTTTATTTAAGTAGTTT
CTTTAATTCGGGTAACACACACGACAACCCAATAAATTAAACAACGAAAAAATGCAACCTCTATAACCCC
CTAACCCTTTTATTCGGGTAACACACACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATA
ACCCCCGAAAAGGATTGGTGTCGAGTTAGTAAAGGCGAGGGCTAGGACGCTGGTCTTGTGAGGAAGAGGA
GGTTGGATTATTTTTTCTTTTCTTTAATAAGTTGTTTATTTAAGTAGTTTCTTTTATTCGGGTAACACAC
ACGACAACCCAATAAATTAAACAACGAAAAATGCAACCTCTATAACCCCCATCCAAGGTGCCACGGTTGA
GATTGAGCCCCACACACTGTCGGGACAAAGGAGAAACGACCTTCGGGTCAGAGGTTCCAGCGCTCTGGCC
TTCACTGACTACGACCTGAAGGTTTACTCCCTCGGGGACCGAGACGCGAGGAGCACAGCCACCCCCAGCA
CCCCCAACAGCAAACTGGCCGACTTCTGCTTGGACCGGTGCGTGAACTGGCTCGACAAGGTGGGTCAGGT
In [33]:
gunzip *gz
ls
CM000046.1.fa CM000047.1.fa CM000047.1.fa.bak
Archival with tar
¶
Often we want to compress an entire direcotry (possibly with
subdirectories). In Unix, this is most commonly done with tar
. Flags
to note
c
compressx
extractz
gzip or gunzipf
filesv
verbose
In [34]:
tar --help | head -8
Usage: tar [OPTION...] [FILE]...
GNU 'tar' saves many files together into a single tape or disk archive, and can
restore individual files from the archive.
Examples:
tar -cf archive.tar foo bar # Create archive.tar from files foo and bar.
tar -tvf archive.tar # List all files in archive.tar verbosely.
tar -xf archive.tar # Extract all files from archive.tar.
In [35]:
cd
ls seqs
CM000046.1.fa CM000047.1.fa CM000047.1.fa.bak
In [36]:
tar czvf seqs.tar.gz seqs
ls seqs*
seqs/
seqs/CM000046.1.fa
seqs/CM000047.1.fa
seqs/CM000047.1.fa.bak
seqs.tar.gz
seqs:
CM000046.1.fa CM000047.1.fa CM000047.1.fa.bak
View contents of archive¶
In [37]:
tar tzvf seqs.tar.gz
drwxr-xr-x cliburn/cliburn 0 2018-06-21 10:25 seqs/
-rw-r--r-- cliburn/cliburn 1434513 2018-06-21 10:25 seqs/CM000046.1.fa
-rw-r--r-- cliburn/cliburn 1194378 2018-06-21 10:25 seqs/CM000047.1.fa
-rw-r--r-- cliburn/cliburn 1194398 2018-06-21 10:19 seqs/CM000047.1.fa.bak
Remove direcotry and regenerate from archive¶
In [38]:
rm -rf seqs
ls seqs*
seqs.tar.gz
In [39]:
tar xzvf seqs.tar.gz
ls seqs*
seqs/
seqs/CM000046.1.fa
seqs/CM000047.1.fa
seqs/CM000047.1.fa.bak
seqs.tar.gz
seqs:
CM000046.1.fa CM000047.1.fa CM000047.1.fa.bak
In [40]:
rm seqs.tar.gz
0.4 Checksums¶
When working with genomic data, we deal with very large files. There is a small risk that these files will be corrupted over time or during data transfer. To ensure that files are not changed, we use a “checksum” function. This is a function that generates an long, essentially random number called a checksum that represents the contents of the file. When the file contents change, so will the checksum. In theory, there is a very small probability that two different files generate the same checksum, but in practice the probability is too small to worry about.
There are several different algorithms for generating the checksums, and at least 3 Unix commands to do so, but they all work very similarly for our purposes.
The strategy is:
- Generate and store a checksum together with a data file whose integrity you care about
- When you use or download the data, re-generate the checksum (using the same algorithm e.g. MD5) and compare with the checksum
In [41]:
ls ~/seqs
CM000046.1.fa CM000047.1.fa CM000047.1.fa.bak
In [42]:
md5sum ~/seqs/*fa > MD5_CHECKSUM
We check that the files are unchanged wth the -c
flag¶
In [43]:
md5sum -c MD5_CHECKSUM
/home/cliburn/seqs/CM000046.1.fa: OK
/home/cliburn/seqs/CM000047.1.fa: OK
Suppsoe one of the files is corrupted¶
In [44]:
echo 'Oops I did it again' >> seqs/CM000047.1.fa
In [45]:
md5sum -c MD5_CHECKSUM \
| cat # this is jusst so we can Run All
/home/cliburn/seqs/CM000046.1.fa: OK
md5sum: /home/cliburn/seqs/CM000047.1.fa: FAILED
WARNING: 1 computed checksum did NOT match
Using diff
to find changes¶
Let’s rename the altered file, re-download the orginal and see what was changed.
In [46]:
mv seqs/CM000047.1.fa seqs/CM000047.1.fa.bak
In [47]:
efetch -db=nuccore -format=fasta -id=CM000047.1 > seqs/CM000047.1.fa
In [48]:
diff -u seqs/CM000047.1.fa seqs/CM000047.1.fa.bak \
| cat # this is jusst so we can Run All
--- seqs/CM000047.1.fa 2018-06-21 10:25:49.253700667 -0400
+++ seqs/CM000047.1.fa.bak 2018-06-21 10:25:48.425709045 -0400
@@ -16820,3 +16820,4 @@
CCGTGTCGGCACGCATCTGATTTTTTTCTTATTATTCATAACAACGGCTACAGATCACGCAAGCGCATCG
ACTCCGCTGCTGCGAAACACCACGAGTACTGCAACCATTGCTGCGGATGAGCTTAGGGGGTTAGGGGGTT
AGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGG
+Oops I did it again
Explanaiton from [official documentation]
Header
--- seqs/CM000047.1.fa 2018-06-21 09:58:18.246185758 -0400
+++ seqs/CM000047.1.fa.bak 2018-06-21 09:55:55.967707764 -0400
Explanation
The unified output format starts with a two-line header, which looks like this:
--- from-file from-file-modification-time
+++ to-file to-file-modification-time
The timestamp looks like ‘2002-02-21 23:30:39.942229878 -0800’ to indicate the date, time with fractional seconds, and time zone. The fractional seconds are omitted on hosts that do not support fractional timestamps.
You can change the header’s content with the --label=label option. See Alternate Names.
Hunk
@@ -16820,3 +16820,4 @@
CCGTGTCGGCACGCATCTGATTTTTTTCTTATTATTCATAACAACGGCTACAGATCACGCAAGCGCATCG
ACTCCGCTGCTGCGAAACACCACGAGTACTGCAACCATTGCTGCGGATGAGCTTAGGGGGTTAGGGGGTT
AGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGG
+Oops I did it again
Explanation
``` @@ from-file-line-numbers to-file-line-numbers @@ line-from-either-file line-from-either-file…
If a hunk contains just one line, only its start line number appears. Otherwise its line numbers look like ‘start,count’. An empty hunk is considered to start at the line that follows the hunk.
If a hunk and its context contain two or more lines, its line numbers look like ‘start,count’. Otherwise only its end line number appears. An empty hunk is considered to end at the line that precedes the hunk.
The lines common to both files begin with a space character. The lines that actually differ between the two files have one of the following indicator characters in the left print column:
‘+’ A line was added here to the first file.
‘-’ A line was removed here from the first file. ```
Verify¶
We can confirm that a line was added at the position given by diff
In [49]:
cat seqs/CM000047.1.fa | tail +16820 | head -3
CCGTGTCGGCACGCATCTGATTTTTTTCTTATTATTCATAACAACGGCTACAGATCACGCAAGCGCATCG
ACTCCGCTGCTGCGAAACACCACGAGTACTGCAACCATTGCTGCGGATGAGCTTAGGGGGTTAGGGGGTT
AGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGG
In [50]:
cat seqs/CM000047.1.fa.bak | tail +16820 | head -4
CCGTGTCGGCACGCATCTGATTTTTTTCTTATTATTCATAACAACGGCTACAGATCACGCAAGCGCATCG
ACTCCGCTGCTGCGAAACACCACGAGTACTGCAACCATTGCTGCGGATGAGCTTAGGGGGTTAGGGGGTT
AGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGGGTTAGGGG
Oops I did it again
1. Working with FASTA files¶
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:
>P01013 GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
FLFLIKHNPTNTIVYFGRYWSP
Note that a FASTA file may contain mulitple sequences, each with its own description line.
Set up variables for files we will be using¶
In [51]:
DATA_DIR='/data/hts/2018/Granek_4837_180427A5'
FASTA='cneo040623.scaffs.fa.gz'
FASTQ='10_MA_C_S3_L003_R1_001.fastq.gz'
GTF='Cryptococcus_neoformans.ASM9104v1.39.gtf.gz'
This is how you get the value of a shell variable¶
In [52]:
echo $DATA_DIR
/data/hts/2018/Granek_4837_180427A5
List files in data direcotry¶
In [53]:
ls $DATA_DIR | head -3
10_MA_C_S3_L001_R1_001.fastq
10_MA_C_S3_L002_R1_001.fastq.gz
10_MA_C_S3_L003_R1_001.fastq.gz
Inspect the contents of a zipped file.¶
zcat
is like cat
but works on gzipped files.
In [54]:
zcat $DATA_DIR/$FASTA | head
>chr01.b3501.040506 5 contigs (2263073 bp)
CCCTAACCCCCTAACCCCCTAACCCTCAGTCGGGGAGGCGACAATAGCTG
GTCATATGTTCTGTACGTTATTGGCCAACTGATCATACCTGAATCGAGCC
GAACCGGGATTATCAAAGACGAACATGGTCGGGTCCTTGAACCAAACGAA
TCTCCGTCCGCTGGCGTGTTTTTCTTTTCTCAAGTGGGCAAGTTACCCGG
GGATTTATACAACGATAAGAGGCTATCCGCTGACTATAAATTGTGTTAGC
TGATCCAGATCCGCAGTTGGCACAAACCGATGTTTCCTTTTCCTCAGGTC
CTGAATATTTCTCTTCCATCAAGGCTCCCAACCCCGAGGGGAGTATTTCC
ACCCGATCTGATTCCAAGAGGTCCAGCGTCAACCAGGTACGTGAATGTTT
TCTTCTTTAGTGGTGGGGACCCTATGCTTGTCCCACAGAACTGCACTGAA
Show headers only¶
In [55]:
zcat $DATA_DIR/$FASTA | grep '^>' | head
>chr01.b3501.040506 5 contigs (2263073 bp)
>chr02.b3501.040506 6 contigs (1607892 bp)
>chr03.b3501.040506 6 contigs (2025976 bp)
>chr04.b3501.040623 5 contigs (1762351 bp)
>chr05.b3501.040506 7 contigs (1432964 bp)
>chr06.b3501.040616 4 contigs (1411173 bp)
>chr07.b3501.040506 6 contigs (1356306 bp)
>chr08.b3501.040506 3 contigs (1153453 bp)
>chr09.b3501.040506 5 contigs (1072068 bp)
>chr10.b3501.040506 3 contigs (1017008 bp)
Show headers with line numbers¶
In [56]:
zcat $DATA_DIR/$FASTA| grep -n '^>' | head
1:>chr01.b3501.040506 5 contigs (2263073 bp)
45264:>chr02.b3501.040506 6 contigs (1607892 bp)
77423:>chr03.b3501.040506 6 contigs (2025976 bp)
117944:>chr04.b3501.040623 5 contigs (1762351 bp)
153193:>chr05.b3501.040506 7 contigs (1432964 bp)
181854:>chr06.b3501.040616 4 contigs (1411173 bp)
210079:>chr07.b3501.040506 6 contigs (1356306 bp)
237207:>chr08.b3501.040506 3 contigs (1153453 bp)
260278:>chr09.b3501.040506 5 contigs (1072068 bp)
281721:>chr10.b3501.040506 3 contigs (1017008 bp)
Extract first part of the sequence from chromosome 7¶
In [58]:
zcat $DATA_DIR/$FASTA | sed -n '210079,210082p'
>chr07.b3501.040506 6 contigs (1356306 bp)
TGGTCTTATGAGGAAGAGGAGTTTGGATTATTTTTTCTTTTCTTTAAAAA
GTTGTTTATTTAAGTAGTTTCTTTAATTCGGGTAACACACACGACAACCC
AATAAATTAAACAACGAAAAAATGCAACCTCTATAACCCCCTAACCCTTT
Extract first line of chromosome 7 and save to file¶
In [59]:
zcat $DATA_DIR/$FASTA | sed -n '210080p' > dna.txt
Inspect the file we just saved¶
In [60]:
cat dna.txt
TGGTCTTATGAGGAAGAGGAGTTTGGATTATTTTTTCTTTTCTTTAAAAA
Find complement¶
In [61]:
cat dna.txt | tr ACGT TGCA
ACCAGAATACTCCTTCTCCTCAAACCTAATAAAAAAGAAAAGAAATTTTT
Find reverse complement¶
In [62]:
cat dna.txt | tr ACGT TGCA | rev
TTTTTAAAGAAAAGAAAAAATAATCCAAACTCCTCTTCCTCATAAGACCA
Delete A and T¶
In [63]:
cat dna.txt | tr -d 'AT'
GGCGGGGGGGGGCC
Only count non A and T¶
In [64]:
cat dna.txt | tr -d 'AT' | tr -d '\n' | wc -c
14
In [65]:
man tr | grep '\-d,' -A2
-d, --delete
delete characters in SET1, do not translate
Count C and G¶
In [66]:
grep '[CG]' -o dna.txt | wc -l
14
In [67]:
man grep | grep '\-o,' -A2
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
Specify data directory as a shell variable¶
2. Working with FASTQ files¶
From Wikipedia
A FASTQ file normally uses four lines per sequence.
Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
Specify target file as a shell variable¶
Show the contents of a zipped file¶
In [68]:
zcat $DATA_DIR/$FASTQ | head -n 10
@NB501800:128:HTYKFBGX5:3:11401:19195:1019 1:N:0:ATTACTCG+AGGATAGG
NGTAAGGGTCTGGGGGACGTTGTTGTGGATGGAAACAACCCTGGCCTCTCGGGGCTGAGCAGGGAGTCGCTTGAGC
+
#AAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEE/<<
@NB501800:128:HTYKFBGX5:3:11401:24339:1020 1:N:0:ATTACTCG+AGGATAGG
NCTCGCCTGGGGTGCTTGCGTGTCCCTGTCCTTGGCCCTGGTCACCGGGGAGATGCACATCATCTTCTTCTGGCTC
+
#AAAAEAAEEEEEEEEEEEEEEEE/EEE/EEEEEEEEEEEEEEEE<A/EEEE6EEE//EEEAEEEEEEE/EAEEEA
@NB501800:128:HTYKFBGX5:3:11401:24140:1020 1:N:0:ATTACTCG+AGGATAGG
NTGATCACGGGAGACAAGCTAGAGATGCAAAATGCCATAGTACGATCATCCCATACAAGATGGGTGAGAGAAAAAG
Extracting just the sequence¶
We need to extract every 4th line starting from the 2nd. We cna use a
tool called sed
(stream editor) to do so.
In [69]:
zcat $DATA_DIR/$FASTQ | sed -n '2~4p;' | head
NGTAAGGGTCTGGGGGACGTTGTTGTGGATGGAAACAACCCTGGCCTCTCGGGGCTGAGCAGGGAGTCGCTTGAGC
NCTCGCCTGGGGTGCTTGCGTGTCCCTGTCCTTGGCCCTGGTCACCGGGGAGATGCACATCATCTTCTTCTGGCTC
NTGATCACGGGAGACAAGCTAGAGATGCAAAATGCCATAGTACGATCATCCCATACAAGATGGGTGAGAGAAAAAG
NGGGCCTTGGTCCTGCCCTCAGCCTTGATAGGACCCTTGCCGTCGATGGGGTTACCCAAAGCGTCAACAACTCGGC
NACCTTTCGGAGACCTCGGTGGGTCTTCCTGGGGAGACGGGTAGTACCCCATCGGGCGGTAACACCCTCGTAACCG
NCCGAACTGTTGAAAAGAGCTTCCACGCTCCAAAGGCAGGGACAAGAGTAAAGAGCCACCACAGCCAATCTCCAAA
NGCAAACTGAACGTCCTCAGTCTCCATTCGCTGGAAGTCCGATTCAGGACGGGCAGCGTCATCGATAGAGAATGGA
NGCATTACCAGATGCACTGATAGCAATGATCGTCTCATCTTCCAAGTACTTGCTCACAGCGATACTATCACTCAAA
GGCGATCTTCCTCTTGGGGAAGTGAACGTTCTTCTTGTGCTGCTTCTCGGTGATAGCAGAAGCCTCCTTGGTAGTC
CGGCGATGATGAGAGTGAAGGGATCTTGCCACATGAGAGGGGATGATCGACGGGGACGAGGACGGATGGGGATGAT
Explanation
sed -n '2~4p'
-n
means to suppress all output2~4p
means print (p) every 4th line starting at line 2
Joining lines¶
Getting sequence as a single string. We just need to delete newlline (:raw-latex:`\n`) characters.
Note: We can break long commands uisng the line continuation character
In [70]:
zcat $DATA_DIR/$FASTQ | \
sed -n '2~4p;' | \
head | \
tr -d '\n'
NGTAAGGGTCTGGGGGACGTTGTTGTGGATGGAAACAACCCTGGCCTCTCGGGGCTGAGCAGGGAGTCGCTTGAGCNCTCGCCTGGGGTGCTTGCGTGTCCCTGTCCTTGGCCCTGGTCACCGGGGAGATGCACATCATCTTCTTCTGGCTCNTGATCACGGGAGACAAGCTAGAGATGCAAAATGCCATAGTACGATCATCCCATACAAGATGGGTGAGAGAAAAAGNGGGCCTTGGTCCTGCCCTCAGCCTTGATAGGACCCTTGCCGTCGATGGGGTTACCCAAAGCGTCAACAACTCGGCNACCTTTCGGAGACCTCGGTGGGTCTTCCTGGGGAGACGGGTAGTACCCCATCGGGCGGTAACACCCTCGTAACCGNCCGAACTGTTGAAAAGAGCTTCCACGCTCCAAAGGCAGGGACAAGAGTAAAGAGCCACCACAGCCAATCTCCAAANGCAAACTGAACGTCCTCAGTCTCCATTCGCTGGAAGTCCGATTCAGGACGGGCAGCGTCATCGATAGAGAATGGANGCATTACCAGATGCACTGATAGCAATGATCGTCTCATCTTCCAAGTACTTGCTCACAGCGATACTATCACTCAAAGGCGATCTTCCTCTTGGGGAAGTGAACGTTCTTCTTGTGCTGCTTCTCGGTGATAGCAGAAGCCTCCTTGGTAGTCCGGCGATGATGAGAGTGAAGGGATCTTGCCACATGAGAGGGGATGATCGACGGGGACGAGGACGGATGGGGATGAT
3.Working wtih GTF files¶
From UCSD Genome Browser
The GTF (General Transfer Format) is an extension of GFF version 2
and used to represent transcription models. GFF (General Feature Format)
consists of one line per feature, each containing 9 columns of data.
Fields
Fields are tab-separated. Also, all but the final field in each
feature line must contain a value; "empty" columns are denoted
with a '.'
seqname - name of the chromosome or scaffold; chromosome names
without a 'chr'
source - name of the program that generated this feature, or
the data source (database or project name)
feature - feature type name. Current allowed features are
{gene, transcript, exon, CDS, Selenocysteine, start_codon,
stop_codon and UTR}
start - start position of the feature, with sequence numbering
starting at 1.
end - end position of the feature, with sequence numbering
starting at 1.
score - a floating point value indiciating the score of a feature
strand - defined as + (forward) or - (reverse).
frame - one of '0', '1' or '2'. Frame indicates the number of base pairs
before you encounter a full codon. '0' indicates the feature
begins with a whole codon. '1' indicates there is an extra
base (the 3rd base of the prior codon) at the start of this feature.
'2' indicates there are two extra bases (2nd and 3rd base of the
prior exon) before the first codon. All values are given with
relation to the 5' end.
attribute - a semicolon-separated list of tag-value pairs (separated by a space),
providing additional information about each feature. A key can be
repeated multiple times.
Attributes
The following attributes are available. All attributes are semi-colon
separated pairs of keys and values.
- gene_id: The stable identifier for the gene
- gene_version: The stable identifier version for the gene
- gene_name: The official symbol of this gene
- gene_source: The annotation source for this gene
- gene_biotype: The biotype of this gene
- transcript_id: The stable identifier for this transcript
- transcript_version: The stable identifier version for this transcript
- transcript_name: The symbold for this transcript derived from the gene name
- transcript_source: The annotation source for this transcript
- transcript_biotype: The biotype for this transcript
- exon_id: The stable identifier for this exon
- exon_version: The stable identifier version for this exon
- exon_number: Position of this exon in the transcript
- ccds_id: CCDS identifier linked to this transcript
- protein_id: Stable identifier for this transcript's protein
- protein_version: Stable identifier version for this transcript's protein
- tag: A collection of additional key value tags
- transcript_support_level: Ranking to assess how well a transcript is supported (from 1 to 5)
Inspect the GTF file¶
In [71]:
zcat $DATA_DIR/$GTF | head
#!genome-build ASM9104v1
#!genome-version ASM9104v1
#!genome-date 2016-06
#!genome-build-accession GCA_000091045.1
#!genebuild-last-updated 2016-06
1 ena gene 4839 6236 . + . gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 4839 6236 . + . gene_id "CNA00010"; transcript_id "AAW41410"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 4839 5178 . + . gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-1"; exon_version "1";
1 ena CDS 4914 5178 . + 0 gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AAW41410"; protein_version "1";
1 ena start_codon 4914 4916 . + 0 gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
Ignore comment lines version 1¶
This uses tail
to take all lines starting from line 6
In [72]:
zcat $DATA_DIR/$GTF | tail +6 | head -3
1 ena gene 4839 6236 . + . gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 4839 6236 . + . gene_id "CNA00010"; transcript_id "AAW41410"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 4839 5178 . + . gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-1"; exon_version "1";
Ingore comment lines 2¶
This uses grep
to ignore (-v
) lines starting with ^#
In [73]:
zcat $DATA_DIR/$GTF | grep -v '^#' | head -3
1 ena gene 4839 6236 . + . gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 4839 6236 . + . gene_id "CNA00010"; transcript_id "AAW41410"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 4839 5178 . + . gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-1"; exon_version "1";
Find all unique feature types in GTF¶
In [74]:
zcat $DATA_DIR/$GTF | grep -v '^#' | cut -f 3 | sort | uniq
CDS
exon
five_prime_utr
gene
start_codon
stop_codon
three_prime_utr
transcript
Get counts of all unique feature types in GTF¶
In [75]:
zcat $DATA_DIR/$GTF | grep -v '^#' | cut -f 3 | sort | uniq -c
44079 CDS
44344 exon
6738 five_prime_utr
6886 gene
6894 start_codon
2369 stop_codon
6718 three_prime_utr
7117 transcript
Find the the attributes of the first 3 exons¶
In [76]:
zcat $DATA_DIR/$GTF | grep -v '^#' | awk '$3 == "exon"' | head -3
1 ena exon 4839 5178 . + . gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-1"; exon_version "1";
1 ena exon 5241 5377 . + . gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "2"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-2"; exon_version "1";
1 ena exon 5453 5560 . + . gene_id "CNA00010"; transcript_id "AAW41410"; exon_number "3"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AAW41410-3"; exon_version "1";
Store attributes of all genes in a file¶
It is common to use Unix commands to do simple processing and save the results to a file, then use a scripting language such as R or Python to furhter process the file.
In [77]:
zcat $DATA_DIR/$GTF | grep -v '^#' | awk '$3 == "gene"' | cut -f 9 > genes.txt
In [78]:
head -6 genes.txt
gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00020"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00030"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00040"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00050"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00060"; gene_source "ena"; gene_biotype "protein_coding";
4. Looping¶
Loop throgh first 10 lines¶
- Creae a counter variable
LINE_NUM
and initialize it to 0 - Redirect
genes.txt
to a while loop - Whlie loop reads each line into the variable
LINE
- The
[[ CONDITION ]]
tests if CONDITION is true - If LINE_NUM is less than
-lt
10 print LINE and increment LINE_NUM - Note that
(( EXPR ))
is used to evaluate numeric expressions
In [79]:
LINE_NUM=0
while read LINE
do
if [[ $LINE_NUM -lt 10 ]]; then
echo $LINE
(( LINE_NUM++ ))
fi
done < genes.txt
gene_id "CNA00010"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00020"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00030"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00040"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00050"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00060"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00070"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00075"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00080"; gene_source "ena"; gene_biotype "protein_coding";
gene_id "CNA00090"; gene_source "ena"; gene_biotype "protein_coding";
Loop through first 10 tRNAs¶
Almost the same as previous program but we add a new condition to check
tha the gene_biotype is tRNA using a REGULAr EXPRESSION via the match
operator =~
.*gene_biotype[[:space:]]\"tRNA\".*
means
.*
: Match any number of any charactergene_biotype
: Match the string gene_biotype[[:space:]]
: Match a single space cahracter\"tRNA\"
: Math the string “tRNA” - we escape the double quotes to match linteral double quote characters.*
: Match any number of any character
In [80]:
LINE_NUM=0
while read LINE
do
if [[ $LINE_NUM -lt 10 ]] && [[ $LINE =~ .*gene_biotype[[:space:]]\"tRNA\".* ]] ; then
echo $LINE
((LINE_NUM++))
fi
done < genes.txt
gene_id "CNA02110"; gene_name "CNA02110"; gene_source "ena"; gene_biotype "tRNA";
gene_id "EBG00005237496"; gene_name "tRNA"; gene_source "Rfam"; gene_biotype "tRNA";
gene_id "CNA02190"; gene_name "CNA02190"; gene_source "ena"; gene_biotype "tRNA";
gene_id "EBG00005237490"; gene_name "tRNA"; gene_source "Rfam"; gene_biotype "tRNA";
gene_id "CNA02200"; gene_name "CNA02200"; gene_source "ena"; gene_biotype "tRNA";
gene_id "CNA02280"; gene_name "CNA02280"; gene_source "ena"; gene_biotype "tRNA";
gene_id "EBG00005237502"; gene_name "tRNA"; gene_source "Rfam"; gene_biotype "tRNA";
gene_id "CNA02320"; gene_name "CNA02320"; gene_source "ena"; gene_biotype "tRNA";
gene_id "EBG00005237484"; gene_name "tRNA"; gene_source "Rfam"; gene_biotype "tRNA";
gene_id "CNA03230"; gene_name "CNA03230"; gene_source "ena"; gene_biotype "tRNA";
Loop through files in a directory and print the filename followed by the first line¶
- Use a COMMAND EXPRESSION
$(EXPR)
to list files matching the glob pattern - We loop through each file using a
for
statement - For each file, extract the first line with
head -1
and save in variableFIRST
- The
2>/dev/null
ignores output from thestderr
stream - We use another COMMAND EXPRESSION to print the filename withou the
directory (using
basename
) - We then print the first line storred in
FIRST
In [81]:
for FILE in $(ls $DATA_DIR/*_MA_J_S20_*gz)
do
FIRST=$(zcat $FILE 2>/dev/null | head -1)
echo $(basename $FILE)
echo $FIRST
echo
done
11_MA_J_S20_L001_R1_001.fastq.gz
@NB501800:128:HTYKFBGX5:1:11101:26760:1056 1:N:0:CGCTCATT+NGGATAGG
11_MA_J_S20_L002_R1_001.fastq.gz
@NB501800:128:HTYKFBGX5:2:11101:20327:1041 1:N:0:CGCTCATT+AGGATAGN
11_MA_J_S20_L003_R1_001.fastq.gz
@NB501800:128:HTYKFBGX5:3:11401:10896:1024 1:N:0:CGCTCATT+AGGATAGG
11_MA_J_S20_L004_R1_001.fastq.gz
@NB501800:128:HTYKFBGX5:4:11401:8259:1032 1:N:0:CGCTCATT+AGGATAGG