Working with the Unix Shell 2¶
Exercise 1.1. Download a compressed data archive from https://www.dropbox.com/s/vivut71p4bkurhw/data.tar.gz (10 points)
In [ ]:
Exercise 1.2. Regenerate the original data folder from
data.tar.gz
. Change directory into the data folder. List the files
in the folder. (10 points)
In [ ]:
Exercise 1.3. Check if any files have been corrupted using the
MDFSUM checksum file and note its FILENAME
. Delete any corrupted
files. (10 points)
In [ ]:
Exercise 1.4. Replace the corrupted file with a correct copy from
https://www.dropbox.com/s/vf8qcoj07mcq7wn/FILENAME. You will need to
replace FILENAME
with the correct filename as noted earlier. Check
that there are no more md5sum
errors. Go back to the original
directory. (10 points)
In [ ]:
Exercise 2.1. (20 points) Write a script extract.sh
that
extracts only the raw sequence letters and quality scores from a
FASTQ
file. For example, if test.fq
consists of
@071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+071112_SLXA-EAS1_s_7:5:1:817:345
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
@071112_SLXA-EAS1_s_7:5:1:801:338
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
+071112_SLXA-EAS1_s_7:5:1:801:338
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
then
cat test.fq | extract.sh
should print
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI
For all of Exercise 2, create test.fq, and run your script on test.fq to check that it gets the expected results.
In [ ]:
Exercise 2.2 (20 points) Write a script extract_seq.sh
that
extracts only the raw sequence letters from a FASTQ file and combines
them into a single string. For example,
cat test.fq | bash extract_seq.sh
should print
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCGTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
In [ ]:
Exercise 2.3 Write a script calc_gc.sh
that estimates the GC
ratio (the fraction of all bases that are either G or C) from a FASTQ
file. For example, the GC ratio of test.fq is .5138, (20 points)
In [ ]: