Working with the Unix Shell 2

Exercise 1.1. Download a compressed data archive from https://www.dropbox.com/s/vivut71p4bkurhw/data.tar.gz (10 points)

In [ ]:

Exercise 1.2. Regenerate the original data folder from data.tar.gz. Change directory into the data folder. List the files in the folder. (10 points)

In [ ]:

Exercise 1.3. Check if any files have been corrupted using the MDFSUM checksum file and note its FILENAME. Delete any corrupted files. (10 points)

In [ ]:

Exercise 1.4. Replace the corrupted file with a correct copy from https://www.dropbox.com/s/vf8qcoj07mcq7wn/FILENAME. You will need to replace FILENAME with the correct filename as noted earlier. Check that there are no more md5sum errors. Go back to the original directory. (10 points)

In [ ]:

Exercise 2.1. (20 points) Write a script extract.sh that extracts only the raw sequence letters and quality scores from a FASTQ file. For example, if test.fq consists of

@071112_SLXA-EAS1_s_7:5:1:817:345
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+071112_SLXA-EAS1_s_7:5:1:817:345
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
@071112_SLXA-EAS1_s_7:5:1:801:338
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
+071112_SLXA-EAS1_s_7:5:1:801:338
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI

then

cat test.fq | extract.sh

should print

GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
GTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII6IBI

For all of Exercise 2, create test.fq, and run your script on test.fq to check that it gets the expected results.

In [ ]:

Exercise 2.2 (20 points) Write a script extract_seq.sh that extracts only the raw sequence letters from a FASTQ file and combines them into a single string. For example,

cat test.fq | bash extract_seq.sh

should print

GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCGTTCAGGGATACGACGTTTGTATTTTAAGAATCTGA
In [ ]:

Exercise 2.3 Write a script calc_gc.sh that estimates the GC ratio (the fraction of all bases that are either G or C) from a FASTQ file. For example, the GC ratio of test.fq is .5138, (20 points)

In [ ]: