Homework: Counting k-mers

Q1. Finding sequencing errors.

Q1. Finding sequencing errors.

We generated a simulated set of sequenced DNA reads of the 16 yeast chromosomes at 30-fold coverage in the file data/HW_KMER/reads.txt. Each read has a length of 75 bases. Some of the reads have sequencing errors. We would like to identify which lines in the file contains reads with sequencing errors.

(Part 1) One simple way to do this is by k-mer (a DNA sequence of length k) counting. Consider the k-mers generated by a shifting window of width k for each read. If the sequence has no error, each k-mer will likely be found in multiple copies due to the 30-fold coverage. However, if there is a random sequencing error, the k-mer will likely be found in only 1 or a few copies (depends on length of k). Use k=15. Identify the lines likely to contain reads with sequencing errors as any line containing a k-mer with a count of only 1. Because we are dealing with a potentially large number of reads, write the code so that it minimizes memory usage using streams. You may use the toolz library if you think it will be helpful. (50 points)

(Part 2) The actual line numbers for reads with sequencing errors is found in ‘data/HW_KMER/error_lines.txt. Line numbers start with 0, not 1. Of the reads that you believe to have errors, how many are false positives? For each false positive read, identify the chromosome(s) and starting position(s) it comes from. (50 points)

Chromosome sequences in FASTA format are in

data/HW_KMER/chr01.fas
data/HW_KMER/chr02.fas
data/HW_KMER/chr03.fas
data/HW_KMER/chr04.fas
data/HW_KMER/chr05.fas
data/HW_KMER/chr06.fas
data/HW_KMER/chr07.fas
data/HW_KMER/chr08.fas
data/HW_KMER/chr09.fas
data/HW_KMER/chr10.fas
data/HW_KMER/chr11.fas
data/HW_KMER/chr12.fas
data/HW_KMER/chr13.fas
data/HW_KMER/chr14.fas
data/HW_KMER/chr15.fas
data/HW_KMER/chr16.fas

Hints:

  1. Check your code on a smaller version of the original data or your own simulated data before applying it to the full data set (which can take a long time).
  2. You only need to consider k-mers from sliding windows generated for a single line. Ignore k-mers that wrap across two lines.
  3. When generating k-mers, also keep a record of the line number that the k-mer was generated from. This avoids having to later search all reads to find out where a k-mer comes from, which is a very expensive operation.
In [ ]: