Lab11B: Spark¶
Background
You will use Spark and Python to process genomic data. This consists of bout 3 billion nucleotides in the human genome and a smaller number for the flatworm C. elegans. The genome sequences are found as FASTA files. For the purposes of this exercise, treat lower and upper case as the same. Recall that FASTA files have comment lines starting with ‘>’ that must be excluded from the analysis. For the exercises below, assume that k=20 for the k-mers.
In [1]:
%%spark
Starting Spark application
SparkSession available as 'spark'.
In [3]:
hadoop = sc._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('/data/c_elegans')
for f in fs.get(conf).listStatus(path):
print f.getPath()
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.II.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.III.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.IV.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.V.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.X.fa
Exercise 2 (50 points)
Write a program using spark
to find 5 most common k-mers (shifting
windows of length k) in the human genome. Ignore case when processing
k-mers. You can work one line at a time - we will ignore k-mers that
wrap around lines. You should write a function that takes a path to
FASTA files and a value for k, and returns an key-value RDD of k-mer
counts. Remember to strip comment lines that begin with ‘>’ from the
anlaysis.
Use k=20
Note: The textFile method takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Please set this paramter to 60 - it will speed up processing.
Check: Use the C. elegans genome at /data/c_elegans
. You should
get
[
(u'ATATATATATATATATATAT', 2217),
(u'TATATATATATATATATATA', 2184),
(u'CTCTCTCTCTCTCTCTCTCT', 1373),
(u'TCTCTCTCTCTCTCTCTCTC', 1361),
(u'AGAGAGAGAGAGAGAGAGAG', 1033)
]
In [ ]:
Exercise 3 (10 points)
As a simple QC measure, we can assume that the k-mers that have a count of only 1 are due to sequencing errors. Put all the k-mers with a count of 2 or more in a Spark DataFrame with two columns (sequence, count).
In [ ]:
Exercise 4 (10 points)
Find all k-mers with count greater than 1 that are palindromes.
In [ ]: