Lab11B: Spark¶

Background

You will use Spark and Python to process genomic data. This consists of bout 3 billion nucleotides in the human genome and a smaller number for the flatworm C. elegans. The genome sequences are found as FASTA files. For the purposes of this exercise, treat lower and upper case as the same. Recall that FASTA files have comment lines starting with ‘>’ that must be excluded from the analysis. For the exercises below, assume that k=20 for the k-mers.

In [1]:

%%spark

Starting Spark application

ID	YARN Application ID	Kind	State	Spark UI	Driver log	Current session?
135	application_1522938745830_0060	pyspark	idle	Link	Link	✔

SparkSession available as 'spark'.

In [3]:

hadoop = sc._jvm.org.apache.hadoop

fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('/data/c_elegans')

for f in fs.get(conf).listStatus(path):
    print f.getPath()

hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.II.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.III.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.IV.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.V.fa
hdfs://vcm-2167.oit.duke.edu:8020/data/c_elegans/Caenorhabditis_elegans.WBcel235.dna.chromosome.X.fa

Exercise 2 (50 points)

Write a program using spark to find 5 most common k-mers (shifting windows of length k) in the human genome. Ignore case when processing k-mers. You can work one line at a time - we will ignore k-mers that wrap around lines. You should write a function that takes a path to FASTA files and a value for k, and returns an key-value RDD of k-mer counts. Remember to strip comment lines that begin with ‘>’ from the anlaysis.

Use k=20

Note: The textFile method takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Please set this paramter to 60 - it will speed up processing.

Check: Use the C. elegans genome at /data/c_elegans. You should get

[
(u'ATATATATATATATATATAT', 2217),
(u'TATATATATATATATATATA', 2184),
(u'CTCTCTCTCTCTCTCTCTCT', 1373),
(u'TCTCTCTCTCTCTCTCTCTC', 1361),
(u'AGAGAGAGAGAGAGAGAGAG', 1033)
]

In [ ]:

Exercise 3 (10 points)

As a simple QC measure, we can assume that the k-mers that have a count of only 1 are due to sequencing errors. Put all the k-mers with a count of 2 or more in a Spark DataFrame with two columns (sequence, count).

In [ ]:

Exercise 4 (10 points)

Find all k-mers with count greater than 1 that are palindromes.

In [ ]: