Homework 12

Note 1: You should do this in a PySpark notebook, not a Python 3 notebook. Execute the first 3 cells that are provided.

Note 2: In the following exercises, keep the amount of data returned by Spark to the local notebook session to the minimum needed for that exercise. In other words, all the work should be done via distributed computing and not by returning a large collection that is then processed in regular Python.

Note 3: To minimize waitinng times, do the exercises using the C. elegans genome. Data for the human genome at /data/human/*fa but since it takes a long time, running your code against the human genome is optional.



In [1]:

    
import os
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'









    



Starting Spark application






    





ID YARN Application ID Kind State Spark UI Driver log Current session?
1 pyspark idle ✔






    



SparkContext available as 'sc'.
HiveContext available as 'sqlContext'.



In [2]:

    
%%spark



In [3]:

    
# Change path when debugging is compete to work on human genome

# fasta_path = '/data/human/*fa'
fasta_path = '/data/c_elegans/*fa'

Exercise 1 (50 points)

Write a program using spark to find 5 most common k-mers (shifting windows of length k) in the human genome. Ignore case when processing k-mers. You can work one line at a time - we will ignore k-mers that wrap around lines. You should write a function that takes a path to FASTA files and a value for k, and returns an key-value RDD of k-mer counts. Remember to strip comment lines that begin with '>' from the anlaysis.

Use k = 20.

Note: The textFile method takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Please set this paramter to 60 - it will speed up processing.

Check: Use the C. elegans genome at /data/c_elegans/*fa. You should get

[
(u'ATATATATATATATATATAT', 2168), 
(u'TATATATATATATATATATA', 2142), 
(u'CTCTCTCTCTCTCTCTCTCT', 1337), 
(u'TCTCTCTCTCTCTCTCTCTC', 1327), 
(u'AGAGAGAGAGAGAGAGAGAG', 1007)
]



In [ ]:

Exercise 2 (10 points)

Find all k-mers that are palindromes (i.e the sequence is the same when read back-to-front). How many are there?



In [ ]:

Exercise 3 (10 points)

As a simple QC measure, we can assume that the k-mers that have a count of only 1 are due to sequencing errors. Put all the k-mers with a count of 2 or more in a Spark DataFrame with two columns (sequence, count). Count how many rows in the DataFrame have counts between 5 and 10 (inclusive of both 5 and 10).



In [ ]:

Exercsie 4 (30 points)

Make a Markov transition matrix for any nucleotide ('A', 'C', 'T', 'G') to any other nucleotide. The (i,j) entry should indicate the probability of finding the jth nucleotide appearing immediaely after the ith nucleotide in the genome. For example, the entry (0, 2) shows the probability of finding a T immediately followng an A. The matrix should have shape (4,4).



In [ ]: