In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np

Q1 (10 points)

The heart dataframe contains the survival time after receiving a heart transplant, the age of the patient and whether or not the survival time was censored

Number of Observations - 69
Number of Variables - 3

Variable name definitions::

survival - Days after surgery until death
censors - indicates if an observation is censored. 1 is uncensored
age - age at the time of surgery

Answer the following questions with respect to the heart data set:

Sort the data frame by age in descending order (oldest at top) without making a copy
How many patients were censored?
What is the average age for uncensored patients under the age of 45?
Find the mean and standard deviation of age and survival time for each value of the cenoring variable.
Plot the linear regression of survival (y-axis) against age (x-axis) conditioned on censoring (i.e. either have two separate plots or a single plot using color to distinguish censored and uncensored patients).



In [2]:

    
heart = sm.datasets.heart.load_pandas().data
heart.head(n=6)



In [ ]:

Q2 (10 points)

Write a flatmap function that works like map except that the function given takes a list and returns a list of lists that is then flattened (4 points).

In other words, flatmap takes two arguments, a function and a list (or other iterable), just like map. Howevver the function given as the first agument takes a single argument and returns a list (or ohter iterable). In order to get a simple list back, we need to unravel the reuslting list of lists, hence the flatten part.

For example,

flatmap(lambda x: x.split(), ["hello world", "the quick dog"])

should return

["hello", "world", "the", "quick", "dog"]



In [ ]:

Q3 (10 points)

An affine transformation of a vector $x$ is the operation $Ax + b$, where $A$ is a matrix and $b$ is a vector.

Write a function to perform an affine transformation.
Write a function to reverse the affine transformation
Perform an affine transformation of a random 3 by 3 matrix A, and random 3-vectors $x$ and $b$ drawn from the standard uniform distribution with random seed = 1234 and save the result as $y$. Perform the reverse affine transform on $y$ to recover the original vector $x$.

Q4 (10 points)

You are given the following DNA sequecne in FASTA format.

dna = '''> A simulated DNA sequence.
TTAGGCAGTAACCCCGCGATAGGTAGAGCACGCAATCGTCAAGGCGTGCGGTAGGGCTTCCGTGTCTTACCCAAAGAAAC
GACGTAACGTTCCCCGGGCGGTTAAACCAAATCCACTTCACCAACGGCATAACGCGAAGCCCAAACTAAATCGCGCTCGA
GCGGACGCACATTCGCTAGGCTGTGTAGGGGCAGTCTCCGTTAAGGACGATTACCACGTGATGGTAGTTCGCAACATTGG
ACTGTCGGGAATTCCCGAAGGCACTTAAGCGGAGTCTTAGCGTACAGTAACGCAGTCCCGCGTGAACGACTGACAGATGA
'''

Remove the comment line and combine the 4 lines of nucleotide symbols into a single string
Count the frequecny of all 16 two-letter combinations in the string.



In [ ]:

Q5 (10 points)

The code given below performs a stochastic gradient descent to fit a quadratic polynomila to $n$ data points. Maake the code run faster by:

Using numba JIT
using Cython

Some test code is provided. Please run this for your optimized versios to confirm that they give the same resuls as the orignal code.



In [ ]:

    
def sgd(b, x, y, max_iter, alpha):
    n = x.shape[0]
    for i in range(max_iter):
        for j in range(n):
            b[0] -= alpha * (2*(b[0] + b[1]*x[j] + b[2]*x[j]**2 - y[j]))
            b[1] =- alpha * (2*x[j] * (b[0] + b[1]*x[j] + b[2]*x[j]**2 - y[j]))
            b[2] -= alpha * (2*x[j]**2 * (b[0] + b[1]*x[j] + b[2]*x[j]**2 - y[j]))
    return b



In [ ]:

    
np.random.seed(12345)
n = 10000
x = np.linspace(0, 10, n)
y = 2*x**2 + 6*x + 3 + np.random.normal(0, 5, n)
k = 100
alpha = 0.00001

b0 = np.random.random(3) 
b = sgd(b0, x, y, k, alpha)

yhat = b[0] + b[1]*x+ b[2]*x**2
idx = sorted(np.random.choice(n, 100))
plt.scatter(x[idx], y[idx])
plt.plot(x[idx], yhat[idx], c='red')
pass



In [ ]:

	survival	censors	age
0	15.0	1.0	54.3
1	3.0	1.0	40.4
2	624.0	1.0	51.0
3	46.0	1.0	42.5
4	127.0	1.0	48.0
5	64.0	1.0	54.6