In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np

Q1 (10 points)

The heart dataframe contains the survival time after receiving a heart transplant, the age of the patient and whether or not the survival time was censored

  • Number of Observations - 69
  • Number of Variables - 3

Variable name definitions::

  • survival - Days after surgery until death
  • censors - indicates if an observation is censored. 1 is uncensored
  • age - age at the time of surgery

Answer the following questions with respect to the heart data set:

  • Sort the data frame by age in descending order (oldest at top) without making a copy
  • How many patients were censored?
  • What is the average age for uncensored patients under the age of 45?
  • Find the mean and standard deviation of age and survival time for each value of the cenoring variable.
  • Plot the linear regression of survival (y-axis) against age (x-axis) conditioned on censoring (i.e. either have two separate plots or a single plot using color to distinguish censored and uncensored patients).

In [2]:
heart = sm.datasets.heart.load_pandas().data
heart.head(n=6)


Out[2]:
survival censors age
0 15.0 1.0 54.3
1 3.0 1.0 40.4
2 624.0 1.0 51.0
3 46.0 1.0 42.5
4 127.0 1.0 48.0
5 64.0 1.0 54.6

In [ ]:

Q2 (10 points)

Write a flatmap function that works like map except that the function given takes a list and returns a list of lists that is then flattened (4 points).

In other words, flatmap takes two arguments, a function and a list (or other iterable), just like map. Howevver the function given as the first agument takes a single argument and returns a list (or ohter iterable). In order to get a simple list back, we need to unravel the reuslting list of lists, hence the flatten part.

For example,

flatmap(lambda x: x.split(), ["hello world", "the quick dog"])

should return

["hello", "world", "the", "quick", "dog"]

In [ ]:

Q3 (10 points)

An affine transformation of a vector $x$ is the operation $Ax + b$, where $A$ is a matrix and $b$ is a vector.

  • Write a function to perform an affine transformation.
  • Write a function to reverse the affine transformation
  • Perform an affine transformation of a random 3 by 3 matrix A, and random 3-vectors $x$ and $b$ drawn from the standard uniform distribution with random seed = 1234 and save the result as $y$. Perform the reverse affine transform on $y$ to recover the original vector $x$.

Q4 (10 points)

You are given the following DNA sequecne in FASTA format.

dna = '''> A simulated DNA sequence.
TTAGGCAGTAACCCCGCGATAGGTAGAGCACGCAATCGTCAAGGCGTGCGGTAGGGCTTCCGTGTCTTACCCAAAGAAAC
GACGTAACGTTCCCCGGGCGGTTAAACCAAATCCACTTCACCAACGGCATAACGCGAAGCCCAAACTAAATCGCGCTCGA
GCGGACGCACATTCGCTAGGCTGTGTAGGGGCAGTCTCCGTTAAGGACGATTACCACGTGATGGTAGTTCGCAACATTGG
ACTGTCGGGAATTCCCGAAGGCACTTAAGCGGAGTCTTAGCGTACAGTAACGCAGTCCCGCGTGAACGACTGACAGATGA
'''
  • Remove the comment line and combine the 4 lines of nucleotide symbols into a single string
  • Count the frequecny of all 16 two-letter combinations in the string.

In [ ]:

Q5 (10 points)

The code given below performs a stochastic gradient descent to fit a quadratic polynomila to $n$ data points. Maake the code run faster by:

  • Using numba JIT
  • using Cython

Some test code is provided. Please run this for your optimized versios to confirm that they give the same resuls as the orignal code.


In [ ]:
def sgd(b, x, y, max_iter, alpha):
    n = x.shape[0]
    for i in range(max_iter):
        for j in range(n):
            b[0] -= alpha * (2*(b[0] + b[1]*x[j] + b[2]*x[j]**2 - y[j]))
            b[1] =- alpha * (2*x[j] * (b[0] + b[1]*x[j] + b[2]*x[j]**2 - y[j]))
            b[2] -= alpha * (2*x[j]**2 * (b[0] + b[1]*x[j] + b[2]*x[j]**2 - y[j]))
    return b

In [ ]:
np.random.seed(12345)
n = 10000
x = np.linspace(0, 10, n)
y = 2*x**2 + 6*x + 3 + np.random.normal(0, 5, n)
k = 100
alpha = 0.00001

b0 = np.random.random(3) 
b = sgd(b0, x, y, k, alpha)

yhat = b[0] + b[1]*x+ b[2]*x**2
idx = sorted(np.random.choice(n, 100))
plt.scatter(x[idx], y[idx])
plt.plot(x[idx], yhat[idx], c='red')
pass

In [ ]: