Machine Learning


In [ ]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np

from sklearn import datasets
from sklearn.decomposition import PCA

Protein structures

  • Write a function that loads in the x, y, and z coordinates for all CA atoms from a pdb file.

In [ ]:

  • Load in the pdb files homolog-1.pdb and homolog-2.pdb into separate numpy arrays.

In [ ]:

  • Plot x vs. y for the two proteins on the same graph.

In [ ]:

  • Perform a principle component analysis using sklearn.decomposition.PCA on each individual set of coordinates and then transform them individually onto their PCA axes.

In [ ]:

  • Plot the transformed coordinates on top of one another.

In [ ]:

  • Can you explain the result?

In [ ]:

Worm Population

You are studying a mixed population of C. elegans worms and would like to figure out how many classes of worms are present. You measure 10 different features (things like worm length, fecundity, etc.) for 50,000 individual. You have a dataset in pca_dataset.csv, with the parameters in columns the top (numbered 0 to 9) and the individuals in rows.

  • Use a PCA analysis to decide how many worm classes you can discriminate. NOTE: Make sure you exclude the worm number (leftmost column) from the analysis

In [ ]:

  • How many principle components do you have to look at to capture 90% of the variation in the worm features?

In [ ]:

  • You measure the features of a new worm that was sent to the lab. It's feature set is below. Does this worm belong to one of your classes? If so, which one?

In [ ]: