Welcome to Machine Learning!

This is the section of the class where we learn how to make a computer look at our data and identify aspects of the data that we didn't know to look for. The first section of this module begins with videos that give a brief background and introduction. In the following units, we'll start putting this vocabulary to use!

Video 1: Introduction

You can find Casey's introduction to machine learning for GCB 535 here: https://youtu.be/Cj_giNsKZYc

Video 2: Types of Machine Learning Methods

You can find Casey's discussion of different classes of machine learning methods here: https://youtu.be/4n2m3bLY2ps

Prelab Questions:

Q1: What type of question would you address with an unsupervised algorithm?

Q2: Would you use a supervised or unsupervised algorithm to find genes involved in mitochondrial biogenesis if you have already identified a few genes that play a role in the process?

Q3: Why?

Q4: For the situation described in Q2, what are the Features, Examples, Labels, and Predictions?

Video 3: Example of Supervised Machine Learning

You can find Casey's discussion how you might structure an analysis to use a supervised algorithm to predict the effective therapeutic dose of a drug here: https://youtu.be/9N19ogr9mZc

Q5: Why are the samples in Video 2 features, while the samples here are examples?

Video 4: Example of Unupervised Machine Learning

You can find Casey's discussion how you might look for disease subtypes with unsupervised algorithms here: https://youtu.be/y400v_AAJSE

Q6: What are the Features, Examples and Labels for the question discussed in Video 4?

k-Means Clustering

Let's meet our first machine learning algorithm: k-means clustering. K-means has been used to identify subtypes of disease. For example, we discuss this paper by Tothill et al. in our k-means introduction video. Before you dive into the nuts and bolts of an implementation of k-means clustering, let's try to get an intuitive understanding of how this method works: https://youtu.be/qL7TBaMtooM

Q7: Is k-means clustering a supervised or unsupervised algorithm?

k-Means Demo Code:

Now we're actually going to use some code that will perform k-means clustering. First we need to get some python packages that we're going to use out of the way.



In [0]:

    
%matplotlib inline
# this crazy line lets us make figures in an ipython notebook

import random
import sys
from math import sqrt

import matplotlib.pyplot as plt
import numpy as np

The next function is used to assign an observation to the centroid that is nearest to it.



In [0]:

    
def assign_nearest(centroids, point):
    """
    assigns the point to its nearest centroid
    
    params:
    centroids - a list of centroids, each of which has 2 dimensions
    point - a point, which has two dimensions
    
    returns:
    the index of the centroid the point is closest to.
    """
    nearest_idx = 0
    nearest_dist = sys.float_info.max  # largest float on your computer
    for i in range(len(centroids)):
        # sqrt((x1-x2)^2 + (y1-y2)^2)
        dist = sqrt((centroids[i][0]-point[0])**2 + (centroids[i][1]-point[1])**2)
        if dist < nearest_dist:  # smallest distance thus far
            nearest_idx = i
            nearest_dist = dist
    
    return nearest_idx

The next function actually performs k-means clustering. You need to understand how the algorithm works at the level of the video lecture. You don't need to understand every line of this, but you should feel free to dive in if you're interested!



In [0]:

    
def kmeans(data, k):
    """
    performs k-means clustering for two-dimensional data.
    
    params:
    data - A numpy array of shape N, 2
    k - The number of clusters.
    
    returns:
    a dictionary with three elements
        - ['centroids']: a list of the final centroid positions.
        - ['members']: a list [one per centroid] of the points assigned to
          that centroid at the conclusion of clustering.
        - ['paths']: a list [one per centroid] of lists [one per iteration]
          containing the points occupied by each centroid.
    """
    
    # http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.ndarray.shape.html#numpy.ndarray.shape
    # .shape returns the size of the input numpy array in each dimension
    # if there are not 2 dimensions, we can't handle it here.
    if data.shape[1] != 2:
        return 'This implementation only supports two dimensional data.'
    if data.shape[0] < k:
        return 'This implementation requires at least as many points as clusters.'
    
    # pick random points as initial centroids
    centroids = []
    for x in random.sample(data, k):
        # note the use of tuples here
        centroids.append(tuple(x.tolist()))
        
    paths = []
    for i in range(k):
        paths.append([centroids[i],])
    
    # we'll store all previous states
    # so if we ever hit the same point again we know to stop
    previous_states = set()
    
    # continue until we repeat the same centroid positions
    assignments = None
    while not tuple(centroids) in previous_states:
        previous_states.add(tuple(centroids))
        assignments = []
        for point in data:
            assignments.append(assign_nearest(centroids, point))
        
        centroids_sum = [] # Make a list for each centroid to store position sum
        centroids_n = [] # Make a list for each centroid to store counts
        for i in range(k):
            centroids_sum.append((0,0))
            centroids_n.append(0)
            
        for i in range(len(assignments)):
            centroid = assignments[i]
            centroids_n[centroid] += 1  # found a new member of this centroid
            # add the point
            centroids_sum[centroid] = (centroids_sum[centroid][0] + data[i][0],
                                       centroids_sum[centroid][1] + data[i][1])
        
        for i in range(k):
            new_centroid = (centroids_sum[i][0]/centroids_n[i], centroids_sum[i][1]/centroids_n[i])
            centroids[i] = new_centroid
            paths[i].append(new_centroid)
        
    r_dict = {}
    r_dict['centroids'] = centroids
    r_dict['paths'] = paths
    r_dict['members'] = assignments
    return r_dict

This next cell is full of plotting code. It uses something called matplotlib to show kmeans clustering. Specifically it shows the path centroids took, where they ended up, and which points were assigned to them. Feel free to take a look at this, but understanding it goes beyond the scope of the class.



In [0]:

    
def plot_km(km, points):
    """
    Plots the results of a kmeans run.
    
    params:
    km - a kmeans result object that contains centroids, paths, and members
    
    returns:
    a matplotlib figure object
    """
    
    (xmin, ymin) = np.amin(points, axis=0)
    (xmax, ymax) = np.amax(points, axis=0)
    

    plt.figure(1)
    plt.clf()
    plt.plot(points[:, 0], points[:, 1], 'k.', markersize=2)
    
    for path in km['paths']:
        nppath = np.asarray(path)
        plt.plot(nppath[:, 0], nppath[:, 1])

    # Plot the calculated centroids as a red X
    centroids = np.asarray(km['centroids'])
    plt.scatter(centroids[:, 0], centroids[:, 1],
                marker='x', s=169, linewidths=3,
                color='r', zorder=10)

    plt.title('K-means clustering of simulated data.\n'
              'estimated (red), path (lines)')
    plt.xlim(xmin, xmax)
    plt.ylim(ymin, ymax)
    plt.xticks(())
    plt.yticks(())
    plt.yticks(())
    plt.show()

The next line will load a file of data using the numpy function loadtxt. We've created a population of points.



In [0]:

    
pop = np.loadtxt('kmeans-population.csv', delimiter=',')

Now we can use the k-means function to cluster! In this case, we're saying we want to find three clusters.



In [0]:

    
km_result = kmeans(pop, 3)

Now we can plot the results!



In [0]:

    
plot_km(km_result, pop)

Woo! You're done with this prelab! Feel free to run the kmeans clustering and plotting lines a few times to see how the algorithm works. For our in class exercise, we're going to perform k-means clustering in an exercise we call The Duck Strikes Back.

Extra Information

The k-means implementation above is functional and could be used in practice. However, much more optimized implementation is available in the scikit learn package that we're going to use for the supervised machine learning applications in this course. For more information on using that implementation, check out the documentation: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html



In [0]: