This notebook is intended to introduce you to running ipython notebook and to familiarize you with some basics of numpy, matplotlib, and sklearn, which you'll use extensively in this course. Read through the commands, try making changes, and make sure you understand how the plots below are generated.

In your projects, you should focus on making your code as readable as possible. Use lots of comments -- see the code below -- and try to prefer clarity over compact code.

You should also familiarize yourself with the various keyboard shortcuts for moving between cells and running cells. Ctrl-ENTER runs a cell, while shift-ENTER runs a cell and advances focus to the next cell.

The first code cell just contains setup calls -- importing libraries and some other global settings to make things run smoothly.



In [4]:

    
# Import a bunch of libraries.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

Start by setting the randomizer seed so results are the same each time. We'll use a random number generator later.



In [7]:

    
np.random.seed(100)

Generate evenly spaced X values in [0,1] using linspace. X is a numpy array, in particular a multi-dimensional "ndarray". Try looking at the documentation for ndarray: http://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html



In [39]:

    
# How many samples to generate. Try adjusting this value.
n_samples = 20

X = np.linspace(0, 1, n_samples)

# Inspect X.
print X
print type(X)
print X.shape









    



[ 0.          0.05263158  0.10526316  0.15789474  0.21052632  0.26315789
  0.31578947  0.36842105  0.42105263  0.47368421  0.52631579  0.57894737
  0.63157895  0.68421053  0.73684211  0.78947368  0.84210526  0.89473684
  0.94736842  1.        ]
<type 'numpy.ndarray'>
(20,)

Let's create a "true" function that we will try to approximate with a model, below. We'll use python's lambda syntax, which makes it easy to define a simple function in a single line. See here for more details:

http://www.python-course.eu/lambda.php



In [30]:

    
# Set the true function as a piece of a cosine curve.
true_function = lambda x: np.cos(1.5 * np.pi * x)

# Try it out. Notice that you can apply the function to a scalar or an array.
print true_function(0)
print true_function(0.5)
print true_function(np.array([0, 0.5]))









    



1.0
-0.707106781187
[ 1.         -0.70710678]

Now, let's generate noisy observations of our true function. This simulates something like the situation we encounter in the real world: we observe noisy data from which we'd like to infer a model.



In [35]:

    
# Generate true y values.
y = true_function(X)

# Print the values of y to the nearest hundredth.
print ['%.2f' %i for i in y]

# Add random noise to y.
# The randn function samples random numbers from the standard Normal distribution.
# Multiplying adjusts the standard deviation of the distribution.
y += np.random.randn(n_samples) * 0.2

# Print the noise-added values of y for comparison.
print ['%.2f' %i for i in y]









    



['1.00', '0.97', '0.88', '0.74', '0.55', '0.32', '0.08', '-0.16', '-0.40', '-0.61', '-0.79', '-0.92', '-0.99', '-1.00', '-0.95', '-0.84', '-0.68', '-0.48', '-0.25', '-0.00']
['0.66', '0.74', '0.28', '0.74', '0.50', '0.23', '0.11', '-0.16', '-0.34', '-0.76', '-1.05', '-0.90', '-1.07', '-1.23', '-1.02', '-1.09', '-0.36', '-0.34', '-0.64', '-0.03']

Ok. Now we have some outputs, y, that we want to predict, and some inputs X. In general, our outputs (in this course) will always be 1-dimensional. Our inputs will usually have more than 1 dimension -- we'll call these our features. But here, for simplicity, we have just a single feature.

Since the machine learning classes in sklearn expect input feature vectors, we need to turn each input x in X into a feature vector [x].



In [40]:

    
# Another way to do this is np.transpose([X]). Read more about array indexing for details.
X = X[:, np.newaxis]
print X









    



[[ 0.        ]
 [ 0.05263158]
 [ 0.10526316]
 [ 0.15789474]
 [ 0.21052632]
 [ 0.26315789]
 [ 0.31578947]
 [ 0.36842105]
 [ 0.42105263]
 [ 0.47368421]
 [ 0.52631579]
 [ 0.57894737]
 [ 0.63157895]
 [ 0.68421053]
 [ 0.73684211]
 [ 0.78947368]
 [ 0.84210526]
 [ 0.89473684]
 [ 0.94736842]
 [ 1.        ]]

Since you're already familiar with linear regression, let's try that first. Check out the sklearn documentation for linear regression:

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html



In [51]:

    
# Try setting fit_intercept=False as well.
lr = LinearRegression(fit_intercept=True)
lr.fit(X, y)
print lr.intercept_
print lr.coef_
print 'Estimated function: y = %.2f + %.2fx' %(lr.intercept_, lr.coef_[0])









    



0.46290376311
[-1.49735383]
Estimated function: y = 0.46 + -1.50x

Approximating a cosine function with a linear model doesn't work so well. By adding polynomial transformations of our feature(s), we can fit more complex functions. This is often called polynomial regression. Take a look at the sklearn documentation for the PolynomialFeatures preprocessor:

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

You'll notice that the sklearn classes have many of the same function names like fit() and fit_transform().



In [57]:

    
# Try increasing the degree past 2.
poly = PolynomialFeatures(degree=2, include_bias=False)
X2 = poly.fit_transform(X)
print X2









    



[[ 0.          0.        ]
 [ 0.05263158  0.00277008]
 [ 0.10526316  0.01108033]
 [ 0.15789474  0.02493075]
 [ 0.21052632  0.04432133]
 [ 0.26315789  0.06925208]
 [ 0.31578947  0.09972299]
 [ 0.36842105  0.13573407]
 [ 0.42105263  0.17728532]
 [ 0.47368421  0.22437673]
 [ 0.52631579  0.27700831]
 [ 0.57894737  0.33518006]
 [ 0.63157895  0.39889197]
 [ 0.68421053  0.46814404]
 [ 0.73684211  0.54293629]
 [ 0.78947368  0.6232687 ]
 [ 0.84210526  0.70914127]
 [ 0.89473684  0.80055402]
 [ 0.94736842  0.89750693]
 [ 1.          1.        ]]

Now let's fit a linear model where the input features are (x, x^2).



In [59]:

    
lr = LinearRegression(fit_intercept=True)
lr.fit(X2, y)
print lr.intercept_
print lr.coef_
print 'Estimated function: y = %.2f + %.2fx0 + %.2fx1' %(lr.intercept_, lr.coef_[0], lr.coef_[1])









    



1.14251113954
[-5.80153388  4.30418005]
Estimated function: y = 1.14 + -5.80x0 + 4.30x1

Let's put everything together and try some plotting. We can use sklearn's Pipeline framework to connect the 2 operations, PolynomialFeatures and LinearRegression, both of which have a fit() method.



In [60]:

    
# Below, we'll fit polynomials to the noisy data with these degrees.
degrees = [1, 4, 15]

# Initialize a new plot.
plt.figure(figsize=(14, 4))

# We'll create a subplot for each value of the degrees list.
for i in range(len(degrees)):
    # The subplots are all on the same row.
    ax = plt.subplot(1, len(degrees), i+1)
    
    # Turn off tick marks to keep things clean.
    plt.setp(ax, xticks=(), yticks=())

    # Set up the polynomial features preprocessor.
    polynomial_features = PolynomialFeatures(degree=degrees[i],
                                             include_bias=False)

    # Use the sklearn's Pipeline to string together 2 operations.
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])

    pipeline.fit(X, y)
    
    # Show samples from the fitted function.
    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")

    # Show the true function.
    plt.plot(X_test, true_function(X_test), label="True function")

    # Show the original noisy samples.
    plt.scatter(X, y, label="Samples")

    # Add a few more labels to the plot.
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((-.05, 1.05))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title("Degree %d" % degrees[i])

# Render the plots.
plt.show()

The machine learning lesson here is that we are interested in the smallest model that fits our data the best. Clearly, the degree 1 model, while very small (only 2 parameters), doesn't fit the observed data well. The degree 15 model fits the observed data extremely well, but is unlikely to generalize to new data. This is a case of "over-fitting", which often happens when we try to estimate too many parameters from just a few examples. The degree 4 model appears to be a good blend of small model size and good generalization.