Lab 10


In [ ]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab10.ok')

Today's lab is about the computational nuts and bolts of Ordinary Least Squares regression and featurization. For simplicity, we will use synthetic datasets.

Basic practice

In this first dataset, the feature columns are x0, x1, etc, and the response column is y. $n = 1000$ and $p = 10$. Everything is a real number.


In [ ]:
data0 = pd.read_csv("data0.csv")
data0.head()

Recall from lecture that a least-squares regression problem can be cast as:

$$\theta^{\text{OLS}} = \arg \min_{\theta \in \mathbb{R}^p} (Y - X \theta)^T (Y - X \theta)$$

Here $Y \in \mathbb{R}^n$ are the true values we're trying to match, one per datum; $X \in \mathbb{R}^{n \times p}$ are the features, one row per datum; and $\theta \in \mathbb{R}^p$ is the parameter vector for a linear prediction function. So the output of the $\arg \min$ is a vector of length $p$.

We could solve this with numerical optimization, which we saw briefly in the previous lab. We can also solve it by a direct computation. Taking the derivative (technically, the gradient) of the minimand with respect to $\theta$ and setting it equal to 0, we get:

$$X^T X \theta = X^T Y$$

The solution to this linear equation in $\theta$ can be computed in various ways. Here are a few:

  1. Inverting the matrix $(X^T X)$ and computing $(X^T X)^{-1} X^T Y$.
  2. Solving the equation directly for $\theta$ without inverting $(X^T X)$. (Here are some examples.)
  3. Minimizing $(Y - X \theta)^T (Y - X \theta)$ using numerical optimization. (Here is a list of some of the specialized methods that have been developed.)

We'll look at methods 1 and 2.

Question 1

After loading the data, form the matrix $X$, which should be an $n$ by $p$ NumPy array. Also form the array $Y$, which should be a NumPy array with dimension $n$.

Hint: The DataFrame method as_matrix will be useful.

Hint 2: If x is a DataFrame, then x[x.columns.difference(['col', 'another_col'])] is a copy of x without the two named columns.


In [ ]:
X = ...
Y = ...
X, Y

In [ ]:
_ = ok.grade('q1')
_ = ok.backup()

Question 2

Form the matrix $X^T X$. It should be a $p$ by $p$ NumPy array. Name the result XtX.

Hint: The operator @ will multiply two matrices. You can also use a.dot(b).

Hint 2: X.T is the transpose of X.


In [ ]:
XtX = ...
XtX

In [ ]:
_ = ok.grade('q2')
_ = ok.backup()

Question 3

Invert the matrix XtX and use it to solve the normal equations for $\theta$. Name the result theta3.

Hint: Import the scipy module to do linear algebra. It is customary to import it as sc. Then use the function sc.linalg.inv.


In [ ]:
import scipy as sc
XtXinv = ...
theta3 = ...
theta3

In [ ]:
_ = ok.grade('q3')
_ = ok.backup()

Question 4

What is the squared-error loss (the average of the squares of the residuals) for your theta3 on this dataset?


In [ ]:
avg_squared_loss4 = np.mean((X @ theta3 - Y)**2)
avg_squared_loss4

In [ ]:
_ = ok.grade('q4')
_ = ok.backup()

Question 5

Use sc.linalg.solve to solve for $\theta$ without inverting $X^T X$.


In [ ]:
theta5 = sc.linalg.solve(XtX, X.T @ Y)
theta5

In [ ]:
_ = ok.grade('q5')
_ = ok.backup()

Question 6

Are the differences between theta5 and theta3 large? Investigate by computing the absolute differences theta5 - theta3 and the relative differences (theta5 - theta3) / theta5.


In [ ]:
absolute_diffs = ...
relative_diffs = ...

In [ ]:
# Run this cell to see histograms of the differences.
# You should see that they are not very large.
plt.hist(absolute_diffs)
plt.show()
plt.hist(relative_diffs);

Question 7

Use scikit-learn to find $\theta$, as follows:

  • Import the module sklearn.linear_model.
  • Call LinearRegression to produce a least-squares linear regression model. Be sure to use settings that will produce the same model as we've been using. At least 1 default option is incorrect!
  • Call the fit method of the model to fit it to $X$ and $Y$.
  • The coef_ field of the model will now contain $\theta$.

Use the names provided in the skeleton.

Hint: Remember, if you type "linear_model.LinearRegression(" and then hit Shift+Tab, you'll see the documentation for the LinearRegression function.


In [ ]:
from sklearn import linear_model
lin_model7 = ...
...
theta7 = ...
theta7

In [ ]:
_ = ok.grade('q7')
_ = ok.backup()

Question 8

Using the predict method of the model you created, compute a prediction for each row of $X$. (These should be the same as X @ theta7, but it is useful to learn the API.) Use these predictions to compute the average squared loss as in question 4.


In [ ]:
avg_squared_loss8 = ...
avg_squared_loss8

In [ ]:
_ = ok.grade('q8')
_ = ok.backup()

One-hot encoding and numerical stability

The next cell loads a simple, slightly more realistic dataset.

We are trying to predict a person's age on the basis of their height and name. We have a dataset of 5 people. (This sounds silly, but given a larger dataset, it could be a reasonable idea. Names go in and out of fashion over time. For example, someone named Sierra in the United States is likely born in the 1990s, while someone named Judith is likely born in the 1940s.)


In [ ]:
people = pd.DataFrame({
        'name':        ['Aditya', 'Basil', 'Caty', 'Delilah', 'Delilah'],
        'height (ft)': [5,        5,       2,      4,         4        ],
        'age (yrs)':   [29,       17,      3,      7,         12       ]
    })
people

Of course, if we attempt to run something like:

X_people = people[['name', 'height (ft)']].as_matrix()
Y_people = people['age (yrs)'].as_matrix()
sc.linalg.solve(X_people.T @ X_people, X_people.T @ Y_people)

...it won't work, because some entries of X_people are strings.

Recall that a typical solution to this is one-hot encoding: we transform the strings into several 0/1 features, with one feature for each string in the dataset. The package sklearn.feature_extraction provides many kinds of featurization, including sklearn.feature_extraction.DictVectorizer, which one-hot encodes strings.

Question 9

Use DictVectorizer to one-hot encode the name column of people. Then, create a feature matrix X_people containing both the height (ft) column and the one-hot-encoded name features. Also create a response matrix Y_people containing the ages.


In [ ]:
# Run this cell and examine its output.  Use it to complete
# the next cell.
from sklearn import feature_extraction
one_hot_encoder = feature_extraction.DictVectorizer()
list_of_name_dicts = people[['name']].to_dict(orient='records')
names_encoded = one_hot_encoder.fit_transform(list_of_name_dicts).toarray()
names_encoded_tbl = pd.DataFrame(names_encoded, columns=one_hot_encoder.get_feature_names())
names_encoded_tbl

In [ ]:
...
X_people = ...
Y_people = ...
X_people, Y_people

In [ ]:
_ = ok.grade('q9')
_ = ok.backup()

Question 10

Read the following 4 cells, run them, and examine their output. What do you notice?


In [ ]:
np.linalg.matrix_rank(X_people.T @ X_people)

In [ ]:
inv_people = sc.linalg.inv(X_people.T @ X_people)
inv_people

In [ ]:
theta_people_via_inv = sc.linalg.inv(X_people.T @ X_people) @ X_people.T @ Y_people
theta_people_via_inv

In [ ]:
(X_people.T @ X_people @ theta_people_via_inv) - (X_people.T @ Y_people)

Write your answer here, replacing this text.

The feature matrix X_people is rank deficient, meaning its rank is lower than it could be because some of its rows or columns are linearly dependent. Specifically, X_people has 5 columns and 5 rows, and the last two rows are equal, so it has rank 4. That means X_people.T @ X_people also has rank 4 and is therefore not invertible. Unfortunately, sc.linalg.inv produces garbage output in such situations.

One solution is to try sc.linalg.solve or linear_model.LinearRegression. You're welcome to do that. However, a student proposes a different solution to this problem:

Let's add tiny random numbers (say, on the order of 1e-15) to each entry of X_people. It shouldn't make any qualitative difference in the features, and it will make X_people full-rank, since the last two rows won't be exactly the same any more.

Question 11

Adding small numbers to a matrix is called a perturbation. Try it out. Produce a matrix called X_people_perturbed that's a copy of X_people, but with tiny random numbers added to each element. Then solve the new least-squares problem by inverting (X_people_perturbed.T @ X_people_perturbed). Call the result theta_people_perturbed.


In [ ]:
# The noise matrix is provided for you.  It contains
# numbers sampled uniformly at random from the interval
# [-1e-15, 1e-15].
noise = (np.random.rand(*X_people.shape) - .5)*2*1e-15
X_people_perturbed = ...
theta_people_perturbed = ...
theta_people_perturbed

In [ ]:
_ = ok.grade('q11')
_ = ok.backup()

Question 12

Rerun the cell above a few times. Do you notice anything that says we have (or haven't) solved the problem?

Write your answer here, replacing this text.

Perturbing a matrix that is rank-deficient will technically make it invertible, but it remains ill-conditioned. That means that computer solvers will work poorly, and the solution doesn't really make sense because it is highly sensitive to small changes in the input data. (It is numerically unstable.) That is especially true when we use sc.linalg.inv, but you would notice a similar problem with sc.linalg.solve.

Feature matrices in the wild are not often rank deficient, and they can typically be made full-rank by inconsequential changes to feature values, as you've seen here. However, feature matrices are sometimes ill-conditioned. The phenomenon is not limited to one-hot encoding; it's just convenient to demonstrate in that case.

Fortunately, there is an easy way to check whether a matrix is ill-conditioned: its condition number. Higher condition numbers are worse. As a rule of thumb, 1e10 is considered a very high condition number.

Question 13

Use the function np.linalg.cond to compute the condition numbers of X_people and of X_people_perturbed.


In [ ]:
X_people_condition_number = np.linalg.cond(X_people)
X_people_perturbed_condition_number = np.linalg.cond(X_people_perturbed)
print("Condition number for X_people:\t\t\t{:.4e}".format(X_people_condition_number))
print("Condition number for X_people_perturbed:\t{:.4e}".format(X_people_perturbed_condition_number))

In [ ]:
_ = ok.grade('q13')
_ = ok.backup()

Soon we will see regularization, a method that (among other benefits) can alleviate numerical instability.

In the last section, you'll investigate how linear regression works on a much larger dataset.

Scaling up

Our $X$ matrix has 10,000 entries, which is not particularly large. Let us see if your computer can handle a much larger dataset: $n = 100,000$ and $p = 1000$.

The mathematical simplicity of least-squares linear regression is a big advantage here. In fact, the main computational difficulties with the next dataset are (1) getting the data to your computer and (2) holding it in memory. To alleviate the first problem, we will synthesize the dataset directly, rather than loading it from a file.

After reading the following note, run the next cell to generate your big dataset.

Note: If it takes more than 30 seconds to run the next cell, the generated matrix is probably too large to fit in memory on your computer. (It contains 100 million floating-point numbers, or roughly 800 megabytes of data.) You will probably have to restart the kernel. Then, try reducing p in the call to generate_dataset, say to 500 or 200.


In [ ]:
def generate_dataset(n, p):
    error_magnitude = 10
    theta_magnitude = 100
    X = np.random.randn(n, p)
    e = np.random.randn(n)*error_magnitude
    true_theta = (np.random.rand(p)-.5)*2*theta_magnitude
    Y = X @ true_theta + e
    return (X, Y, true_theta)

X_big, Y_big, true_theta_big = generate_dataset(n = 100000, p = 1000)

The features are in the matrix X_big, and the response values are in Y_big. Notice also that the true $\theta$ is stored in true_theta_big.

Question 14

Perform least-squares linear regression for this dataset using any method you wish.


In [ ]:
theta14 = ...
theta14

In [ ]:
_ = ok.grade('q14')
_ = ok.backup()

Question 15

Compare your theta14 to true_theta_big, the true value that generated the data. There are 1000 differences, so we can't just read them. Instead, compute a histogram to show the distribution of differences. (It may be informative to divide the differences by the average magnitude of the elements of true_theta_big.)


In [ ]:
...

You should find that most parameters are well-estimated. Note that this depends on the amount of data -- if we had only 2000 samples, the errors would be much larger. If you are interested, it would be informative to try that yourself by generating another dataset.

Question 16

The function time below will time any function you wish to run. See how long it takes to run the following steps for the big dataset:

  1. Compute $X^T X$.
  2. Compute $X^T Y$.
  3. Having precomputed $X^T X$, compute $(X^T X)^{-1}$.
  4. Having precomputed $(X^T X)^{-1}$ and $X^T Y$, compute $(X^T X)^{-1} X^T Y$.
  5. Having precomputed $X^T X$ and $X^T Y$, solve for $\theta$ (using sc.linalg.solve).

In [ ]:
def time(f, count):
    """Report how long it takes to call f.
    
    IMPORTANT NOTE:
    The count should be chosen judiciously.  Try 1 to start,
    just to make sure your function doesn't take forever to
    run.  Then try 10, 100, etc, until it takes at least a
    second for this function to finish timing.
    
    Args:
      f (function): A function that takes no arguments and
                    performs some computation.
      count (int): The number of times to call the function.
                   If a function is very fast, calling it
                   more times will produce a more accurate
                   estimate of its speed.
    
    Returns:
      float: The length of time it takes (in seconds) to call
             f, on average.
    
    Example:
    >>> time(lambda: 2 + 2, 1000000)
    7.233020500279963e-08
    """
    import timeit
    return min(timeit.repeat(f, repeat=3, number=count)) / count

In [ ]:
...

In [ ]:
...

In [ ]:
X_big_t_X_big = X_big.T @ X_big # This should be useful.
...

In [ ]:
X_big_t_Y_big = X_big.T @ Y_big
X_big_t_X_big_inv = sc.linalg.inv(X_big.T @ X_big)
...

In [ ]:
...

Question 17

Solving a linear system is generally thought to be slow. Based on your timing results, when $n$ is much larger than $p$, which step in linear regression is the slow one? (Note that a trick can be used to get similar results if $p$ is much larger than $n$! Numerical methods like stochastic gradient descent can be faster still.)

Write your answer here, replacing this text.

Submitting your assignment

If you made a good-faith effort to complete the lab, change i_finished_the_lab to True in the cell below. In any case, run the cells below to submit the lab.


In [ ]:
i_finished_the_lab = False

In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()

In [ ]:
_ = ok.submit()

Now, run this code in your terminal to make a git commit that saves a snapshot of your changes in git. The last line of the cell runs git push, which will send your work to your personal Github repo.

# Tell git to commit your changes to this notebook
git add sp17/lab/lab10/lab10.ipynb

# Tell git to make the commit
git commit -m "lab10 finished"

# Send your updates to your personal private repo
git push origin master