In [ ]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
!pip install -U okpy
from client.api.notebook import Notebook
ok = Notebook('lab10.ok')
Today's lab is about the computational nuts and bolts of Ordinary Least Squares regression and featurization. For simplicity, we will use synthetic datasets.
In this first dataset, the feature columns are x0
, x1
, etc, and the response column is y
. $n = 1000$ and $p = 10$. Everything is a real number.
In [ ]:
data0 = pd.read_csv("data0.csv")
data0.head()
Recall from lecture that a least-squares regression problem can be cast as:
$$\theta^{\text{OLS}} = \arg \min_{\theta \in \mathbb{R}^p} (Y - X \theta)^T (Y - X \theta)$$Here $Y \in \mathbb{R}^n$ are the true values we're trying to match, one per datum; $X \in \mathbb{R}^{n \times p}$ are the features, one row per datum; and $\theta \in \mathbb{R}^p$ is the parameter vector for a linear prediction function. So the output of the $\arg \min$ is a vector of length $p$.
We could solve this with numerical optimization, which we saw briefly in the previous lab. We can also solve it by a direct computation. Taking the derivative (technically, the gradient) of the minimand with respect to $\theta$ and setting it equal to 0, we get:
$$X^T X \theta = X^T Y$$The solution to this linear equation in $\theta$ can be computed in various ways. Here are a few:
We'll look at methods 1 and 2.
After loading the data, form the matrix $X$, which should be an $n$ by $p$ NumPy array. Also form the array $Y$, which should be a NumPy array with dimension $n$.
Hint: The DataFrame method as_matrix
will be useful.
Hint 2: If x
is a DataFrame, then x[x.columns.difference(['col', 'another_col'])]
is a copy of x
without the two named columns.
In [ ]:
X = ...
Y = ...
X, Y
In [ ]:
_ = ok.grade('q1')
_ = ok.backup()
In [ ]:
XtX = ...
XtX
In [ ]:
_ = ok.grade('q2')
_ = ok.backup()
In [ ]:
import scipy as sc
XtXinv = ...
theta3 = ...
theta3
In [ ]:
_ = ok.grade('q3')
_ = ok.backup()
In [ ]:
avg_squared_loss4 = np.mean((X @ theta3 - Y)**2)
avg_squared_loss4
In [ ]:
_ = ok.grade('q4')
_ = ok.backup()
In [ ]:
theta5 = sc.linalg.solve(XtX, X.T @ Y)
theta5
In [ ]:
_ = ok.grade('q5')
_ = ok.backup()
In [ ]:
absolute_diffs = ...
relative_diffs = ...
In [ ]:
# Run this cell to see histograms of the differences.
# You should see that they are not very large.
plt.hist(absolute_diffs)
plt.show()
plt.hist(relative_diffs);
Use scikit-learn
to find $\theta$, as follows:
sklearn.linear_model
.LinearRegression
to produce a least-squares linear regression model. Be sure to use settings that will produce the same model as we've been using. At least 1 default option is incorrect!fit
method of the model to fit it to $X$ and $Y$.coef_
field of the model will now contain $\theta$.Use the names provided in the skeleton.
Hint: Remember, if you type "linear_model.LinearRegression(
" and then hit Shift+Tab, you'll see the documentation for the LinearRegression
function.
In [ ]:
from sklearn import linear_model
lin_model7 = ...
...
theta7 = ...
theta7
In [ ]:
_ = ok.grade('q7')
_ = ok.backup()
In [ ]:
avg_squared_loss8 = ...
avg_squared_loss8
In [ ]:
_ = ok.grade('q8')
_ = ok.backup()
The next cell loads a simple, slightly more realistic dataset.
We are trying to predict a person's age on the basis of their height and name. We have a dataset of 5 people. (This sounds silly, but given a larger dataset, it could be a reasonable idea. Names go in and out of fashion over time. For example, someone named Sierra in the United States is likely born in the 1990s, while someone named Judith is likely born in the 1940s.)
In [ ]:
people = pd.DataFrame({
'name': ['Aditya', 'Basil', 'Caty', 'Delilah', 'Delilah'],
'height (ft)': [5, 5, 2, 4, 4 ],
'age (yrs)': [29, 17, 3, 7, 12 ]
})
people
Of course, if we attempt to run something like:
X_people = people[['name', 'height (ft)']].as_matrix()
Y_people = people['age (yrs)'].as_matrix()
sc.linalg.solve(X_people.T @ X_people, X_people.T @ Y_people)
...it won't work, because some entries of X_people
are strings.
Recall that a typical solution to this is one-hot encoding: we transform the strings into several 0/1 features, with one feature for each string in the dataset. The package sklearn.feature_extraction
provides many kinds of featurization, including sklearn.feature_extraction.DictVectorizer
, which one-hot encodes strings.
In [ ]:
# Run this cell and examine its output. Use it to complete
# the next cell.
from sklearn import feature_extraction
one_hot_encoder = feature_extraction.DictVectorizer()
list_of_name_dicts = people[['name']].to_dict(orient='records')
names_encoded = one_hot_encoder.fit_transform(list_of_name_dicts).toarray()
names_encoded_tbl = pd.DataFrame(names_encoded, columns=one_hot_encoder.get_feature_names())
names_encoded_tbl
In [ ]:
...
X_people = ...
Y_people = ...
X_people, Y_people
In [ ]:
_ = ok.grade('q9')
_ = ok.backup()
In [ ]:
np.linalg.matrix_rank(X_people.T @ X_people)
In [ ]:
inv_people = sc.linalg.inv(X_people.T @ X_people)
inv_people
In [ ]:
theta_people_via_inv = sc.linalg.inv(X_people.T @ X_people) @ X_people.T @ Y_people
theta_people_via_inv
In [ ]:
(X_people.T @ X_people @ theta_people_via_inv) - (X_people.T @ Y_people)
Write your answer here, replacing this text.
The feature matrix X_people
is rank deficient, meaning its rank is lower than it could be because some of its rows or columns are linearly dependent. Specifically, X_people
has 5 columns and 5 rows, and the last two rows are equal, so it has rank 4. That means X_people.T @ X_people
also has rank 4 and is therefore not invertible. Unfortunately, sc.linalg.inv
produces garbage output in such situations.
One solution is to try sc.linalg.solve
or linear_model.LinearRegression
. You're welcome to do that. However, a student proposes a different solution to this problem:
Let's add tiny random numbers (say, on the order of 1e-15) to each entry of
X_people
. It shouldn't make any qualitative difference in the features, and it will makeX_people
full-rank, since the last two rows won't be exactly the same any more.
Adding small numbers to a matrix is called a perturbation. Try it out. Produce a matrix called X_people_perturbed
that's a copy of X_people
, but with tiny random numbers added to each element. Then solve the new least-squares problem by inverting (X_people_perturbed.T @ X_people_perturbed)
. Call the result theta_people_perturbed
.
In [ ]:
# The noise matrix is provided for you. It contains
# numbers sampled uniformly at random from the interval
# [-1e-15, 1e-15].
noise = (np.random.rand(*X_people.shape) - .5)*2*1e-15
X_people_perturbed = ...
theta_people_perturbed = ...
theta_people_perturbed
In [ ]:
_ = ok.grade('q11')
_ = ok.backup()
Write your answer here, replacing this text.
Perturbing a matrix that is rank-deficient will technically make it invertible, but it remains ill-conditioned. That means that computer solvers will work poorly, and the solution doesn't really make sense because it is highly sensitive to small changes in the input data. (It is numerically unstable.) That is especially true when we use sc.linalg.inv
, but you would notice a similar problem with sc.linalg.solve
.
Feature matrices in the wild are not often rank deficient, and they can typically be made full-rank by inconsequential changes to feature values, as you've seen here. However, feature matrices are sometimes ill-conditioned. The phenomenon is not limited to one-hot encoding; it's just convenient to demonstrate in that case.
Fortunately, there is an easy way to check whether a matrix is ill-conditioned: its condition number. Higher condition numbers are worse. As a rule of thumb, 1e10 is considered a very high condition number.
In [ ]:
X_people_condition_number = np.linalg.cond(X_people)
X_people_perturbed_condition_number = np.linalg.cond(X_people_perturbed)
print("Condition number for X_people:\t\t\t{:.4e}".format(X_people_condition_number))
print("Condition number for X_people_perturbed:\t{:.4e}".format(X_people_perturbed_condition_number))
In [ ]:
_ = ok.grade('q13')
_ = ok.backup()
Soon we will see regularization, a method that (among other benefits) can alleviate numerical instability.
In the last section, you'll investigate how linear regression works on a much larger dataset.
Our $X$ matrix has 10,000 entries, which is not particularly large. Let us see if your computer can handle a much larger dataset: $n = 100,000$ and $p = 1000$.
The mathematical simplicity of least-squares linear regression is a big advantage here. In fact, the main computational difficulties with the next dataset are (1) getting the data to your computer and (2) holding it in memory. To alleviate the first problem, we will synthesize the dataset directly, rather than loading it from a file.
After reading the following note, run the next cell to generate your big dataset.
Note: If it takes more than 30 seconds to run the next cell, the generated matrix is probably too large to fit in memory on your computer. (It contains 100 million floating-point numbers, or roughly 800 megabytes of data.) You will probably have to restart the kernel. Then, try reducing p
in the call to generate_dataset
, say to 500 or 200.
In [ ]:
def generate_dataset(n, p):
error_magnitude = 10
theta_magnitude = 100
X = np.random.randn(n, p)
e = np.random.randn(n)*error_magnitude
true_theta = (np.random.rand(p)-.5)*2*theta_magnitude
Y = X @ true_theta + e
return (X, Y, true_theta)
X_big, Y_big, true_theta_big = generate_dataset(n = 100000, p = 1000)
The features are in the matrix X_big
, and the response values are in Y_big
. Notice also that the true $\theta$ is stored in true_theta_big
.
In [ ]:
theta14 = ...
theta14
In [ ]:
_ = ok.grade('q14')
_ = ok.backup()
Compare your theta14
to true_theta_big
, the true value that generated the data. There are 1000 differences, so we can't just read them. Instead, compute a histogram to show the distribution of differences. (It may be informative to divide the differences by the average magnitude of the elements of true_theta_big
.)
In [ ]:
...
You should find that most parameters are well-estimated. Note that this depends on the amount of data -- if we had only 2000 samples, the errors would be much larger. If you are interested, it would be informative to try that yourself by generating another dataset.
The function time
below will time any function you wish to run. See how long it takes to run the following steps for the big dataset:
sc.linalg.solve
).
In [ ]:
def time(f, count):
"""Report how long it takes to call f.
IMPORTANT NOTE:
The count should be chosen judiciously. Try 1 to start,
just to make sure your function doesn't take forever to
run. Then try 10, 100, etc, until it takes at least a
second for this function to finish timing.
Args:
f (function): A function that takes no arguments and
performs some computation.
count (int): The number of times to call the function.
If a function is very fast, calling it
more times will produce a more accurate
estimate of its speed.
Returns:
float: The length of time it takes (in seconds) to call
f, on average.
Example:
>>> time(lambda: 2 + 2, 1000000)
7.233020500279963e-08
"""
import timeit
return min(timeit.repeat(f, repeat=3, number=count)) / count
In [ ]:
...
In [ ]:
...
In [ ]:
X_big_t_X_big = X_big.T @ X_big # This should be useful.
...
In [ ]:
X_big_t_Y_big = X_big.T @ Y_big
X_big_t_X_big_inv = sc.linalg.inv(X_big.T @ X_big)
...
In [ ]:
...
Solving a linear system is generally thought to be slow. Based on your timing results, when $n$ is much larger than $p$, which step in linear regression is the slow one? (Note that a trick can be used to get similar results if $p$ is much larger than $n$! Numerical methods like stochastic gradient descent can be faster still.)
Write your answer here, replacing this text.
In [ ]:
i_finished_the_lab = False
In [ ]:
_ = ok.grade('qcompleted')
_ = ok.backup()
In [ ]:
_ = ok.submit()
Now, run this code in your terminal to make a
git commit
that saves a snapshot of your changes in git
. The last line of the cell
runs git push, which will send your work to your personal Github repo.
# Tell git to commit your changes to this notebook
git add sp17/lab/lab10/lab10.ipynb
# Tell git to make the commit
git commit -m "lab10 finished"
# Send your updates to your personal private repo
git push origin master