In [3]:
%%capture
%run "14 - Simple Linear Regression.ipynb"

The Model

Suppose we put forth a new hypothesis that makes use of more independent variables: $$minutes = \alpha + \beta_1 friends + \beta_2 work\ hours + \beta_3 phd + \epsilon$$

In multiple regression, we can denote a vector or parameters as $\beta$. Then we use vectors for x and beta:

beta = [alpha, beta_1, ..., beta_k]

x_i = [1, x_i1, ..., x_ik]

beta = [ 0.9, 0.7, 0.8, 0.6]

x = [1, # constant term, 49, # number of friends 4, # work hours per day 0] # does not have a PhD

In Simple Linear Regression, our model was: $$y_i = \alpha + \beta x_i + \epsilon_i$$ In a multiple regression we will use: $$y_i = \alpha + \beta_1 x_{i1} + ... + \beta_k x_{ik} + \epsilon_i$$


In [4]:
def predict(x_i, beta):
    return dot(x_i, beta)

Further Assumptions Of The Least Squares Model

For our model to be reasonable, it requires a couple of assumptions:

  • Columns of x are linearly independent. If this is not true, it will be impossible to estimate beta.
  • Columns of x are all uncorrelated with the errors $\epsilon$.

Fitting The Model

Here as well we choose beta in order to minimize the sum of squared errors. Gradient descent will be used to do this.


In [5]:
def error(x_i, y_i, beta):
    return y_i - predict(x_i, beta)

def squared_error(x_i, y_i, beta):
    return error(x_i, y_i, beta)**2

def squared_error_gradient(x_i, y_i, beta):
    return [-2 * x_ij * error(x_i, y_i, beta) for x_ij in x_i]

def estimate_beta(x, y):
    beta_initial = [random.random() for x_i in x[0]]
    return minimize_stochastic(squared_error, squared_error_gradient, x, y, beta_initial, 0.001)

x = [1, # constant term,
     49, # number of friends
     4, # work hours per day
     0] # does not have a PhD

random.seed(0)
beta = estimate_beta([x], daily_minutes_clean)
beta


Out[5]:
[0.8564580492426055,
 1.347728091101083,
 0.46871637170108277,
 0.25891675029296335]

Above does not match the book, could not get code to work as shown.

Digression: The Bootstrap

Suppose we have a sample of n number of points and we want to estimate the median but we don't know the distribution. To estimate in this scenario, we can bootstrap new samples from which to generate our estimate in order to increase confidence. We will choose n new data points with replacement and then computer the medians of those generated samples.


In [6]:
def bootstrap_sample(data):
    return [random.choice(data) for _ in data]

def bootstrap_statistic(data, stats_fn, num_samples):
    return [stats_fn(bootstrap_sample(data)) for _ in range(num_samples)]

In [11]:
close_to_100 = [99.5 + random.random() for _ in range(101)]

far_from_100 = ([99.5 + random.random()] +
                [random.random() for _ in range(50)] +
                [200 + random.random() for _ in range(50)])

In [12]:
bootstrap_statistic(close_to_100, median, 100)[:5]


Out[12]:
[100.07109087767446,
 100.07109087767446,
 99.97100234853127,
 99.95912040920284,
 99.92467543047273]

In [13]:
bootstrap_statistic(far_from_100, median, 100)[:5]


Out[13]:
[200.1243777414685,
 200.01680008514114,
 200.01680008514114,
 200.03822444747576,
 200.14616387265144]

In [14]:
standard_deviation(close_to_100), standard_deviation(far_from_100)


Out[14]:
(0.27414331046697443, 99.98457857099415)

Standard Errors of Regression Coefficients


In [ ]: