Linear Regression


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
X = np.random.rand(100)
y = X + 0.1 * np.random.randn(100)

In [3]:
plt.scatter(X, y);
plt.show()


Following the steps prescribed by Jake Vanderplas in his awesome text Python Data Science Handbook. He has kindly provided all his codes on github as well.

Step 1. Choose a class of model.

In this case we are using linear regression


In [4]:
from sklearn.linear_model import LinearRegression

Step 2. Choose model hyperparameters.


In [5]:
model = LinearRegression(fit_intercept=True)

Step 3. Arrange data into a features matrix and target vector


In [6]:
X = X.reshape(-1, 1)

In [7]:
X.shape


Out[7]:
(100, 1)

Step 4. Fit the model to your data.


In [8]:
model.fit(X, y)


Out[8]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [9]:
model.coef_


Out[9]:
array([ 0.97408915])

In [10]:
model.intercept_


Out[10]:
0.022535905418693603

If you are statistically trained, you would normally dig into other information such as normality of the residuals and check for autocorrelation etc. You may also want to evaluation the parameters as well. Those are valid statistical modelling questions.

Machine Learning focus is on prediction. You will not find these information with the scikit-learn package. Do take note of this key difference between statistics and machine learning.

Step 5. Predict labels for unknown data


In [11]:
x_test = np.linspace(0, 1)
x_test


Out[11]:
array([ 0.        ,  0.02040816,  0.04081633,  0.06122449,  0.08163265,
        0.10204082,  0.12244898,  0.14285714,  0.16326531,  0.18367347,
        0.20408163,  0.2244898 ,  0.24489796,  0.26530612,  0.28571429,
        0.30612245,  0.32653061,  0.34693878,  0.36734694,  0.3877551 ,
        0.40816327,  0.42857143,  0.44897959,  0.46938776,  0.48979592,
        0.51020408,  0.53061224,  0.55102041,  0.57142857,  0.59183673,
        0.6122449 ,  0.63265306,  0.65306122,  0.67346939,  0.69387755,
        0.71428571,  0.73469388,  0.75510204,  0.7755102 ,  0.79591837,
        0.81632653,  0.83673469,  0.85714286,  0.87755102,  0.89795918,
        0.91836735,  0.93877551,  0.95918367,  0.97959184,  1.        ])

In [12]:
y_pred = model.predict(x_test.reshape(-1,1))

In [13]:
plt.scatter(X, y)
plt.plot(x_test, y_pred);
plt.show()