Linear Regression

In this example we will be exploring the estimation of subjective wine quality for Portuguese wines via linear regression. We will implement linear regression in two different ways: a) using some lines of python code with gradient descend as an optimization procedure; b)using machine learning libraries like python's sklearn that have highly optimized implementations of many of these algorithms.

1) Reading, inspecting and preparing the data

We use a Portuguese red and white wine database downloaded from http://archive.ics.uci.edu/ml/datasets/Wine+Quality and described in:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems>, Elsevier, 47(4):547-553. ISSN: 0167-9236.

In this exercise we will use the red wine, which is what I enjoy drinking more...


In [1]:
%matplotlib inline

import pandas as pd #used for reading/writing data 
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library

wineData = pd.read_csv('data/winequality/winequality-red.csv', sep=';')

We can inspect the contents of the data: We look at the first 10 entries to get a feeling of how the data is structured. Then we can also see some statistics of each column.


In [4]:
wineData.head(10)


Out[4]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
5 7.4 0.66 0.00 1.8 0.075 13.0 40.0 0.9978 3.51 0.56 9.4 5
6 7.9 0.60 0.06 1.6 0.069 15.0 59.0 0.9964 3.30 0.46 9.4 5
7 7.3 0.65 0.00 1.2 0.065 15.0 21.0 0.9946 3.39 0.47 10.0 7
8 7.8 0.58 0.02 2.0 0.073 9.0 18.0 0.9968 3.36 0.57 9.5 7
9 7.5 0.50 0.36 6.1 0.071 17.0 102.0 0.9978 3.35 0.80 10.5 5

In [5]:
wineData.describe()


Out[5]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000

All columns except the last one are the input parameters of the system, obtained from real vines obtained through physicochemical tests. The last column shows the output value, based on sensory tests. To work with this data we will need to split it into two variables:


In [6]:
wineDataInput = wineData.drop('quality', axis=1)
wineDataOutput = wineData['quality']

3) Linear regression training using gradient descend way

To perform optimization by gradient descent, we will need to implement the following formulas. In all of these formulas, h(x) is the prediction (i.e. output) of x by the linear regressor.

Cost (m is the length of X):

$$\frac{1}{2m}\sum\limits_{i=1}^{m}(h_\theta(x^i)-y^i)^2$$

Gradient of cost with respect to theta (m is the length of X):

$$\frac{1}{m}(h_\theta(x)-y)$$

Update rule (for every j in X, m is the length of X):

$$\theta_j = \theta_j + \alpha\sum\limits_{i=1}^{m}(y^i - h_\theta(x^i))x^i_j$$

(Code adapted from https://kastnerkyle.github.io/posts/linear-regression/)

3.1) Features normalization

The first good thing to do is to normalize the input data to go have zero mean and unit variance. Without normalization we have a high chance of divergence in datasets containing input variables with very different dynamic ranges. In the current database if we do not apply normalization (try it below by commenting out the scale line) we will need to use very small (0.0005 or lower) learning rate to avoid the algorithm to diverge. Also we need many steps to get a reasonable minimum (1000) When normalizing features we can use a much bigger learning rate without issues, with a higher rate (0.05) and only 100 steps or so. Note that every database is a different world.


In [7]:
#prepare the data
from sklearn import preprocessing

#adapt output values to an array form
wineDataOutputArray = np.asarray(wineDataOutput.astype(float))
wineDataOutputArray = wineDataOutputArray[:, np.newaxis]
                       
#adapt input features as an array form, with an extra '1' for \theta_0 term (bias term)
wineDataInputArray_original = np.asarray(wineDataInput.astype(float))
### comment the next line out to test regression on unnormalized data
wineDataInputArray = preprocessing.scale(wineDataInputArray_original) 
wineDataInputArray = np.hstack((np.ones((wineDataInputArray.shape[0], 1)), wineDataInputArray))

#visualization of the normalized and unnormalized
f, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3, sharex='col', sharey='row')
f.set_figheight(7)
f.set_figwidth(15)
ax1.plot(wineDataInputArray_original[0:20, 0])
ax2.plot(wineDataInputArray_original[0:20, 1])
ax3.plot(wineDataInputArray_original[0:20, 2])
ax4.plot(wineDataInputArray[0:20, 1]) #note we added a column of 1's
ax5.plot(wineDataInputArray[0:20, 2])
ax6.plot(wineDataInputArray[0:20, 3])


Out[7]:
[<matplotlib.lines.Line2D at 0x7f5ce24bee80>]

3.2) Linear regression implementation in Python code

We implement the gradient descend algorithm to estimate the best value for $\theta$ parameters. The result of this function are the best $\theta$ parameters and the mean squared error error in each step.


In [8]:
%time
#Initialize theta to zeros                                                      
learning_rate = alpha = 0.05                                                                    
iters = 1000    

def mean_squared_error(theta,X,y):                                                     
    m = y.shape[0]                      
    return 1. / (2. * m) * np.sum((np.dot(X, theta) - y) ** 2.)                            
                        
def l1_error(theta, X, y):
    m = y.shape[0]                      
    return 1. / m * np.sum(abs((np.dot(X, theta) - y)))                                
    
def gradient_update(theta,X,y):                                                
    m = y.shape[0]     
    return 1. / m * np.dot(X.T, (np.dot(X,theta) - y))                                 
                                                                                
def gradient_descent(X,y,alpha,iters):                                    
    m = y.shape[0]
    all_cost = []
    #Initialize theta to zeros
    all_theta = [np.zeros((X.shape[1],1))] #array of vectors   

    for i in range(iters):
        cost = mean_squared_error(all_theta[-1], X, y)
        all_cost.append(cost)
        all_theta.append(all_theta[-1] - float(alpha) * gradient_update(all_theta[-1], X, y))                       
    return all_theta,all_cost                                                       
                                                                             
#Perform linear regression via gradient descent
all_theta, all_cost = gradient_descent(wineDataInputArray, wineDataOutputArray, alpha, iters)


CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 4.29 µs

let's take a look at the mean square error w.r.t. each iteration, and the final one.


In [9]:
plt.plot(all_cost[0:100])


Out[9]:
[<matplotlib.lines.Line2D at 0x7f5ce23cef60>]

In [10]:
mean_squared_error(all_theta[-1], wineDataInputArray, wineDataOutputArray)


Out[10]:
0.20838390311799945

As the value to be estimated is a subjective metric between 3 and 8 we would be most interested in knowing what is the average deviation from the output value that our hypotheses values get. We can compute this with the L1 error:


In [11]:
l1_error(all_theta[-1], wineDataInputArray, wineDataOutputArray)


Out[11]:
0.50048160226452787

In this particular case we could consider that the outputs should be cathegorized after the regression. Of course, we could had approached the problem as a classification one, but this is next class...

In here one quick change we can do is to map all output hypotheses to the nearest integer, and compute error again.


In [12]:
def l1_error_int(theta, X, y):
    m = y.shape[0]                      
    return 1. / m * np.sum(abs((np.rint(np.dot(X, theta)) - y)))

l1_error_int(all_theta[-1], wineDataInputArray, wineDataOutputArray)


Out[12]:
0.43964978111319575

A bit of improvement over standard L1 error.

Over all, there is lots of room for improvement, but we see the system is able to get scores that are in the same ballpark as the desired output.

Next, let's take a look at how the $\theta$ parameters have evolved over time to stable values:


In [13]:
aa = list(map(list, zip(*all_theta)))
f, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3, sharex='col', sharey='row')
f.set_figheight(7)
f.set_figwidth(15)
ax1.plot(aa[1])
ax2.plot(aa[2])
ax3.plot(aa[3])
ax4.plot(aa[4])
ax5.plot(aa[5])
ax6.plot(aa[6])


Out[13]:
[<matplotlib.lines.Line2D at 0x7f5cdfec39b0>]

Also, we can see in a scatter prot how the input vs output looks like:


In [14]:
plt.scatter(wineDataOutputArray, np.dot(wineDataInputArray, all_theta[-1]), color='black')
plt.ylabel('Hypothesis regression')
plt.xlabel('Known output value')


Out[14]:
<matplotlib.text.Text at 0x7f5ce24034a8>

It is clear that our regressor is clearly under-shooting the hypothesized value, as it estimates values around 6 to 7 for known values of 8

2) Running linear regression on the data with sklearn

Although it is nice to implement algorithms ourselves and know all the nuts and bolts around them, we can also rely on ready-made algorithms through the sklearn library.

These have the advantage that internally they do not perform iterations in the same way as we do in the gradient descend, but implement optimization methods (or use a closed-for solution) to speedup the convergence and therefore make it more efficient. Although we might not be able to see much in this small database, note the result of running the algorithm above and the one in this section by using the %time command.


In [16]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression(normalize=True)

As a reminder, here we want to fit the following: $$y = \theta_0 + \sum_{i=1}^{N} \theta_i x_i$$ Where $x_i$ is the $i^{th}$ input feature ($N=11$ in our case) with coefficients $\theta_i$ to be estimated (called "coefs_" in sklearn), and $\theta_0$ the offset (called "itercept_" in sklearn

Note that in here we do not need to define any learning rate or number of iterations, as the particular implementation we use of sklearn solves the equation in exact form. This is possible here because the number of wines is not too big, ;but becomes highly intractable if we have many thousands of data points.

Note also that we have enabled the normalization or input parameters by passing "normalize=True" to the constructor.


In [17]:
%time
lm.fit(wineDataInput, wineDataOutput)


CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.48 µs
Out[17]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)

We can observe the estimated coefficients and the intercept (i.e. the $\theta_0$ parameter)


In [18]:
lm.intercept_


Out[18]:
21.96520844945173

In [19]:
lm.coef_


Out[19]:
array([  2.49905527e-02,  -1.08359026e+00,  -1.82563948e-01,
         1.63312698e-02,  -1.87422516e+00,   4.36133331e-03,
        -3.26457970e-03,  -1.78811638e+01,  -4.13653144e-01,
         9.16334413e-01,   2.76197699e-01])

Let's use these parameters to predict the outputs on our input set:


In [20]:
predictedQuality = lm.predict(wineDataInput)
predictedQuality[0:10]


Out[20]:
array([ 5.03285045,  5.13787975,  5.20989474,  5.69385794,  5.03285045,
        5.06557035,  5.1070693 ,  5.34370699,  5.33670489,  5.65840581])

In [21]:
plt.scatter(wineDataOutput, predictedQuality,  color='black')
plt.ylabel('Hypothesis regression')
plt.xlabel('Known output value')


Out[21]:
<matplotlib.text.Text at 0x7f5cdd9a30f0>

And we can compute the same metrics, which are consistent with our implementation above:


In [22]:
print(lm.score(wineDataInput, wineDataOutput))
print(1. / wineDataOutput.shape[0] * np.sum(abs(predictedQuality - wineDataOutput)))


0.360551703039
0.500489963564

4) Train versus test datasets

In reality we would never do what we just did. Our experiment is ill-defined, as we use the same data for training and for testing. We should split the data into two subsets, being the training one much bigger, and the testing one not seen at all by the system at training time, except to compute the prediction on the input variable.


In [23]:
from sklearn.model_selection import train_test_split

wineDataInput_train, wineDataInput_test, wineDataOutput_train, wineDataOutput_test = train_test_split(wineDataInput, wineDataOutput, test_size=0.3, random_state=0)
lm2 = LinearRegression(normalize=True)
lm2.fit(wineDataInput_train, wineDataOutput_train)
predicted = lm2.predict(wineDataInput_test)
print(lm.score(wineDataInput_test, wineDataOutput_test))
print(1. / wineDataOutput_test.shape[0] * np.sum(abs(predicted - wineDataOutput_test)))


0.337666758102
0.487126216459

Similarly consistent with results obtained above (the small discrepancy of +0.2 is due to the fact that this test data was unknown to training in this case)