Linear Regression

In this example we will be exploring the estimation of subjective wine quality for Portuguese wines via linear regression. We will implement linear regression in two different ways: a) using some lines of python code with gradient descend as an optimization procedure; b)using machine learning libraries like python's sklearn that have highly optimized implementations of many of these algorithms.

1) Reading, inspecting and preparing the data

We use a Portuguese red and white wine database downloaded from http://archive.ics.uci.edu/ml/datasets/Wine+Quality and described in:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems>, Elsevier, 47(4):547-553. ISSN: 0167-9236.

In this exercise we will use the red wine, which is what I enjoy drinking more...



In [1]:

    
%matplotlib inline

import pandas as pd #used for reading/writing data 
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library

wineData = pd.read_csv('data/winequality/winequality-red.csv', sep=';')

We can inspect the contents of the data: We look at the first 10 entries to get a feeling of how the data is structured. Then we can also see some statistics of each column.



In [4]:

    
wineData.head(10)









    Out[4]:






  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      0
      7.4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
      5
    
    
      1
      7.8
      0.88
      0.00
      2.6
      0.098
      25.0
      67.0
      0.9968
      3.20
      0.68
      9.8
      5
    
    
      2
      7.8
      0.76
      0.04
      2.3
      0.092
      15.0
      54.0
      0.9970
      3.26
      0.65
      9.8
      5
    
    
      3
      11.2
      0.28
      0.56
      1.9
      0.075
      17.0
      60.0
      0.9980
      3.16
      0.58
      9.8
      6
    
    
      4
      7.4
      0.70
      0.00
      1.9
      0.076
      11.0
      34.0
      0.9978
      3.51
      0.56
      9.4
      5
    
    
      5
      7.4
      0.66
      0.00
      1.8
      0.075
      13.0
      40.0
      0.9978
      3.51
      0.56
      9.4
      5
    
    
      6
      7.9
      0.60
      0.06
      1.6
      0.069
      15.0
      59.0
      0.9964
      3.30
      0.46
      9.4
      5
    
    
      7
      7.3
      0.65
      0.00
      1.2
      0.065
      15.0
      21.0
      0.9946
      3.39
      0.47
      10.0
      7
    
    
      8
      7.8
      0.58
      0.02
      2.0
      0.073
      9.0
      18.0
      0.9968
      3.36
      0.57
      9.5
      7
    
    
      9
      7.5
      0.50
      0.36
      6.1
      0.071
      17.0
      102.0
      0.9978
      3.35
      0.80
      10.5
      5



In [5]:

    
wineData.describe()









    Out[5]:






  
    
      
      fixed acidity
      volatile acidity
      citric acid
      residual sugar
      chlorides
      free sulfur dioxide
      total sulfur dioxide
      density
      pH
      sulphates
      alcohol
      quality
    
  
  
    
      count
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
      1599.000000
    
    
      mean
      8.319637
      0.527821
      0.270976
      2.538806
      0.087467
      15.874922
      46.467792
      0.996747
      3.311113
      0.658149
      10.422983
      5.636023
    
    
      std
      1.741096
      0.179060
      0.194801
      1.409928
      0.047065
      10.460157
      32.895324
      0.001887
      0.154386
      0.169507
      1.065668
      0.807569
    
    
      min
      4.600000
      0.120000
      0.000000
      0.900000
      0.012000
      1.000000
      6.000000
      0.990070
      2.740000
      0.330000
      8.400000
      3.000000
    
    
      25%
      7.100000
      0.390000
      0.090000
      1.900000
      0.070000
      7.000000
      22.000000
      0.995600
      3.210000
      0.550000
      9.500000
      5.000000
    
    
      50%
      7.900000
      0.520000
      0.260000
      2.200000
      0.079000
      14.000000
      38.000000
      0.996750
      3.310000
      0.620000
      10.200000
      6.000000
    
    
      75%
      9.200000
      0.640000
      0.420000
      2.600000
      0.090000
      21.000000
      62.000000
      0.997835
      3.400000
      0.730000
      11.100000
      6.000000
    
    
      max
      15.900000
      1.580000
      1.000000
      15.500000
      0.611000
      72.000000
      289.000000
      1.003690
      4.010000
      2.000000
      14.900000
      8.000000

All columns except the last one are the input parameters of the system, obtained from real vines obtained through physicochemical tests. The last column shows the output value, based on sensory tests. To work with this data we will need to split it into two variables:



In [6]:

    
wineDataInput = wineData.drop('quality', axis=1)
wineDataOutput = wineData['quality']

3) Linear regression training using gradient descend way

To perform optimization by gradient descent, we will need to implement the following formulas. In all of these formulas, h(x) is the prediction (i.e. output) of x by the linear regressor.

Cost (m is the length of X):

$$\frac{1}{2m}\sum\limits_{i=1}^{m}(h_\theta(x^i)-y^i)^2$$

Gradient of cost with respect to theta (m is the length of X):

$$\frac{1}{m}(h_\theta(x)-y)$$

Update rule (for every j in X, m is the length of X):

$$\theta_j = \theta_j + \alpha\sum\limits_{i=1}^{m}(y^i - h_\theta(x^i))x^i_j$$

(Code adapted from https://kastnerkyle.github.io/posts/linear-regression/)

3.1) Features normalization

The first good thing to do is to normalize the input data to go have zero mean and unit variance. Without normalization we have a high chance of divergence in datasets containing input variables with very different dynamic ranges. In the current database if we do not apply normalization (try it below by commenting out the scale line) we will need to use very small (0.0005 or lower) learning rate to avoid the algorithm to diverge. Also we need many steps to get a reasonable minimum (1000) When normalizing features we can use a much bigger learning rate without issues, with a higher rate (0.05) and only 100 steps or so. Note that every database is a different world.



In [7]:

    
#prepare the data
from sklearn import preprocessing

#adapt output values to an array form
wineDataOutputArray = np.asarray(wineDataOutput.astype(float))
wineDataOutputArray = wineDataOutputArray[:, np.newaxis]
                       
#adapt input features as an array form, with an extra '1' for \theta_0 term (bias term)
wineDataInputArray_original = np.asarray(wineDataInput.astype(float))
### comment the next line out to test regression on unnormalized data
wineDataInputArray = preprocessing.scale(wineDataInputArray_original) 
wineDataInputArray = np.hstack((np.ones((wineDataInputArray.shape[0], 1)), wineDataInputArray))

#visualization of the normalized and unnormalized
f, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3, sharex='col', sharey='row')
f.set_figheight(7)
f.set_figwidth(15)
ax1.plot(wineDataInputArray_original[0:20, 0])
ax2.plot(wineDataInputArray_original[0:20, 1])
ax3.plot(wineDataInputArray_original[0:20, 2])
ax4.plot(wineDataInputArray[0:20, 1]) #note we added a column of 1's
ax5.plot(wineDataInputArray[0:20, 2])
ax6.plot(wineDataInputArray[0:20, 3])









    Out[7]:





[<matplotlib.lines.Line2D at 0x7f5ce24bee80>]

3.2) Linear regression implementation in Python code

We implement the gradient descend algorithm to estimate the best value for $\theta$ parameters. The result of this function are the best $\theta$ parameters and the mean squared error error in each step.



In [8]:

    
%time
#Initialize theta to zeros                                                      
learning_rate = alpha = 0.05                                                                    
iters = 1000    

def mean_squared_error(theta,X,y):                                                     
    m = y.shape[0]                      
    return 1. / (2. * m) * np.sum((np.dot(X, theta) - y) ** 2.)                            
                        
def l1_error(theta, X, y):
    m = y.shape[0]                      
    return 1. / m * np.sum(abs((np.dot(X, theta) - y)))                                
    
def gradient_update(theta,X,y):                                                
    m = y.shape[0]     
    return 1. / m * np.dot(X.T, (np.dot(X,theta) - y))                                 
                                                                                
def gradient_descent(X,y,alpha,iters):                                    
    m = y.shape[0]
    all_cost = []
    #Initialize theta to zeros
    all_theta = [np.zeros((X.shape[1],1))] #array of vectors   

    for i in range(iters):
        cost = mean_squared_error(all_theta[-1], X, y)
        all_cost.append(cost)
        all_theta.append(all_theta[-1] - float(alpha) * gradient_update(all_theta[-1], X, y))                       
    return all_theta,all_cost                                                       
                                                                             
#Perform linear regression via gradient descent
all_theta, all_cost = gradient_descent(wineDataInputArray, wineDataOutputArray, alpha, iters)









    



CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 4.29 µs

let's take a look at the mean square error w.r.t. each iteration, and the final one.



In [9]:

    
plt.plot(all_cost[0:100])









    Out[9]:





[<matplotlib.lines.Line2D at 0x7f5ce23cef60>]



In [10]:

    
mean_squared_error(all_theta[-1], wineDataInputArray, wineDataOutputArray)









    Out[10]:





0.20838390311799945

As the value to be estimated is a subjective metric between 3 and 8 we would be most interested in knowing what is the average deviation from the output value that our hypotheses values get. We can compute this with the L1 error:



In [11]:

    
l1_error(all_theta[-1], wineDataInputArray, wineDataOutputArray)









    Out[11]:





0.50048160226452787

In this particular case we could consider that the outputs should be cathegorized after the regression. Of course, we could had approached the problem as a classification one, but this is next class...

In here one quick change we can do is to map all output hypotheses to the nearest integer, and compute error again.



In [12]:

    
def l1_error_int(theta, X, y):
    m = y.shape[0]                      
    return 1. / m * np.sum(abs((np.rint(np.dot(X, theta)) - y)))

l1_error_int(all_theta[-1], wineDataInputArray, wineDataOutputArray)









    Out[12]:





0.43964978111319575

A bit of improvement over standard L1 error.

Over all, there is lots of room for improvement, but we see the system is able to get scores that are in the same ballpark as the desired output.

Next, let's take a look at how the $\theta$ parameters have evolved over time to stable values:



In [13]:

    
aa = list(map(list, zip(*all_theta)))
f, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3, sharex='col', sharey='row')
f.set_figheight(7)
f.set_figwidth(15)
ax1.plot(aa[1])
ax2.plot(aa[2])
ax3.plot(aa[3])
ax4.plot(aa[4])
ax5.plot(aa[5])
ax6.plot(aa[6])









    Out[13]:





[<matplotlib.lines.Line2D at 0x7f5cdfec39b0>]

Also, we can see in a scatter prot how the input vs output looks like:



In [14]:

    
plt.scatter(wineDataOutputArray, np.dot(wineDataInputArray, all_theta[-1]), color='black')
plt.ylabel('Hypothesis regression')
plt.xlabel('Known output value')









    Out[14]:





<matplotlib.text.Text at 0x7f5ce24034a8>

It is clear that our regressor is clearly under-shooting the hypothesized value, as it estimates values around 6 to 7 for known values of 8

2) Running linear regression on the data with sklearn

Although it is nice to implement algorithms ourselves and know all the nuts and bolts around them, we can also rely on ready-made algorithms through the sklearn library.

These have the advantage that internally they do not perform iterations in the same way as we do in the gradient descend, but implement optimization methods (or use a closed-for solution) to speedup the convergence and therefore make it more efficient. Although we might not be able to see much in this small database, note the result of running the algorithm above and the one in this section by using the %time command.



In [16]:

    
from sklearn.linear_model import LinearRegression
lm = LinearRegression(normalize=True)

As a reminder, here we want to fit the following: $$y = \theta_0 + \sum_{i=1}^{N} \theta_i x_i$$ Where $x_i$ is the $i^{th}$ input feature ($N=11$ in our case) with coefficients $\theta_i$ to be estimated (called "coefs_" in sklearn), and $\theta_0$ the offset (called "itercept_" in sklearn

Note that in here we do not need to define any learning rate or number of iterations, as the particular implementation we use of sklearn solves the equation in exact form. This is possible here because the number of wines is not too big, ;but becomes highly intractable if we have many thousands of data points.

Note also that we have enabled the normalization or input parameters by passing "normalize=True" to the constructor.



In [17]:

    
%time
lm.fit(wineDataInput, wineDataOutput)









    



CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.48 µs






    Out[17]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)

We can observe the estimated coefficients and the intercept (i.e. the $\theta_0$ parameter)



In [18]:

    
lm.intercept_









    Out[18]:





21.96520844945173



In [19]:

    
lm.coef_









    Out[19]:





array([  2.49905527e-02,  -1.08359026e+00,  -1.82563948e-01,
         1.63312698e-02,  -1.87422516e+00,   4.36133331e-03,
        -3.26457970e-03,  -1.78811638e+01,  -4.13653144e-01,
         9.16334413e-01,   2.76197699e-01])

Let's use these parameters to predict the outputs on our input set:



In [20]:

    
predictedQuality = lm.predict(wineDataInput)
predictedQuality[0:10]









    Out[20]:





array([ 5.03285045,  5.13787975,  5.20989474,  5.69385794,  5.03285045,
        5.06557035,  5.1070693 ,  5.34370699,  5.33670489,  5.65840581])



In [21]:

    
plt.scatter(wineDataOutput, predictedQuality,  color='black')
plt.ylabel('Hypothesis regression')
plt.xlabel('Known output value')









    Out[21]:





<matplotlib.text.Text at 0x7f5cdd9a30f0>

And we can compute the same metrics, which are consistent with our implementation above:



In [22]:

    
print(lm.score(wineDataInput, wineDataOutput))
print(1. / wineDataOutput.shape[0] * np.sum(abs(predictedQuality - wineDataOutput)))









    



0.360551703039
0.500489963564

4) Train versus test datasets

In reality we would never do what we just did. Our experiment is ill-defined, as we use the same data for training and for testing. We should split the data into two subsets, being the training one much bigger, and the testing one not seen at all by the system at training time, except to compute the prediction on the input variable.



In [23]:

    
from sklearn.model_selection import train_test_split

wineDataInput_train, wineDataInput_test, wineDataOutput_train, wineDataOutput_test = train_test_split(wineDataInput, wineDataOutput, test_size=0.3, random_state=0)
lm2 = LinearRegression(normalize=True)
lm2.fit(wineDataInput_train, wineDataOutput_train)
predicted = lm2.predict(wineDataInput_test)
print(lm.score(wineDataInput_test, wineDataOutput_test))
print(1. / wineDataOutput_test.shape[0] * np.sum(abs(predicted - wineDataOutput_test)))









    



0.337666758102
0.487126216459

Similarly consistent with results obtained above (the small discrepancy of +0.2 is due to the fact that this test data was unknown to training in this case)

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
5	7.4	0.66	0.00	1.8	0.075	13.0	40.0	0.9978	3.51	0.56	9.4	5
6	7.9	0.60	0.06	1.6	0.069	15.0	59.0	0.9964	3.30	0.46	9.4	5
7	7.3	0.65	0.00	1.2	0.065	15.0	21.0	0.9946	3.39	0.47	10.0	7
8	7.8	0.58	0.02	2.0	0.073	9.0	18.0	0.9968	3.36	0.57	9.5	7
9	7.5	0.50	0.36	6.1	0.071	17.0	102.0	0.9978	3.35	0.80	10.5	5

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000
mean	8.319637	0.527821	0.270976	2.538806	0.087467	15.874922	46.467792	0.996747	3.311113	0.658149	10.422983	5.636023
std	1.741096	0.179060	0.194801	1.409928	0.047065	10.460157	32.895324	0.001887	0.154386	0.169507	1.065668	0.807569
min	4.600000	0.120000	0.000000	0.900000	0.012000	1.000000	6.000000	0.990070	2.740000	0.330000	8.400000	3.000000
25%	7.100000	0.390000	0.090000	1.900000	0.070000	7.000000	22.000000	0.995600	3.210000	0.550000	9.500000	5.000000
50%	7.900000	0.520000	0.260000	2.200000	0.079000	14.000000	38.000000	0.996750	3.310000	0.620000	10.200000	6.000000
75%	9.200000	0.640000	0.420000	2.600000	0.090000	21.000000	62.000000	0.997835	3.400000	0.730000	11.100000	6.000000
max	15.900000	1.580000	1.000000	15.500000	0.611000	72.000000	289.000000	1.003690	4.010000	2.000000	14.900000	8.000000