In this example we will be exploring the estimation of subjective wine quality for Portuguese wines via linear regression. We will implement linear regression in two different ways: a) using some lines of python code with gradient descend as an optimization procedure; b)using machine learning libraries like python's sklearn that have highly optimized implementations of many of these algorithms.
We use a Portuguese red and white wine database downloaded from http://archive.ics.uci.edu/ml/datasets/Wine+Quality and described in:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems>, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In this exercise we will use the red wine, which is what I enjoy drinking more...
In [1]:
%matplotlib inline
import pandas as pd #used for reading/writing data
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library
wineData = pd.read_csv('data/winequality/winequality-red.csv', sep=';')
We can inspect the contents of the data: We look at the first 10 entries to get a feeling of how the data is structured. Then we can also see some statistics of each column.
In [4]:
wineData.head(10)
Out[4]:
In [5]:
wineData.describe()
Out[5]:
All columns except the last one are the input parameters of the system, obtained from real vines obtained through physicochemical tests. The last column shows the output value, based on sensory tests. To work with this data we will need to split it into two variables:
In [6]:
wineDataInput = wineData.drop('quality', axis=1)
wineDataOutput = wineData['quality']
To perform optimization by gradient descent, we will need to implement the following formulas. In all of these formulas, h(x) is the prediction (i.e. output) of x by the linear regressor.
Cost (m is the length of X):
$$\frac{1}{2m}\sum\limits_{i=1}^{m}(h_\theta(x^i)-y^i)^2$$Gradient of cost with respect to theta (m is the length of X):
$$\frac{1}{m}(h_\theta(x)-y)$$Update rule (for every j in X, m is the length of X):
$$\theta_j = \theta_j + \alpha\sum\limits_{i=1}^{m}(y^i - h_\theta(x^i))x^i_j$$(Code adapted from https://kastnerkyle.github.io/posts/linear-regression/)
The first good thing to do is to normalize the input data to go have zero mean and unit variance. Without normalization we have a high chance of divergence in datasets containing input variables with very different dynamic ranges. In the current database if we do not apply normalization (try it below by commenting out the scale line) we will need to use very small (0.0005 or lower) learning rate to avoid the algorithm to diverge. Also we need many steps to get a reasonable minimum (1000) When normalizing features we can use a much bigger learning rate without issues, with a higher rate (0.05) and only 100 steps or so. Note that every database is a different world.
In [7]:
#prepare the data
from sklearn import preprocessing
#adapt output values to an array form
wineDataOutputArray = np.asarray(wineDataOutput.astype(float))
wineDataOutputArray = wineDataOutputArray[:, np.newaxis]
#adapt input features as an array form, with an extra '1' for \theta_0 term (bias term)
wineDataInputArray_original = np.asarray(wineDataInput.astype(float))
### comment the next line out to test regression on unnormalized data
wineDataInputArray = preprocessing.scale(wineDataInputArray_original)
wineDataInputArray = np.hstack((np.ones((wineDataInputArray.shape[0], 1)), wineDataInputArray))
#visualization of the normalized and unnormalized
f, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3, sharex='col', sharey='row')
f.set_figheight(7)
f.set_figwidth(15)
ax1.plot(wineDataInputArray_original[0:20, 0])
ax2.plot(wineDataInputArray_original[0:20, 1])
ax3.plot(wineDataInputArray_original[0:20, 2])
ax4.plot(wineDataInputArray[0:20, 1]) #note we added a column of 1's
ax5.plot(wineDataInputArray[0:20, 2])
ax6.plot(wineDataInputArray[0:20, 3])
Out[7]:
In [8]:
%time
#Initialize theta to zeros
learning_rate = alpha = 0.05
iters = 1000
def mean_squared_error(theta,X,y):
m = y.shape[0]
return 1. / (2. * m) * np.sum((np.dot(X, theta) - y) ** 2.)
def l1_error(theta, X, y):
m = y.shape[0]
return 1. / m * np.sum(abs((np.dot(X, theta) - y)))
def gradient_update(theta,X,y):
m = y.shape[0]
return 1. / m * np.dot(X.T, (np.dot(X,theta) - y))
def gradient_descent(X,y,alpha,iters):
m = y.shape[0]
all_cost = []
#Initialize theta to zeros
all_theta = [np.zeros((X.shape[1],1))] #array of vectors
for i in range(iters):
cost = mean_squared_error(all_theta[-1], X, y)
all_cost.append(cost)
all_theta.append(all_theta[-1] - float(alpha) * gradient_update(all_theta[-1], X, y))
return all_theta,all_cost
#Perform linear regression via gradient descent
all_theta, all_cost = gradient_descent(wineDataInputArray, wineDataOutputArray, alpha, iters)
let's take a look at the mean square error w.r.t. each iteration, and the final one.
In [9]:
plt.plot(all_cost[0:100])
Out[9]:
In [10]:
mean_squared_error(all_theta[-1], wineDataInputArray, wineDataOutputArray)
Out[10]:
As the value to be estimated is a subjective metric between 3 and 8 we would be most interested in knowing what is the average deviation from the output value that our hypotheses values get. We can compute this with the L1 error:
In [11]:
l1_error(all_theta[-1], wineDataInputArray, wineDataOutputArray)
Out[11]:
In this particular case we could consider that the outputs should be cathegorized after the regression. Of course, we could had approached the problem as a classification one, but this is next class...
In here one quick change we can do is to map all output hypotheses to the nearest integer, and compute error again.
In [12]:
def l1_error_int(theta, X, y):
m = y.shape[0]
return 1. / m * np.sum(abs((np.rint(np.dot(X, theta)) - y)))
l1_error_int(all_theta[-1], wineDataInputArray, wineDataOutputArray)
Out[12]:
A bit of improvement over standard L1 error.
Over all, there is lots of room for improvement, but we see the system is able to get scores that are in the same ballpark as the desired output.
Next, let's take a look at how the $\theta$ parameters have evolved over time to stable values:
In [13]:
aa = list(map(list, zip(*all_theta)))
f, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3, sharex='col', sharey='row')
f.set_figheight(7)
f.set_figwidth(15)
ax1.plot(aa[1])
ax2.plot(aa[2])
ax3.plot(aa[3])
ax4.plot(aa[4])
ax5.plot(aa[5])
ax6.plot(aa[6])
Out[13]:
Also, we can see in a scatter prot how the input vs output looks like:
In [14]:
plt.scatter(wineDataOutputArray, np.dot(wineDataInputArray, all_theta[-1]), color='black')
plt.ylabel('Hypothesis regression')
plt.xlabel('Known output value')
Out[14]:
It is clear that our regressor is clearly under-shooting the hypothesized value, as it estimates values around 6 to 7 for known values of 8
Although it is nice to implement algorithms ourselves and know all the nuts and bolts around them, we can also rely on ready-made algorithms through the sklearn library.
These have the advantage that internally they do not perform iterations in the same way as we do in the gradient descend, but implement optimization methods (or use a closed-for solution) to speedup the convergence and therefore make it more efficient. Although we might not be able to see much in this small database, note the result of running the algorithm above and the one in this section by using the %time command.
In [16]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression(normalize=True)
As a reminder, here we want to fit the following: $$y = \theta_0 + \sum_{i=1}^{N} \theta_i x_i$$ Where $x_i$ is the $i^{th}$ input feature ($N=11$ in our case) with coefficients $\theta_i$ to be estimated (called "coefs_" in sklearn), and $\theta_0$ the offset (called "itercept_" in sklearn
Note that in here we do not need to define any learning rate or number of iterations, as the particular implementation we use of sklearn solves the equation in exact form. This is possible here because the number of wines is not too big, ;but becomes highly intractable if we have many thousands of data points.
Note also that we have enabled the normalization or input parameters by passing "normalize=True" to the constructor.
In [17]:
%time
lm.fit(wineDataInput, wineDataOutput)
Out[17]:
We can observe the estimated coefficients and the intercept (i.e. the $\theta_0$ parameter)
In [18]:
lm.intercept_
Out[18]:
In [19]:
lm.coef_
Out[19]:
Let's use these parameters to predict the outputs on our input set:
In [20]:
predictedQuality = lm.predict(wineDataInput)
predictedQuality[0:10]
Out[20]:
In [21]:
plt.scatter(wineDataOutput, predictedQuality, color='black')
plt.ylabel('Hypothesis regression')
plt.xlabel('Known output value')
Out[21]:
And we can compute the same metrics, which are consistent with our implementation above:
In [22]:
print(lm.score(wineDataInput, wineDataOutput))
print(1. / wineDataOutput.shape[0] * np.sum(abs(predictedQuality - wineDataOutput)))
In reality we would never do what we just did. Our experiment is ill-defined, as we use the same data for training and for testing. We should split the data into two subsets, being the training one much bigger, and the testing one not seen at all by the system at training time, except to compute the prediction on the input variable.
In [23]:
from sklearn.model_selection import train_test_split
wineDataInput_train, wineDataInput_test, wineDataOutput_train, wineDataOutput_test = train_test_split(wineDataInput, wineDataOutput, test_size=0.3, random_state=0)
lm2 = LinearRegression(normalize=True)
lm2.fit(wineDataInput_train, wineDataOutput_train)
predicted = lm2.predict(wineDataInput_test)
print(lm.score(wineDataInput_test, wineDataOutput_test))
print(1. / wineDataOutput_test.shape[0] * np.sum(abs(predicted - wineDataOutput_test)))
Similarly consistent with results obtained above (the small discrepancy of +0.2 is due to the fact that this test data was unknown to training in this case)