<img src="http://imgur.com/gallery/pHkA0", alt="let the reader be warned, i'm not a wine expert">
The wine industry makes billions in revenue each year. Wine is toted as everything from a health product to a status symbol. Needless to say, wine is a big deal. But to the layman determining what's a good purchase is often a guessing game. Labels are confusing and the entire business is coated in mystery. Today were going to stab at lessening the confusion.
Were going to determine the relationship between the quantity of alcohol in a wine and its quality using data from the UCI Machine Learning Repository.
In order to test this our null hypothesis is going to that their is no relationship. Our alternative hypothesis will be that their is. Will put this into the context of our model (Linear Regression) after a brief explanation.
The stronger the relationship between our predictor and response variable, the better linear regression will work.
to reflect that this a model and not a true population, all our estimated variables put on hats.
$$ \hat{y} = \hat{\beta_{0}} + \hat{\beta_{1}} * x $$so in the context of linear regression Our Hypothesis
with only symbols:
We need to figure our if our slope differs significantly from 0. Sense significance is subjective, we have to choose a point where we consider it significant or not. This point is called alpha. And were going to set it to 0.05
$$ \alpha = 0.05 $$To re-orient ourselves, what were trying to do is find out how confident we are that the slope of the regression line we found is like the true slope. The range of values we think the true slope could be based on the estimated slope. The process of figuring out how unlikely our slope is given that the true hyothesis is 0, is called the p-value and is part of a t-test. Specificcaly, the p-vlaue is the proportion of scores that would fall beyond a t-score. In our case
Lets make a quick list of the items will need:
To create our model and find the slope, R and R squared. Were going to use Pythons DataFrame library. This puts our data in a easy to manipulate table called a data frame.
In [8]:
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt
%matplotlib inline
In [9]:
red_wine_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(red_wine_URL, sep=";")
In [10]:
df.head(5)
Out[10]:
A scatter plot will help us get an idea if their is any relationship
In [11]:
df.plot(kind='scatter', x='alcohol', y='quality')
Out[11]:
the data is really cramped, it would help us to see the density of our plots. Lets adjust the alpha (transperence) and s (size) to get a better idea.
In [12]:
df.plot(kind='scatter', x='alcohol', y='quality', s=50, alpha=0.03)
Out[12]:
Ok well, the slope, if any, is positive but very weak.
The equations for the slope of the regression line can be constructed as:
In [13]:
slope = df.alcohol.cov(df.quality)/df.alcohol.var()
lets overlay that
the equation for the intercept is as follows:
intercept equals the expected responsive variables mean subtracted by the product of the estimated slope times the mean of the predictive variable
In [14]:
intercept = df.quality.mean() - (slope * df.alcohol.mean())
In [15]:
expected = (slope * df.alcohol) + intercept
re-written again...
In [16]:
plt.scatter(x=df.alcohol, y=df.quality, s=70, alpha=0.03)
plt.plot(df.alcohol, expected, color="brown")
plt.show()
degrees of freedom are the number of ways in which our calculation of our slope is free to vary. Another, maybe easier way to think of it, is that as we add more predictive variables (we only have the one here) the more we erode our calculations with complications. The general formula for degrees of freedom is:
degrees of freedom = (number of total variables) - (number of predicative variables) - 1/the intercept (sense it helps predict too)
in symbols
In [17]:
degrees_of_freedom = df.shape[0] - 1 - 1
In [18]:
degrees_of_freedom
Out[18]:
the equation for the standard error of the slope is:
standard error of the slope equal standard error of the estimate divided by the sqrt of the mean squares error of the predictor variable.
expanded out thats...
In [19]:
n = df.shape[0]
num = sqrt((1.0/(n-2) * ((df.quality - expected)**2).sum()))
dem = sqrt(((df.alcohol - df.alcohol.mean())**2).sum())
standard_error_of_the_slope = num/dem
standard_error_of_the_slope
Out[19]:
In [20]:
#pandas std takes into account the degrees of freedom
standard_error_of_the_slope = expected.std()/sqrt(((df.alcohol - df.alcohol.mean())**2).sum())
standard_error_of_the_slope
Out[20]:
The t score can be found with the following formula:
t-score equal the slope divided by the standard error of the slope
but beta_0 is considered zero if the hypothesis is that x (alcohol) and y (quality) are'nt related
The equation for the intercept of the regression line can be constructed as:
In [21]:
t_score = (slope - 0)/standard_error_of_the_slope
t_score
Out[21]:
the t critical value is that point beyond which our findings (our slope) is deemed significant.
In [22]:
import scipy.stats
# recall that our we set our alhpa to equal 0.05
alpha = 0.05
# we divided alpha by two because our test is two tailed (i.e we care if our slope is off in either direction)
t_critical_value = scipy.stats.t.cdf(alpha/2, degrees_of_freedom)
t_critical_value
Out[22]:
Lets recap the most relevant values:
In [23]:
t_score
Out[23]:
In [24]:
t_critical_value
Out[24]:
we would reject the null hypothesis when our t_score was greater then our t_critical_value
In [25]:
reject_null = t_score > t_critical_value
reject_null
Out[25]:
before we go any further lets make sure we did everything right. You may be happy to know i wasted a lot of time figuring out how to do things with the hard(er) way.
In [26]:
# statsmodels contains a lot of great tools for statisticans
import statsmodels.formula.api as smf
est = smf.ols(formula="quality ~ alcohol", data=df).fit()
# will check if their t_score is roughly equal to our t_score
round(est.tvalues.alcohol,5) == round(t_score, 5)
Out[26]:
Concern urning our hypothesis:
In [27]:
print "with an alpha score of {0}, we found a t-critical value of {1} and a t score of \
{2} and so we {3}reject the Null hypothesis".format(alpha, round(t_critical_value,2), round(t_score,2), {False: "Failed to ", True: ""}[reject_null])
This means that we can be sure that their is a significant slight positive correlation between the quantity of alcohol and its "quality".
With some additional help, we can find out how much of the variation in quality is due to variation in alcohol by examining the coefficient of determination:
In [28]:
est.rsquared_adj
Out[28]:
Combining the two ideas and it's clear that while their is a relationship, its weak.
So what can we conclude? Well, like so many studies we can conclude that more work needs to be done. The logic next step would be to multiple regression using the other features we ignored making sure we pruned them intellegently.
In [28]: