Demystifying what makes a quality wine

<img src="http://imgur.com/gallery/pHkA0", alt="let the reader be warned, i'm not a wine expert">

OverView

  • Introduction: About Wine
  • Research Question and Hypothesis: What determines the quality of red wine
    • Wine is a big deal
    • About linear regression
    • The hypothesis
  • Experiential Design:
    • analysis plan
    • tools
  • Creating the models
    • setting up our data
    • taking a peak at the data
    • finding the..
      • slope
      • intercept
      • line
      • degrees of freedom
      • standard error of the line
      • t score
      • t critical value
  • Results: Putting together our numbers in one place.
  • Conclusions: Does Alcohol determine the quality of red wine?

Introduction: About Wine

The wine industry makes billions in revenue each year. Wine is toted as everything from a health product to a status symbol. Needless to say, wine is a big deal. But to the layman determining what's a good purchase is often a guessing game. Labels are confusing and the entire business is coated in mystery. Today were going to stab at lessening the confusion.

Research Question and Hypothesis: What determines the quality of red wine.

wine is a big deal

Were going to determine the relationship between the quantity of alcohol in a wine and its quality using data from the UCI Machine Learning Repository.

In order to test this our null hypothesis is going to that their is no relationship. Our alternative hypothesis will be that their is. Will put this into the context of our model (Linear Regression) after a brief explanation.

About Linear Regression

The stronger the relationship between our predictor and response variable, the better linear regression will work.

$$ response = intercept + (slope * predictor) $$$$ y = \beta_{0} + \beta_{1} * x $$

to reflect that this a model and not a true population, all our estimated variables put on hats.

$$ \hat{y} = \hat{\beta_{0}} + \hat{\beta_{1}} * x $$

The Hypothesis

so in the context of linear regression Our Hypothesis

  • Null Hypothesis: The slope of the regression line is equal to zero.
  • Alternative Hypothesis: The slope of the regression line is NOT equal to zero.

with only symbols:

$$ H_{o}: \beta_{1} = 0 $$$$ H_{a}: \beta_{1} \neq 0 $$

Expermental Design:

analysis plan

We need to figure our if our slope differs significantly from 0. Sense significance is subjective, we have to choose a point where we consider it significant or not. This point is called alpha. And were going to set it to 0.05

$$ \alpha = 0.05 $$

To re-orient ourselves, what were trying to do is find out how confident we are that the slope of the regression line we found is like the true slope. The range of values we think the true slope could be based on the estimated slope. The process of figuring out how unlikely our slope is given that the true hyothesis is 0, is called the p-value and is part of a t-test. Specificcaly, the p-vlaue is the proportion of scores that would fall beyond a t-score. In our case

Lets make a quick list of the items will need:

  • estimated slope (which will call slope)
  • degrees of freedom of the slope
  • standard error of the slope ~ necessary to calculate the confidence interval
  • t-score
  • t-critical score
  • p-value.

Tools:

To create our model and find the slope, R and R squared. Were going to use Pythons DataFrame library. This puts our data in a easy to manipulate table called a data frame.

Creating our models

Setting up our data


In [8]:
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt
%matplotlib inline

In [9]:
red_wine_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
df = pd.read_csv(red_wine_URL, sep=";")

taking a peak at the data


In [10]:
df.head(5)


Out[10]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5

A scatter plot will help us get an idea if their is any relationship


In [11]:
df.plot(kind='scatter', x='alcohol', y='quality')


Out[11]:
<matplotlib.axes.AxesSubplot at 0x7fe382bb3910>

the data is really cramped, it would help us to see the density of our plots. Lets adjust the alpha (transperence) and s (size) to get a better idea.


In [12]:
df.plot(kind='scatter', x='alcohol', y='quality', s=50, alpha=0.03)


Out[12]:
<matplotlib.axes.AxesSubplot at 0x7fe382c2f1d0>

Ok well, the slope, if any, is positive but very weak.

Lets find that slope.

The equations for the slope of the regression line can be constructed as:

$$ \hat\beta = \frac{ \operatorname{Cov}[x,y] }{ \operatorname{Var}[x] } $$

In [13]:
slope = df.alcohol.cov(df.quality)/df.alcohol.var()

lets overlay that

lets find the intercept ~ will need it to find the line

the equation for the intercept is as follows:

intercept equals the expected responsive variables mean subtracted by the product of the estimated slope times the mean of the predictive variable

$$ \hat{\alpha} = \bar{y} - \hat{\beta_{1}}\bar{x} $$

In [14]:
intercept = df.quality.mean() - (slope * df.alcohol.mean())

lets find the line ~ the estimted y


In [15]:
expected = (slope * df.alcohol) + intercept
$$ \hat{y} = (slope * x) + intercept $$

re-written again...

$$ \hat{y} = \hat{\beta_{1}}x + \hat{\beta_{0}} $$

In [16]:
plt.scatter(x=df.alcohol, y=df.quality, s=70, alpha=0.03)
plt.plot(df.alcohol, expected, color="brown")
plt.show()


Lets find the degrees of freedom

degrees of freedom are the number of ways in which our calculation of our slope is free to vary. Another, maybe easier way to think of it, is that as we add more predictive variables (we only have the one here) the more we erode our calculations with complications. The general formula for degrees of freedom is:

degrees of freedom = (number of total variables) - (number of predicative variables) - 1/the intercept (sense it helps predict too)

in symbols

$$ df = N - k - 1 $$

In [17]:
degrees_of_freedom = df.shape[0] - 1 - 1

In [18]:
degrees_of_freedom


Out[18]:
1597

Finding the standard error of the slope

the equation for the standard error of the slope is:

standard error of the slope equal standard error of the estimate divided by the sqrt of the mean squares error of the predictor variable.

$$ SE_{\hat{\beta}} = \frac{s_{e}}{\sqrt{MSE_{x}}} $$

expanded out thats...

$$SE_{\widehat\beta} = \frac{\sqrt{\frac{1}{n - 2}\sum_{i=1}^n (Y_i - \widehat y_i)^2}}{\sqrt{ \sum_{i=1}^n (x_i - \overline{x})^2 }} $$

In [19]:
n = df.shape[0]
num = sqrt((1.0/(n-2) * ((df.quality - expected)**2).sum()))
dem = sqrt(((df.alcohol - df.alcohol.mean())**2).sum())
standard_error_of_the_slope = num/dem
standard_error_of_the_slope


Out[19]:
0.016675160295475198

In [20]:
#pandas std takes into account the degrees of freedom
standard_error_of_the_slope = expected.std()/sqrt(((df.alcohol - df.alcohol.mean())**2).sum()) 
standard_error_of_the_slope


Out[20]:
0.0090266875772390658

Finding the test statistic, also known as the t score

The t score can be found with the following formula:

t-score equal the slope divided by the standard error of the slope

$$ t_{score} = \frac{\hat{\beta_{1}} - \beta_{0}}{s_{b}} $$

but beta_0 is considered zero if the hypothesis is that x (alcohol) and y (quality) are'nt related

The equation for the intercept of the regression line can be constructed as:


In [21]:
t_score = (slope - 0)/standard_error_of_the_slope
t_score


Out[21]:
39.974992182613988

Finding the t critical value

the t critical value is that point beyond which our findings (our slope) is deemed significant.


In [22]:
import scipy.stats
# recall that our we set our alhpa to equal 0.05
alpha = 0.05
# we divided alpha by two because our test is two tailed (i.e we care if our slope is off in either direction)
t_critical_value = scipy.stats.t.cdf(alpha/2, degrees_of_freedom)
t_critical_value


Out[22]:
0.50997095653423974

Results

Lets recap the most relevant values:


In [23]:
t_score


Out[23]:
39.974992182613988

In [24]:
t_critical_value


Out[24]:
0.50997095653423974

we would reject the null hypothesis when our t_score was greater then our t_critical_value


In [25]:
reject_null = t_score > t_critical_value
reject_null


Out[25]:
True

before we go any further lets make sure we did everything right. You may be happy to know i wasted a lot of time figuring out how to do things with the hard(er) way.


In [26]:
# statsmodels contains a lot of great tools for statisticans 
import statsmodels.formula.api as smf 

est = smf.ols(formula="quality ~ alcohol", data=df).fit()
# will check if their t_score is roughly equal to our t_score
round(est.tvalues.alcohol,5) == round(t_score, 5)


Out[26]:
False

Conclusions

Concern urning our hypothesis:

$$ H_{o}: \beta_{1} = 0 $$$$ H_{a}: \beta_{1} \neq 0 $$

In [27]:
print "with an alpha score of {0}, we found a t-critical value of {1} and a t score of \
{2} and so we {3}reject the Null hypothesis".format(alpha, round(t_critical_value,2), round(t_score,2), {False: "Failed to ", True: ""}[reject_null])


with an alpha score of 0.05, we found a t-critical value of 0.51 and a t score of 39.97 and so we reject the Null hypothesis

This means that we can be sure that their is a significant slight positive correlation between the quantity of alcohol and its "quality".

With some additional help, we can find out how much of the variation in quality is due to variation in alcohol by examining the coefficient of determination:


In [28]:
est.rsquared_adj


Out[28]:
0.22625016921989893

Combining the two ideas and it's clear that while their is a relationship, its weak.

Wrap up

So what can we conclude? Well, like so many studies we can conclude that more work needs to be done. The logic next step would be to multiple regression using the other features we ignored making sure we pruned them intellegently.


In [28]: