We look at two datasets related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests.
Vinho verde is a unique product from the Minho (northwest) region of Portugal. Medium in alcohol, is it particularly appreciated due to its freshness (specially in the summer).
The source states that these datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).
If we can establish this relationship, then of significance would be the sommeliers' reputation who perform wine tasting all the time and providing a certificate of merit or grading to sommeliers such that based on the correlation as stated previously, one could establish the quality of wine and wine quality as set by the taster.
This is really interesting to the massive wine market, wine consumers, wine tasters and to the reputation of esteemed wine-breweries! A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.
A research question relevant to the topic is: Is there a correlation between physiochemical makeup of the wine and quality of wine?
For this, we need to explore the physiochemical attributes given in the dataset and analyis the statistical variables to find out if there is any possibility of a strong correlation. If not, we reject the proposition that there is any correlation. Which brings us close to framing the hypothesis.
Null Hypothesis H0 : There is no significant relationship between any of the physiochemical constituents and the quality of Red Wine.
Which means the alternative hypothesis should be:
Alternative Hypothesis HA: There is evidence from Pearson's R values that there is a significant relation between any of the physiochemical constituents of and the quality of Red Wine. We accept the alternative hypothesis HA if we get a Pearson's R of atleast 0.60.
The Dataset contains the following two samples (with sample size):
We note that specific attributes have been included per sample as given here: Input variables (based on physicochemical tests):
alcohol
Output variable (based on sensory data):
quality (score between 0 and 10)
We also observe that:
Lets first sort the tables in a proper format and analyze the data afterwards.
In [108]:
#import matplotlib
%automagic
%matplotlib inline
import numpy as np
import pandas
from pandas import DataFrame
import os
import csv
def read_write_wine_csv(fpath_name, fpath_new_name):
"""
Takes a raw unformatted csv file as input.
Returns a sorted by column, formatted csv file as output
"""
#open the input file for reading csv
with open(fpath_name) as f:
wine = csv.reader(f, delimiter = ';', quoting=csv.QUOTE_NONNUMERIC)
#open output file for writing to csv
with open(fpath_new_name, 'wb+') as new_wine_file:
#get headers with serial no. as first column
headers = ['sno'] + list(wine.next())
#write the new file with headers as first row using DictWriter
csv_dct_writer = csv.DictWriter(new_wine_file, fieldnames = headers)
csv_dct_writer.writerow({col:col for col in headers})
#write rest of the rows
[csv_dct_writer.writerow(dict(zip(headers,[line_num]+line))) for line_num, line in enumerate(wine)]
In [86]:
#Get white wine data sorted
input_csv = os.getcwd() + r'\wine_data\winequality-white.csv'
output_csv = os.getcwd() + r'\wine_data\new_white_wine.csv'
read_write_wine_csv(input_csv, output_csv)
#Get red wine data sorted
input_csv = os.getcwd() + r'\wine_data\winequality-red.csv'
output_csv = os.getcwd() + r'\wine_data\new_red_wine.csv'
read_write_wine_csv(input_csv, output_csv)
In [188]:
#Get dataset for analysis
white_wine_df = DataFrame.from_csv(open(os.getcwd()+r'\wine_data\new_white_wine.csv'))
red_wine_df = DataFrame.from_csv(open(os.getcwd()+r'\wine_data\new_red_wine.csv'))
Now that we have the data in our comfortable format, it will be interesting to show that there is a relation between physiochemical attributes of wine and quality of wine based on wine tasting.
We take into account all indicators per sample and perform the following exploratory analysis:
We take the histogram plots of each attribute in the Red Wine dataset.
Alongwith taking histogram plots of each attribute, we will clean out any outliers from all the attributes to find out unbiased standard deviation.
In [189]:
#histogram analysis
from iqr import quartile_range, outliers_min_max, filter_outlier_df
#fixed acidity
fixed_acidity_df = filter_outlier_df(red_wine_df['fixed acidity'])
hist(red_wine_df['fixed acidity'], color = 'red', label = 'fixed acidity')
pyplot.legend(loc='upper right')
Out[189]:
In [190]:
#volatile acidity
volatile_acidity_df = filter_outlier_df(red_wine_df['volatile acidity'])
hist(red_wine_df['volatile acidity'], color = 'black', label = 'volatile acidity')
pyplot.legend(loc='upper right')
Out[190]:
In [191]:
#citric acid
citric_acid_df = filter_outlier_df(red_wine_df['citric acid'])
hist(red_wine_df['citric acid'], color = 'green', label = 'citric acid')
pyplot.legend(loc = 'upper right')
Out[191]:
In [192]:
#residual sugar
residual_sugar_df = filter_outlier_df(red_wine_df['residual sugar'])
hist(red_wine_df['residual sugar'], color = 'blue', label = 'residual sugar')
pyplot.legend(loc = 'upper right')
Out[192]:
In [193]:
#chlorides
chlorides_df = filter_outlier_df(red_wine_df['chlorides'])
hist(red_wine_df['chlorides'], color = 'cyan', label = 'chlorides')
pyplot.legend(loc = 'upper right')
Out[193]:
In [194]:
#free sulfur dioxide
free_sulfur_dioxide_df = filter_outlier_df(red_wine_df['free sulfur dioxide'])
hist(red_wine_df['free sulfur dioxide'], color = 'yellow', label = 'free sulfur dioxide')
pyplot.legend(loc = 'upper right')
Out[194]:
In [195]:
#total sulfur dioxide
total_sulfure_dioxide_df = filter_outlier_df(red_wine_df['total sulfur dioxide'])
hist(red_wine_df['total sulfur dioxide'], color = 'yellow', label = 'total sulfur dioxide')
pyplot.legend(loc = 'upper right')
Out[195]:
In [196]:
#density
density_df = filter_outlier_df(red_wine_df['density'])
hist(red_wine_df['density'], color = 'cyan', label = 'density')
pyplot.legend(loc = 'upper right')
Out[196]:
In [197]:
#pH
pH_df = filter_outlier_df(red_wine_df['pH'])
hist(red_wine_df['pH'], color = 'blue', label = 'pH')
pyplot.legend(loc = 'upper right')
Out[197]:
It is interesting to observe the pH values are distributed normally across the data points.
In [198]:
#sulphates
sulphates_df = filter_outlier_df(red_wine_df['sulphates'])
hist(red_wine_df['sulphates'], color = 'yellow', label = 'sulphates')
pyplot.legend(loc = 'upper right')
Out[198]:
In [199]:
#alcohol
alcohol_df = filter_outlier_df(red_wine_df['alcohol'])
hist(red_wine_df['alcohol'], color = 'blue', label = 'alcohol')
pyplot.legend(loc = 'upper right')
Out[199]:
In [200]:
#quality
quality_df = filter_outlier_df(red_wine_df['quality'])
hist(red_wine_df['quality'], color = 'gold', label = 'quality')
pyplot.legend(loc = 'upper right')
Out[200]:
In [201]:
#find out STD DEV of each of attributes - unbiased Std Dev
red_wine_std_ds = pandas.Series(red_wine_df.std(), name=['Original Data'])
red_wine_std_ds.sort()
#Join the individual filtered attribute series
quartile_filtered_red_wine_df = pandas.concat([density_df, chlorides_df, pH_df, sulphates_df, volatile_acidity_df, citric_acid_df, quality_df, alcohol_df, residual_sugar_df, fixed_acidity_df, free_sulfur_dioxide_df, total_sulfure_dioxide_df], axis=1)
#find STD DEV of filtered datasets attributes - unbiased Std Dev
filtered_red_wine_std_ds = pandas.Series(quartile_filtered_red_wine_df.std(), name=['Filtered Data'])
filtered_red_wine_std_ds.sort()
cmp_std_red_wine_df = DataFrame(data = {'Original STD DEV':red_wine_std_ds, 'Filtered STD DEV': filtered_red_wine_std_ds}, dtype=float64).sort(columns=['Filtered STD DEV'], kind='quicksort')
print "\t\tUnbiased Std Dev\n"
print cmp_std_red_wine_df
We filter the outliers by finding the lower and upper bound of the sorted individual attributes. It can be observed from the Filtered STD DEV table that it gives a slimmer standard deviation for :
and increases for other attributes.
This can make for finding correlation between density vs quality and chlorides vs quality or doing a multiregression among density vs chlorides and quality, we still need to investigate further.
Let us take the filtered dataset and perform:
Let us take the quartile filtered data and find the Coefficient of Determination using Pearson's R for all possible combinations for the 11 input attributes as Dataset df.
First, let us perform the univariate analysis by finding descriptive statistics variables.
In [202]:
import math
from itertools import permutations
from scipy.stats import pearsonr
#from linear_regression import r_squared, line_fitting
#from py_variance_std import critical_t
red_wine_dof = len(quartile_filtered_red_wine_df) - 2
#critical_t_val = critical_t(95, red_wine_dof, 0)
critical_t_per = 0.05
#Drop all NaN rows
quartile_filtered_red_wine_df.dropna(inplace=True)
#Drop all null or nan rows and drop the Quality Column
quartile_filtered_red_wine_df = quartile_filtered_red_wine_df[quartile_filtered_red_wine_df.notnull()]\
[(quartile_filtered_red_wine_df != math.isnan) & (quartile_filtered_red_wine_df != np.NaN)].\
drop(['quality'], axis=1)
df = quartile_filtered_red_wine_df
#get all permutations of column pairs
keys = df.columns
attr_permutations = permutations(keys, 2)
coeff_r = {}
[coeff_r.setdefault(each_comb[0], {}).update({each_comb[1]:pearsonr(df[each_comb[0]],\
df[each_comb[1]])[0]}) \
for each_comb in attr_permutations]
df = DataFrame(data=coeff_r, columns=keys, index=keys).fillna(1)
#correlation table
print df
Now that we have correlation dataset df, let us summarize the descriptive analysis in a table.
In [203]:
from itertools import izip
from py_variance_std import se
mean_tbl = quartile_filtered_red_wine_df[['citric acid', 'total sulfur dioxide', 'free sulfur dioxide', 'fixed acidity']].mean()
min_tbl = quartile_filtered_red_wine_df[['citric acid', 'total sulfur dioxide', 'free sulfur dioxide', 'fixed acidity']].min()
max_tbl = quartile_filtered_red_wine_df[['citric acid', 'total sulfur dioxide', 'free sulfur dioxide', 'fixed acidity']].max()
#get unbiased std deviation of quartile filtered dataset
std_tbl = (quartile_filtered_red_wine_df[['citric acid', 'total sulfur dioxide', 'free sulfur dioxide', 'fixed acidity']]).std()
median_tbl = quartile_filtered_red_wine_df[['citric acid', 'total sulfur dioxide', 'free sulfur dioxide', 'fixed acidity']].median()
q1_tbl = red_wine_df[['citric acid','total sulfur dioxide','free sulfur dioxide','fixed acidity']].quantile(q=0.25, axis=0)
q3_tbl = red_wine_df[['citric acid','total sulfur dioxide','free sulfur dioxide','fixed acidity']].quantile(q=0.75, axis=0)
#calculate std err in a dict and create its dataframe
std_err_dct = {index:se(sd, len(df[index])) \
for sd,index in izip(std_tbl, \
list(std_tbl.index))}
std_err_tbl = DataFrame(data=std_err_dct, index = ['std err']).transpose()
#create a DF of all tables and join std_err_tbl
tbl = DataFrame(data=[mean_tbl, min_tbl, max_tbl, std_tbl, median_tbl, q1_tbl, q3_tbl], index = ['mean', 'min', 'max', 'std','median','Q1','Q3']).transpose().join([std_err_tbl])
#calculate range b/w max and min values
data_range = DataFrame(data = tbl['max'] - tbl['min'], columns=['range'])
#calculate Quartile range between Q3 and Q1
iqr = DataFrame(data = tbl['Q3'] - tbl['Q1'], columns=['iqr'])
#join tbl with range and Quartile data
tbl = tbl.join([data_range, iqr])
print tbl.transpose()
We observe that:
We conclude that since range is greater than iqr even after filtering out the outliers, the quartile range is lesser than the range of extreme data points which means that we need to reduce the range less than iqr. This implies that we still need to clean the outliers.
Now lets look at the high correlation value r, in the range 0.40 <= r < 1 and see which attributes share these values.
In [204]:
#High correlations (≥ 40% in absolute value) are identified
limit_df = df[(df >= .40) & (df < 1)]
red_wine_correl = {}
for each in permutations(keys, 2):
col, attr = each
#condition to negate nan/null
if limit_df[col][attr] > 0.:
if attr in red_wine_correl:
if (col in red_wine_correl[attr]):
continue
red_wine_correl.setdefault(col, []).append(attr)
print red_wine_correl
We find out that there seems to be correlation between Citric acid and Fixed acidity, density and fixed acidity and density and chlorides and between free and total sulfur dioxide.
Here are the scatter plots for these attributes, using the filtered dataset for red wine.
In [231]:
from linear_regression import trace_line
#Generate Scatter Plots of the red_wine_correl attributes
#scatter plot for citric acid and fixed acidity
#filter out the outliers and recreate DF and drop all null values
citric_fixed_acid_filter_df = DataFrame(data=[filter_outlier_df(quartile_filtered_red_wine_df['citric acid']), \
filter_outlier_df(quartile_filtered_red_wine_df['fixed acidity'])]).transpose().dropna()
scatter(citric_fixed_acid_filter_df['citric acid'],citric_fixed_acid_filter_df['fixed acidity'])
#R value
citric_fixed_acid_r = pearsonr(citric_fixed_acid_filter_df['citric acid'],citric_fixed_acid_filter_df['fixed acidity'])
print citric_fixed_acid_r
#plot line
x,y = zip(*trace_line(citric_fixed_acid_filter_df['citric acid'],citric_fixed_acid_filter_df['fixed acidity']))
plot(x,y, color='red')
Out[231]:
In [233]:
#filter out the outliers and recreate DF and drop all null values
density_fixed_acid_filter_df = DataFrame(data=[filter_outlier_df(quartile_filtered_red_wine_df['density']), \
filter_outlier_df(quartile_filtered_red_wine_df['fixed acidity'])]).transpose().dropna()
#scatter plot for density and fixed acidity
scatter(density_fixed_acid_filter_df['density'],density_fixed_acid_filter_df['fixed acidity'])
#R value
density_fixed_acid_r = pearsonr(density_fixed_acid_filter_df['density'],density_fixed_acid_filter_df['fixed acidity'])
print density_fixed_acid_r
#plot line
x,y = zip(*trace_line(density_fixed_acid_filter_df['density'],density_fixed_acid_filter_df['fixed acidity']))
plot(x,y, color='green')
Out[233]:
In [234]:
#filter out the outliers and recreate DF and drop all null values
density_chlorides_filter_df = DataFrame(data=[filter_outlier_df(quartile_filtered_red_wine_df['density']), \
filter_outlier_df(quartile_filtered_red_wine_df['chlorides'])]).transpose().dropna()
#scatter plot for density and chlorides
scatter(density_chlorides_filter_df['density'], density_chlorides_filter_df['chlorides'])
#R value
density_chlorides_r = pearsonr(density_chlorides_filter_df['density'],\
density_chlorides_filter_df['chlorides'])
print density_chlorides_r
#plot line
x,y = zip(*trace_line(density_chlorides_filter_df['density'], density_chlorides_filter_df['chlorides']))
plot(x,y, color='red')
Out[234]:
In [235]:
sulfur_r_filter_df = DataFrame(data=[filter_outlier_df(quartile_filtered_red_wine_df['free sulfur dioxide']), \
filter_outlier_df(quartile_filtered_red_wine_df['total sulfur dioxide'])]).transpose().dropna()
#scatter plot for free sulfur dioxide and total sulfur dioxide
scatter(sulfur_r_filter_df['free sulfur dioxide'], sulfur_r_filter_df['total sulfur dioxide'])
#R value
sulfur_r = pearsonr(sulfur_r_filter_df['free sulfur dioxide'],\
sulfur_r_filter_df['total sulfur dioxide'])
print sulfur_r
#plot line
x,y = zip(*trace_line(sulfur_r_filter_df['free sulfur dioxide'], sulfur_r_filter_df['total sulfur dioxide']))
plot(x,y, color='red')
Out[235]:
Noticeable correlations:
Let us test the scatterplot between:
In [236]:
#generate a DF for all above filtered dataframes with quality attribute
filtered_df = {attr_name:filter_outlier_df(quartile_filtered_red_wine_df[attr_name]) \
for attr_name in df.columns.tolist()}
filtered_df = DataFrame(data=filtered_df).join(red_wine_df['quality']).dropna()
#generate scatterplots
scatter(filtered_df['free sulfur dioxide'], filtered_df['quality'])
print pearsonr(filtered_df['free sulfur dioxide'], filtered_df['quality'])
#plot line
x,y = zip(*trace_line(filtered_df['free sulfur dioxide'], filtered_df['quality']))
plot(x,y, color='red')
Out[236]:
In [237]:
scatter(filtered_df['total sulfur dioxide'], filtered_df['quality'])
print pearsonr(filtered_df['total sulfur dioxide'], filtered_df['quality'])
#plot line
x,y = zip(*trace_line(filtered_df['total sulfur dioxide'], filtered_df['quality']))
plot(x,y, color='red')
Out[237]:
In [238]:
scatter(filtered_df['citric acid'], filtered_df['quality'])
print pearsonr(filtered_df['citric acid'], filtered_df['quality'])
#plot line
x,y = zip(*trace_line(filtered_df['citric acid'], filtered_df['quality']))
plot(x,y, color='red')
Out[238]:
In [239]:
scatter(filtered_df['fixed acidity'], filtered_df['quality'])
print pearsonr(filtered_df['fixed acidity'], filtered_df['quality'])
#plot line
x,y = zip(*trace_line(filtered_df['fixed acidity'], filtered_df['quality']))
plot(x,y, color='red')
Out[239]:
From the Pearson R of the scatter plots we conclude there is no significant correlation between (fixed acidity, citric acid, total sulfur dioxide, free sulfur dioxide) and quality as the output variable.
We are unable to establish any significance to the positive correlation between citric vs fixed acidity and quality nor with free vs total sulfur and quality.
We see that density and chlorides provided minimal standard deviations so we proceed further for a univariate and bivariate analysis to test their correlations with other attributes and quality finally.
We find out the Pearson's R and summarize the descriptive analysis. We fid out the data still has outliers from the extreme range and inter-quartile range. We find out the probable lines of best fit could be among citric acid, (having the minimnum standard error out of all probably correlation attributes) free and total sulfur dioxide, density, chlorides and fixed acidity.
As a result, we re run the filtering process to clear out any outliers from these attributes and run a scatter plot among each of the select physiochemical attributes against quality. But we find that the line of best fit and the scatterplot don't quite do justice and are in agreement with the Pearson's R showing no to littler significance between the correlation of the select physiochemical attributes and the quality of wine.
Therefore, we fail to reject the Null Hypothesis H0 and conclude that the dataset given is not appropriate for this test.
But for ages Sommeliers have been grading wine based on taste and taste is a side-effect of the physiochemical makeup of the wine. Which leaves us with one possibility and that is of wine-maturity and time.
Thus, a Pre- and Post- Time-period analysis is needed where we record the physiochemical data for a considerable period of time, say over 50 years and record the same attributes at the time of brewing wine and after 50 years and what changes have the physiochemical scores have gone through as well as comparing the Quality of wine for atleast these two given periods of time.
This way, we will have a Pre- and Post- dataset with the same number of attributes. With this kind of pair of dataset, we need to test the Null Hypothesis that the Quality of wine has remaind the same and the mean Quality of the two datasets is the same while the Alternative Hypothesis would be that there is significant difference between the two mean Quality attributes. Since, there could be 2 or more datasets over an interval of time, a T-test would be a better measure of significance. If we negate the criteria of independent datasets, a One Way ANOVA test could also be performed over datasets over periods of intervals.
If we find there is a significant difference between the quality per dataset, then we could perform a T-test on the physiochemical makeup between the two datasets. From there, we might be able to infer some significant change in the physiochemical property that is correlated to the change in the quality of the wine over the period of time and come to a conclusion.
In [ ]: