The purpose of this study is to determine how well a model can predict the Percieved Quality of based in some of the most relevant physical and chemical properties of wine. The dataset was taken from: 'P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009'
The there are two datasets, one for red wine and another for white wines. Both contain the same variables but with different number of instances. Only one of the dataset will be chosen to perform the analysis.
In [17]:
%matplotlib notebook
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import IPython
from IPython.display import display
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn import cross_validation
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
In [16]:
raw_df_red = pd.read_csv("winequality-red.csv", sep =';')
raw_df_white = pd.read_csv("winequality-white.csv", sep =';')
In [3]:
raw_df_red.describe()
Out[3]:
In [4]:
raw_df_white.describe()
Out[4]:
The dataset that will be chosen for this exercise will be the white wine dataset since it contains more instances(4898).
In [5]:
raw_df_white.info()
The dataset does not contain missing values or non numerical data.
In [5]:
X = raw_df_white.iloc[:,:-1].values # independent variables X
y = raw_df_white['quality'].values # dependent Variables y
X_train_white, X_test_white, y_train_white, y_test_white = cross_validation.train_test_split(X, y, test_size = 0.2, random_state = 0)
In [7]:
X_train = raw_df_white.iloc[:,:-1]
y_train = raw_df_white['quality']
pd.plotting.scatter_matrix(X_train, c = y_train, figsize = (30, 30), marker ='o', hist_kwds = {'bins': 20},
s = 60, alpha = 0.7)