Wine quality:

The purpose of this study is to determine how well a model can predict the Percieved Quality of based in some of the most relevant physical and chemical properties of wine. The dataset was taken from: 'P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009'

The there are two datasets, one for red wine and another for white wines. Both contain the same variables but with different number of instances. Only one of the dataset will be chosen to perform the analysis.


In [17]:
%matplotlib notebook
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import scipy as sp
import IPython
from IPython.display import display
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn import cross_validation
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

Importing the dataset:


In [16]:
raw_df_red = pd.read_csv("winequality-red.csv", sep =';')
raw_df_white = pd.read_csv("winequality-white.csv", sep =';')

Exploring the datasets:


In [3]:
raw_df_red.describe()


Out[3]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000

In [4]:
raw_df_white.describe()


Out[4]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
count 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000
mean 6.854788 0.278241 0.334192 6.391415 0.045772 35.308085 138.360657 0.994027 3.188267 0.489847 10.514267 5.877909
std 0.843868 0.100795 0.121020 5.072058 0.021848 17.007137 42.498065 0.002991 0.151001 0.114126 1.230621 0.885639
min 3.800000 0.080000 0.000000 0.600000 0.009000 2.000000 9.000000 0.987110 2.720000 0.220000 8.000000 3.000000
25% 6.300000 0.210000 0.270000 1.700000 0.036000 23.000000 108.000000 0.991723 3.090000 0.410000 9.500000 5.000000
50% 6.800000 0.260000 0.320000 5.200000 0.043000 34.000000 134.000000 0.993740 3.180000 0.470000 10.400000 6.000000
75% 7.300000 0.320000 0.390000 9.900000 0.050000 46.000000 167.000000 0.996100 3.280000 0.550000 11.400000 6.000000
max 14.200000 1.100000 1.660000 65.800000 0.346000 289.000000 440.000000 1.038980 3.820000 1.080000 14.200000 9.000000

The dataset that will be chosen for this exercise will be the white wine dataset since it contains more instances(4898).


In [5]:
raw_df_white.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
fixed acidity           4898 non-null float64
volatile acidity        4898 non-null float64
citric acid             4898 non-null float64
residual sugar          4898 non-null float64
chlorides               4898 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4898 non-null float64
sulphates               4898 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
dtypes: float64(11), int64(1)
memory usage: 459.3 KB

The dataset does not contain missing values or non numerical data.

White wine training and test set selection:

To select the training and test set the method .train_test_split() will be used.


In [5]:
X = raw_df_white.iloc[:,:-1].values # independent variables X
y = raw_df_white['quality'].values # dependent Variables y

X_train_white, X_test_white, y_train_white, y_test_white = cross_validation.train_test_split(X, y, test_size = 0.2, random_state = 0)

Visual Data Exploration


In [7]:
X_train = raw_df_white.iloc[:,:-1]
y_train = raw_df_white['quality']
pd.plotting.scatter_matrix(X_train, c = y_train, figsize = (30, 30), marker ='o', hist_kwds = {'bins': 20},
                            s = 60, alpha = 0.7)


Out[7]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000018D9F34F438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA13C9D30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA13FF048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018D9F348B38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018D9F3682E8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018D9F368390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA149C080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA14C3710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA14ECDA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA151D470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1543B00>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA15761D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA159E860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA15C4EF0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA15F55C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA161EC50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA164F320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA16769B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA16A9080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA16CF710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA16F6DA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1728470>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1750B00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA17821D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA17A9860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA17D1EF0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA18015C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1827C50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1858320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA18839B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA18B1080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA18DA710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1901DA0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1932470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA195AB00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA198B1D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA19B2860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA19DBEF0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1A0C5C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1A34C50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1A67320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1A8C9B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1ABE080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1AE6710>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1B0DDA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1B3D470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1B65B00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1B981D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1BBF860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1BE7EF0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1C185C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1C3FC50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1C6F320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1C979B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1CC9080>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1CF0710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1D18DA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1D4A470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1D70B00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1DA11D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1DC9860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1DF2EF0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1E235C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1E4AC50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1E7D320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1EA49B0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1ED4080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1EFB710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1F25DA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1F55470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1F7DB00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1FAD1D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1FD5860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA1FFCEF0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA202E5C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2056C50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2087320>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA20AE9B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA20DF080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2105710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA212EDA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2161470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2187B00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA21B81D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA21E1860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2208EF0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA22395C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2262C50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2294320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA22B89B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA22EC080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2315710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA233ADA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA236D470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2395B00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA23C41D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA23EB860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2416EF0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA24455C0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA246CC50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA24A0320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA24C59B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA24F7080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA251E710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2548DA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2577470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA259FB00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA25CF1D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA25F9860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA261EEF0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA26505C0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2678C50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA26AB320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA26D29B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2702080>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA272A710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2752DA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2783470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA27A9B00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA27DB1D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000018DA2803860>]],
      dtype=object)

Scaling the independent variables using standard scaler. Since the dependent variable is ordinal, there is no need to scale it.


In [8]:
plt.boxplot(X_train_white, manage_xticks = False)
plt.yscale("symlog")
plt.xlabel("Features")
plt.ylabel("Target Variable")
plt.show()


Here, the range of values vary drastically. Standardize scaling will be performed in order to reduce that variability:


In [6]:
scaler = StandardScaler()
#scaler =  MinMaxScaler()
#scaler = Normalizer()
X_train_white = scaler.fit(X_train_white).transform(X_train_white)
X_test_white = scaler.fit(X_test_white).transform(X_test_white)

In [11]:
plt.boxplot(X_train_white, manage_xticks = False)
plt.yscale("symlog")
plt.xlabel("Features")
plt.ylabel("Target Variable")
plt.show()


After scaling, The variability of has been reduced significantly

Dimensionality reduction

Performing PCA to check the most relevant variables:


In [12]:
from sklearn.decomposition import PCA
pca = PCA(n_components = None) # input a number for feature extraction

X_train_white = pca.fit_transform(X_train_white)
X_test_white = pca.transform(X_test_white)
explained_var = pca.explained_variance_ratio_
explained_var


Out[12]:
array([0.29631274, 0.14101785, 0.11148285, 0.09476468, 0.08750276,
       0.08318285, 0.06648231, 0.05468109, 0.03713087, 0.02616993,
       0.00127206])

Performing Kernel PCA to check the most relevant variables:


In [13]:
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components = None, kernel = 'rbf') # input a number for feature extraction

X_train_white = kpca.fit_transform(X_train_white)
X_test_white = kpca.transform(X_test_white)

In this case, there will be no feature extraction because the best results are obtained when the variables are untouched.

KNN


In [10]:
knn = KNeighborsClassifier(n_neighbors = 10, metric = 'manhattan', weights = 'distance', algorithm = 'auto')
knn.fit(X_train_white, y_train_white)
predicted_knn = knn.predict(X_test_white)
# print("Predictions: {}".format(predicted_knn))

Performing cross validation:


In [11]:
scores = cross_val_score(knn, X = X_train_white, y = y_train_white)
print ("Cross Validation Scores: {}".format(scores))


Cross Validation Scores: [0.65343511 0.62222222 0.64620107]

Reporting scores:


In [12]:
report = classification_report(y_test_white, predicted_knn)
print (report)


             precision    recall  f1-score   support

          3       0.00      0.00      0.00         9
          4       1.00      0.12      0.21        51
          5       0.67      0.66      0.66       295
          6       0.60      0.73      0.66       409
          7       0.58      0.57      0.57       183
          8       0.67      0.24      0.36        33

avg / total       0.64      0.62      0.61       980

c:\users\franc\appdata\local\programs\python\python36\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

According to the classification report. The model has an accuracy of 0.64 using KNN. The best results are obtained when using all the independent variables.

Finding the best parameters for KNN:

In this case, gridsearchCV and a simple loop will be used in order to find the optimal hyperparameters for KNN


In [ ]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
       
params2 = [{'n_neighbors': [1,10,50,100], 'algorithm': ['auto','ball_tree','kd_tree' ], 
            'weights': ['uniform', 'distance'], 'metric': ['minkowski', 'manhattan']}]
                  
grid_search = GridSearchCV(estimator = knn, param_grid = params2, scoring = 'accuracy', cv = 5, n_jobs = 1)
grid_search = grid_search.fit(X_train_white, y_train_white)
accuracy = grid_search.best_score_
best_params = grid_search.best_params_

Due to very slow processing time. This gridsearch calculation had to be done using atom.

Best scores : 0.6687085247575294

Best Params : {'algorithm': 'auto', 'metric': 'minkowski', 'n_neighbors': 50, 'weights': 'distance'}


In [ ]:
print(accuracy)
print(best_params)

In [13]:
train_accuracy = []
test_accuracy = []

neighbors = range(1,100,10)
algorithms = ['auto', 'ball_tree', 'kd_tree']
weights = ['uniform', 'distance']

for i in neighbors:
    knn = KNeighborsClassifier(n_neighbors = i, metric = 'manhattan', weights = 'distance', algorithm = 'auto')
    knn.fit(X_train_white, y_train_white)
    train_accuracy.append(knn.score(X_train_white, y_train_white))
    test_accuracy.append(knn.score(X_test_white, y_test_white))
plt.plot(neighbors, train_accuracy, label = 'Train set accuracy')
plt.plot(neighbors, test_accuracy, label = 'Test set accuracy')
plt.ylabel("Accuracy")
plt.xlabel("Number of neighbors")
plt.legend()
plt.show()


As it can be appreciated on the figure, the accuracy of the training set is very high which indicates a significant amount of overfitting.

Kernel SVC:


In [14]:
from sklearn.svm import SVC
svm = SVC(C = 1000, kernel = 'rbf', gamma = 1)
svm.fit(X_train_white, y_train_white)
predicted = svm.predict(X_test_white)

#print("Predictions: {}".format(predicted))

scores = cross_val_score(svm, X = X_train_white, y = y_train_white)
report = classification_report(y_test_white, predicted)
print (report)

# print ("Cross Validation Scores: {}".format(scores))


             precision    recall  f1-score   support

          3       0.00      0.00      0.00         9
          4       0.80      0.08      0.14        51
          5       0.74      0.60      0.66       295
          6       0.58      0.82      0.68       409
          7       0.69      0.54      0.60       183
          8       0.75      0.27      0.40        33

avg / total       0.66      0.64      0.61       980

c:\users\franc\appdata\local\programs\python\python36\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

Finding the best parameters for Kernel SVC:


In [ ]:
params = [{'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [1, 0.1, 0.01, 0.001]}]
     
         
                  
grid_search = GridSearchCV(estimator = svm, param_grid = params, scoring = 'accuracy', cv = 5, n_jobs =1)
grid_search = grid_search.fit(X_train_white, y_train_white)
accuracySVC = grid_search.best_score_
best_paramsSVC = grid_search.best_params_

Due to very slow processing time. This gridsearch calculation had to be done using IDE.

Best Score: 0.6482899438489025

Best Params : {'C': 10, 'gamma': 1, 'kernel': 'rbf'}


In [15]:
train_accuracy = []
test_accuracy = []

Ci = [10, 100, 1000]

for i in Ci:
    svm = SVC(C = i, kernel = 'rbf', gamma = 1) # try rbf, linear and poly
    svm.fit(X_train_white, y_train_white)
    train_accuracy.append(svm.score(X_train_white, y_train_white))
    test_accuracy.append(svm.score(X_test_white, y_test_white))
plt.plot(Ci, train_accuracy, label = 'Train set accuracy')
plt.plot(Ci, test_accuracy, label = 'Test set accuracy')
plt.ylabel("Accuracy")
plt.xlabel("C")
plt.legend()
plt.show()


As it happened with the previous model, the accuracy of the training set is very high which indicates a significant amount of overfitting.

Descretization of the input variables in two clasess:


In [37]:
Xi = raw_df_white.iloc[:,:-1] # independent variables X
yi = raw_df_white['quality']

Xi = pd.DataFrame(Xi)
Xi.describe()


Out[37]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
count 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000
mean 6.854788 0.278241 0.334192 6.391415 0.045772 35.308085 138.360657 0.994027 3.188267 0.489847 10.514267
std 0.843868 0.100795 0.121020 5.072058 0.021848 17.007137 42.498065 0.002991 0.151001 0.114126 1.230621
min 3.800000 0.080000 0.000000 0.600000 0.009000 2.000000 9.000000 0.987110 2.720000 0.220000 8.000000
25% 6.300000 0.210000 0.270000 1.700000 0.036000 23.000000 108.000000 0.991723 3.090000 0.410000 9.500000
50% 6.800000 0.260000 0.320000 5.200000 0.043000 34.000000 134.000000 0.993740 3.180000 0.470000 10.400000
75% 7.300000 0.320000 0.390000 9.900000 0.050000 46.000000 167.000000 0.996100 3.280000 0.550000 11.400000
max 14.200000 1.100000 1.660000 65.800000 0.346000 289.000000 440.000000 1.038980 3.820000 1.080000 14.200000

In [47]:
Xi = raw_df_white.iloc[:,:-1] # independent variables X
yi = raw_df_white['quality']

Xi = pd.DataFrame(Xi)


# transforming to boolean integers.

Xi['fixed acidity'] = (Xi.iloc[:, 0] >= 6.8)*1
Xi['volatile acidity'] = (Xi.iloc[:, 1] >= 0.26)*1
Xi['citric acid'] = (Xi.iloc[:, 2] >= 0.32)*1
Xi['residual sugar'] = (Xi.iloc[:, 3] >= 5.2)*1
Xi['chlorides'] = (Xi.iloc[:, 4] >= 0.043)*1
Xi['free sulfur dioxide'] = (Xi.iloc[:, 5] >= 34)*1
Xi['total sulfur dioxide'] = (Xi.iloc[:, 6] >= 134)*1
Xi['density'] = (Xi.iloc[:, 7] >= 0.9937)*1
Xi['pH'] = (Xi.iloc[:, 8] >= 3.18)*1
Xi['sulphates'] = (Xi.iloc[:, 9] >= 0.47)*1
Xi['alcohol'] = (Xi.iloc[:, 10] >= 10.4)*1

KNN with discrete


In [50]:
X_train_white, X_test_white, y_train_white, y_test_white = cross_validation.train_test_split(Xi, y, test_size = 0.2, random_state = 0)

In [71]:
#scaler = StandardScaler()
scaler =  MinMaxScaler()
#scaler = Normalizer()
X_train_white = scaler.fit(X_train_white).transform(X_train_white)
X_test_white = scaler.fit(X_test_white).transform(X_test_white)

In [78]:
knn = KNeighborsClassifier(n_neighbors = 50, metric = 'manhattan', weights = 'distance', algorithm = 'auto')
knn.fit(X_train_white, y_train_white)
predicted_knn = knn.predict(X_test_white)

In [79]:
scores = cross_val_score(knn, X = X_train_white, y = y_train_white)
print ("Cross Validation Scores: {}".format(scores))


Cross Validation Scores: [0.52900763 0.49348659 0.48119724]

In [80]:
report = classification_report(y_test_white, predicted_knn)
print (report)


             precision    recall  f1-score   support

          3       0.00      0.00      0.00         9
          4       0.42      0.10      0.16        51
          5       0.53      0.58      0.56       295
          6       0.52      0.64      0.57       409
          7       0.51      0.34      0.41       183
          8       0.38      0.18      0.24        33

avg / total       0.51      0.52      0.50       980


In [82]:
svm = SVC(C = 10, kernel = 'rbf', gamma = 1)
svm.fit(X_train_white, y_train_white)
predicted = svm.predict(X_test_white)

#print("Predictions: {}".format(predicted))

scores = cross_val_score(svm, X = X_train_white, y = y_train_white)
report = classification_report(y_test_white, predicted)
print (report)


             precision    recall  f1-score   support

          3       0.00      0.00      0.00         9
          4       0.00      0.00      0.00        51
          5       0.57      0.48      0.52       295
          6       0.49      0.74      0.59       409
          7       0.50      0.32      0.39       183
          8       0.00      0.00      0.00        33

avg / total       0.47      0.51      0.47       980

c:\users\franc\appdata\local\programs\python\python36\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

Performing the discretization of the input variables does not increase accuracy, on the contrary, it significantly reduces it.

Continuing in part two.......


In [ ]: