Setup

Python version : Python 2.7.10
Data file : wine_data.csv. This file should be in the same directory as the code.

Reading data



In [1]:

    
%matplotlib inline
import matplotlib.colors
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

df = pd.read_csv('wine_data.csv')



In [2]:

    
# View data sample
df.ix[1:5]









    Out[2]:






  
    
      
      class
      alcohol
      malic_acid
      ash
      alcalinity
      magnesium
      phenols
      flavanoids
      nonflavanoid_phenols
      proanthocyanins
      color_intensity
      hue
      OD280_OD315
      proline
    
  
  
    
      1
      1
      13.20
      1.78
      2.14
      11.2
      100
      2.65
      2.76
      0.26
      1.28
      4.38
      1.05
      3.40
      1050
    
    
      2
      1
      13.16
      2.36
      2.67
      18.6
      101
      2.80
      3.24
      0.30
      2.81
      5.68
      1.03
      3.17
      1185
    
    
      3
      1
      14.37
      1.95
      2.50
      16.8
      113
      3.85
      3.49
      0.24
      2.18
      7.80
      0.86
      3.45
      1480
    
    
      4
      1
      13.24
      2.59
      2.87
      21.0
      118
      2.80
      2.69
      0.39
      1.82
      4.32
      1.04
      2.93
      735
    
    
      5
      1
      14.20
      1.76
      2.45
      15.2
      112
      3.27
      3.39
      0.34
      1.97
      6.75
      1.05
      2.85
      1450

Plots

Alcohol vs Hue



In [3]:

    
cmap = matplotlib.colors.ListedColormap(["red","cyan","blue"])
s = plt.scatter(df.ix[:, 1], df.ix[:, 11],  c=df.ix[:, 0], cmap=cmap)
plt.xlabel('Alcohol')
plt.ylabel('Hue')
plt.show()

From the plot above, we can see that there isn't any significant linear relationship between the hue and alcohol content. However, a strategy like kNN might be useful for classification using only hue and alcohol since we can see distinct clusters, but with a slight overlap.

N choose 2 graphs



In [28]:

    
cmap = matplotlib.colors.ListedColormap(["red","cyan","blue"])
for i in range(1, 13):
    for j in range(i+1, 14):
        s = plt.scatter(df.ix[:, i], df.ix[:, j],  c=df.ix[:, 0], cmap=cmap)
        plt.xlabel(df.columns.values[i])
        plt.ylabel(df.columns.values[j])
        plt.show()

The flavanoids vs alcohol grpah is giving a good separation amongst the categories.
Also, we can see a linear relationship between total phenols and flavanoids (although not sufficient for class separation). This should motivate us to use PCA for dimentionality reduction.

Look for linear separation using a single feature



In [23]:

    
for i in range(1, 14):
    s = plt.scatter(df.ix[:, i], df.ix[:, 0], cmap=cmap)
    plt.xlabel(df.columns.values[i])
    plt.ylabel('Category')
    plt.show()

With this simple experiment, it is clear that just using a linear separator on a single feature would not help.
However, very low and very high proline levels are suggestive of representing a particular category as can be seen from the graph.



In [ ]:

Checking if PCA would help



In [11]:

    
df_norm = (df.ix[:, 1:] - df.ix[:, 1:].mean()) / (df.ix[:, 1:].max() - df.ix[:, 1:].min())
cov_mat = df_norm.cov()



In [6]:

    
print np.linalg.eigvals(cov_mat)
print np.diagonal(cov_mat)









    



[ 0.2200922   0.10246084  0.04624247  0.04011226  0.03005877  0.02516286
  0.01978926  0.00440241  0.0074605   0.00687688  0.01301012  0.01228411
  0.01215769]
[ 0.04564144  0.04874375  0.02152324  0.02963303  0.02410082  0.04657426
  0.044407    0.05513932  0.03260005  0.0391272   0.03453299  0.06763628
  0.05045102]

It looks like applying PCA would be helpful given the difference in diagonal entries of current covariance matrix and the eigenvalues (the correlations after PCA).



In [72]:

    
from IPython.display import HTML

HTML('''
<script>
code_show=false; 
function code_toggle() {
    if (code_show){
        $('div.input').show();
    } else {
        $('div.input').hide();
    }
    code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code"></form>''')









    Out[72]:



In [ ]:

	class	alcohol	malic_acid	ash	alcalinity	magnesium	phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	OD280_OD315	proline
1	1	13.20	1.78	2.14	11.2	100	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050
2	1	13.16	2.36	2.67	18.6	101	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185
3	1	14.37	1.95	2.50	16.8	113	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480
4	1	13.24	2.59	2.87	21.0	118	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735
5	1	14.20	1.76	2.45	15.2	112	3.27	3.39	0.34	1.97	6.75	1.05	2.85	1450