Create a classifier to predict the wine color from wine quality attributes using this dataset: http://archive.ics.uci.edu/ml/datasets/Wine+Quality

The data is in the database we've been using

  • host='training.c1erymiua9dx.us-east-1.rds.amazonaws.com'
  • database='training'
  • port=5432
  • user='dot_student'
  • password='qgis'
  • table name = 'winequality'

In [40]:
import pandas as pd
%matplotlib inline
from sklearn import datasets
from sklearn import tree
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import cross_val_score
import matplotlib.pyplot as plt
import numpy as np
from numpy import array

Query for the data and create a numpy array


In [4]:
import pg8000
conn = pg8000.connect(host='training.c1erymiua9dx.us-east-1.rds.amazonaws.com', port= 5432, database= 'training', user= 'dot_student', password= 'qgis')

In [5]:
cursor = conn.cursor()

In [7]:
cursor.execute("select column_name from information_schema.columns where table_name='winequality'")
column_name=[]
for item in cursor.fetchall():
    column_name.append(item)
    print(item)


['fixed_acidity']
['volatile_acidity']
['citric_acid']
['residual_sugar']
['chlorides']
['free_sulfur_dioxide']
['total_sulfur_dioxide']
['density']
['ph']
['sulphates']
['alcohol']
['color']

In [21]:
cursor.execute("select * from winequality")
data=[]
for item in cursor.fetchall():
    #print(item)
    data.append(item)
type(data)
my_array= array(data)
type(my_array)


Out[21]:
numpy.ndarray

Split the data into features (x) and target (y, the last column in the table)

Remember you can cast the results into an numpy array and then slice out what you want


In [42]:
x = my_array[:,:11]
y = my_array[:,11:] #two dimensional y array
y2= my_array[:,11] #one dimensional y array THIS IS THE ONE WE NEED

Create a decision tree with the data


In [38]:
dt= DecisionTreeClassifier()

In [39]:
dt= dt.fit(x,y)

Run 10-fold cross validation on the model


In [46]:
scores = cross_val_score(dt, x, y2, cv=10)

In [47]:
scores


Out[47]:
array([ 0.97846154,  0.98307692,  0.97538462,  0.97846154,  0.98615385,
        0.98      ,  0.97538462,  0.96923077,  0.98613251,  0.97222222])

If you have time, calculate the feature importance and graph based on the code in the slides from last class


In [ ]: