Wine Selection

Framing

I want to buy a fine wine but I have no idea about wine selection.I'm not good at wine tasting.

I will use the data and understand what goes into making fine wine


In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [ ]:
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (13,8)

In [ ]:
df = pd.read_csv("./winequality-red.csv")
df.head()

In [ ]:
df.shape

Wine Category

Let's create a new column 'category' which signifies the category of wine - High (1) or Low (0)

Wine with quality > 6 is considered to be High quality, rest are Low quality


In [ ]:
#df.loc[df.b > 0, 'd'] = 1
df.loc[df.quality > 5, 'category'] = 1
df.loc[df.quality <= 5, 'category'] = 0

This is the frequency count for each category


In [ ]:
df.category.value_counts()

In [ ]:
df.head()

Visual Exploration

Let's see how the columns are related

To start, lets take 2 variables at a time to explore data

Correlation


In [ ]:
df.corr()

In [ ]:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(df, figsize=(15,15), diagonal='kde')

Alcohol vs Category


In [ ]:
df.plot(x="alcohol", y="category", kind="scatter")

Exercise: Volatile Acidity vs Category


In [ ]:

3 variable visualization

Let's add one more dimension to get more sense what is correlated

Alcohol vs Volatile Acidity vs Category


In [ ]:
#df.plot(x="alcohol", y="volatile acidity", kind="scatter", c="category")
ax = df[df.category == 1].plot(x="alcohol", y="volatile acidity", kind="scatter", color="red", label="HIGH", s=100, alpha=0.5)
df[df.category == 0].plot(x="alcohol", y="volatile acidity", kind="scatter", color="green", label="LOW", s=100, alpha=0.5, ax=ax)

In [ ]:
pd.set_option("precision",3)

Time to build a predictive model

Let's build a model that can predict the category of wine, given information about alcohol content and volatile acidity

Building a predictive model involves training the model with historical data known as training data. Once we have the model trained, the model can predict labels (in this case, the category of wine) for the given features (test data) We have 1600 rows of the wine data, lets split this data into 80:20 ratio as training:testingg data

Why do we need to do this?

We can compare the predicted label with the actual label. By doing this, we can measure how accurate our model is.


In [ ]:
df.shape

In [ ]:
df_train = df.iloc[:1280,]
df_test = df.iloc[1280:,]

In [ ]:
X_train = df_train["volatile acidity"]
y_train = df_train["category"]

In [ ]:
X_test = df_test["volatile acidity"]
y_test = df_test["category"]

In [ ]:
X_train = X_train.reshape(X_train.shape[0],1)

X_test = X_test.reshape(X_test.shape[0],1)

In [ ]:
from sklearn.linear_model import LogisticRegression

In [ ]:
logistic_model = LogisticRegression()

In [ ]:
logistic_model.fit(X_train, y_train)

In [ ]:
sns.lmplot(data=df, x="alcohol", y="category", logistic=True)

It’s a bird… it’s a plane… it… depends on your classifier’s threshold -- Sancho McCann


In [ ]:
predicted = logistic_model.predict(X_test)

In [ ]:
df_compare = pd.DataFrame()
df_compare["actual"] = y_test
df_compare["predicted"] = predicted
df_compare["volatile acidity"] = df_test["volatile acidity"]

In [ ]:
ax=df_compare.plot(x="volatile acidity", y="actual", kind="scatter", color="blue", label="actual")
df_compare.plot(x="volatile acidity", y="predicted", kind="scatter", color="red", label="predicted", ax=ax)

Let's add more features - volatile acidity, sulphates, alcohol to predict the category

2 variable model


In [ ]:
df_train = df.iloc[:1280,]
df_test = df.iloc[1280:,]

In [ ]:
X_train = df_train[["sulphates", "alcohol"]]
y_train = df_train["category"]

In [ ]:
X_test = df_test[["sulphates", "alcohol"]]
y_test = df_test["category"]

In [ ]:
logistic_model = LogisticRegression()

logistic_model.fit(X_train, y_train)

In [ ]:
predicted = logistic_model.predict(X_test)

In [ ]:
df_compare = pd.DataFrame()
df_compare["actual"] = y_test
df_compare["predicted"] = predicted
df_compare["sulphates"] = df_test["sulphates"]
df_compare["alcohol"] = df_test["alcohol"]

In [ ]:
df_compare.head()

In [ ]:
ax = df_compare[df_compare.actual == 1].plot(x="alcohol", y="sulphates", kind="scatter", color="red", label="HIGH", s=100, alpha=0.5)
df_compare[df_compare.actual == 0].plot(x="alcohol", y="sulphates", kind="scatter", color="green", label="LOW", s=100, alpha=0.5, ax=ax)

In [ ]:
ax = df_compare[df_compare.predicted == 1].plot(x="alcohol", y="sulphates", kind="scatter", color="red", label="HIGH", s=100, alpha=0.5)
df_compare[df_compare.predicted == 0].plot(x="alcohol", y="sulphates", kind="scatter", color="green", label="LOW", s=100, alpha=0.5, ax=ax)

Accuracy Metrics

  • AUC
  • ROC
  • Misclassification Rate
  • Confusion Matrix
  • Precision & Recall

Confusion Matrix

Calculate True Positive Rate

TPR = TP / (TP+FN)

Calculate False Positive Rate

FPR = FP / (FP+TN)

In [ ]:


In [ ]:

Precise & Recall

AUC-ROC for the model


In [ ]:
from sklearn import metrics

In [ ]:
#ols_auc = metrics.roc_auc_score(df_compare.actual, df_compare.predicted)
fpr, tpr, thresholds = metrics.roc_curve(df_compare.actual, df_compare.predicted)
plt.plot(fpr, tpr)
plt.plot([0,1],[0,1])

In [ ]: