These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
The attributes are (dontated by Riccardo Leardi, iclea@anchem.unige.it )
1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10)Color intensity
11)Hue
12)OD280/OD315 of diluted wines
13)Proline
Number of Instances:
class 1 59
class 2 71
class 3 48
Number of Attributes
13
In [1]:
import numpy as np
import pandas as pd
%pylab inline
pylab.style.use('ggplot')
In [4]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
data_df = pd.read_csv(url, header=None)
In [5]:
data_df.head()
Out[5]:
In [6]:
cols = """Alcohol
Malic acid
Ash
Alcalinity of ash
Magnesium
Total phenols
Flavanoids
Nonflavanoid phenols
Proanthocyanins
Color intensity
Hue
OD280_OD315 of diluted wines
Proline""".split('\n')
cols = [c.lower().replace(' ', '_') for c in cols]
In [7]:
data_df.columns = ['target'] + cols
In [8]:
data_df.head()
Out[8]:
In [9]:
import seaborn as sns
corrs = data_df.drop('target', axis=1).corr()
sns.heatmap(corrs)
Out[9]:
In [10]:
features = data_df.drop('target', axis=1)
target = data_df['target']
In [16]:
from sklearn.feature_selection import chi2, SelectKBest
s = SelectKBest(chi2, k=5)
s.fit(X=features, y=target)
scores = pd.Series(s.scores_, index=features.columns)
scores.sort_values(ascending=False).plot(kind='bar')
Out[16]:
In [18]:
scores.sort_values(ascending=False).iloc[1:].plot(kind='bar')
Out[18]:
In [44]:
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
def cross_validate_with_top_n_features(n_features=2, n_folds=5):
selector = SelectKBest(chi2, k=n_features)
estimator = GaussianNB()
pipeline = make_pipeline(selector, estimator)
cv = StratifiedKFold(n_splits=n_folds, shuffle=True)
scores = cross_val_score(pipeline, X=features, y=target, cv=cv, scoring='f1_macro')
score_series = pd.Series(data=scores)
return score_series
cv_results = {'cv_with_%s_features' % k: cross_validate_with_top_n_features(n_features=k)
for k in range(2, 10)}
cv_results = pd.concat(cv_results, axis=1)
In [47]:
cv_results.plot(kind='bar', figsize=(16, 10))
Out[47]:
In [46]:
cv_results.mean(axis=0).plot(kind='barh')
Out[46]: