Balance Scale Classification - UCI

Analysis of the UCI Balance Scale Dataset.

Get the Data


In [1]:
import pandas as pd
import numpy as np

%pylab inline
pylab.style.use('ggplot')


Populating the interactive namespace from numpy and matplotlib

In [2]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data'

balance_df = pd.read_csv(url, header=None)

In [3]:
balance_df.columns = ['class_name', 'left_weight', 'left_distance', 'right_weight', 'right_distance']

In [4]:
balance_df.head()


Out[4]:
class_name left_weight left_distance right_weight right_distance
0 B 1 1 1 1
1 R 1 1 1 2
2 R 1 1 1 3
3 R 1 1 1 4
4 R 1 1 1 5

Check for Class Imbalance


In [5]:
counts = balance_df['class_name'].value_counts()
counts.plot(kind='bar')


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dd0b82f9e8>

Feature Importances

Now we check for feature importances. However, this requires all feature values to be positive.


In [6]:
from sklearn.feature_selection import f_classif

In [7]:
features = balance_df.drop('class_name', axis=1)
names = balance_df['class_name']

In [8]:
# check for negative feature values
features[features < 0].sum(axis=0)


Out[8]:
left_weight       0.0
left_distance     0.0
right_weight      0.0
right_distance    0.0
dtype: float64

In [9]:
t_stats, p_vals = f_classif(features, names)

feature_importances = pd.DataFrame(np.column_stack([t_stats, p_vals]), 
                                   index=features.columns.copy(), 
                                   columns=['t_stats', 'p_vals'])

In [10]:
feature_importances.plot(subplots=True, kind='bar')
plt.xticks(rotation=30)


Out[10]:
(array([0, 1, 2, 3]), <a list of 4 Text xticklabel objects>)

In [11]:
import seaborn as sns

for colname in balance_df.columns.drop('class_name'):
    fg = sns.FacetGrid(col='class_name', data=balance_df)
    fg = fg.map(pylab.hist, colname)



In [12]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.naive_bayes import GaussianNB

estimator = GaussianNB()
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=12345)

f1 = cross_val_score(estimator, features, names, cv=cv, scoring='f1_micro')
pd.Series(f1).plot(title='F1 Score (Micro)', kind='bar')


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dd0eea8d30>

In [14]:
estimator = GaussianNB()
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=12345)

f1 = cross_val_score(estimator, features, names, cv=cv, scoring='accuracy')
pd.Series(f1).plot(title='Accuracy', kind='bar')


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dd0ef50128>