Letter Recognition - UCI



In [1]:

    
import pandas as pd
import numpy as np
%pylab inline
pylab.style.use('ggplot')









    



Populating the interactive namespace from numpy and matplotlib

Getting the Data

This is a dataset of 20 image features for uppercase English characters.

https://archive.ics.uci.edu/ml/datasets/Letter+Recognition



In [2]:

    
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/letter-recognition.data'
letter_df = pd.read_csv(url, header=None)



In [4]:

    
letter_df.head()

Next we attach the column names.



In [6]:

    
s = """ 1.	lettr	capital letter	(26 values from A to Z)
	 2.	x-box	horizontal position of box	(integer)
	 3.	y-box	vertical position of box	(integer)
	 4.	width	width of box			(integer)
	 5.	high 	height of box			(integer)
	 6.	onpix	total # on pixels		(integer)
	 7.	x-bar	mean x of on pixels in box	(integer)
	 8.	y-bar	mean y of on pixels in box	(integer)
	 9.	x2bar	mean x variance			(integer)
	10.	y2bar	mean y variance			(integer)
	11.	xybar	mean x y correlation		(integer)
	12.	x2ybr	mean of x * x * y		(integer)
	13.	xy2br	mean of x * y * y		(integer)
	14.	x-ege	mean edge count left to right	(integer)
	15.	xegvy	correlation of x-ege with y	(integer)
	16.	y-ege	mean edge count bottom to top	(integer)
	17.	yegvx	correlation of y-ege with x	(integer)"""

lines = [l.strip() for l in s.split('\n')]
feature_names = [l.split()[1] for l in lines]
feature_names = [f.replace('-', '_') for f in feature_names]



In [7]:

    
letter_df.columns = feature_names



In [8]:

    
letter_df.head()

Check for Class Imbalance



In [11]:

    
letter_counts = letter_df['lettr'].value_counts()
letter_counts.sort_index(ascending=False).plot(kind='barh')









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0x23f0c555dd8>

All the classes are represented in a fairly balanced manner, so looks like in this instance we don't have to address class imbalance.

Feature Correlations



In [18]:

    
features_df = letter_df.drop('lettr', axis=1)
letters = letter_df['lettr']

import seaborn as sns
f_corrs = features_df.corr()

fig, ax = pylab.subplots(figsize=(12, 12))
sns.heatmap(f_corrs, annot=True, ax=ax)









    Out[18]:





<matplotlib.axes._subplots.AxesSubplot at 0x23f0e993128>

The first 5 features, x_box, y_box, width, high, onpix are highly correlated with each other.

ANOVA Feature Selection

ANOVA F-test based feature selection requires all feature values to be positive.



In [28]:

    
features_df[features_df < 0].sum(axis=0)









    Out[28]:





x_box    0.0
y_box    0.0
width    0.0
high     0.0
onpix    0.0
x_bar    0.0
y_bar    0.0
x2bar    0.0
y2bar    0.0
xybar    0.0
x2ybr    0.0
xy2br    0.0
x_ege    0.0
xegvy    0.0
y_ege    0.0
yegvx    0.0
dtype: float64

The above condition holds in our case - none of the features have negative values.



In [19]:

    
from sklearn.feature_selection import f_classif



In [26]:

    
t_stats, p_vals = f_classif(features_df, letters)

f_test_results = pd.DataFrame(np.column_stack([t_stats, p_vals]),
                              index=features_df.columns.copy(),
                             columns=['test_statistic', 'p_value'])



In [27]:

    
f_test_results.plot(kind='bar', subplots=True)









    Out[27]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x0000023F10715AC8>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x0000023F10800780>], dtype=object)

Top 5 Features with GaussianNB Classifier



In [44]:

    
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.naive_bayes import GaussianNB



In [45]:

    
estimator = GaussianNB()
selector = SelectKBest(f_classif, k=5)

pipeline = Pipeline([
    ('selector', selector),
    ('model', estimator)
])

cross_validator = StratifiedKFold(n_splits=10, shuffle=True)

scores = cross_val_score(pipeline, features_df, letters, 
                         cv=cross_validator, scoring='f1_macro')

score_1 = pd.Series(scores)



In [46]:

    
score_1.plot(kind='bar')









    Out[46]:





<matplotlib.axes._subplots.AxesSubplot at 0x23f10d7bf60>

Top 10 Features with GaussianNB Classifier



In [47]:

    
estimator = GaussianNB()
selector = SelectKBest(f_classif, k=10)

pipeline = Pipeline([
    ('selector', selector),
    ('model', estimator)
])

cross_validator = StratifiedKFold(n_splits=10, shuffle=True)
scores = cross_val_score(pipeline, features_df, letters, 
                         cv=cross_validator, scoring='f1_macro')

score_2 = pd.Series(scores)



In [48]:

    
combined_scores = pd.concat([score_1, score_2], axis=1, keys=['cv_top5', 'cv_top10'])
combined_scores.plot(kind='bar')









    Out[48]:





<matplotlib.axes._subplots.AxesSubplot at 0x23f11e65390>

Top 15 Features with GaussianNB Classifier



In [49]:

    
estimator = GaussianNB()
selector = SelectKBest(f_classif, k=15)

pipeline = Pipeline([
    ('selector', selector),
    ('model', estimator)
])

cross_validator = StratifiedKFold(n_splits=10, shuffle=True)
scores = cross_val_score(pipeline, features_df, letters, 
                         cv=cross_validator, scoring='f1_macro')

score_3 = pd.Series(scores)



In [50]:

    
combined_scores2 = pd.concat([score_2, score_3], axis=1, keys=['cv_top10', 'cv_top15'])
combined_scores2.plot(kind='bar')









    Out[50]:





<matplotlib.axes._subplots.AxesSubplot at 0x23f11ec0fd0>

We get a significant (in layman terms, not in a statistical testing sense) boost in accuracy by going from top 5 to top 10 features. But the improvement in accuracy with top 15 features over top 10 features are marginal.

Pairplot



In [54]:

    
top_5_feature_names = f_test_results.nlargest(5, columns='test_statistic').index
pairplot_df = features_df.loc[:, top_5_feature_names].copy()
pairplot_df['letter'] = letters

sns.pairplot(pairplot_df, hue='letter')









    Out[54]:





<seaborn.axisgrid.PairGrid at 0x23f13cd9e80>



In [58]:

    
from sklearn.svm import SVC



In [59]:

    
estimator = SVC(C=100.0, kernel='rbf')
selector = SelectKBest(f_classif, k=5)

pipeline = Pipeline([
    ('selector', selector),
    ('model', estimator)
])

cross_validator = StratifiedKFold(n_splits=10, shuffle=True)
scores = cross_val_score(pipeline, features_df, letters, 
                         cv=cross_validator, scoring='f1_macro')

svm_5 = pd.Series(scores)



In [60]:

    
svm_5.plot(kind='bar', title='10 Fold CV with SVM (top 5 features)')









    Out[60]:





<matplotlib.axes._subplots.AxesSubplot at 0x23f1447ad30>



In [62]:

    
combined_3 = pd.concat([score_3, svm_5], axis=1, keys=['Gaussian_15', 'svm_5'])
combined_3.plot(kind='bar')









    Out[62]:





<matplotlib.axes._subplots.AxesSubplot at 0x23f149ba748>



In [68]:

    
estimator = SVC(C=100.0, kernel='rbf')
selector = SelectKBest(f_classif, k=10)

pipeline = Pipeline([
    ('selector', selector),
    ('model', estimator)
])

cross_validator = StratifiedKFold(n_splits=10, shuffle=True)
scores = cross_val_score(pipeline, features_df, letters, 
                         cv=cross_validator, scoring='f1_macro')

svm_10 = pd.Series(scores)



In [72]:

    
combined_4 = pd.concat([svm_5, svm_10], axis=1, keys=['svm_5', 'svm_10'])
combined_4.plot(kind='bar')









    Out[72]:





<matplotlib.axes._subplots.AxesSubplot at 0x23f18aa0278>



In [70]:

    
estimator = SVC(C=100.0, kernel='rbf')
selector = SelectKBest(f_classif, k=15)

pipeline = Pipeline([
    ('selector', selector),
    ('model', estimator)
])

cross_validator = StratifiedKFold(n_splits=10, shuffle=True)
scores = cross_val_score(pipeline, features_df, letters, 
                         cv=cross_validator, scoring='f1_macro')

svm_15 = pd.Series(scores)



In [71]:

    
combined_5 = pd.concat([svm_10, svm_15], axis=1, keys=['svm_10', 'svm_15'])
combined_5.plot(kind='bar')









    Out[71]:





<matplotlib.axes._subplots.AxesSubplot at 0x23f189e0ef0>

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
0	T	2	8	3	5	1	8	13	0	6	6	10	8	0	8	0	8
1	I	5	12	3	7	2	10	5	5	4	13	3	9	2	8	4	10
2	D	4	11	6	8	6	10	6	2	6	10	3	7	3	7	3	9
3	N	7	11	6	6	3	5	9	4	6	4	4	10	6	10	2	8
4	G	2	1	3	1	1	8	6	6	6	6	5	9	1	7	5	10

	lettr	x_box	y_box	width	high	onpix	x_bar	y_bar	x2bar	y2bar	xybar	x2ybr	xy2br	x_ege	xegvy	y_ege	yegvx
0	T	2	8	3	5	1	8	13	0	6	6	10	8	0	8	0	8
1	I	5	12	3	7	2	10	5	5	4	13	3	9	2	8	4	10
2	D	4	11	6	8	6	10	6	2	6	10	3	7	3	7	3	9
3	N	7	11	6	6	3	5	9	4	6	4	4	10	6	10	2	8
4	G	2	1	3	1	1	8	6	6	6	6	5	9	1	7	5	10

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
0	T	2	8	3	5	1	8	13	0	6	6	10	8	0	8	0	8
1	I	5	12	3	7	2	10	5	5	4	13	3	9	2	8	4	10
2	D	4	11	6	8	6	10	6	2	6	10	3	7	3	7	3	9
3	N	7	11	6	6	3	5	9	4	6	4	4	10	6	10	2	8
4	G	2	1	3	1	1	8	6	6	6	6	5	9	1	7	5	10

	lettr	x_box	y_box	width	high	onpix	x_bar	y_bar	x2bar	y2bar	xybar	x2ybr	xy2br	x_ege	xegvy	y_ege	yegvx
0	T	2	8	3	5	1	8	13	0	6	6	10	8	0	8	0	8
1	I	5	12	3	7	2	10	5	5	4	13	3	9	2	8	4	10
2	D	4	11	6	8	6	10	6	2	6	10	3	7	3	7	3	9
3	N	7	11	6	6	3	5	9	4	6	4	4	10	6	10	2	8
4	G	2	1	3	1	1	8	6	6	6	6	5	9	1	7	5	10

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
0	T	2	8	3	5	1	8	13	0	6	6	10	8	0	8	0	8
1	I	5	12	3	7	2	10	5	5	4	13	3	9	2	8	4	10
2	D	4	11	6	8	6	10	6	2	6	10	3	7	3	7	3	9
3	N	7	11	6	6	3	5	9	4	6	4	4	10	6	10	2	8
4	G	2	1	3	1	1	8	6	6	6	6	5	9	1	7	5	10

	lettr	x_box	y_box	width	high	onpix	x_bar	y_bar	x2bar	y2bar	xybar	x2ybr	xy2br	x_ege	xegvy	y_ege	yegvx
0	T	2	8	3	5	1	8	13	0	6	6	10	8	0	8	0	8
1	I	5	12	3	7	2	10	5	5	4	13	3	9	2	8	4	10
2	D	4	11	6	8	6	10	6	2	6	10	3	7	3	7	3	9
3	N	7	11	6	6	3	5	9	4	6	4	4	10	6	10	2	8
4	G	2	1	3	1	1	8	6	6	6	6	5	9	1	7	5	10