A gentle walkthrough of a Kaggle classification problem

  • Date: May 21, 2015
  • Instructor: Kian Ho
  • Prerequisites:
    • familiarity with Python and basic statistics/machine learning concepts

Review

This session covered some basic operations in pandas and scikit-learn that could be used to produce a rudimentary kaggle entry to the Otto group product classification problem. This session was highly interactive with plenty of questions and discussions pertaining to the pro's and con's of the machine learning design decisions in this walkthrough.

Dependencies

  • Download and unzip the train.csv.zip and test.csv.zip files from the Otto challenge website into the same directory as this notebook.

Obligatory house-keeping import(s)


In [ ]:
import pandas
import numpy

Import the classifiers to evaluate


In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

Import the log-loss (as per Kaggle) and cross-validation functions for classifier evaluation.


In [91]:
from sklearn.metrics import log_loss
from sklearn.cross_validation import cross_val_score, StratifiedKFold

Data Preparation

Load the Otto training dataset into a dataframe


In [5]:
df = pandas.read_csv("train.csv", index_col=0)

Inspect the first 10 rows of the dataset


In [41]:
df.head(10)


Out[41]:
feat_1 feat_2 feat_3 feat_4 feat_5 feat_6 feat_7 feat_8 feat_9 feat_10 ... feat_85 feat_86 feat_87 feat_88 feat_89 feat_90 feat_91 feat_92 feat_93 target
id
1 1 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 Class_1
2 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 Class_1
3 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 Class_1
4 1 0 0 1 6 1 5 0 0 1 ... 0 1 2 0 0 0 0 0 0 Class_1
5 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 1 0 0 0 Class_1
6 2 1 0 0 7 0 0 0 0 0 ... 0 3 0 0 0 0 2 0 0 Class_1
7 2 0 0 0 0 0 0 2 0 1 ... 1 1 0 0 0 0 0 0 1 Class_1
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 Class_1
9 0 0 0 0 0 0 0 4 0 0 ... 0 2 0 0 0 0 0 0 1 Class_1
10 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 1 0 0 Class_1

10 rows × 94 columns

Inspect the last 10 rows of the dataset


In [42]:
df.tail(10)


Out[42]:
feat_1 feat_2 feat_3 feat_4 feat_5 feat_6 feat_7 feat_8 feat_9 feat_10 ... feat_85 feat_86 feat_87 feat_88 feat_89 feat_90 feat_91 feat_92 feat_93 target
id
61874 1 0 0 1 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 2 0 Class_9
61875 4 0 0 0 0 0 0 0 0 0 ... 0 2 0 0 2 0 0 1 0 Class_9
61876 0 0 0 0 0 0 0 3 1 0 ... 0 3 1 0 0 0 0 0 0 Class_9
61877 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 3 10 0 Class_9
61878 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 2 0 Class_9

5 rows × 94 columns

Perform a more detailed visual inspection of the attribute values, for example:

  • check if there are missing (NaN/null) values
  • check if all attributes are of a specific data type

In [7]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 61878 entries, 1 to 61878
Data columns (total 94 columns):
feat_1     61878 non-null int64
feat_2     61878 non-null int64
feat_3     61878 non-null int64
feat_4     61878 non-null int64
feat_5     61878 non-null int64
feat_6     61878 non-null int64
feat_7     61878 non-null int64
feat_8     61878 non-null int64
feat_9     61878 non-null int64
feat_10    61878 non-null int64
feat_11    61878 non-null int64
feat_12    61878 non-null int64
feat_13    61878 non-null int64
feat_14    61878 non-null int64
feat_15    61878 non-null int64
feat_16    61878 non-null int64
feat_17    61878 non-null int64
feat_18    61878 non-null int64
feat_19    61878 non-null int64
feat_20    61878 non-null int64
feat_21    61878 non-null int64
feat_22    61878 non-null int64
feat_23    61878 non-null int64
feat_24    61878 non-null int64
feat_25    61878 non-null int64
feat_26    61878 non-null int64
feat_27    61878 non-null int64
feat_28    61878 non-null int64
feat_29    61878 non-null int64
feat_30    61878 non-null int64
feat_31    61878 non-null int64
feat_32    61878 non-null int64
feat_33    61878 non-null int64
feat_34    61878 non-null int64
feat_35    61878 non-null int64
feat_36    61878 non-null int64
feat_37    61878 non-null int64
feat_38    61878 non-null int64
feat_39    61878 non-null int64
feat_40    61878 non-null int64
feat_41    61878 non-null int64
feat_42    61878 non-null int64
feat_43    61878 non-null int64
feat_44    61878 non-null int64
feat_45    61878 non-null int64
feat_46    61878 non-null int64
feat_47    61878 non-null int64
feat_48    61878 non-null int64
feat_49    61878 non-null int64
feat_50    61878 non-null int64
feat_51    61878 non-null int64
feat_52    61878 non-null int64
feat_53    61878 non-null int64
feat_54    61878 non-null int64
feat_55    61878 non-null int64
feat_56    61878 non-null int64
feat_57    61878 non-null int64
feat_58    61878 non-null int64
feat_59    61878 non-null int64
feat_60    61878 non-null int64
feat_61    61878 non-null int64
feat_62    61878 non-null int64
feat_63    61878 non-null int64
feat_64    61878 non-null int64
feat_65    61878 non-null int64
feat_66    61878 non-null int64
feat_67    61878 non-null int64
feat_68    61878 non-null int64
feat_69    61878 non-null int64
feat_70    61878 non-null int64
feat_71    61878 non-null int64
feat_72    61878 non-null int64
feat_73    61878 non-null int64
feat_74    61878 non-null int64
feat_75    61878 non-null int64
feat_76    61878 non-null int64
feat_77    61878 non-null int64
feat_78    61878 non-null int64
feat_79    61878 non-null int64
feat_80    61878 non-null int64
feat_81    61878 non-null int64
feat_82    61878 non-null int64
feat_83    61878 non-null int64
feat_84    61878 non-null int64
feat_85    61878 non-null int64
feat_86    61878 non-null int64
feat_87    61878 non-null int64
feat_88    61878 non-null int64
feat_89    61878 non-null int64
feat_90    61878 non-null int64
feat_91    61878 non-null int64
feat_92    61878 non-null int64
feat_93    61878 non-null int64
target     61878 non-null object
dtypes: int64(93), object(1)

The output above suggests that each column/feature is a 64-bit integer, is this what we were expecting? I would typically add explicit checks for such assumptions using assertions, as below, or a series of if-statements. Note that there may be other attribute characteristics that you should check for that aren't mentioned here, e.g. do the features contain strange, unexpected values?


In [43]:
# Explicitly check if all columns are 64-bit integers
assert(df.drop("target", axis=1).dtypes.unique() ==
       numpy.array(numpy.dtype("int64")))

Check the number of target labels.


In [44]:
len(df.target.unique())


Out[44]:
9

Therefore, this is a multi-class classification problem.


In [47]:
df.target.unique()


Out[47]:
array(['Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5', 'Class_6',
       'Class_7', 'Class_8', 'Class_9'], dtype=object)

Convert the dataset into numpy arrays to ensure they can be processed by the scikit-learn classifiers. However, sklearn-pandas improves the interoperability of pandas dataframes with sklearn. Note the differences between pandas dataframes and numpy matrices/arrays.


In [9]:
X, y = df.drop("target", axis=1).values, df.target.values

View the converted dataset


In [10]:
X


Out[10]:
array([[ 1,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       ..., 
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 1,  0,  0, ...,  3, 10,  0],
       [ 0,  0,  0, ...,  0,  2,  0]])

In [11]:
y


Out[11]:
array(['Class_1', 'Class_1', 'Class_1', ..., 'Class_9', 'Class_9',
       'Class_9'], dtype=object)

Model Training

Instantiate the candidate classifiers. I've selected only two to keep the training time down.


In [59]:
classifiers = [
    LogisticRegression(class_weight="auto"),
    RandomForestClassifier(n_estimators=100)
]

Train and evaluate the generalisation error of each classifier using the mean log-loss over k-fold cross-validation, storing each result in a list.

NOTE: _cross_val_score returns a negative value for scoring functions in which "lower is better" (see github), therefore the sign must be flipped._

NOTE: this is a crude approach to model selection and fitting but is used here for brevity and demonstrative purposes, more robust methodologies are used in production. A separate session dedicated to classifier training and evaluation is perhaps in order.


In [60]:
k = 3 # no. of folds
results_rows = []
for clf in classifiers:
    cv = StratifiedKFold(y, n_folds=k, random_state=RANDOM_SEED)
    cv_score = -cross_val_score(clf, X, y, cv=cv, scoring="log_loss")
    results_rows.append((clf.__class__.__name__, cv_score.mean(), cv_score.std()))

Store the results in a dataframe and sort by the log-loss in ascending order.


In [61]:
results_df = pandas.DataFrame(results_rows, columns=["clf", "mean", "SD"])
results_df.sort("mean", ascending=True, inplace=True)
results_df


Out[61]:
clf mean SD
1 RandomForestClassifier 0.621202 0.011696
0 LogisticRegression 0.721064 0.002514

Classify the test data

Load the test data into a pandas dataframe. This would also be a good time to check the the correctness of the test data, as per the checks used for the training data.


In [75]:
test_df = pandas.read_csv("test.csv", index_col=0)
test_df.head()


Out[75]:
feat_1 feat_2 feat_3 feat_4 feat_5 feat_6 feat_7 feat_8 feat_9 feat_10 ... feat_84 feat_85 feat_86 feat_87 feat_88 feat_89 feat_90 feat_91 feat_92 feat_93
id
1 0 0 0 0 0 0 0 0 0 3 ... 0 0 11 1 20 0 0 0 0 0
2 2 2 14 16 0 0 0 0 0 0 ... 0 0 0 0 0 4 0 0 2 0
3 0 1 12 1 0 0 0 0 0 0 ... 0 0 0 0 2 0 0 0 0 1
4 0 0 0 1 0 0 0 0 0 0 ... 0 3 1 0 0 0 0 0 0 0
5 1 0 0 1 0 0 1 2 0 3 ... 0 0 0 0 0 0 0 9 0 0

5 rows × 93 columns

Train the RandomForest classifier on the entire training dataset.


In [66]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)


Out[66]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

Predict the test data labels.


In [69]:
y_pred = clf.predict(test_df.values)
y_pred


Out[69]:
array(['Class_4', 'Class_6', 'Class_6', ..., 'Class_3', 'Class_2',
       'Class_2'], dtype=object)

However, the submissions for this challenge are required to be in the following CSV format:

id,Class_1,Class_2,Class_3,Class_4,Class_5,Class_6,Class_7,Class_8,Class_9
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.2,0.3,0.3,0.0,0.0,0.1,0.1,0.0
...
etc.

where each non-id value indicates the predicted probability for the corresponding class. To obtain this, use the predict_proba method for the selected classifier.


In [70]:
y_prob = clf.predict_proba(test_df.values)
y_prob


Out[70]:
array([[ 0.03,  0.14,  0.26, ...,  0.03,  0.01,  0.01],
       [ 0.01,  0.07,  0.01, ...,  0.02,  0.32,  0.03],
       [ 0.  ,  0.01,  0.  , ...,  0.  ,  0.01,  0.  ],
       ..., 
       [ 0.  ,  0.37,  0.4 , ...,  0.04,  0.01,  0.02],
       [ 0.  ,  0.55,  0.13, ...,  0.  ,  0.  ,  0.01],
       [ 0.  ,  0.46,  0.41, ...,  0.04,  0.  ,  0.  ]])

Each column of y_prob is ordered according to the following class labels:


In [85]:
clf.classes_


Out[85]:
array(['Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5', 'Class_6',
       'Class_7', 'Class_8', 'Class_9'], dtype=object)

Write the predicted probabilities to submission.csv in the submission format using a dataframe.


In [89]:
pandas.DataFrame(y_prob, columns=clf.classes_, index=test_df.index).to_csv("submission.csv")

Re-inspect the submission


In [90]:
pandas.read_csv("submission.csv", index_col=0).head()


Out[90]:
Class_1 Class_2 Class_3 Class_4 Class_5 Class_6 Class_7 Class_8 Class_9
id
1 0.03 0.14 0.26 0.47 0.00 0.05 0.03 0.01 0.01
2 0.01 0.07 0.01 0.01 0.01 0.52 0.02 0.32 0.03
3 0.00 0.01 0.00 0.01 0.00 0.97 0.00 0.01 0.00
4 0.01 0.56 0.28 0.08 0.00 0.00 0.01 0.00 0.06
5 0.08 0.01 0.00 0.00 0.00 0.08 0.05 0.24 0.54