A gentle walkthrough of a Kaggle classification problem

Date: May 21, 2015
Instructor: Kian Ho
Prerequisites:
- familiarity with Python and basic statistics/machine learning concepts

Review

This session covered some basic operations in pandas and scikit-learn that could be used to produce a rudimentary kaggle entry to the Otto group product classification problem. This session was highly interactive with plenty of questions and discussions pertaining to the pro's and con's of the machine learning design decisions in this walkthrough.

Dependencies

Download and unzip the train.csv.zip and test.csv.zip files from the Otto challenge website into the same directory as this notebook.

Obligatory house-keeping import(s)



In [ ]:

    
import pandas
import numpy

Import the classifiers to evaluate



In [3]:

    
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

Import the log-loss (as per Kaggle) and cross-validation functions for classifier evaluation.



In [91]:

    
from sklearn.metrics import log_loss
from sklearn.cross_validation import cross_val_score, StratifiedKFold

Data Preparation

Load the Otto training dataset into a dataframe



In [5]:

    
df = pandas.read_csv("train.csv", index_col=0)

Inspect the first 10 rows of the dataset



In [41]:

    
df.head(10)









    Out[41]:






  
    
      
      feat_1
      feat_2
      feat_3
      feat_4
      feat_5
      feat_6
      feat_7
      feat_8
      feat_9
      feat_10
      ...
      feat_85
      feat_86
      feat_87
      feat_88
      feat_89
      feat_90
      feat_91
      feat_92
      feat_93
      target
    
    
      id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1 
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       1
       0
       0
       0
       0
       0
       0
       0
       0
       Class_1
    
    
      2 
       0
       0
       0
       0
       0
       0
       0
       1
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       Class_1
    
    
      3 
       0
       0
       0
       0
       0
       0
       0
       1
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
       0
       0
       Class_1
    
    
      4 
       1
       0
       0
       1
       6
       1
       5
       0
       0
       1
      ...
       0
       1
       2
       0
       0
       0
       0
       0
       0
       Class_1
    
    
      5 
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       1
       0
       0
       0
       0
       1
       0
       0
       0
       Class_1
    
    
      6 
       2
       1
       0
       0
       7
       0
       0
       0
       0
       0
      ...
       0
       3
       0
       0
       0
       0
       2
       0
       0
       Class_1
    
    
      7 
       2
       0
       0
       0
       0
       0
       0
       2
       0
       1
      ...
       1
       1
       0
       0
       0
       0
       0
       0
       1
       Class_1
    
    
      8 
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       1
       0
       0
       0
       0
       0
       0
       Class_1
    
    
      9 
       0
       0
       0
       0
       0
       0
       0
       4
       0
       0
      ...
       0
       2
       0
       0
       0
       0
       0
       0
       1
       Class_1
    
    
      10
       0
       0
       0
       0
       0
       0
       1
       0
       0
       0
      ...
       0
       0
       1
       0
       0
       0
       1
       0
       0
       Class_1
    
  

10 rows × 94 columns

Inspect the last 10 rows of the dataset



In [42]:

    
df.tail(10)









    Out[42]:






  
    
      
      feat_1
      feat_2
      feat_3
      feat_4
      feat_5
      feat_6
      feat_7
      feat_8
      feat_9
      feat_10
      ...
      feat_85
      feat_86
      feat_87
      feat_88
      feat_89
      feat_90
      feat_91
      feat_92
      feat_93
      target
    
    
      id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      61874
       1
       0
       0
       1
       1
       0
       0
       0
       0
       0
      ...
       1
       0
       0
       0
       0
       0
       0
        2
       0
       Class_9
    
    
      61875
       4
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       2
       0
       0
       2
       0
       0
        1
       0
       Class_9
    
    
      61876
       0
       0
       0
       0
       0
       0
       0
       3
       1
       0
      ...
       0
       3
       1
       0
       0
       0
       0
        0
       0
       Class_9
    
    
      61877
       1
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       1
       0
       3
       10
       0
       Class_9
    
    
      61878
       0
       0
       0
       0
       0
       0
       0
       0
       0
       0
      ...
       0
       0
       0
       0
       0
       0
       0
        2
       0
       Class_9
    
  

5 rows × 94 columns

Perform a more detailed visual inspection of the attribute values, for example:

check if there are missing (NaN/null) values
check if all attributes are of a specific data type



In [7]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 61878 entries, 1 to 61878
Data columns (total 94 columns):
feat_1     61878 non-null int64
feat_2     61878 non-null int64
feat_3     61878 non-null int64
feat_4     61878 non-null int64
feat_5     61878 non-null int64
feat_6     61878 non-null int64
feat_7     61878 non-null int64
feat_8     61878 non-null int64
feat_9     61878 non-null int64
feat_10    61878 non-null int64
feat_11    61878 non-null int64
feat_12    61878 non-null int64
feat_13    61878 non-null int64
feat_14    61878 non-null int64
feat_15    61878 non-null int64
feat_16    61878 non-null int64
feat_17    61878 non-null int64
feat_18    61878 non-null int64
feat_19    61878 non-null int64
feat_20    61878 non-null int64
feat_21    61878 non-null int64
feat_22    61878 non-null int64
feat_23    61878 non-null int64
feat_24    61878 non-null int64
feat_25    61878 non-null int64
feat_26    61878 non-null int64
feat_27    61878 non-null int64
feat_28    61878 non-null int64
feat_29    61878 non-null int64
feat_30    61878 non-null int64
feat_31    61878 non-null int64
feat_32    61878 non-null int64
feat_33    61878 non-null int64
feat_34    61878 non-null int64
feat_35    61878 non-null int64
feat_36    61878 non-null int64
feat_37    61878 non-null int64
feat_38    61878 non-null int64
feat_39    61878 non-null int64
feat_40    61878 non-null int64
feat_41    61878 non-null int64
feat_42    61878 non-null int64
feat_43    61878 non-null int64
feat_44    61878 non-null int64
feat_45    61878 non-null int64
feat_46    61878 non-null int64
feat_47    61878 non-null int64
feat_48    61878 non-null int64
feat_49    61878 non-null int64
feat_50    61878 non-null int64
feat_51    61878 non-null int64
feat_52    61878 non-null int64
feat_53    61878 non-null int64
feat_54    61878 non-null int64
feat_55    61878 non-null int64
feat_56    61878 non-null int64
feat_57    61878 non-null int64
feat_58    61878 non-null int64
feat_59    61878 non-null int64
feat_60    61878 non-null int64
feat_61    61878 non-null int64
feat_62    61878 non-null int64
feat_63    61878 non-null int64
feat_64    61878 non-null int64
feat_65    61878 non-null int64
feat_66    61878 non-null int64
feat_67    61878 non-null int64
feat_68    61878 non-null int64
feat_69    61878 non-null int64
feat_70    61878 non-null int64
feat_71    61878 non-null int64
feat_72    61878 non-null int64
feat_73    61878 non-null int64
feat_74    61878 non-null int64
feat_75    61878 non-null int64
feat_76    61878 non-null int64
feat_77    61878 non-null int64
feat_78    61878 non-null int64
feat_79    61878 non-null int64
feat_80    61878 non-null int64
feat_81    61878 non-null int64
feat_82    61878 non-null int64
feat_83    61878 non-null int64
feat_84    61878 non-null int64
feat_85    61878 non-null int64
feat_86    61878 non-null int64
feat_87    61878 non-null int64
feat_88    61878 non-null int64
feat_89    61878 non-null int64
feat_90    61878 non-null int64
feat_91    61878 non-null int64
feat_92    61878 non-null int64
feat_93    61878 non-null int64
target     61878 non-null object
dtypes: int64(93), object(1)

The output above suggests that each column/feature is a 64-bit integer, is this what we were expecting? I would typically add explicit checks for such assumptions using assertions, as below, or a series of if-statements. Note that there may be other attribute characteristics that you should check for that aren't mentioned here, e.g. do the features contain strange, unexpected values?



In [43]:

    
# Explicitly check if all columns are 64-bit integers
assert(df.drop("target", axis=1).dtypes.unique() ==
       numpy.array(numpy.dtype("int64")))

Check the number of target labels.



In [44]:

    
len(df.target.unique())









    Out[44]:





9

Therefore, this is a multi-class classification problem.



In [47]:

    
df.target.unique()









    Out[47]:





array(['Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5', 'Class_6',
       'Class_7', 'Class_8', 'Class_9'], dtype=object)

Convert the dataset into numpy arrays to ensure they can be processed by the scikit-learn classifiers. However, sklearn-pandas improves the interoperability of pandas dataframes with sklearn. Note the differences between pandas dataframes and numpy matrices/arrays.



In [9]:

    
X, y = df.drop("target", axis=1).values, df.target.values

View the converted dataset



In [10]:

    
X









    Out[10]:





array([[ 1,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       ..., 
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 1,  0,  0, ...,  3, 10,  0],
       [ 0,  0,  0, ...,  0,  2,  0]])



In [11]:

    
y









    Out[11]:





array(['Class_1', 'Class_1', 'Class_1', ..., 'Class_9', 'Class_9',
       'Class_9'], dtype=object)

Model Training

Instantiate the candidate classifiers. I've selected only two to keep the training time down.



In [59]:

    
classifiers = [
    LogisticRegression(class_weight="auto"),
    RandomForestClassifier(n_estimators=100)
]

Train and evaluate the generalisation error of each classifier using the mean log-loss over k-fold cross-validation, storing each result in a list.

NOTE: _cross_val_score returns a negative value for scoring functions in which "lower is better" (see github), therefore the sign must be flipped._

NOTE: this is a crude approach to model selection and fitting but is used here for brevity and demonstrative purposes, more robust methodologies are used in production. A separate session dedicated to classifier training and evaluation is perhaps in order.



In [60]:

    
k = 3 # no. of folds
results_rows = []
for clf in classifiers:
    cv = StratifiedKFold(y, n_folds=k, random_state=RANDOM_SEED)
    cv_score = -cross_val_score(clf, X, y, cv=cv, scoring="log_loss")
    results_rows.append((clf.__class__.__name__, cv_score.mean(), cv_score.std()))

Store the results in a dataframe and sort by the log-loss in ascending order.



In [61]:

    
results_df = pandas.DataFrame(results_rows, columns=["clf", "mean", "SD"])
results_df.sort("mean", ascending=True, inplace=True)
results_df









    Out[61]:






  
    
      
      clf
      mean
      SD
    
  
  
    
      1
       RandomForestClassifier
       0.621202
       0.011696
    
    
      0
           LogisticRegression
       0.721064
       0.002514

Classify the test data

Load the test data into a pandas dataframe. This would also be a good time to check the the correctness of the test data, as per the checks used for the training data.



In [75]:

    
test_df = pandas.read_csv("test.csv", index_col=0)
test_df.head()









    Out[75]:






  
    
      
      feat_1
      feat_2
      feat_3
      feat_4
      feat_5
      feat_6
      feat_7
      feat_8
      feat_9
      feat_10
      ...
      feat_84
      feat_85
      feat_86
      feat_87
      feat_88
      feat_89
      feat_90
      feat_91
      feat_92
      feat_93
    
    
      id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
       0
       0
        0
        0
       0
       0
       0
       0
       0
       3
      ...
       0
       0
       11
       1
       20
       0
       0
       0
       0
       0
    
    
      2
       2
       2
       14
       16
       0
       0
       0
       0
       0
       0
      ...
       0
       0
        0
       0
        0
       4
       0
       0
       2
       0
    
    
      3
       0
       1
       12
        1
       0
       0
       0
       0
       0
       0
      ...
       0
       0
        0
       0
        2
       0
       0
       0
       0
       1
    
    
      4
       0
       0
        0
        1
       0
       0
       0
       0
       0
       0
      ...
       0
       3
        1
       0
        0
       0
       0
       0
       0
       0
    
    
      5
       1
       0
        0
        1
       0
       0
       1
       2
       0
       3
      ...
       0
       0
        0
       0
        0
       0
       0
       9
       0
       0
    
  

5 rows × 93 columns

Train the RandomForest classifier on the entire training dataset.



In [66]:

    
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)









    Out[66]:





RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

Predict the test data labels.



In [69]:

    
y_pred = clf.predict(test_df.values)
y_pred









    Out[69]:





array(['Class_4', 'Class_6', 'Class_6', ..., 'Class_3', 'Class_2',
       'Class_2'], dtype=object)

However, the submissions for this challenge are required to be in the following CSV format:

id,Class_1,Class_2,Class_3,Class_4,Class_5,Class_6,Class_7,Class_8,Class_9
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.2,0.3,0.3,0.0,0.0,0.1,0.1,0.0
...
etc.

where each non-id value indicates the predicted probability for the corresponding class. To obtain this, use the predict_proba method for the selected classifier.



In [70]:

    
y_prob = clf.predict_proba(test_df.values)
y_prob









    Out[70]:





array([[ 0.03,  0.14,  0.26, ...,  0.03,  0.01,  0.01],
       [ 0.01,  0.07,  0.01, ...,  0.02,  0.32,  0.03],
       [ 0.  ,  0.01,  0.  , ...,  0.  ,  0.01,  0.  ],
       ..., 
       [ 0.  ,  0.37,  0.4 , ...,  0.04,  0.01,  0.02],
       [ 0.  ,  0.55,  0.13, ...,  0.  ,  0.  ,  0.01],
       [ 0.  ,  0.46,  0.41, ...,  0.04,  0.  ,  0.  ]])

Each column of y_prob is ordered according to the following class labels:



In [85]:

    
clf.classes_









    Out[85]:





array(['Class_1', 'Class_2', 'Class_3', 'Class_4', 'Class_5', 'Class_6',
       'Class_7', 'Class_8', 'Class_9'], dtype=object)

Write the predicted probabilities to submission.csv in the submission format using a dataframe.



In [89]:

    
pandas.DataFrame(y_prob, columns=clf.classes_, index=test_df.index).to_csv("submission.csv")

Re-inspect the submission



In [90]:

    
pandas.read_csv("submission.csv", index_col=0).head()

	feat_1	feat_2	feat_3	feat_4	feat_5	feat_6	feat_7	feat_8	feat_9	feat_10	...	feat_85	feat_86	feat_87	feat_88	feat_89	feat_90	feat_91	feat_92	feat_93	target
id
1	1	0	0	0	0	0	0	0	0	0	...	1	0	0	0	0	0	0	0	0	Class_1
2	0	0	0	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	Class_1
3	0	0	0	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	0	0	Class_1
4	1	0	0	1	6	1	5	0	0	1	...	0	1	2	0	0	0	0	0	0	Class_1
5	0	0	0	0	0	0	0	0	0	0	...	1	0	0	0	0	1	0	0	0	Class_1
6	2	1	0	0	7	0	0	0	0	0	...	0	3	0	0	0	0	2	0	0	Class_1
7	2	0	0	0	0	0	0	2	0	1	...	1	1	0	0	0	0	0	0	1	Class_1
8	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	0	Class_1
9	0	0	0	0	0	0	0	4	0	0	...	0	2	0	0	0	0	0	0	1	Class_1
10	0	0	0	0	0	0	1	0	0	0	...	0	0	1	0	0	0	1	0	0	Class_1

	feat_1	feat_2	feat_3	feat_4	feat_5	feat_6	feat_7	feat_8	feat_9	feat_10	...	feat_85	feat_86	feat_87	feat_88	feat_89	feat_90	feat_91	feat_92	feat_93	target
id
61874	1	0	0	1	1	0	0	0	0	0	...	1	0	0	0	0	0	0	2	0	Class_9
61875	4	0	0	0	0	0	0	0	0	0	...	0	2	0	0	2	0	0	1	0	Class_9
61876	0	0	0	0	0	0	0	3	1	0	...	0	3	1	0	0	0	0	0	0	Class_9
61877	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	3	10	0	Class_9
61878	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	2	0	Class_9

	clf	mean	SD
1	RandomForestClassifier	0.621202	0.011696
0	LogisticRegression	0.721064	0.002514

	feat_1	feat_2	feat_3	feat_4	feat_5	feat_6	feat_7	feat_8	feat_9	feat_10	...	feat_84	feat_85	feat_86	feat_87	feat_88	feat_89	feat_90	feat_91	feat_92	feat_93
id
1	0	0	0	0	0	0	0	0	0	3	...	0	0	11	1	20	0	0	0	0	0
2	2	2	14	16	0	0	0	0	0	0	...	0	0	0	0	0	4	0	0	2	0
3	0	1	12	1	0	0	0	0	0	0	...	0	0	0	0	2	0	0	0	0	1
4	0	0	0	1	0	0	0	0	0	0	...	0	3	1	0	0	0	0	0	0	0
5	1	0	0	1	0	0	1	2	0	3	...	0	0	0	0	0	0	0	9	0	0

	Class_1	Class_2	Class_3	Class_4	Class_5	Class_6	Class_7	Class_8	Class_9
id
1	0.03	0.14	0.26	0.47	0.00	0.05	0.03	0.01	0.01
2	0.01	0.07	0.01	0.01	0.01	0.52	0.02	0.32	0.03
3	0.00	0.01	0.00	0.01	0.00	0.97	0.00	0.01	0.00
4	0.01	0.56	0.28	0.08	0.00	0.00	0.01	0.00	0.06
5	0.08	0.01	0.00	0.00	0.00	0.08	0.05	0.24	0.54

	feat_1	feat_2	feat_3	feat_4	feat_5	feat_6	feat_7	feat_8	feat_9	feat_10	...	feat_84	feat_85	feat_86	feat_87	feat_88	feat_89	feat_90	feat_91	feat_92	feat_93
id
1	0	0	0	0	0	0	0	0	0	3	...	0	0	11	1	20	0	0	0	0	0
2	2	2	14	16	0	0	0	0	0	0	...	0	0	0	0	0	4	0	0	2	0
3	0	1	12	1	0	0	0	0	0	0	...	0	0	0	0	2	0	0	0	0	1
4	0	0	0	1	0	0	0	0	0	0	...	0	3	1	0	0	0	0	0	0	0
5	1	0	0	1	0	0	1	2	0	3	...	0	0	0	0	0	0	0	9	0	0

	feat_1	feat_2	feat_3	feat_4	feat_5	feat_6	feat_7	feat_8	feat_9	feat_10	...	feat_84	feat_85	feat_86	feat_87	feat_88	feat_89	feat_90	feat_91	feat_92	feat_93
id
1	0	0	0	0	0	0	0	0	0	3	...	0	0	11	1	20	0	0	0	0	0
2	2	2	14	16	0	0	0	0	0	0	...	0	0	0	0	0	4	0	0	2	0
3	0	1	12	1	0	0	0	0	0	0	...	0	0	0	0	2	0	0	0	0	1
4	0	0	0	1	0	0	0	0	0	0	...	0	3	1	0	0	0	0	0	0	0
5	1	0	0	1	0	0	1	2	0	3	...	0	0	0	0	0	0	0	9	0	0