This session covered some basic operations in pandas and scikit-learn that could be used to produce a rudimentary kaggle entry to the Otto group product classification problem. This session was highly interactive with plenty of questions and discussions pertaining to the pro's and con's of the machine learning design decisions in this walkthrough.
Obligatory house-keeping import(s)
In [ ]:
import pandas
import numpy
Import the classifiers to evaluate
In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
Import the log-loss (as per Kaggle) and cross-validation functions for classifier evaluation.
In [91]:
from sklearn.metrics import log_loss
from sklearn.cross_validation import cross_val_score, StratifiedKFold
Load the Otto training dataset into a dataframe
In [5]:
df = pandas.read_csv("train.csv", index_col=0)
Inspect the first 10 rows of the dataset
In [41]:
df.head(10)
Out[41]:
Inspect the last 10 rows of the dataset
In [42]:
df.tail(10)
Out[42]:
Perform a more detailed visual inspection of the attribute values, for example:
In [7]:
df.info()
The output above suggests that each column/feature is a 64-bit integer, is this what we were expecting? I would typically add explicit checks for such assumptions using assertions, as below, or a series of if-statements. Note that there may be other attribute characteristics that you should check for that aren't mentioned here, e.g. do the features contain strange, unexpected values?
In [43]:
# Explicitly check if all columns are 64-bit integers
assert(df.drop("target", axis=1).dtypes.unique() ==
numpy.array(numpy.dtype("int64")))
Check the number of target labels.
In [44]:
len(df.target.unique())
Out[44]:
Therefore, this is a multi-class classification problem.
In [47]:
df.target.unique()
Out[47]:
Convert the dataset into numpy arrays to ensure they can be processed by the scikit-learn classifiers. However, sklearn-pandas improves the interoperability of pandas dataframes with sklearn. Note the differences between pandas dataframes and numpy matrices/arrays.
In [9]:
X, y = df.drop("target", axis=1).values, df.target.values
View the converted dataset
In [10]:
X
Out[10]:
In [11]:
y
Out[11]:
Instantiate the candidate classifiers. I've selected only two to keep the training time down.
In [59]:
classifiers = [
LogisticRegression(class_weight="auto"),
RandomForestClassifier(n_estimators=100)
]
Train and evaluate the generalisation error of each classifier using the mean log-loss over k-fold cross-validation, storing each result in a list.
NOTE: _cross_val_score returns a negative value for scoring functions in which "lower is better" (see github), therefore the sign must be flipped._
NOTE: this is a crude approach to model selection and fitting but is used here for brevity and demonstrative purposes, more robust methodologies are used in production. A separate session dedicated to classifier training and evaluation is perhaps in order.
In [60]:
k = 3 # no. of folds
results_rows = []
for clf in classifiers:
cv = StratifiedKFold(y, n_folds=k, random_state=RANDOM_SEED)
cv_score = -cross_val_score(clf, X, y, cv=cv, scoring="log_loss")
results_rows.append((clf.__class__.__name__, cv_score.mean(), cv_score.std()))
Store the results in a dataframe and sort by the log-loss in ascending order.
In [61]:
results_df = pandas.DataFrame(results_rows, columns=["clf", "mean", "SD"])
results_df.sort("mean", ascending=True, inplace=True)
results_df
Out[61]:
Load the test data into a pandas dataframe. This would also be a good time to check the the correctness of the test data, as per the checks used for the training data.
In [75]:
test_df = pandas.read_csv("test.csv", index_col=0)
test_df.head()
Out[75]:
Train the RandomForest classifier on the entire training dataset.
In [66]:
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)
Out[66]:
Predict the test data labels.
In [69]:
y_pred = clf.predict(test_df.values)
y_pred
Out[69]:
However, the submissions for this challenge are required to be in the following CSV format:
id,Class_1,Class_2,Class_3,Class_4,Class_5,Class_6,Class_7,Class_8,Class_9
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.2,0.3,0.3,0.0,0.0,0.1,0.1,0.0
...
etc.
where each non-id value indicates the predicted probability for the corresponding class. To obtain this, use the predict_proba method for the selected classifier.
In [70]:
y_prob = clf.predict_proba(test_df.values)
y_prob
Out[70]:
Each column of y_prob is ordered according to the following class labels:
In [85]:
clf.classes_
Out[85]:
Write the predicted probabilities to submission.csv in the submission format using a dataframe.
In [89]:
pandas.DataFrame(y_prob, columns=clf.classes_, index=test_df.index).to_csv("submission.csv")
Re-inspect the submission
In [90]:
pandas.read_csv("submission.csv", index_col=0).head()
Out[90]: