My first quick attempt at the Forest Cover Type Prediction Kaggle Competition, one of the recommended starter projects I'm working through as part of my ML Study Curriculum. The goal is to preprocess the data and explore the performance of a few algorithms and make some initial submissions.
Let's load the labeled test dataset into a Pandas Dataframe and take a gander. The dataset section of the competition also summarizes the variables in the dataset.
In [1]:
import pandas as pd
labeled_df = pd.read_csv('train.csv')
In [2]:
labeled_df.head()
Out[2]:
In [3]:
print("Variables with missing rows:")
labeled_df.isnull().sum()
Out[3]:
It looks like our job is easier than with the titanic competition: the categorical variables are already one-hot encoded, and there are no missing values.
We still need to take care of:
In [4]:
# %load preprocess.py
from sklearn.preprocessing import StandardScaler
import functools
import operator
def make_preprocessor(td, column_summary):
# it's important to scale consistently on all preprocessing based on
# consistent scaling, so we do it once and keep ahold of it for all future
# scaling.
stdsc = StandardScaler()
stdsc.fit(td[column_summary['quantitative']])
def scale_q(df, column_summary):
df[column_summary['quantitative']] = stdsc.transform(df[column_summary['quantitative']])
return df, column_summary
def scale_binary_c(df, column_summary):
binary_cs = [['{}{}'.format(c, v) for v in vs] for c, vs in column_summary['categorical'].items()]
all_binary_cs = functools.reduce(operator.add, binary_cs)
df[all_binary_cs] = df[all_binary_cs].applymap(lambda x: 1 if x == 1 else -1)
return df, column_summary
def preprocess(df):
fns = [scale_q, scale_binary_c]
cs = column_summary
for fn in fns:
df, cs = fn(df, cs)
return df
return preprocess, column_summary
In [5]:
preprocess, column_summary = make_preprocessor(
labeled_df,
{
'quantitative': ['Elevation', 'Aspect', 'Slope',
'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology',
'Horizontal_Distance_To_Roadways',
'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
'Horizontal_Distance_To_Fire_Points'],
'categorical': {
'Wilderness_Area': [1, 2, 3, 4],
'Soil_Type': list(range(1, 41))
},
'ordinal': {
}
}
)
In [6]:
labeled_df_wrangled = preprocess(labeled_df)
labeled_df_wrangled.head()
Out[6]:
In [7]:
def extract_X_y(df):
return df.iloc[:, 1:-1].values, df.iloc[:, -1].values
X, y = extract_X_y(labeled_df_wrangled)
In [10]:
import numpy as np
from sklearn.decomposition import PCA
pca_explore = PCA()
X_train_pca = pca_explore.fit_transform(X)
pca_explore.explained_variance_ratio_
for i, cum_var in enumerate(np.cumsum(pca_explore.explained_variance_ratio_)):
print("with {}/{} dimensions we preserve {:.2f} of variance".format(i + 1, X.shape[1] + 1, cum_var))
if cum_var >= .95:
break
In [11]:
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.bar(
range(1, 26),
pca_explore.explained_variance_ratio_[:25],
alpha=0.5, align='center',
label='individual explained variance')
plt.step(
range(1, 26),
np.cumsum(pca_explore.explained_variance_ratio_[:25]),
where='mid', label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='center right')
plt.tight_layout()
plt.show()
Let's compare performance of a few off the shelf models. We split the labaled data into a 70/30 training/test split so we can not only see how well it can fit data, but how well it generalizes to an unseen portion.
For each model, let's also try it on a PCA reduced dataset; we showed we can retain 95% of the variance of our dataset with the first 24 features, will be intersting to see if we retain performance in our model and save on train / fit time.
In [12]:
from sklearn.cross_validation import train_test_split
labeled_df_train, labeled_df_test = train_test_split(labeled_df, test_size=0.3, random_state=0)
labeled_df_train, labeled_df_test = labeled_df_train.copy(), labeled_df_test.copy()
(labeled_df_train.shape, labeled_df_test.shape)
Out[12]:
We need a preprocess function that is fit to just the training set so that when we evaluate the model performance on the unseen test split, the test split data can be fit with the same parameters.
In [13]:
preprocess_for_model_evaluation, _ = make_preprocessor(
labeled_df_train,
{
'quantitative': ['Elevation', 'Aspect', 'Slope',
'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology',
'Horizontal_Distance_To_Roadways',
'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
'Horizontal_Distance_To_Fire_Points'],
'categorical': {
'Wilderness_Area': [1, 2, 3, 4],
'Soil_Type': list(range(1, 41))
},
'ordinal': {
}
}
)
labeled_df_train_wrangled = preprocess_for_model_evaluation(labeled_df_train)
labeled_df_test_wrangled = preprocess_for_model_evaluation(labeled_df_test)
labeled_df_train_wrangled.head()
Out[13]:
In [14]:
from sklearn.cross_validation import train_test_split
X_train, y_train = extract_X_y(labeled_df_train_wrangled)
X_test, y_test = extract_X_y(labeled_df_test_wrangled)
pca_evaluate = PCA(n_components=24)
pca_evaluate.fit(X_train)
X_train_pca = pca_evaluate.transform(X_train)
X_test_pca = pca_evaluate.transform(X_test)
(X_train.shape, X_test.shape)
Out[14]:
In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
def labeled_models():
return [
("logistic_regression", LogisticRegression(C=100.0, random_state=0)),
("decision_tree6", DecisionTreeClassifier(criterion='entropy', max_depth=6, random_state=0)),
("random_forest10", RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=1, n_jobs=2)),
("linear_svm", SVC(kernel='linear', C=1.0, random_state=0)),
("kernel_svm", SVC(kernel='rbf', random_state=0, gamma=0.10, C=10.0))
]
In [16]:
from sklearn.metrics import accuracy_score
import time
for label, model in labeled_models():
for pca in [False, True]:
if pca:
X_train_eval = X_train_pca
X_test_eval = X_test_pca
else:
X_train_eval = X_train
X_test_eval = X_test
pre_fit = time.clock()
model.fit(X_train_eval, y_train)
post_fit = time.clock()
fit_time = post_fit - pre_fit
training_fit = accuracy_score(y_train, model.predict(X_train_eval))
test_accuracy = accuracy_score(y_test, model.predict(X_test_eval))
predict_time = time.clock() - post_fit
print("{}{} training/test accuracy: {:.2f} | {:.2f} runtimes: {:.2f} (fit), {:.2f} (predict)".format(
'PCA: ' if pca else '',
label,
training_fit,
test_accuracy,
fit_time,
predict_time))
if label.endswith('svm'):
print(" num support vectors: {}".format(model.n_support_))
Interesting to see that the linear models perform significantly worse. It also appears our non-linear models, while still performing better on the test dataset than linear models, suffer from high variance given they do not perform as well as they fit the training data.
Also: dang, it's a bummer that SVM prediction times are so slow, that could be a drawback of having an non-parametric model, where the prediction time is a function of the number of support vectors.
On a related note, this is the first case where it seems like PCA comes in handy to improve performance.
In [17]:
unlabeled_df = pd.read_csv('test.csv')
unlabeled_df_preprocessed = preprocess(unlabeled_df)
unlabeled_df_preprocessed.head()
Out[17]:
In [18]:
X_submit = unlabeled_df_preprocessed.iloc[:, 1:].values
In [19]:
for label, model in labeled_models()[:-2]:
model.fit(X, y)
print("{} full training set accuracy: {:.2f}".format(
label,
accuracy_score(y, model.predict(X))))
submission_df = unlabeled_df[['Id']].copy()
submission_df['Cover_Type'] = model.predict(X_submit)
fname = "forest-cover-type-submission-{}.csv".format(label)
submission_df.to_csv(fname, index=False)
print(" Saved {}".format(fname))
alg | 70/30 training fit | 70/30 test accuracy | training time | prediction time | full training fit | full test accuracy (kaggle submission) |
---|---|---|---|---|---|---|
Logistic Regression | 0.68 | 0.67 | 3.07 | 0.01 | 0.60 | 0.55999 |
Decision Tree Depth 6 | 0.70 | 0.68 | 0.06 | 0.00 | 0.69 | 0.57956 |
Random Forest Depth 10 | 0.99 | 0.82 | 0.23 | 0.04 | 0.99 | 0.71758 |
Kernel SVM | 0.91 | 0.82 | 3.44 | 6.2 | 0.90 | 0.72143 |
Kernel SVM on PCA reduced data | 0.90 | 0.82 | 2.27 | 3.26 |
Ok not bad, we can see a clear performance improvement in tree based methods over logistic regression.
The 99% training fit did not generalize to the test set, and accordingly, the performance on the submission to kaggle on the larger test set was only 71%.
Tree based methods do not require scaling as the decision boundaries can be chosen anywhere within the range of values for each given variable.
So far, I still trained the tree based methods on a scaled dataset out of convenience, but now I'm curious whether the scaling might actually hurt performance.
Let's try again on an unscaled dataset.
In [20]:
X_train_unscaled, y_train_unscaled = extract_X_y(labeled_df_train)
X_test_unscaled, y_test_unscaled = extract_X_y(labeled_df_test)
for label, model in [
("decision_tree6", DecisionTreeClassifier(criterion='entropy', max_depth=6, random_state=0)),
("random_forest10", RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=1, n_jobs=2)),]:
model.fit(X_train_unscaled, y_train_unscaled)
print("{} training/test accuracy: {:.2f} | {:.2f}".format(
label,
accuracy_score(y_train, model.predict(X_train_unscaled)),
accuracy_score(y_test, model.predict(X_test_unscaled))))
Doesn't appear to be any difference locally. We can still submit to kaggle to double check.
In [21]:
X_unscaled, y_unscaled = extract_X_y(labeled_df)
X_submit_unscaled = unlabeled_df.iloc[:, 1:].values
for label, model in [
("decision_tree6", DecisionTreeClassifier(criterion='entropy', max_depth=6, random_state=0)),
("random_forest10", RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=1, n_jobs=2)),]:
model.fit(X_unscaled, y_unscaled)
print("{} full training set accuracy: {:.2f}".format(
label,
accuracy_score(y_unscaled, model.predict(X_unscaled))))
submission_df = unlabeled_df[['Id']].copy()
submission_df['Cover_Type'] = model.predict(X_submit_unscaled)
fname = "forest-cover-type-submission-unscaled-{}.csv".format(label)
submission_df.to_csv(fname, index=False)
print("saved {}".format(fname))
It turns out this has the exact same submission score (could have diffed locally to confirm first, whoops).
So good to know that at least in this case, bothering to use an unscaled dataset when training / predicting using tree based method is not necessary and we can uniformly use the same scaled dataset.