The time has arrived, the time to learn. Will we succeed? Let's see it
In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'
In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import cm as cmap
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder
sns.set(font='sans')
When I needed to select an algorithm for training I didn't know which of them I studied about was to work correctly. So, I asked to my thesis's supervisor, Fernando Sancho Ph.D., and his recommendation, according to his experience with other related projects, was to use random forests.
In the feature_columns variable showed in the next piece of code, we can see which attributes are going to be used for predicting.
In [3]:
labelize_columns = ['medallion', 'hack_license', 'vendor_id']
interize_columns = ['pickup_month', 'pickup_weekday', 'pickup_non_working_today', 'pickup_non_working_tomorrow']
feature_columns = ['medallion', 'hack_license', 'vendor_id', 'pickup_month', 'pickup_weekday', 'pickup_day',
'pickup_time_in_mins', 'pickup_non_working_today', 'pickup_non_working_tomorrow', 'fare_amount',
'surcharge', 'tolls_amount', 'passenger_count', 'trip_time_in_secs', 'trip_distance', 'pickup_longitude',
'pickup_latitude', 'dropoff_longitude', 'dropoff_latitude']
class_column = 'tip_label'
In [4]:
data = pd.read_csv('../data/dataset/dataset.csv')
Before starting the training, we need to transform the no numeric attributes to numeric ones, so that they can be used with scitkit-learn.
In [5]:
for column in labelize_columns:
real_column = data[column].values
le = LabelEncoder()
le.fit(real_column)
labelized_column = le.transform(real_column)
data[column] = labelized_column
le = None
real_column = None
labelized_column = None
In [6]:
for column in interize_columns:
data[column] = data[column].astype(int)
Let's start the training! We are going to use 10-fold stratified cross-validation for training a random forest model with 256 trees.
In [7]:
data_features = data[feature_columns].values
data_classes = data[class_column].values
In [8]:
cross_validation = StratifiedShuffleSplit(data_classes, n_iter=10, test_size=0.1, random_state=0)
scores = []
confusion_matrices = []
for train_index, test_index in cross_validation:
data_features_train, data_classes_train = data_features[train_index], data_classes[train_index]
data_features_test, data_classes_test = data_features[test_index], data_classes[test_index]
'''
You need at least 16GB RAM for predicting 6 classes with 256 trees.
Of course, you can use a lower number, but gradually you'll notice worse performance.
'''
clf = RandomForestClassifier(n_estimators=256, n_jobs=-1)
clf.fit(data_features_train, data_classes_train)
# Saving the scores.
test_score = clf.score(data_features_test, data_classes_test)
scores.append(test_score)
# Saving the confusion matrices.
data_classes_pred = clf.predict(data_features_test)
cm = confusion_matrix(data_classes_test, data_classes_pred)
confusion_matrices.append(cm)
clf = None
print 'Accuracy mean: ' + str(np.mean(scores))
print 'Accuracy std: ' + str(np.std(scores))
A prediction with an accuracy of 52.95%. What happened?
As I am not a machine learning expert, I'm not 100% sure of what were the problems for this bad result. This is an indicator that I have yet to study more machine learning theory. A thing I'm willing to do, spoiler, specially after the results we will obtain in the next notebook.
For trying to know the reason of the bad accuracy, let's use another tool for measuring the performance, a confusion matrix.
In [9]:
classes = [' ', '[0-10)', '[10-15)', '[15-20)', '[20-25)', '[25-30)', '[30-inf)']
first = True
cm = None
for cm_iter in confusion_matrices:
if first:
cm = cm_iter.copy()
first = False
else:
cm = cm + cm_iter
fig, axes = plt.subplots()
colorbar = axes.matshow(cm, cmap=cmap.Blues)
fig.colorbar(colorbar, ticks=[0, 25000, 50000, 75000, 100000, 125000, 150000, 175000, 200000, 225000, 250000])
axes.set_xlabel('Predicted class', fontsize=15)
axes.set_ylabel('True class', fontsize=15)
axes.set_xticklabels(classes)
axes.set_yticklabels(classes)
axes.tick_params(labelsize=12)
This is pretty strange. It looks like all the classes want to be only in two of them! Let's check how the tip is distributed in the dataset.
In [10]:
tip = data.groupby('tip_perc').size()
tip.index = np.floor(tip.index)
ax = tip.groupby(tip.index).sum().plot(kind='bar', figsize=(15, 5))
ax.set_xlabel('floor(tip_perc)', fontsize=18)
ax.set_ylabel('number of trips', fontsize=18)
ax.tick_params(labelsize=12)
tip = None
By looking at the previous figure we can say that the social norm is to tip the 20% of the charge. Perhaps that is the question, to know if a tip will be above or below that norm.
For answering that, let's change the classes for use only two:
$$ ``<\:20"\:\:and\:\:``>=\:20" $$
In [11]:
tip_labels = ['< 20', '>= 20']
tip_ranges_by_label = [[0.0, 20.0], [20.0, 51.0]]
for i, tip_label in enumerate(tip_labels):
tip_mask = ((data.tip_perc >= tip_ranges_by_label[i][0]) & (data.tip_perc < tip_ranges_by_label[i][1]))
data.tip_label[tip_mask] = tip_label
tip_mask = None
In [12]:
data.to_csv('../data/dataset/dataset.csv', index=False)
Will this change work? Let's find it out in the next notebook.