This notebook uses various models from scikit-learn to produce a soltion for the kaggle Titanic problem. It's worth observing that none of these methods seemed to produce a good solution to the Titanic problem, with accuracy rates being generally in the low 70% range. The best solution I had was using a neural network.
In [2]:
from __future__ import print_function
from __future__ import division
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, RandomTreesEmbedding, BaggingClassifier
from sklearn.svm import SVC, LinearSVC, LinearSVR, NuSVC, NuSVR
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, ExtraTreeClassifier, ExtraTreeRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
import sys
The method modify_data below is used to do some feature management in a pandas DataFrame. We create a new DataFrame keeping only the features we want to build our decision models from. I played around with this for a while, and this set of features just represents the place I stopped, not necessarily what worked the best.
In [5]:
def modify_data(base_df):
new_df = pd.DataFrame()
new_df['Gender'] = base_df.Sex.map(lambda x:1 if x.lower() == 'female' else 0)
fares_by_class = base_df.groupby('Pclass').Fare.median()
def getFare(example):
if pd.isnull(example):
example['Fare'] = fares_by_class[example['Pclass']]
return example
new_df['Fare'] = base_df['Fare']
new_df['Family'] = (base_df.Parch + base_df.SibSp) > 0
new_df['Family'] = new_df['Family'].map(lambda x:1 if x else 0)
new_df['GenderFam'] = new_df['Gender']+new_df['Family']
new_df['Title'] = base_df.Name.map(lambda x:x.split(' ')[0])
new_df['Rich'] = base_df.Pclass == 1
return new_df
Load the training and test data and run it through modify_data. y contains the know correct values for the training data, and ids
are passenger ids needed for submission.
The fillna method replaces all the unknown values in the data.
Finally, the for loop just replaces any non-numeric values with numeric identifiers. The Title field will contain things like Mr., Mrs., and Dr. which will be translated to numeric values here.
In [7]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
y = train.Survived.values
ids = test['PassengerId'].values
train = modify_data(train)
test = modify_data(test)
train = train.fillna(-1)
test = test.fillna(-1)
for f in train.columns:
if train[f].dtype=='object':
lbl = preprocessing.LabelEncoder()
lbl.fit(list(train[f].values) + list(test[f].values))
train[f] = lbl.transform(list(train[f].values))
test[f] = lbl.transform(list(test[f].values))
These are the models we'll use. Notice that it's basically a trivial operation to add anew one in. I did a small bit of parameter tuning for the first 3.
In [8]:
models = {'xgb': xgb.XGBClassifier(n_estimators=2700,
nthread=-1,
max_depth=12,
learning_rate=0.09,
silent=True,
subsample=0.8,
colsample_bytree=0.75),
'rf': RandomForestClassifier(n_estimators = 150, criterion='gini'),
'linearsvc': LinearSVC(C=0.13, loss='hinge'),
'linearsvr': LinearSVR(),
'nusvc': NuSVC(),
'nusvr': NuSVR(),
'dtc': DecisionTreeClassifier(),
'dtr': DecisionTreeRegressor(),
'etc': ExtraTreeClassifier(),
'etr': ExtraTreeRegressor(),
'rfr': RandomForestRegressor(),
'bc': BaggingClassifier(),
'lr': LinearRegression(),
'logit': LogisticRegression()}
Now we'll just run the data through our models.
We use cross validation as a means to evaluate our model.
In [10]:
model_scores = {}
for name, model in models.items():
scores = cross_val_score(model, train, y, cv=3)
model_scores[name] = scores.mean()
print(name, model_scores[name])
The rest is just printing results. The only interesting thing here is that I created a basic ensemble from the above models in which each "good" model gets a vote.
In [11]:
pred_array = []
for m, score in model_scores.items():
if score < 0.76:
continue
model = models[m].fit(train, y)
preds = model.predict(test)
pred_array.append(preds)
results = pd.DataFrame({"PassengerId":ids, 'Survived': preds})
results['PassengerId'] = results['PassengerId'].astype('int')
results.set_index("PassengerId")
results.to_csv('output/test_results_{}.csv'.format(m), index=False)
ensemble_preds = [0]*len(ids)
for p in pred_array:
if not ensemble_preds:
ensemble_preds = p
else:
ensemble_preds = [a+b for a, b in zip(ensemble_preds, p)]
votes = [0 if a < len(pred_array)/2 else 1 for a in ensemble_preds]
results = pd.DataFrame({"PassengerId":ids, 'Survived': votes})
results['PassengerId'] = results['PassengerId'].astype('int')
results.set_index("PassengerId")
results.to_csv('output/test_results_ensemble.csv'.format(m), index=False)