Predicting which passengers survived the sinking of the Titanic

  • using a random forest this time

In [1]:
import csv as csv
import numpy as np
import pandas as pd

In [2]:
# We can use the pandas library in python to read in the csv file.
# This creates a pandas dataframe and assigns it to the titanic variable.
titanic = pd.read_csv("data/train.csv")

# Print the first 5 rows of the dataframe.
print(titanic.head(5))


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex  Age  SibSp  \
0                            Braund, Mr. Owen Harris    male   22      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   
2                             Heikkinen, Miss. Laina  female   26      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   35      1   
4                           Allen, Mr. William Henry    male   35      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

In [3]:
print(titanic.describe())


       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
  • Since Age is missing some data, we'll have to clean it by inserting the median age

In [4]:
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

Non-numeric columns

  • We have to either exclude our non-numeric columns when we train our algorithm (Name, Sex, Cabin, Embarked, and Ticket), or find a way to convert them to numeric columns.
  • We'll ignore the Ticket, Cabin, and Name columns. There isn't much information we can extract from there. Most of the values in the cabin column are missing (only 204 values out of 891 rows), and it likely isn't a particularly informative column in the first place.
  • The Ticket and Name columns are unlikely to tell us much without some domain knowledge about what the ticket numbers mean, and about which names correlate with characteristics like large or rich families.

Converting the Sex column

  • The Sex column is non-numeric, but we want to keep it around
  • first have to find all the unique genders in the column (we know male and female are there, but did whoever recorded the dataset use another code for missing values?)
  • We'll also assign a code of 0 to male, and a code of 1 to female

In [5]:
# Find all the unique genders -- the column appears to contain only male and female.
print(titanic["Sex"].unique())

# Replace all the occurences of male with the number 0.
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1


['male' 'female']

Converting the Embarked column


In [6]:
# Find all the unique values for "Embarked".
print(titanic["Embarked"].unique())

titanic["Embarked"] = titanic["Embarked"].fillna("S")
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2


['S' 'C' 'Q' nan]

In [7]:
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=10, min_samples_split=2, min_samples_leaf=1)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())


0.801346801347

Parameter Tuning

  • train more trees to increase accuracy
  • tweak min_samples_split and min_samples_leaf to prevent overfitting

In [8]:
alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=4, min_samples_leaf=2)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())


0.820426487093

Generating new features

  • The length of the name -- this could pertain to how rich the person was, and therefore their position in the Titanic.
  • The total number of people in a family (SibSp + Parch).

In [9]:
# Generating a familysize column
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]

# The .apply method generates a new series
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

Extract the title using a regexp and map each unique title to an integer value


In [10]:
import re
import pandas

# A function to get the title from a name.
def get_title(name):
    # Use a regular expression to search for a title.  Titles always consist of capital and lowercase letters, and end with a period.
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""

# Get all the titles and print how often each one occurs.
titles = titanic["Name"].apply(get_title)
print(pandas.value_counts(titles))

# Map each title to an integer.  Some titles are very rare, and are compressed into the same codes as other titles.
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
for k,v in title_mapping.items():
    titles[titles == k] = v

# Verify that we converted everything.
print(pandas.value_counts(titles))

# Add in the title column.
titanic["Title"] = titles


Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Col           2
Major         2
Mlle          2
Countess      1
Ms            1
Lady          1
Jonkheer      1
Don           1
Mme           1
Capt          1
Sir           1
Name: Name, dtype: int64
1     517
2     183
3     125
4      40
5       7
6       6
7       5
10      3
8       3
9       2
Name: Name, dtype: int64

Family groups

We can also generate a feature indicating which family people are in. Because survival was likely highly dependent on your family and the people around you, this has a good chance at being a good feature.

To get this, we'll concatenate someone's last name with FamilySize to get a unique family id. We'll then be able to assign a code to each person based on their family id.


In [11]:
import operator

# A dictionary mapping family name to id
family_id_mapping = {}

# A function to get the id given a row
def get_family_id(row):
    # Find the last name by splitting on a comma
    last_name = row["Name"].split(",")[0]
    # Create the family id
    family_id = "{0}{1}".format(last_name, row["FamilySize"])
    # Look up the id in the mapping
    if family_id not in family_id_mapping:
        if len(family_id_mapping) == 0:
            current_id = 1
        else:
            # Get the maximum id from the mapping and add one to it if we don't have an id
            current_id = (max(family_id_mapping.items(), key=operator.itemgetter(1))[1] + 1)
        family_id_mapping[family_id] = current_id
    return family_id_mapping[family_id]

# Get the family ids with the apply method
family_ids = titanic.apply(get_family_id, axis=1)

# There are a lot of family ids, so we'll compress all of the families under 3 members into one code.
family_ids[titanic["FamilySize"] < 3] = -1

# Print the count of each unique id.
print(pandas.value_counts(family_ids))

titanic["FamilyId"] = family_ids


-1      800
 14       8
 149      7
 63       6
 50       6
 59       6
 17       5
 384      4
 27       4
 25       4
 162      4
 8        4
 84       4
 340      4
 43       3
 269      3
 58       3
 633      2
 167      2
 280      2
 510      2
 90       2
 83       1
 625      1
 376      1
 449      1
 498      1
 588      1
dtype: int64

In [12]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_selection import SelectKBest, f_classif

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)

%matplotlib notebook
# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

# Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare", "Title"]

alg = RandomForestClassifier(random_state=1, n_estimators=150, min_samples_split=8, min_samples_leaf=4)

scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())


//anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
0.811447811448

In [15]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold
import numpy as np

# The algorithms we want to ensemble.
# We're using the more linear predictors for the logistic regression, and everything with the gradient boosting classifier.
algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]],
    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]

# Initialize the cross validation folds
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    train_target = titanic["Survived"].iloc[train]
    full_test_predictions = []
    # Make predictions for each algorithm on each fold
    for alg, predictors in algorithms:
        # Fit the algorithm on the training data.
        alg.fit(titanic[predictors].iloc[train,:], train_target)
        # Select and predict on the test fold.  
        # The .astype(float) is necessary to convert the dataframe to all floats and avoid an sklearn error.
        test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
        full_test_predictions.append(test_predictions)
    # Use a simple ensembling scheme -- just average the predictions to get the final classification.
    test_predictions = (full_test_predictions[0] + full_test_predictions[1]) / 2
    # Any value over .5 is assumed to be a 1 prediction, and below .5 is a 0 prediction.
    test_predictions[test_predictions <= .5] = 0
    test_predictions[test_predictions > .5] = 1
    predictions.append(test_predictions)

# Put all the predictions together into one array.
predictions = np.concatenate(predictions, axis=0)

# Compute accuracy by comparing to the training data.
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
print(accuracy)


0.819304152637
/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:39: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index

Run it over the test data


In [17]:
import pandas
titanic_test = pandas.read_csv("data/test.csv")
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")

titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

# First, we'll add titles to the test set.
titles = titanic_test["Name"].apply(get_title)
# We're adding the Dona title to the mapping, because it's in the test set, but not the training set
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2, "Dona": 10}
for k,v in title_mapping.items():
    titles[titles == k] = v
titanic_test["Title"] = titles
# Check the counts of each unique title.
print(pandas.value_counts(titanic_test["Title"]))

# Now, we add the family size column.
titanic_test["FamilySize"] = titanic_test["SibSp"] + titanic_test["Parch"]

# Now we can add family ids.
# We'll use the same ids that we did earlier.
print(family_id_mapping)

family_ids = titanic_test.apply(get_family_id, axis=1)
family_ids[titanic_test["FamilySize"] < 3] = -1
titanic_test["FamilyId"] = family_ids

titanic_test["NameLength"] = titanic_test["Name"].apply(lambda x: len(x))


1     240
2      79
3      72
4      21
7       2
6       2
10      1
5       1
Name: Title, dtype: int64
{"O'Sullivan0": 426, 'Mangan0': 620, 'Lindqvist1': 543, 'Denkoff0': 297, 'Rouse0': 413, 'Berglund0': 207, 'Meo0': 142, 'Arnold-Franchi1': 49, 'Chronopoulos1': 71, 'Skoog5': 63, 'Widener2': 329, 'Pengelly0': 217, 'Goncalves0': 400, 'Myhrman0': 626, 'Beane1': 456, 'Moss0': 104, 'Carlsson0': 610, 'Nicholls2': 136, 'Jussila1': 110, 'Jussila0': 483, 'Long0': 632, 'Wheadon0': 33, 'Connolly0': 261, 'Hansen2': 680, 'Stephenson1': 493, 'Davies0': 336, 'Silven2': 359, 'Vanden Steen0': 311, 'Astor1': 571, 'Patchett0': 480, 'Johanson0': 184, 'Coleridge0': 220, 'Christmann0': 87, 'Carter3': 340, 'Compton2': 665, 'Carter1': 226, 'Turkula0': 414, 'Hassab0': 558, 'Saad0': 566, 'Mellors0': 208, 'Mamee0': 36, 'Madsen0': 119, 'Anderson0': 395, 'Kraeff0': 42, 'Robbins0': 468, 'Lundahl0': 522, 'Gilinski0': 490, 'Porter0': 107, 'Sdycoff0': 352, 'Green0': 204, 'Bishop1': 263, 'Sinkkonen0': 603, 'Otter0': 640, 'Dahl0': 299, 'Troutt0': 582, 'Samaan2': 48, 'Edvardsson0': 553, 'Petroff0': 98, 'Burke0': 134, 'Cardeza1': 556, 'Hawksford0': 599, 'Somerton0': 416, 'Healy0': 248, 'Andersson0': 137, 'Fortune5': 27, 'Andersson6': 14, 'Johnson2': 9, 'Johnson0': 273, 'Coxon0': 91, 'Banfield0': 696, 'McCarthy0': 7, 'Panula5': 50, 'West3': 58, 'Kallio0': 374, 'Hoyt1': 206, 'Hoyt0': 638, 'Mannion0': 589, 'Touma2': 232, 'Futrelle1': 4, 'Jardin0': 507, 'Lemore0': 437, 'Davies2': 460, 'Robert1': 630, 'Butt0': 451, 'Wick2': 286, 'Emanuel0': 628, 'Ibrahim Shawah0': 643, 'Carbines0': 175, 'McEvoy0': 583, 'Moutal0': 75, 'Ling0': 157, 'Watson0': 552, 'Lahoud0': 443, 'Sedgwick0': 302, 'Beesley0': 22, 'Sheerlinck0': 79, 'Hunt0': 218, 'Clifford0': 408, 'Stranden0': 602, 'Ahlin1': 40, 'Woolner0': 55, 'Caram1': 482, 'Sandstrom2': 11, 'Nakid2': 333, 'Frost0': 412, 'Moen0': 73, 'Henry0': 239, 'Persson1': 241, 'Mudd0': 671, 'Glynn0': 32, 'Bowerman1': 312, 'Daniel0': 505, 'Nysten0': 132, 'Saalfeld0': 270, 'Turcin0': 173, 'Nasser1': 10, 'Waelens0': 78, 'Homer0': 502, 'Graham1': 242, 'Graham0': 699, 'Brown0': 177, 'Dorking0': 256, 'Bing0': 72, 'Blank0': 191, 'Tornquist0': 245, 'Olsen1': 180, 'Olsen0': 144, 'Davison1': 304, 'Butler0': 544, 'Calderhead0': 575, 'Cor0': 530, 'Kalvik0': 534, 'Hagland1': 386, 'Boulos2': 131, 'Boulos0': 497, 'Leitch0': 496, 'Endres0': 581, 'Fleming0': 275, 'Natsch1': 247, 'Garfirth0': 614, 'Blackwell0': 300, 'Thayer2': 461, 'Holm0': 657, 'Rekic0': 105, 'Allen0': 5, 'Hendekovic0': 282, 'Svensson0': 424, 'Montvila0': 698, 'Perreault0': 441, 'Francatelli0': 278, 'Gronnestad0': 621, 'Maisner0': 399, 'Radeff0': 537, 'Daly0': 432, 'Jarvis0': 488, 'Bowen0': 515, 'Risien0': 453, 'Walker0': 436, 'Klaber0': 578, 'Wright0': 466, 'Artagaveytia0': 419, 'Navratil2': 138, 'Hays2': 658, 'Seward0': 383, 'Hays0': 279, 'Appleton2': 478, 'Fry0': 654, 'Romaine0': 171, 'de Messemaeker1': 469, 'Gustafsson0': 331, 'Gustafsson2': 101, 'Lulic0': 659, 'Beckwith2': 225, 'Sutton0': 516, 'Cunningham0': 355, 'Behr0': 700, 'Rothschild1': 434, 'Stankovic0': 257, 'Abelson1': 277, 'Lobb1': 230, 'Shellard0': 423, 'Knight0': 593, 'Silverthorne0': 572, "O'Leary0": 535, 'Saundercock0': 13, 'Baxter1': 114, 'Elias0': 624, 'Nankoff0': 598, 'Mitkoff0': 533, 'Lines1': 677, 'Burns0': 298, "O'Driscoll0": 47, 'Cumings1': 2, 'Culumovic0': 674, 'Penasco y Castellana1': 276, 'Parkes0': 251, 'Duff Gordon1': 467, 'Sloper0': 24, 'Strom2': 228, 'Ringhini0': 327, 'Barah0': 616, 'Richards2': 350, 'Crease0': 67, 'Cavendish1': 600, 'Bjornstrom-Steffansson0': 371, 'Masselmani0': 20, 'Moubarek2': 65, 'Turja0': 555, 'Goldsmith2': 154, 'Fahlstrom0': 210, 'Holverson1': 35, 'Herman3': 510, 'Nenkoff0': 205, 'Nysveen0': 292, 'Eklund0': 617, 'Paulner0': 487, 'Nilsson0': 284, 'Dick1': 563, 'Dantcheff0': 639, 'Swift0': 682, 'Humblen0': 570, 'McCormack0': 661, 'Duane0': 253, 'Ekstrom0': 121, 'Wiseman0': 366, 'Smith0': 160, 'Marvin1': 604, 'Cleaver0': 576, 'Gill0': 683, 'Cameron0': 193, 'Augustsson0': 663, 'Petranec0': 97, 'Stahelin-Maeglin0': 523, 'McDermott0': 80, 'Dean3': 90, 'Laleff0': 691, 'Meyer0': 649, 'Meyer1': 34, 'Johansson0': 100, 'Cacic0': 405, 'Ryerson4': 280, 'Becker3': 167, 'Tobin0': 627, 'Murphy1': 219, 'Goldschmidt0': 93, 'Kink2': 68, 'Weisz1': 125, 'Hold1': 215, 'Stanley0': 420, 'Minahan2': 222, 'Toufik0': 450, 'Minahan1': 354, 'Olsvigen0': 559, 'Odahl0': 307, 'Coleff0': 435, 'Vestrom0': 15, 'Hood0': 70, 'Yousseff0': 421, 'Barton0': 109, 'Icard0': 61, 'Smiljanic0': 148, 'Harmer0': 634, 'Ford4': 84, 'Maioni0': 428, 'Abbing0': 675, 'Oreskovic0': 347, 'Sundman0': 356, 'Kink-Heilmann2': 168, 'Lehmann0': 339, 'Dennis0': 288, 'del Carlo1': 316, 'Najib0': 690, 'Ohman0': 465, 'McCoy2': 272, 'Osen0': 129, 'Pernot0': 166, 'Kent0': 415, 'Turpin1': 41, 'Zabour1': 108, 'Alexander0': 650, 'Palsson4': 8, 'Hogeboom1': 618, 'Keefe0': 404, 'Eustis1': 422, 'Marechal0': 669, 'Uruchurtu0': 30, 'Downton0': 485, 'Ilmakangas1': 591, 'Corn0': 147, 'Parrish1': 236, 'Jerwan0': 406, 'Aks1': 678, 'Lam0': 565, 'Slayter0': 290, 'Weir0': 567, 'Frolicher2': 454, 'Danbom2': 365, 'Alhomaki0': 670, 'Brown2': 548, 'Faunthorpe1': 53, 'Givard0': 195, 'Leyson0': 213, 'Potter1': 692, 'Emir0': 26, 'Slocovski0': 85, 'Celotti0': 86, 'Rood0': 169, 'Betros0': 330, 'Larsson0': 211, 'Andreasson0': 88, 'Vande Walle0': 183, 'Morley0': 396, 'Trout0': 344, 'Thorneycroft1': 372, 'Shorney0': 92, 'Attalah0': 111, 'Ward0': 235, 'Yousif0': 310, 'Wiklund1': 325, 'Thorne0': 233, 'Warren1': 320, 'Milling0': 398, 'Rommetvedt0': 545, 'Pickard0': 370, 'Hassan0': 592, 'Heininen0': 655, 'Bazzani0': 200, 'Plotcharsky0': 335, 'Lindell1': 503, 'Caldwell2': 76, 'Meanwell0': 474, 'Bystrom0': 684, 'Harrison0': 238, 'Goldenberg1': 388, 'Webber0': 117, 'Shelley1': 693, 'Mayne0': 577, 'Honkanen0': 198, 'Karlsson0': 410, 'Calic0': 153, 'Eitemiller0': 538, 'Chapman0': 568, 'Asplund6': 25, 'Peuchen0': 385, 'Cairns0': 244, 'Bailey0': 611, 'Hanna0': 268, 'Garside0': 481, 'Mernagh0': 179, 'Staneff0': 74, 'Toomey0': 393, 'Canavan0': 425, 'Dakic0': 560, 'Foo0': 529, 'Pasic0': 666, 'Novel0': 57, 'Lemberopolous0': 673, 'Hocking4': 625, 'Moran1': 106, 'Abbott2': 252, 'Kvillner0': 377, 'Elias2': 309, 'Doharr0': 476, 'Norman0': 472, 'Leader0': 641, 'Smart0': 402, 'White1': 99, 'Gale1': 348, 'Doling1': 95, 'Moor1': 607, 'Taussig2': 237, 'Pinsky0': 174, 'Gallagher0': 573, 'Markun0': 694, 'de Mulder0': 258, 'Kassem0': 444, 'Yrois0': 182, 'Kantor1': 96, 'Sobey0': 126, 'Shutes0': 506, 'Brocklebank0': 509, 'Cherry0': 234, 'Dooley0': 701, 'Hedman0': 646, 'Madigan0': 181, 'Parr0': 524, 'Campbell0': 401, 'Mullens0': 569, 'Charters0': 363, 'Jonsson0': 477, 'Ostby1': 54, 'Wilhelms0': 551, 'Williams1': 145, 'Williams0': 18, 'Maenpaa0': 221, 'Stone0': 662, 'Lennon1': 46, 'Olsson0': 254, 'Harknett0': 214, 'Horgan0': 508, 'Hart0': 353, 'Hart2': 283, 'Ilett0': 82, 'Baumann0': 156, 'Youseff0': 185, 'Reeves0': 240, 'Thomas1': 645, 'Jensen0': 527, 'Jensen1': 584, 'Cribb1': 150, 'Ross0': 486, 'Andersen-Jensen1': 176, 'Gilnagh0': 146, 'Moran0': 6, 'Coelho0': 123, 'Razi0': 679, 'Reuchlin0': 660, 'Kiernan1': 196, 'Partner0': 295, 'Jermyn0': 322, 'Salkjelsvik0': 103, 'Lahtinen2': 281, 'Drew2': 358, 'Hosono0': 260, 'Danoff0': 289, 'van Billiard2': 143, 'Bracken0': 203, 'Coutts2': 305, 'Connors0': 113, 'Jenkin0': 69, 'Hodges0': 586, 'Vovk0': 442, 'Backstrom1': 188, 'Devaney0': 44, 'Backstrom3': 83, 'Barbara1': 317, "O'Connell0": 520, 'Vande Velde0': 608, 'Reynaldo0': 380, 'Isham0': 163, "O'Brien0": 463, "O'Brien1": 170, 'Giglio0': 130, 'Ponesell0': 644, 'Nicola-Yarred1': 39, 'Angle1': 439, 'Mellinger1': 246, 'Newell1': 197, 'Lang0': 431, 'Davidson1': 549, 'Richard0': 127, 'Ball0': 293, 'Drazenoic0': 122, 'Robins1': 124, 'Richards5': 376, 'Nosworthy0': 51, 'Moore0': 116, 'Newsom2': 128, 'Sjoblom0': 635, 'Zimmerman0': 364, 'Kenyon1': 392, 'Fynney0': 21, 'Laroche3': 43, 'Renouf1': 409, 'Silvey1': 375, 'Christy2': 484, 'Renouf3': 588, 'Theobald0': 612, 'Pekoniemi0': 112, 'Bateman0': 140, 'Spencer1': 31, 'Allison3': 269, 'Cohen0': 186, 'Dimic0': 306, 'Bourke2': 172, 'Farthing0': 447, 'Hamalainen2': 224, 'Karaic0': 504, 'Pettersson0': 648, 'Landergren0': 328, 'Willey0': 532, 'Braund1': 1, 'Slemen0': 652, 'Jalsevac0': 390, 'Harder1': 324, 'Elsbury0': 494, 'Carrau0': 81, 'Sutehall0': 697, 'Sawyer0': 554, 'Vander Planke1': 19, 'Peters0': 557, 'Vander Planke2': 38, 'Dahlberg0': 695, 'Sharp0': 462, 'Sirota0': 667, 'Barkworth0': 521, 'Mack0': 623, 'Sjostedt0': 212, 'Slabenoff0': 499, 'Dodge2': 382, 'Mineff0': 266, 'Hocking3': 449, 'Hickman2': 115, 'Perkin0': 194, 'Stewart0': 64, 'Giles1': 681, 'Lindblom0': 250, 'Meek0': 357, 'Morrow0': 470, 'Newell2': 539, 'Beavan0': 326, 'Balkic0': 688, 'van Melkebeke0': 687, 'Foreman0': 387, 'Hampe0': 378, 'Birkeland0': 351, 'Mockler0': 315, 'Millet0': 391, 'Peter2': 120, 'Yasbeck1': 512, 'Osman0': 642, 'Buss0': 337, 'Sunderland0': 202, 'Lefebre4': 162, 'Bryhl1': 590, 'McGovern0': 314, 'Nye0': 66, 'Davis0': 525, 'Aubart0': 323, 'Spedden2': 287, 'Harris0': 201, 'Harris1': 62, 'McGough0': 433, 'Hewlett0': 16, 'Cann0': 37, 'Mallet2': 656, 'Kirkland0': 517, 'Ridsdale0': 446, 'Connaghton0': 605, 'Stoytcheff0': 475, 'Hippach1': 294, 'Dowdell0': 77, 'Klasen2': 161, 'Rothes0': 613, "O'Connor0": 394, 'Ivanoff0': 597, 'Clarke1': 367, 'Scanlan0': 403, 'Troupiansky0': 595, 'Reed0': 227, 'Rintamaki0': 492, 'Chambers1': 587, 'Haas0': 265, 'Greenfield1': 94, 'Jansson0': 341, 'Gheorgheff0': 362, 'Mitchell0': 550, 'Bonnell0': 12, 'Murdlin0': 491, 'Lester0': 651, 'Van Impe2': 361, 'Simonius-Blumer0': 531, 'Torber0': 501, 'Gillespie0': 585, 'Naidenoff0': 259, 'McNamee1': 601, 'de Pelsmaeker0': 255, 'Quick2': 429, 'Byles0': 139, 'Sage10': 149, 'Frolicher-Stehli2': 489, 'Peduzzi0': 389, 'Chibnall1': 155, 'Hakkarainen1': 133, 'Watt0': 151, 'Sirayanian0': 60, 'Leonard0': 165, 'Bissette0': 243, 'Adahl0': 319, 'LeRoy0': 452, 'Tomlin0': 653, 'Kimball1': 513, 'Gavey0': 511, 'Mionoff0': 102, 'Farrell0': 445, 'Goodwin7': 59, 'Strom1': 187, 'Ali0': 192, 'Funk0': 313, 'Van der hoef0': 158, 'Roebling0': 686, 'Simmons0': 473, 'Asim0': 318, 'Rosblom2': 231, 'Lovell0': 209, 'Phillips0': 368, 'Leinonen0': 526, 'Markoff0': 676, 'Levy0': 264, 'Frauenthal2': 540, 'Frauenthal1': 296, 'Baclini3': 384, 'Johannesen-Bratthammer0': 381, 'Jonkoff0': 609, 'Williams-Lambert0': 308, 'Serepeca0': 672, 'Longley0': 518, "O'Dwyer0": 28, 'Kelly0': 271, 'Douglas1': 457, 'Crosby2': 455, 'Niskanen0': 345, 'Collander0': 301, 'Gee0': 397, 'Tikkanen0': 334, 'McKane0': 342, 'Molson0': 418, 'Bostandyeff0': 519, 'Ayoub0': 631, 'Rogers0': 45, 'Chip0': 668, 'Cook0': 546, 'Stead0': 229, 'Soholt0': 580, 'Moraweck0': 285, 'Bidois0': 332, 'Taylor1': 547, 'Bradley0': 430, 'Fischer0': 561, 'Barber0': 262, 'Albimona0': 189, 'Hale0': 164, 'Flynn0': 369, 'Sadlier0': 338, 'Keane0': 274, 'Young0': 291, 'Lindahl0': 223, 'Rugg0': 56, 'Johnston3': 633, 'Petterson1': 379, 'Pavlovic0': 440, 'Sivic0': 471, 'Hirvonen1': 411, 'Allum0': 664, 'Harrington0': 500, 'Windelov0': 417, 'Lesurer0': 596, 'Pain0': 343, 'Lievens0': 622, 'Fox0': 303, 'Lewy0': 267, 'McMahon0': 118, 'Greenberg0': 579, 'Heikkinen0': 3, 'Louch1': 373, 'Chapman1': 495, 'Berriman0': 594, 'Moussa0': 321, 'Ryan0': 438, 'Madill1': 562, 'Duran y More1': 685, 'Bengtsson0': 152, 'Harper1': 52, 'Kilgannon0': 629, 'Badt0': 541, 'Salonen0': 448, 'Guggenheim0': 636, 'Widegren0': 349, 'Brewe0': 619, 'Sivola0': 159, 'Nirva0': 615, 'Rice5': 17, 'Andrews0': 647, 'Andrews1': 249, 'Collyer2': 216, 'Strandberg0': 407, 'Colley0': 542, 'Nicholson0': 458, 'Chaffee1': 89, 'Gaskell0': 637, 'Wells2': 606, 'Leeni0': 464, 'Todoroff0': 29, 'Adams0': 346, 'Pears1': 141, 'Sagesser0': 528, 'Carr0': 190, 'McGowan0': 23, 'Hansen0': 514, 'Hansen1': 574, 'Karun1': 564, 'Rush0': 479, 'Matthews0': 360, 'Laitinen0': 427, 'Padro y Manent0': 459, 'Andrew0': 135, 'Vander Cruyssen0': 689, 'Jacobsohn1': 199, 'Lurette0': 178, 'Jacobsohn3': 498, 'Hegarty0': 536}

In [ ]:
predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],
    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]

full_predictions = []
for alg, predictors in algorithms:
    # Fit the algorithm using the full training data.
    alg.fit(titanic[predictors], titanic["Survived"])
    # Predict using the test dataset.  We have to convert all the columns to floats to avoid an error.
    predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1]
    full_predictions.append(predictions)

# The gradient boosting classifier generates better predictions, so we weight it higher.
predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4

In [18]:
predictors = ["Pclass", "Sex", "Age", "Fare", "Embarked", "FamilySize", "Title", "FamilyId"]

algorithms = [
    [GradientBoostingClassifier(random_state=1, n_estimators=25, max_depth=3), predictors],
    [LogisticRegression(random_state=1), ["Pclass", "Sex", "Fare", "FamilySize", "Title", "Age", "Embarked"]]
]

full_predictions = []
for alg, predictors in algorithms:
    # Fit the algorithm using the full training data.
    alg.fit(titanic[predictors], titanic["Survived"])
    # Predict using the test dataset.  We have to convert all the columns to floats to avoid an error.
    predictions = alg.predict_proba(titanic_test[predictors].astype(float))[:,1]
    full_predictions.append(predictions)

# The gradient boosting classifier generates better predictions, so we weight it higher.
predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4

# The gradient boosting classifier generates better predictions, so we weight it higher.
predictions = (full_predictions[0] * 3 + full_predictions[1]) / 4

predictions[predictions <= .5] = 0
predictions[predictions > .5] = 1
predictions = predictions.astype(int)
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

In [19]:
# make a kaggle submission from test set predictions with PassengerId,Survived
submission.to_csv("kaggle.csv", index=False)