My playing with the Kaggle titanic challenge.

I COPPIED THE INITIAL CODE and got lots of the ideas for this first Kaggle advanture from here.

I will later compact the important stuff from here into a kernal on my Kaggle account.



In [1]:

    
import pandas as pd 
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import cross_val_score
from sklearn import cross_validation
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import StratifiedKFold
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier

import xgboost as xgb
from xgboost import plot_importance



In [2]:

    
train_df = pd.read_csv("train.csv",dtype={"Age":np.float64},)
#train_df.head()



In [3]:

    
# find how many ages
train_df['Age'].count()









    Out[3]:





714



In [4]:

    
# how many ages are NaN?
train_df['Age'].isnull().sum()









    Out[4]:





177



In [5]:

    
# plot ages of training data set, with NaN's removed
if False:
    train_df['Age'].dropna().astype(int).hist(bins=70)
print 'Mean age = ',train_df['Age'].dropna().astype(int).mean()









    



Mean age =  29.6792717087

Let's see where they got on



In [6]:

    
#train_df['Embarked'].head()



In [7]:

    
#train_df.info()



In [8]:

    
train_df['Embarked'].isnull().sum()









    Out[8]:





2



In [9]:

    
train_df["Embarked"].count()









    Out[9]:





889



In [10]:

    
if False:
    sns.countplot(x="Embarked",data=train_df)



In [11]:

    
if False:
    sns.countplot(x='Survived',hue='Embarked',data=train_df,order=[0,1])

OK, so clearly there were more people who got on at S, and it seems their survival is disproportional. Let's check that.



In [12]:

    
if False:
    embark_survive_perc = train_df[["Embarked", "Survived"]].groupby(['Embarked'],as_index=False).mean()
    sns.barplot(x='Embarked', y='Survived', data=embark_survive_perc,order=['S','C','Q'])

Interesting, actually those from C had higher rate of survival. So, knowing more people from your home town didn't help.

Next, did how much they paid have an effect?



In [13]:

    
if False:
    train_df['Fare'].astype(int).plot(kind='hist',bins=100, xlim=(0,50))



In [14]:

    
# get fare for survived & didn't survive passengers 
if False:
    fare_not_survived = train_df["Fare"].astype(int)[train_df["Survived"] == 0]
    fare_survived     = train_df["Fare"].astype(int)[train_df["Survived"] == 1]

    # get average and std for fare of survived/not survived passengers
    avgerage_fare = DataFrame([fare_not_survived.mean(), fare_survived.mean()])
    std_fare      = DataFrame([fare_not_survived.std(), fare_survived.std()])

    avgerage_fare.index.names = std_fare.index.names = ["Survived"]
    avgerage_fare.plot(yerr=std_fare,kind='bar',legend=False)

Before digging into how the ages factor in, let's take the advice of others and replace NaN's with random values



In [15]:

    
import scipy.stats as stats

# column 'Age' has some NaN values
# A simple approximation of the distribution of ages is a gaussian, but this is not commonly accurate.
# lets make a vector of random ages centered on the mean, with a width of the std
lower, upper = train_df['Age'].min(), train_df['Age'].max()
mu, sigma = train_df["Age"].mean(), train_df["Age"].std()

# number of rows
n = train_df.shape[0]

print 'max: ',train_df['Age'].max()
print 'min: ',train_df['Age'].min()

# vector of random values using the truncated normal distribution.  
X = stats.truncnorm((lower - mu) / sigma, (upper - mu) / sigma, loc=mu, scale=sigma)
rands = X.rvs(n)

# get the indexes of the elements in the original array that are NaN
idx = np.isfinite(train_df['Age'])

# use the indexes to replace the NON-NaNs in the random array with the good values from the original array
rands[idx.values] = train_df[idx]['Age'].values

## At this point rands is now the cleaned column of data we wanted, so push it in to the original df
train_df['Age'] = rands

"""
## we will make a new column with Nan's replaced, then push that into the original df
n = train_df.shape[0] # number of rows
#randy = np.random.randint(average_age_train - std_age_train, average_age_train + std_age_train, size = n)
# draw from a gaussian instead of simple uniform
# note this uses a 'standard gauss' and that tneeds to have its var and mean shifted
randy = np.random.randn(n)*std_age_train + average_age_train
idx = np.isfinite(train_df['Age']) # gives a boolean index for the NaNs in the df's column
randy[idx.values] = train_df[idx]['Age'].values  ## idexing the values of randy with this
#now have updated column, next push into original df
train_df['Age'] = randy
"""

print 'After this gaussian replacment, there are: ',train_df['Age'].isnull().sum()
print 'max: ',train_df['Age'].max()
print 'min: ',train_df['Age'].min()









    



max:  80.0
min:  0.42
After this gaussian replacment, there are:  0
max:  80.0
min:  0.42



In [16]:

    
# plot new Age Values
if False:
    train_df['Age'].hist(bins=70)
# Compare this to that from a few cells up for the raw ages with the NaN's dropped.  Not much different actually.

lets perform the same NaN replacement for the 'Age' with the test data as well



In [17]:

    
## let's pull in the test data
test_df = pd.read_csv("test.csv",dtype={"Age":np.float64},)
#test_df.head()



In [18]:

    
#### Do the same for the test data
# column 'Age' has some NaN values
# A simple approximation of the distribution of ages is a gaussian, but this is not commonly accurate.
# lets make a vector of random ages centered on the mean, with a width of the std
lower, upper = test_df['Age'].min(), test_df['Age'].max()
mu, sigma = test_df["Age"].mean(), test_df["Age"].std()

# number of rows
n = test_df.shape[0]

print 'max: ',test_df['Age'].max()
print 'min: ',test_df['Age'].min()

# vector of random values using the truncated normal distribution.  
X = stats.truncnorm((lower - mu) / sigma, (upper - mu) / sigma, loc=mu, scale=sigma)
rands = X.rvs(n)

# get the indexes of the elements in the original array that are NaN
idx = np.isfinite(test_df['Age'])

# use the indexes to replace the NON-NaNs in the random array with the good values from the original array
rands[idx.values] = test_df[idx]['Age'].values

## At this point rands is now the cleaned column of data we wanted, so push it in to the original df
test_df['Age'] = rands









    



max:  76.0
min:  0.17



In [ ]:



In [ ]:



In [19]:

    
#test_df['Age'].hist(bins=70)



In [20]:

    
## Let's make a couple nice plots of survival vs age
# peaks for survived/not survived passengers by their age
if False:
    facet = sns.FacetGrid(train_df, hue="Survived",aspect=4)
    #facet.map(sns.kdeplot,'Age',shade= True) # This keeps crashing the kernal, but I don't know why!!!!!!!!!!
    facet.set(xlim=(0, train_df['Age'].astype(int).max()))
    facet.add_legend()



In [21]:

    
# average survived passengers by age
if False:
    fig, axis1 = plt.subplots(1,1,figsize=(18,4))
    average_age = train_df[["Age", "Survived"]].groupby(['Age'],as_index=False).mean()
    sns.barplot(x='Age', y='Survived', data=average_age)
    print 'max: ',train_df['Age'].astype(int).max()
    print 'min: ',train_df['Age'].astype(int).min()



In [22]:

    
# Cabin
if False:
    # It has a lot of NaN values, so it won't cause a remarkable impact on prediction
    train_df.drop("Cabin",axis=1,inplace=True)
    test_df.drop("Cabin",axis=1,inplace=True)
## OR convert NaNs to 'U' meaning 'Unknown' and map all to new columns
if True:
    # Code based on that here: http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html
    # replacing missing cabins with U (for Uknown)
    train_df.Cabin.fillna('U',inplace=True)
    # mapping each Cabin value with the cabin letter
    train_df['Cabin'] = train_df['Cabin'].map(lambda c : c[0])
    # dummy encoding ...
    cabin_dummies = pd.get_dummies(train_df['Cabin'],prefix='Cabin')
    train_df = pd.concat([train_df,cabin_dummies],axis=1)
    train_df.drop('Cabin',axis=1,inplace=True)
    
    # replacing missing cabins with U (for Uknown)
    test_df.Cabin.fillna('U',inplace=True)
    # mapping each Cabin value with the cabin letter
    test_df['Cabin'] = test_df['Cabin'].map(lambda c : c[0])
    # dummy encoding ...
    cabin_dummies = pd.get_dummies(test_df['Cabin'],prefix='Cabin')
    test_df = pd.concat([test_df,cabin_dummies],axis=1)
    test_df.drop('Cabin',axis=1,inplace=True)



In [23]:

    
#train_df.head()
#test_df.head()



In [24]:

    
#train_df.head()

This function introduces 4 new features:

FamilySize : the total number of relatives including the passenger (him/her)self.
Sigleton : a boolean variable that describes families of size = 1
SmallFamily : a boolean variable that describes families of 2 <= size <= 4
LargeFamily : a boolean variable that describes families of 5 < size



In [25]:

    
# Family

# Code based on that here: http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html
# introducing a new feature : the size of families (including the passenger)
train_df['FamilySize'] = train_df['Parch'] + train_df['SibSp'] + 1
# introducing other features based on the family size
train_df['Singleton'] = train_df['FamilySize'].map(lambda s : 1 if s == 1 else 0)
train_df['SmallFamily'] = train_df['FamilySize'].map(lambda s : 1 if 2<=s<=4 else 0)
train_df['LargeFamily'] = train_df['FamilySize'].map(lambda s : 1 if 5<=s else 0)

# Code based on that here: http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html
# introducing a new feature : the size of families (including the passenger)
test_df['FamilySize'] = test_df['Parch'] + test_df['SibSp'] + 1
# introducing other features based on the family size
test_df['Singleton'] = test_df['FamilySize'].map(lambda s : 1 if s == 1 else 0)
test_df['SmallFamily'] = test_df['FamilySize'].map(lambda s : 1 if 2<=s<=4 else 0)
test_df['LargeFamily'] = test_df['FamilySize'].map(lambda s : 1 if 5<=s else 0)

if False:

    # Instead of having two columns Parch & SibSp, 
    # we can have only one column represent if the passenger had any family member aboard or not,
    # Meaning, if having any family member(whether parent, brother, ...etc) will increase chances of Survival or not.
    train_df['Family'] =  train_df["Parch"] + train_df["SibSp"]
    train_df['Family'].loc[train_df['Family'] > 0] = 1
    train_df['Family'].loc[train_df['Family'] == 0] = 0

    test_df['Family'] =  test_df["Parch"] + test_df["SibSp"]
    test_df['Family'].loc[test_df['Family'] > 0] = 1
    test_df['Family'].loc[test_df['Family'] == 0] = 0

    # drop Parch & SibSp
    train_df = train_df.drop(['SibSp','Parch'], axis=1)
    test_df    = test_df.drop(['SibSp','Parch'], axis=1)

# plot
if False:
    fig, (axis1,axis2) = plt.subplots(1,2,sharex=True,figsize=(10,5))

    # sns.factorplot('Family',data=train_df,kind='count',ax=axis1)
    sns.countplot(x='Family', data=train_df, order=[1,0], ax=axis1)

    # average of survived for those who had/didn't have any family member
    family_perc = train_df[["Family", "Survived"]].groupby(['Family'],as_index=False).mean()
    sns.barplot(x='Family', y='Survived', data=family_perc, order=[1,0], ax=axis2)

    axis1.set_xticklabels(["With Family","Alone"], rotation=0)



In [26]:

    
# Sex

# As we see, children(age < ~16) on aboard seem to have a high chances for Survival.
# So, we can classify passengers as males, females, and child
def get_person(passenger):
    age,sex = passenger
    return 'child' if age < 16 else sex
    
train_df['Person'] = train_df[['Age','Sex']].apply(get_person,axis=1)
test_df['Person']    = test_df[['Age','Sex']].apply(get_person,axis=1)

# No need to use Sex column since we created Person column
train_df.drop(['Sex'],axis=1,inplace=True)
test_df.drop(['Sex'],axis=1,inplace=True)

# create dummy variables for Person column
person_dummies_titanic  = pd.get_dummies(train_df['Person'])
person_dummies_titanic.columns = ['Child','Female','Male']
#person_dummies_titanic.drop(['Male'], axis=1, inplace=True)

person_dummies_test  = pd.get_dummies(test_df['Person'])
person_dummies_test.columns = ['Child','Female','Male']
#person_dummies_test.drop(['Male'], axis=1, inplace=True)

train_df = train_df.join(person_dummies_titanic)
test_df    = test_df.join(person_dummies_test)
if False:
    fig, (axis1,axis2) = plt.subplots(1,2,figsize=(10,5))

    # sns.factorplot('Person',data=train_df,kind='count',ax=axis1)
    sns.countplot(x='Person', data=train_df, ax=axis1)

    # average of survived for each Person(male, female, or child)
    person_perc = train_df[["Person", "Survived"]].groupby(['Person'],as_index=False).mean()
    sns.barplot(x='Person', y='Survived', data=person_perc, ax=axis2, order=['male','female','child'])

train_df.drop(['Person'],axis=1,inplace=True)
test_df.drop(['Person'],axis=1,inplace=True)

Not surprising, woman and children had higher survival rates.



In [27]:

    
# Pclass

# sns.factorplot('Pclass',data=titanic_df,kind='count',order=[1,2,3])
if False:
    sns.factorplot('Pclass','Survived',order=[1,2,3], data=train_df,size=5)

# create dummy variables for Pclass column
pclass_dummies_titanic  = pd.get_dummies(train_df['Pclass'])
pclass_dummies_titanic.columns = ['Class_1','Class_2','Class_3']
#pclass_dummies_titanic.drop(['Class_3'], axis=1, inplace=True)

pclass_dummies_test  = pd.get_dummies(test_df['Pclass'])
pclass_dummies_test.columns = ['Class_1','Class_2','Class_3']
#pclass_dummies_test.drop(['Class_3'], axis=1, inplace=True)

train_df.drop(['Pclass'],axis=1,inplace=True)
test_df.drop(['Pclass'],axis=1,inplace=True)

train_df = train_df.join(pclass_dummies_titanic)
test_df    = test_df.join(pclass_dummies_test)



In [28]:

    
# Ticket
# Code based on that here: http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html
# a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)
def cleanTicket(ticket):
    ticket = ticket.replace('.','')
    ticket = ticket.replace('/','')
    ticket = ticket.split()
    ticket = map(lambda t : t.strip() , ticket)
    ticket = list(filter(lambda t : not t.isdigit(), ticket))
    if len(ticket) > 0:
        return ticket[0]
    else: 
        return 'XXX'
    
train_df['Ticket'] = train_df['Ticket'].map(cleanTicket)
tickets_dummies = pd.get_dummies(train_df['Ticket'],prefix='Ticket')
train_df = pd.concat([train_df, tickets_dummies],axis=1)
train_df.drop('Ticket',inplace=True,axis=1)

test_df['Ticket'] = test_df['Ticket'].map(cleanTicket)
tickets_dummies = pd.get_dummies(test_df['Ticket'],prefix='Ticket')
test_df = pd.concat([test_df, tickets_dummies],axis=1)
test_df.drop('Ticket',inplace=True,axis=1)



In [29]:

    
train_df.head()









    Out[29]:






  
    
      
      PassengerId
      Survived
      Name
      Age
      SibSp
      Parch
      Fare
      Embarked
      Cabin_A
      Cabin_B
      ...
      Ticket_SOPP
      Ticket_SOTONO2
      Ticket_SOTONOQ
      Ticket_SP
      Ticket_STONO
      Ticket_STONO2
      Ticket_SWPP
      Ticket_WC
      Ticket_WEP
      Ticket_XXX
    
  
  
    
      0
      1
      0
      Braund, Mr. Owen Harris
      22.0
      1
      0
      7.2500
      S
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      2
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      38.0
      1
      0
      71.2833
      C
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      3
      1
      Heikkinen, Miss. Laina
      26.0
      0
      0
      7.9250
      S
      0
      0
      ...
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      3
      4
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      35.0
      1
      0
      53.1000
      S
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
    
      4
      5
      0
      Allen, Mr. William Henry
      35.0
      0
      0
      8.0500
      S
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
    
  

5 rows × 58 columns



In [ ]:



In [30]:

    
# Title
# a map of more aggregated titles
Title_Dictionary = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"

                    }
# we extract the title from each name
train_df['Title'] = train_df['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
# we map each title
train_df['Title'] = train_df.Title.map(Title_Dictionary)
#train_df.head()
# we extract the title from each name
test_df['Title'] = test_df['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
# we map each title
test_df['Title'] = test_df.Title.map(Title_Dictionary)
#test_df.head()



In [31]:

    
# encoding in dummy variable
titles_dummies = pd.get_dummies(train_df['Title'],prefix='Title')
train_df = pd.concat([train_df,titles_dummies],axis=1)
titles_dummies = pd.get_dummies(test_df['Title'],prefix='Title')
test_df = pd.concat([test_df,titles_dummies],axis=1)
# removing the title variable
train_df.drop('Title',axis=1,inplace=True)
test_df.drop('Title',axis=1,inplace=True)



In [32]:

    
# Convert categorical column values to ordinal for model fitting
if False:
    le_title = LabelEncoder()
    # To convert to ordinal:
    train_df.Title = le_title.fit_transform(train_df.Title)
    test_df.Title = le_title.fit_transform(test_df.Title)
    # To convert back to categorical:
    #train_df.Title = le_title.inverse_transform(train_df.Title)
    #train_df.head()
    #test_df.head()

Also unsurprising. The higher the booking class, then higher the chances to survive.

Now lets get to actually training and building a model to make predictions with!

problems with the raw data

a couple NaNs in 'Embarked', so drop column
'Name' strings can't be converted to anything useful, so drop column
replace NaNs in 'Fare' with median
'Ticket' can't be converted to anything useful, so drop column
'PassengerID' has no importance, so drop column



In [33]:

    
#train_df.drop(['Embarked'], axis=1,inplace=True)
#test_df.drop(['Embarked'], axis=1,inplace=True)
# only for test_df, since there is a missing "Fare" values
# could use mean or median here.
test_df["Fare"].fillna(test_df["Fare"].mean(), inplace=True)
train_df.drop(['Name'], axis=1,inplace=True)
test_df.drop(['Name'], axis=1,inplace=True)



In [34]:

    
#train_df.drop(['Ticket'], axis=1,inplace=True)
#test_df.drop(['Ticket'], axis=1,inplace=True)



In [35]:

    
#train_df.drop(['PassengerId'], axis=1,inplace=True)
#test_df.drop(['PassengerId'], axis=1,inplace=True)



In [36]:

    
# only in titanic_df, fill the two missing values with the most occurred value, which is "S".
train_df["Embarked"] = train_df["Embarked"].fillna("S")
# Either to consider Embarked column in predictions,
# and remove "S" dummy variable, 
# and leave "C" & "Q", since they seem to have a good rate for Survival.

# OR, don't create dummy variables for Embarked column, just drop it, 
# because logically, Embarked doesn't seem to be useful in prediction.

embark_dummies_train  = pd.get_dummies(train_df['Embarked'])
#embark_dummies_train.drop(['S'], axis=1, inplace=True)

embark_dummies_test  = pd.get_dummies(test_df['Embarked'])
#embark_dummies_test.drop(['S'], axis=1, inplace=True)

train_df = train_df.join(embark_dummies_train)
test_df    = test_df.join(embark_dummies_test)

train_df.drop(['Embarked'], axis=1,inplace=True)
test_df.drop(['Embarked'], axis=1,inplace=True)

The names are also pointless, so drop them too



In [37]:

    
## Scale all features except passengerID
features = list(train_df.columns)
features.remove('PassengerId')
train_df[features] = train_df[features].apply(lambda x: x/x.max(), axis=0)

features = list(test_df.columns)
features.remove('PassengerId')
test_df[features] = test_df[features].apply(lambda x: x/x.max(), axis=0)



In [38]:

    
train_df.head()









    Out[38]:






  
    
      
      PassengerId
      Survived
      Age
      SibSp
      Parch
      Fare
      Cabin_A
      Cabin_B
      Cabin_C
      Cabin_D
      ...
      Ticket_XXX
      Title_Master
      Title_Miss
      Title_Mr
      Title_Mrs
      Title_Officer
      Title_Royalty
      C
      Q
      S
    
  
  
    
      0
      1
      0.0
      0.2750
      0.125
      0.0
      0.014151
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      1
      2
      1.0
      0.4750
      0.125
      0.0
      0.139136
      0.0
      0.0
      1.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      1.0
      0.0
      0.0
    
    
      2
      3
      1.0
      0.3250
      0.000
      0.0
      0.015469
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      3
      4
      1.0
      0.4375
      0.125
      0.0
      0.103644
      0.0
      0.0
      1.0
      0.0
      ...
      1.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      4
      5
      0.0
      0.4375
      0.000
      0.0
      0.015713
      0.0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
  

5 rows × 65 columns



In [39]:

    
test_df.head()









    Out[39]:






  
    
      
      PassengerId
      Age
      SibSp
      Parch
      Fare
      Cabin_A
      Cabin_B
      Cabin_C
      Cabin_D
      Cabin_E
      ...
      Ticket_XXX
      Title_Master
      Title_Miss
      Title_Mr
      Title_Mrs
      Title_Officer
      Title_Royalty
      C
      Q
      S
    
  
  
    
      0
      892
      0.453947
      0.000
      0.000000
      0.015282
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
    
    
      1
      893
      0.618421
      0.125
      0.000000
      0.013663
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      2
      894
      0.815789
      0.000
      0.000000
      0.018909
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
    
    
      3
      895
      0.355263
      0.000
      0.000000
      0.016908
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      4
      896
      0.289474
      0.125
      0.111111
      0.023984
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
  

5 rows × 60 columns

match up dataframe columns, by removing extras not in the test set.



In [40]:

    
## Remove extra columns in training DF that are not in test DF
train_cs = list(train_df.columns)
train_cs.remove('Survived')
test_cs = list(test_df.columns)
for c in train_cs:
    if c not in test_cs:
        print repr(c)+' not in test columns, so removing it from training df'
        train_df.drop([c], axis=1,inplace=True)
for c in test_cs:
    if c not in train_cs:
        print repr(c)+' not in training columns, so removing it from test df'
        test_df.drop([c], axis=1,inplace=True)









    



'Cabin_T' not in test columns, so removing it from training df
'Ticket_AS' not in test columns, so removing it from training df
'Ticket_CASOTON' not in test columns, so removing it from training df
'Ticket_Fa' not in test columns, so removing it from training df
'Ticket_LINE' not in test columns, so removing it from training df
'Ticket_PPP' not in test columns, so removing it from training df
'Ticket_SCOW' not in test columns, so removing it from training df
'Ticket_SOP' not in test columns, so removing it from training df
'Ticket_SP' not in test columns, so removing it from training df
'Ticket_SWPP' not in test columns, so removing it from training df
'Ticket_A' not in training columns, so removing it from test df
'Ticket_AQ3' not in training columns, so removing it from test df
'Ticket_AQ4' not in training columns, so removing it from test df
'Ticket_LP' not in training columns, so removing it from test df
'Ticket_SCA3' not in training columns, so removing it from test df
'Ticket_STONOQ' not in training columns, so removing it from test df



In [41]:

    
if False:
    print '\nFor train_df:'
    for column in train_df:
        print "# Nans in column '"+column+"' are: "+str(train_df[column].isnull().sum())
        print 'min: ',train_df[column].min()
        print 'max: ',train_df[column].max()

    print '\nFor test_df:'
    for column in test_df:
        print "# Nans in column '"+column+"' are: "+str(test_df[column].isnull().sum())
        print 'min: ',test_df[column].min()
        print 'max: ',test_df[column].max()



In [42]:

    
# define training and testing sets
X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.copy()



In [43]:

    
X_train.head()









    Out[43]:






  
    
      
      PassengerId
      Age
      SibSp
      Parch
      Fare
      Cabin_A
      Cabin_B
      Cabin_C
      Cabin_D
      Cabin_E
      ...
      Ticket_XXX
      Title_Master
      Title_Miss
      Title_Mr
      Title_Mrs
      Title_Officer
      Title_Royalty
      C
      Q
      S
    
  
  
    
      0
      1
      0.2750
      0.125
      0.0
      0.014151
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      1
      2
      0.4750
      0.125
      0.0
      0.139136
      0.0
      0.0
      1.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      1.0
      0.0
      0.0
    
    
      2
      3
      0.3250
      0.000
      0.0
      0.015469
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      3
      4
      0.4375
      0.125
      0.0
      0.103644
      0.0
      0.0
      1.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      4
      5
      0.4375
      0.000
      0.0
      0.015713
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
  

5 rows × 54 columns



In [44]:

    
X_test.head()









    Out[44]:






  
    
      
      PassengerId
      Age
      SibSp
      Parch
      Fare
      Cabin_A
      Cabin_B
      Cabin_C
      Cabin_D
      Cabin_E
      ...
      Ticket_XXX
      Title_Master
      Title_Miss
      Title_Mr
      Title_Mrs
      Title_Officer
      Title_Royalty
      C
      Q
      S
    
  
  
    
      0
      892
      0.453947
      0.000
      0.000000
      0.015282
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
    
    
      1
      893
      0.618421
      0.125
      0.000000
      0.013663
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      2
      894
      0.815789
      0.000
      0.000000
      0.018909
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
    
    
      3
      895
      0.355263
      0.000
      0.000000
      0.016908
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
    
      4
      896
      0.289474
      0.125
      0.111111
      0.023984
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      1.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
    
  

5 rows × 54 columns

Feature Selection



In [45]:

    
clf = ExtraTreesClassifier(n_estimators=200)
clf = clf.fit(X_train, Y_train)
features = pd.DataFrame()
features['feature'] = X_train.columns
features['importance'] = clf.feature_importances_
features.sort(['importance'],ascending=False)









    



/home/kmede/miniconda2/envs/ExoSOFTcondaEnv/lib/python2.7/site-packages/ipykernel/__main__.py:6: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)






    Out[45]:






  
    
      
      feature
      importance
    
  
  
    
      0
      PassengerId
      0.122308
    
    
      1
      Age
      0.120202
    
    
      4
      Fare
      0.104768
    
    
      19
      Male
      0.104242
    
    
      47
      Title_Mr
      0.078914
    
    
      18
      Female
      0.061384
    
    
      22
      Class_3
      0.042198
    
    
      46
      Title_Miss
      0.034166
    
    
      48
      Title_Mrs
      0.028270
    
    
      12
      Cabin_U
      0.027821
    
    
      13
      FamilySize
      0.022124
    
    
      20
      Class_1
      0.021310
    
    
      16
      LargeFamily
      0.021214
    
    
      15
      SmallFamily
      0.018908
    
    
      2
      SibSp
      0.017252
    
    
      21
      Class_2
      0.014073
    
    
      44
      Ticket_XXX
      0.013138
    
    
      3
      Parch
      0.012543
    
    
      53
      S
      0.012191
    
    
      17
      Child
      0.011023
    
    
      51
      C
      0.010887
    
    
      9
      Cabin_E
      0.009562
    
    
      14
      Singleton
      0.008793
    
    
      45
      Title_Master
      0.008343
    
    
      52
      Q
      0.007854
    
    
      6
      Cabin_B
      0.007619
    
    
      8
      Cabin_D
      0.006894
    
    
      29
      Ticket_PC
      0.006557
    
    
      7
      Cabin_C
      0.006040
    
    
      40
      Ticket_STONO
      0.005513
    
    
      49
      Title_Officer
      0.004529
    
    
      26
      Ticket_CA
      0.004020
    
    
      24
      Ticket_A5
      0.003598
    
    
      42
      Ticket_WC
      0.002862
    
    
      5
      Cabin_A
      0.002462
    
    
      37
      Ticket_SOPP
      0.002193
    
    
      25
      Ticket_C
      0.002134
    
    
      39
      Ticket_SOTONOQ
      0.001974
    
    
      41
      Ticket_STONO2
      0.001831
    
    
      10
      Cabin_F
      0.001481
    
    
      11
      Cabin_G
      0.001265
    
    
      50
      Title_Royalty
      0.000846
    
    
      27
      Ticket_FC
      0.000809
    
    
      43
      Ticket_WEP
      0.000771
    
    
      30
      Ticket_PP
      0.000719
    
    
      34
      Ticket_SCPARIS
      0.000599
    
    
      23
      Ticket_A4
      0.000405
    
    
      36
      Ticket_SOC
      0.000359
    
    
      35
      Ticket_SCParis
      0.000358
    
    
      28
      Ticket_FCC
      0.000335
    
    
      33
      Ticket_SCAH
      0.000149
    
    
      38
      Ticket_SOTONO2
      0.000074
    
    
      32
      Ticket_SCA4
      0.000069
    
    
      31
      Ticket_SC
      0.000044

Select top features for use in models



In [46]:

    
model = SelectFromModel(clf, prefit=True)
X_train_new = model.transform(X_train)
X_train_new.shape

X_test_new = model.transform(X_test)
X_test_new.shape









    Out[46]:





(418, 14)



In [47]:

    
# Logistic Regression
logreg = LogisticRegression()

logreg.fit(X_train_new, Y_train)

Y_pred = logreg.predict(X_test_new)

print('standard score ', logreg.score(X_train_new, Y_train))
print('cv score ',np.mean(cross_val_score(logreg, X_train_new, Y_train, cv=10)))









    



('standard score ', 0.83164983164983164)
('cv score ', 0.82940926115083413)



In [48]:

    
# Support Vector Machines
svc = SVC()

svc.fit(X_train_new, Y_train)

Y_pred = svc.predict(X_test_new)

#svc.score(X_train, Y_train)
print('standard score ', svc.score(X_train_new, Y_train))
print('cv score ',np.mean(cross_val_score(svc, X_train_new, Y_train, cv=10)))









    



('standard score ', 0.83726150392817056)
('cv score ', 0.52969072749971624)



In [53]:

    
# Random Forests
random_forest = RandomForestClassifier(n_estimators=300)
random_forest.fit(X_train_new, Y_train)
Y_pred = random_forest.predict(X_test_new)
print('standard score ', random_forest.score(X_train_new, Y_train))
print('cv score ',np.mean(cross_val_score(random_forest, X_train_new, Y_train, cv=10)))









    



('standard score ', 1.0)
('cv score ', 0.81927136533878109)



In [55]:

    
acc = []
mx_v = 0
mx_e = 0
ests = range(10,500,10)
if False:
    for est in ests:
        random_forest = RandomForestClassifier(n_estimators=est)
        random_forest.fit(X_train_new, Y_train)
        Y_pred = random_forest.predict(X_test_new)
        #predictions = model.predict(X_test)
        #accuracy = accuracy_score(y_test, predictions)
        accuracy = np.mean(cross_val_score(random_forest, X_train_new, Y_train, cv=5))* 100.0
        acc.append(accuracy)
        if acc[-1]>mx_v:
            mx_v = acc[-1]
            mx_e = est
    print("maxes were: ",(mx_e,mx_v))
        
    fig = plt.figure(figsize=(7,5))     
    subPlot = fig.add_subplot(111)
    subPlot.plot(ests,acc,linewidth=3)









    



('maxes were: ', (270, 82.27588112151733))



In [50]:

    
# From Comment by 'Ewald' at:
# https://www.kaggle.com/c/job-salary-prediction/forums/t/4000/how-to-add-crossvalidation-to-scikit-randomforestregressor
if True:
    num_folds = 10
    num_instances = len(X_train_new)
    seed = 7
    num_trees = 300
    max_features = 'auto'
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features,
    min_samples_leaf=50)
    results= cross_validation.cross_val_score(model, X_train_new, Y_train, cv=kfold, n_jobs=-1)
    print(results.mean())









    



0.801335830212



In [65]:

    
# Another form of K-fold and hyperparameter tuning from:
#http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html
forest = RandomForestClassifier(max_features='sqrt')

parameter_grid = {
                 'max_depth' : [3,4,5,6,7],
                 'n_estimators': [50,100,130,175,200,210,240,250],
                 'criterion': ['gini','entropy']
                 }

cross_validation = StratifiedKFold(Y_train, n_folds=5)
import timeit
tic=timeit.default_timer()
grid_search = GridSearchCV(forest,
                           param_grid=parameter_grid,
                           cv=cross_validation)

grid_search.fit(X_train_new, Y_train)

print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
toc = timeit.default_timer()
print("It took: ",toc-tic)









    



Best score: 0.829405162738
Best parameters: {'n_estimators': 250, 'criterion': 'gini', 'max_depth': 3}
('It took: ', 188.01947593688965)



In [62]:

    
# K Nearest Neighbors 
knn = KNeighborsClassifier(n_neighbors = 50)

knn.fit(X_train_new, Y_train)

Y_pred = knn.predict(X_test_new)

#knn.score(X_train_new, Y_train)

print('standard score ', knn.score(X_train_new, Y_train))
print('cv score ',np.mean(cross_val_score(knn, X_train_new, Y_train, cv=10)))









    



('standard score ', 0.62177328843995505)
('cv score ', 0.59032743161956647)



In [51]:

    
# Gaussian Naive Bayes
gaussian = GaussianNB()

gaussian.fit(X_train_new, Y_train)

Y_pred = gaussian.predict(X_test_new)

#gaussian.score(X_train, Y_train)
print('standard score ', gaussian.score(X_train_new, Y_train))
print('cv score ',np.mean(cross_val_score(gaussian, X_train_new, Y_train, cv=10)))









    



('standard score ', 0.81481481481481477)
('cv score ', 0.8137660310974919)



In [54]:

    
# get Correlation Coefficient for each feature using Logistic Regression
coeff_df = DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Features']
coeff_df["Coefficient Estimate"] = pd.Series(logreg.coef_[0])

# preview
#coeff_df

NEXT: TRY TO PERFORM BOOSTING WITH SKLEARN, NOT XGBoost to see how it changes above results. THEN TRY TO BUILD SINGLE LAYER NEURAL NETWORKS TO SEE HOW THEY PERFORM, THEN TRY MULTI-LAYER NEURAL NETWORKS.

XGBoost stuff



In [41]:

    
if False:
    submission = pd.DataFrame({
            "PassengerId": test_df["PassengerId"],
            "Survived": Y_pred
        })
    submission.to_csv('submission.csv', index=False)



In [42]:

    
if False:
    ### Using XGboost
    #X_train = train_df.drop("Survived",axis=1)
    #train_X = train_df.drop("Survived",axis=1).as_matrix()
    X_train_new
    #X_train_new, Y_train, X_test_new
    #train_y = train_df["Survived"]
    Y_train
    #test_X = test_df.drop("PassengerId",axis=1).copy().as_matrix()
    X_test_new
    model = xgb.XGBClassifier(max_depth=10, n_estimators=300, learning_rate=0.05)
    model.fit(X_train_new, Y_train)
    predictions = model.predict(X_test_new)
    # plot feature importance
    plot_importance(model)
    plt.show()
    #X_train, X_test, y_train, y_test = train_test_split(X_train_new, Y_train, test_size=0.33)
    #accuracy = accuracy_score(y_test, predictions)
    #print("Accuracy: %.2f%%" % (accuracy * 100.0))



In [ ]:



In [43]:

    
# basic try at iterative training with XGboost
#train_X = train_df.drop("Survived",axis=1).as_matrix()
#train_y = train_df["Survived"]
#test_X = test_df.drop("PassengerId",axis=1).copy().as_matrix()
# fit model on all training data
acc = []
mx_v = 0
mx_e = 0
ests = range(10,500,10)
if False:
    for est in ests:
        #print est
        model = xgb.XGBClassifier(max_depth=5, n_estimators=ests, learning_rate=0.05)
        X_train, X_test, y_train, y_test = train_test_split(X_train_new, Y_train, test_size=0.33)#, random_state=7)
        model.fit(X_train, y_train)
        predictions = model.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        accuracy *= 100.0
        acc.append(accuracy)
        #print("Accuracy: %.2f%%" % (accuracy))
        if acc[-1]>mx_v:
            mx_v = acc[-1]
            mx_e = est
    print("maxes were: ",(mx_e,mx_v))

    fig = plt.figure(figsize=(7,5))     
    subPlot = fig.add_subplot(111)
    subPlot.plot(ests,acc,linewidth=3)


model = xgb.XGBClassifier(max_depth=5, n_estimators=300, learning_rate=0.05)
for i in range(10):
    print "Iteration: "+str(i)
    # split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X_train_new, Y_train, test_size=0.33)#, random_state=7)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    print("Accuracy: %.2f%%" % (accuracy * 100.0))

print "After rounds of training.  Results on original training data:"
predictions = model.predict(train_X)
accuracy = accuracy_score(train_y, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))









    



Iteration: 0
Accuracy: 80.34%
Iteration: 1
Accuracy: 73.90%
Iteration: 2
Accuracy: 77.63%
Iteration: 3
Accuracy: 80.00%
Iteration: 4
Accuracy: 81.36%
Iteration: 5
Accuracy: 78.64%
Iteration: 6
Accuracy: 80.00%
Iteration: 7
Accuracy: 77.63%
Iteration: 8
Accuracy: 80.68%
Iteration: 9
Accuracy: 81.36%



In [41]:

    
# use feature importance for feature selection
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
import timeit
tic=timeit.default_timer()
# load data
#dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = X_train_new
Y = Y_train
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = XGBClassifier(max_depth=10, nthread=100, n_estimators=300, learning_rate=0.05)
model.fit(X_train, y_train)
# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
toc = timeit.default_timer()
print("It took: ",toc-tic)

# Fit model using each importance as a threshold
if False:
    thresholds = sort(model.feature_importances_)
    for thresh in thresholds:
        # select features using threshold
        selection = SelectFromModel(model, threshold=thresh, prefit=True)
        select_X_train = selection.transform(X_train)
        # train model
        selection_model = XGBClassifier(max_depth=10, nthread=100, n_estimators=300, learning_rate=0.05)
        selection_model.fit(select_X_train, y_train)
        # eval model
        select_X_test = selection.transform(X_test)
        y_pred = selection_model.predict(select_X_test)
        predictions = [round(value) for value in y_pred]
        accuracy = accuracy_score(y_test, predictions)
        print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))









    



Accuracy: 76.61%
('It took: ', 5.418628931045532)



In [42]:

    
cv_params = {'max_depth': [3], 'min_child_weight': [1]}
ind_params = {'learning_rate': 0.05, 'n_estimators': 100, 'nthread':100, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'binary:logistic'}
optimized_GBM = GridSearchCV(xgb.XGBClassifier(**ind_params), cv_params, 
                             scoring = 'accuracy', cv = 2, n_jobs = -1) 

import timeit
tic=timeit.default_timer()
#X_train_new, Y_train, X_test_new

#optimized_GBM.fit(X_train_new, Y_train)

#optimized_GBM.grid_scores_
toc = timeit.default_timer()
print("It took: ",toc-tic)









    



('It took: ', 0.002824068069458008)



In [ ]:

	PassengerId	Survived	Name	Age	SibSp	Fare	Embarked	...	Ticket_STONO2	Ticket_XXX
0	1	0	Braund, Mr. Owen Harris	22.0	1	7.2500	S	...	0	0
1	2	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	1	71.2833	C	...	0	0
2	3	1	Heikkinen, Miss. Laina	26.0	0	7.9250	S	...	1	0
3	4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	1	53.1000	S	...	0	1
4	5	0	Allen, Mr. William Henry	35.0	0	8.0500	S	...	0	1

	PassengerId	Survived	Age	SibSp	Fare	Cabin_C	...	Ticket_XXX	Title_Miss	Title_Mr	Title_Mrs	C	S
0	1	0.0	0.2750	0.125	0.014151	0.0	...	0.0	0.0	1.0	0.0	0.0	1.0
1	2	1.0	0.4750	0.125	0.139136	1.0	...	0.0	0.0	0.0	1.0	1.0	0.0
2	3	1.0	0.3250	0.000	0.015469	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0
3	4	1.0	0.4375	0.125	0.103644	1.0	...	1.0	0.0	0.0	1.0	0.0	1.0
4	5	0.0	0.4375	0.000	0.015713	0.0	...	1.0	0.0	1.0	0.0	0.0	1.0

	PassengerId	Age	SibSp	Parch	Fare	...	Ticket_XXX	Title_Mr	Title_Mrs	Q	S
0	892	0.453947	0.000	0.000000	0.015282	...	1.0	1.0	0.0	1.0	0.0
1	893	0.618421	0.125	0.000000	0.013663	...	1.0	0.0	1.0	0.0	1.0
2	894	0.815789	0.000	0.000000	0.018909	...	1.0	1.0	0.0	1.0	0.0
3	895	0.355263	0.000	0.000000	0.016908	...	1.0	1.0	0.0	0.0	1.0
4	896	0.289474	0.125	0.111111	0.023984	...	1.0	0.0	1.0	0.0	1.0

	feature	importance
0	PassengerId	0.122308
1	Age	0.120202
4	Fare	0.104768
19	Male	0.104242
47	Title_Mr	0.078914
18	Female	0.061384
22	Class_3	0.042198
46	Title_Miss	0.034166
48	Title_Mrs	0.028270
12	Cabin_U	0.027821
13	FamilySize	0.022124
20	Class_1	0.021310
16	LargeFamily	0.021214
15	SmallFamily	0.018908
2	SibSp	0.017252
21	Class_2	0.014073
44	Ticket_XXX	0.013138
3	Parch	0.012543
53	S	0.012191
17	Child	0.011023
51	C	0.010887
9	Cabin_E	0.009562
14	Singleton	0.008793
45	Title_Master	0.008343
52	Q	0.007854
6	Cabin_B	0.007619
8	Cabin_D	0.006894
29	Ticket_PC	0.006557
7	Cabin_C	0.006040
40	Ticket_STONO	0.005513
49	Title_Officer	0.004529
26	Ticket_CA	0.004020
24	Ticket_A5	0.003598
42	Ticket_WC	0.002862
5	Cabin_A	0.002462
37	Ticket_SOPP	0.002193
25	Ticket_C	0.002134
39	Ticket_SOTONOQ	0.001974
41	Ticket_STONO2	0.001831
10	Cabin_F	0.001481
11	Cabin_G	0.001265
50	Title_Royalty	0.000846
27	Ticket_FC	0.000809
43	Ticket_WEP	0.000771
30	Ticket_PP	0.000719
34	Ticket_SCPARIS	0.000599
23	Ticket_A4	0.000405
36	Ticket_SOC	0.000359
35	Ticket_SCParis	0.000358
28	Ticket_FCC	0.000335
33	Ticket_SCAH	0.000149
38	Ticket_SOTONO2	0.000074
32	Ticket_SCA4	0.000069
31	Ticket_SC	0.000044