Titanic Validation

We import the original train.csv and test.csv files and use PassengerID as the index column.

The clean_data function then performs the following:

Drops the Name, Ticket and Cabin columns which we currently are not using.
Modifies Fare column to indicate difference from the median fare paid by class.
Imputes median values (based on sex and passenger class) to null values in the Age column.
Family size feature created by adding values in the SibSp and Parch columns.

The cleaned data is saved to cl_train.csv and cl_test.csv.

Logistic Regression Model



In [2]:

    
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
import pandas as pd

train = pd.read_csv('cl_train.csv', index_col='PassengerId')

# create dummy variables
train = pd.get_dummies(train, columns=['Sex', 'Pclass', 'Embarked'])

# create cross validation set
X = train.drop('Survived', axis=1)
y = train['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=53)

# feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# logistic regression
polynomial_features = PolynomialFeatures(degree=3, include_bias=True)
logistic_regression = LogisticRegression(C=0.005)
pipeline = Pipeline([('polynomial_features', polynomial_features),
                     ('logistic_regression', logistic_regression)])

# prediction score
pipeline.fit(X_train, y_train)
print('Logistic Regression Train Score: %s' % pipeline.score(X_train, y_train))
print('Logistic Regression CV Score: %s' % pipeline.score(X_test, y_test))









    



Logistic Regression Train Score: 0.833832335329
Logistic Regression CV Score: 0.820627802691

Random Forest Model



In [3]:

    
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

train = pd.read_csv('cl_train.csv', index_col='PassengerId')

# impute missing 'Embarked' values with 'S' (most common)
train['Embarked'].fillna(value='S', inplace=True)

# encode categorical variables
le = LabelEncoder()
train['Sex'] = le.fit_transform(train['Sex'])
train['Embarked'] = le.fit_transform(train['Embarked'])

# create cross validation set
X = train.drop('Survived', axis=1)
y = train['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=134)

# random forest
clf = RandomForestClassifier(n_estimators=300, max_depth=6)

# prediction score
clf.fit(X_train, y_train)
print('Random Forest Train Score: %s' % clf.score(X_train, y_train))
print('Random Forest CV Score: %s' % clf.score(X_test, y_test))
print('Feature Importance:\n%s' % pd.Series(clf.feature_importances_,
                                            index=X_train.columns))









    



Random Forest Train Score: 0.881736526946
Random Forest CV Score: 0.798206278027
Feature Importance:
Pclass      0.141339
Sex         0.414373
Age         0.169097
Fare        0.149150
Embarked    0.037693
FamSize     0.088347
dtype: float64

Support Vector Machine



In [4]:

    
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import pandas as pd

train = pd.read_csv('cl_train.csv', index_col='PassengerId')

# create dummy variables
train = pd.get_dummies(train, columns=['Sex', 'Pclass', 'Embarked'])

# create cross validation set
X = train.drop('Survived', axis=1)
y = train['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=116)

# feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# support vector machine
clf = SVC(C=5, gamma='auto')

# prediction score
clf.fit(X_train, y_train)
print('SVC Train Score: %s' % clf.score(X_train, y_train))
print('SVC CV Score: %s' % clf.score(X_test, y_test))









    



SVC Train Score: 0.835329341317
SVC CV Score: 0.816143497758

Final Logistic Regression Model

Import the cleaned Titanic data from cl_train.csv and cl_test.csv.
Normalize features by mean and standard deviation.
Create polynomial features.
Save predicted data.

Submission Notes and History

Format: degree / C

6/25: R1 features; polynomial degree of 3, regularization constant 0.005 attained a leaderboard score of 0.77512.



In [254]:

    
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

train = pd.read_csv('cl_train.csv', index_col='PassengerId')
test = pd.read_csv('cl_test.csv', index_col='PassengerId')

# create training set X and y
X_train = train.drop('Survived', axis=1)
y_train = train['Survived']

# combine X train and test for preprocessing
tr_len = len(X_train)
df = pd.concat(objs=[X_train, test], axis=0)

# create dummy variables on train/test
df = pd.get_dummies(df, columns=['Sex', 'Pclass', 'Embarked'])

# split X train and test
X_train = df[:tr_len]
test = df[tr_len:]

# feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(test)

# L2 logistic polynomial regression with C = 1
polynomial_features = PolynomialFeatures(degree=3, include_bias=True)
logistic_regression = LogisticRegression(C=0.005)
pipeline = Pipeline([('polynomial_features', polynomial_features),
                     ('logistic_regression', logistic_regression)])

# fit and predict
pipeline.fit(X_train, y_train)
prediction = pipeline.predict(X_test)

# save survival predictions to a CSV file
predicted = np.column_stack((test.index.values, prediction))
np.savetxt("pr_logistic.csv", predicted.astype(int), fmt='%d', delimiter=",",
           header="PassengerId,Survived", comments='')

Final Random Forest Model

Import the cleaned Titanic data from cl_train.csv and cl_test.csv.
Create encoders for categorical variables.
Save predicted data.

Submission Notes and History

Format: n_estimators / max_depth

6/25: R1 features; 300 estimators, max tree depth of 6 attained a leaderboard score of 0.79904.



In [234]:

    
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

train = pd.read_csv('cl_train.csv', index_col='PassengerId')
test = pd.read_csv('cl_test.csv', index_col='PassengerId')

# create training set X and y
X_train = train.drop('Survived', axis=1)
y_train = train['Survived']

# combine X train and test for preprocessing
tr_len = len(X_train)
df = pd.concat(objs=[X_train, test], axis=0)

# impute missing 'Embarked' values with 'S' (most common)
df['Embarked'].fillna(value='S', inplace=True)

# encode categorical variables
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])
df['Embarked'] = le.fit_transform(df['Embarked'])

# split X train and test
X_train = df[:tr_len]
test = df[tr_len:]

# random forest with 200 estimators, max depth 10
clf = RandomForestClassifier(n_estimators=300, max_depth=6)

# fit and predict
clf.fit(X_train, y_train)
prediction = clf.predict(test)

# save survival predictions to a CSV file
predicted = np.column_stack((test.index.values, prediction))
np.savetxt("pr_forest.csv", predicted.astype(int), fmt='%d', delimiter=",",
           header="PassengerId,Survived", comments='')

Final Support Vector Machine

Import the cleaned Titanic data from cl_train.csv and cl_test.csv.
Normalize features by mean and standard deviation.
Create polynomial features.
Save predicted data.

Submission Notes and History

Format: gamma / C

6/25: R1 features; automatic gamma, regularization constant 5 attained a leaderboard score of 0.77033.



In [29]:

    
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import pandas as pd
import numpy as np

train = pd.read_csv('cl_train.csv', index_col='PassengerId')
test = pd.read_csv('cl_test.csv', index_col='PassengerId')

# create training set X and y
X_train = train.drop('Survived', axis=1)
y_train = train['Survived']

# combine X train and test for preprocessing
tr_len = len(X_train)
df = pd.concat(objs=[X_train, test], axis=0)

# create dummy variables on train/test
df = pd.get_dummies(df, columns=['Sex', 'Pclass', 'Embarked'])

# split X train and test
X_train = df[:tr_len]
test = df[tr_len:]

# feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(test)

# support vector machine
clf = SVC(C=3, gamma='auto')

# fit and predict
clf.fit(X_train, y_train)
prediction = clf.predict(X_test)

# save survival predictions to a CSV file
predicted = np.column_stack((test.index.values, prediction))
np.savetxt("pr_SVM.csv", predicted.astype(int), fmt='%d', delimiter=",",
           header="PassengerId,Survived", comments='')