This notebook uses the dataset from Kaggle's Titanic Comptetition to train a logistic regression and provides results with the test dataset https://www.kaggle.com/c/titanic
author: drublackberry (github)
In [1]:
train_size = 80 # % of the training set used for training
N_MonteCarlo = 50 # number of runs for the monte-carlo analysis
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pydot
from IPython.display import Image, display
# Load into CSV
myRawTrainDf = pd.read_csv('train.csv', index_col=0)
myRawTestDf = pd.read_csv('test.csv', index_col=0)
# Add survived column before merging
myRawTestDf['Survived'] = np.nan
myRawTestDf = myRawTestDf[myRawTrainDf.columns]
# Merge
myRawDf = myRawTrainDf.append(myRawTestDf)
In [3]:
# Inspect the names to see if something can be done
myRawDf['Name'].head(10)
Out[3]:
Let's create two extra columns with the title and the surname to be used as features.
In [4]:
import re
def getTitle (aName):
'''Finds the title in the name'''
myPosStart = aName.find(',')
myPosEnd = aName.find('.')
return re.sub('[^A-Za-z0-9]+', '', aName[myPosStart:myPosEnd])
def getSurname (aName):
'''Finds the title in the name'''
myPos = aName.find(',')
return re.sub('[^A-Za-z0-9]+', '', aName[:myPos])
myInDf = myRawDf.copy()
myInDf['Title'] = [getTitle(x) for x in myInDf['Name']]
myInDf['Surname'] = [getSurname(x) for x in myInDf['Name']]
# Get a sample
myInDf.head(3).append(myInDf.tail(3))
Out[4]:
In order to be able to plot and perform regressions (if needed) one can assign a number to each string for each feature.
In [5]:
def assignNumericalType (aSeries):
'''Assigns a numerical type to string values'''
val = aSeries.unique()
myDict = {val[x]:x for x in range(len(val))}
myDict[np.nan] = np.nan # Ensure nan stays nan
aOut = [myDict[x] for x in aSeries]
return aOut
# Convert strings to numerical type
for myCol in myInDf.columns:
if type(myInDf[myCol].dropna().iloc[0])==str:
myInDf[myCol] = assignNumericalType(myInDf[myCol])
# Get a sample
myInDf.head(3).append(myInDf.tail(3))
Out[5]:
In [6]:
# Exploratory data analysis
for myFeature in myInDf.columns:
if myFeature != 'Survived' and (len(myInDf[myFeature]) > len(myInDf[myFeature].unique())):
myInDf.pivot(columns='Survived', values=myFeature).plot(kind='hist', stacked=True, bins=20)
plt.title(myFeature)
plt.show()
In [7]:
# Do a correlation plot
cax = plt.matshow(myInDf.corr().abs())
plt.colorbar(cax)
plt.xticks(range(len(myInDf.columns)), myInDf.columns, rotation='vertical')
plt.yticks(range(len(myInDf.columns)), myInDf.columns, rotation='horizontal')
plt.show()
In [8]:
myMind = pd.MultiIndex.from_product([['Training', 'Test',],['Missing', 'Total']])
myMissingDf = pd.DataFrame(columns=myMind, index=myInDf.columns)
myMissingDf['Test', 'Missing'] = myInDf[myInDf['Survived'].isnull()].isnull().sum()
myMissingDf['Test', 'Total'] = myInDf[myInDf['Survived'].isnull()].isnull().count()
myMissingDf['Training', 'Missing'] = myInDf[myInDf['Survived'].notnull()].isnull().sum()
myMissingDf['Training', 'Total'] = myInDf[myInDf['Survived'].notnull()].isnull().count()
myMissingDf
Out[8]:
Given the results one can conclude that
The age feature is specially interesting given the high correlation with the survival
In [9]:
myInDf.corr()[[x for x in myInDf.columns if x != 'Age']].loc['Age'].plot(kind='bar')
plt.title('Correlation coefficient of Age with other features')
plt.show()
In [10]:
# Recall the distribution of the age
myInDf.pivot(columns='Survived', values='Age').plot(kind='hist', stacked=True, bins=20)
plt.title('Age before interpolation of nan')
plt.show()
In [11]:
# Predict the age with a linear regression
from sklearn import linear_model
from sklearn.preprocessing import normalize
def predictValueByLinearRegression (aInDf, aFeatureToUse, aFeatureToPredict):
aFeatureToUse.append(aFeatureToPredict)
myDf = aInDf[aFeatureToUse]
# Train
myX = myDf.dropna()[[x for x in myDf.columns if x != aFeatureToPredict]]
myY = myDf.dropna()[aFeatureToPredict]
myLR = linear_model.LinearRegression()
myLR.fit(myX, myY)
# Predict
myX = myDf[myDf[aFeatureToPredict].isnull()][[x for x in myDf.columns if x != aFeatureToPredict]]
return myLR.predict(myX)
In [12]:
# Assign
myInDf.loc[myInDf.isnull()['Age'], 'Age'] = predictValueByLinearRegression(myInDf, ['Sex', 'SibSp', 'Parch', 'Fare', 'Title'], 'Age')
In [13]:
# Check the histogram again to see the distribution
myInDf.pivot(columns='Survived', values='Age').plot(kind='hist', stacked=True, bins=20)
plt.title('Age after linear regression')
plt.show()
In [14]:
myRawDf[myRawDf['Embarked'].isnull()]
Out[14]:
Both of them where at Cabin B28, on which port did the passengers at these cabin also board? Also it is OK to assume that the tickets were sold in order?
In [15]:
# Get passengers with similar tickets
myRawDf[myRawDf['Ticket'].map(lambda x: '1135' in x or '1136' in x)]
Out[15]:
One can see that a number of passenger with similar fare price, cabin on the same section and ticket number close enough to the missing ones embarked in 'C'. Let's assume that's their port of origin
In [16]:
# Assign a numerical value
myInDf.loc[myInDf['Embarked'].isnull(), 'Embarked'] = myInDf.loc[55]['Embarked']
# Check
myInDf.loc[[62,830]]
Out[16]:
In [17]:
myRawDf[myRawDf['Fare'].isnull()]
Out[17]:
Let's look at how the fare correlates with the pclass, sex, age and embarcation port.
In [18]:
myInDf.corr()[[x for x in myInDf.columns if x != 'Fare']].loc['Fare'].plot(kind='bar')
plt.title('Correlation coefficient of Age with other features')
plt.show()
In [19]:
# Run a linear regression with the features that are most correlated
myInDf.loc[myInDf['Fare'].isnull(), 'Fare'] = predictValueByLinearRegression(myInDf, ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Title'], 'Fare')
In [20]:
# Check the final
myInDf.loc[1044]
Out[20]:
In [21]:
myInDf.isnull().sum()
Out[21]:
The 'Cabin' feature is largely missing and difficult to be infered from other features, for the moment then let's just discard it.
In [22]:
from sklearn import tree
from sklearn.metrics import precision_score, recall_score, f1_score
def splitDataset (aDf, aFrac):
'''Splits a Df in a training and a validation dataset randomly'''
aTrainDf = aDf.sample(frac=aFrac/100.)
myValInd = [ind for ind in aDf.index if ind not in aTrainDf.index]
aValDf = aDf.loc[myValInd]
# Create X and Y datasets
aXtrain = aTrainDf[[x for x in aTrainDf.columns if x!='Survived']]
aYtrain = aTrainDf['Survived']
aXval = aValDf[[x for x in aTrainDf.columns if x!='Survived']]
aYval = aValDf['Survived']
return aXtrain, aYtrain, aXval, aYval
def assessPerformance (aX, aY, aClf):
'''Prints the performance of a certain machine learning algorithm'''
myYpred = aClf.predict(aX)
aPrecision = precision_score(aY, myYpred)
aRecall = recall_score(aY, myYpred)
aF1score = f1_score(aY, myYpred)
return aPrecision, aRecall, aF1score
def trainPredictAndAnalyzeDecisionTree (aDf, aDepth=None, draw=False):
# Build a decision tree classifier
myXtrain, myYtrain, myXval, myYval = splitDataset (aDf, train_size)
myClf = tree.DecisionTreeClassifier(max_depth=aDepth)
myClf = myClf.fit(myXtrain, myYtrain)
aTrainPrecision, aTrainRecall, aTrainF1 = assessPerformance(myXtrain, myYtrain, myClf)
aValPrecision, aValRecall, aValF1 = assessPerformance(myXval, myYval, myClf)
if draw:
# Draw the decision tree
myDotData = tree.export_graphviz(myClf, feature_names=myXtrain.columns, out_file='tree.dot' )
(myGraph,) = pydot.graph_from_dot_file('tree.dot')
myPlt = Image(myGraph.create_png())
myGraph.write_png('tree.png')
display(myPlt)
return aTrainPrecision, aTrainRecall, aTrainF1, aValPrecision, aValRecall, aValF1
def runMonteCarlo (aDf, aF, *args):
myPerfDf = pd.DataFrame(columns=['Train Precision', 'Train Recall', 'Train F1', 'Val Precision', 'Val Recall', 'Val F1'])
for i in range(N_MonteCarlo):
myPerfDf.loc[i] = aF(aDf, *args)
return myPerfDf
In [23]:
# Calling and drawing a precision tree with max_depth
# Note that this call removes all the nan (large part of the dataset) and just displays a decision tree for illustration
foo = trainPredictAndAnalyzeDecisionTree(myInDf.dropna(), aDepth=3, draw=True)
In [24]:
# Do not use features which have any value missing
myCompleteFeatures = [x for x in myInDf.columns if myInDf.isnull().sum()[x] ==0]
myCompleteFeatures.append('Survived')
myStats = runMonteCarlo(myInDf[myCompleteFeatures].dropna(), trainPredictAndAnalyzeDecisionTree).describe() # capture for later
myStats
Out[24]:
In [25]:
from sklearn.ensemble import RandomForestClassifier
def trainPredictAndAnalyzeRandomForest (aDf, *args):
if len(args)>0:
myNfeat = args[0]
else:
myNfeat = 'auto'
# Build a decision tree classifier
myXtrain, myYtrain, myXval, myYval = splitDataset (aDf, train_size)
myClf = RandomForestClassifier(max_features=myNfeat)
myClf = myClf.fit(myXtrain, myYtrain)
aTrainPrecision, aTrainRecall, aTrainF1 = assessPerformance(myXtrain, myYtrain, myClf)
aValPrecision, aValRecall, aValF1 = assessPerformance(myXval, myYval, myClf)
return aTrainPrecision, aTrainRecall, aTrainF1, aValPrecision, aValRecall, aValF1
runMonteCarlo(myInDf[myCompleteFeatures].dropna(), trainPredictAndAnalyzeRandomForest).describe()
Out[25]:
Looking at these results it looks like we can get some better results by analyzing the cross-validation sets. Let's play with the number of features used in the random forest to see if we can boost performance
In [26]:
# Create a MultiIndex dataframe to store the stats of all the different runs
myNfeat = np.arange(2,11)
myIndex = pd.MultiIndex.from_product([myNfeat, myStats.columns])
myDf = pd.DataFrame(index=myIndex, columns=myStats.index)
for myN in myNfeat:
myDf.loc[myN,:].iloc[:] = runMonteCarlo(myInDf[myCompleteFeatures].dropna(), trainPredictAndAnalyzeRandomForest, myN).describe().iloc[:].transpose()
In [27]:
for i in ['Val Precision', 'Val Recall', 'Val F1']:
myDf.reset_index().set_index('level_1').loc[i][['level_0', 'min', 'max', 'mean']].set_index('level_0').plot()
plt.title(i)
plt.xlabel('max_features')
plt.show()
Seems that max_features does not have a huge effect and the data is not overfit. Let's assume the default value of sklearn (sqrt(num_features))
In [28]:
# Train the classifier
myUsedFeatures = [x for x in myCompleteFeatures if x != 'Survived']
myXtrain = myInDf[myCompleteFeatures].dropna()[myUsedFeatures]
myYtrain = myInDf['Survived'].dropna()
myClf = RandomForestClassifier()
myClf = myClf.fit(myXtrain, myYtrain)
aTrainPrecision, aTrainRecall, aTrainF1 = assessPerformance(myXtrain, myYtrain, myClf)
print 'Precision on full training set ' + str(aTrainPrecision)
print 'Recall on full training set ' + str(aTrainRecall)
print 'F1 on full training set ' + str(aTrainF1)
In [36]:
# Run the prediction on the test set
myXtest = myInDf.loc[myInDf['Survived'].isnull(), myUsedFeatures]
myOut = pd.Series(myClf.predict(myXtest), index=myXtest.index)
myOut = myOut.apply(np.int)
myOut.to_csv('solution.csv', header=['Survived'])
In [37]:
myOut.head(5)
Out[37]:
In [ ]: