For this lecture we will be working with the Titanic Data Set from Kaggle. This is a very famous data set and very often is a student's first step in machine learning!
We'll be trying to predict a classification- survival or deceased. Let's begin our understanding of implementing Logistic Regression in Python for classification.
We'll use a "semi-cleaned" version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning not shown in this lecture notebook.
Let's import some libraries to get started!
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
train = pd.read_csv('titanic_train.csv')
In [3]:
train.head(25)
Out[3]:
In [4]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[4]:
Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We'll probably drop this later, or change it to another feature like "Cabin Known: 1 or 0"
Let's continue on by visualizing some more of the data! Check out the video for full explanations over these plots, this code is just to serve as reference.
In [5]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train,palette='RdBu_r')
Out[5]:
In [9]:
# sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train,palette='RdBu_r')
Out[9]:
In [10]:
# sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train,palette='rainbow')
Out[10]:
In [11]:
sns.distplot(train['Age'].dropna(),kde=False,color='darkred',bins=30)
Out[11]:
In [12]:
train['Age'].hist(bins=30,color='darkred',alpha=0.7)
Out[12]:
In [13]:
sns.countplot(x='SibSp',data=train)
Out[13]:
In [14]:
train['Fare'].hist(color='green',bins=40,figsize=(8,4))
Out[14]:
In [15]:
import plotly_express as pex
In [17]:
pex.histogram(data_frame=train, x='Fare', nbins=30)
In [6]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')
Out[6]:
We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We'll use these average age values to impute based on Pclass for Age.
In [7]:
def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
Now apply that function!
In [8]:
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)
Now let's check that heat map again!
In [9]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[9]:
Great! Let's go ahead and drop the Cabin column and the row in Embarked that is NaN.
In [10]:
train.drop('Cabin',axis=1,inplace=True)
In [11]:
train.head(50)
Out[11]:
In [12]:
train.shape
Out[12]:
In [13]:
train.dropna(inplace=True)
In [14]:
train.shape
Out[14]:
In [15]:
train.info()
In [16]:
sex = pd.get_dummies(train['Sex'],drop_first=True)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
In [17]:
embark.head()
Out[17]:
In [18]:
sex.head()
Out[18]:
In [19]:
train.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
In [20]:
train = pd.concat([train,sex,embark],axis=1)
In [21]:
train.head()
Out[21]:
In [22]:
from sklearn.model_selection import train_test_split
In [23]:
X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1),
train['Survived'], test_size=0.30,
random_state=101)
In [24]:
from sklearn.linear_model import LogisticRegression
In [29]:
logmodel = LogisticRegression()
logmodel.verbose = 1
In [30]:
logmodel.fit(X_train,y_train)
Out[30]:
In [31]:
logmodel.coef_
Out[31]:
In [32]:
logmodel.intercept_
Out[32]:
In [52]:
predictions = logmodel.predict(X_test)
Let's move on to evaluate our model!
We can check precision,recall,f1-score using classification report!
In [42]:
from sklearn.metrics import classification_report
In [43]:
print(classification_report(y_test,predictions))
Not so bad! You might want to explore other feature engineering and the other titanic_text.csv file, some suggestions for feature engineering:
In [38]:
test_df = pd.read_csv('titanic_test.csv')
test_df.head()
Out[38]:
In [39]:
test_df.shape
Out[39]:
In [48]:
test_df.iloc[0]
Out[48]:
In [ ]: