In [2]:
import pandas as pd
import numpy as np
In [9]:
data_path = "C:/Users/Rishu/Desktop/dATA/titanic/"
titanic_train = pd.read_csv(data_path+"train.csv")
titanic_train.head()
Out[9]:
In [8]:
titanic_train.info()
In [5]:
titanic_train.isnull().sum()
Out[5]:
In [6]:
titanic_train.describe()
Out[6]:
In [20]:
mean_age = titanic_train['Age'].mean()
titanic_train['Age'] = titanic_train['Age'].fillna(mean_age)
#return titanic_train
Finding the mean age of all the passengers excluding the NaN. Next assigning the null values with the Mean Age.
In [27]:
#print(titanic_train['Cabin'].head())
titanic_train['Cabin'] = titanic_train['Cabin'].fillna('N')
titanic_train['Cabin'] = titanic_train['Cabin'].apply(lambda x: x[0])
For the Cabin feature, firstly filling the Null/NaN values with default: N. Since the cabin data is also having noise in the rest of the data, splitting only the first character of the data using lamda function. This will be bring a consistency to the data set
In [ ]: