Titanic ML Kernel

Importing Scikit Learn Libraries


In [2]:
import pandas as pd
import numpy as np

Loading the Titanic dataset from downloaded CSV file


In [9]:
data_path = "C:/Users/Rishu/Desktop/dATA/titanic/"

titanic_train = pd.read_csv(data_path+"train.csv")

titanic_train.head()


Out[9]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Understanding the Dataset

  • Information about the train data

In [8]:
titanic_train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
  • Finding the number of features having NULL or NaN in the dataset

In [5]:
titanic_train.isnull().sum()


Out[5]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
  • Getting a small insight into the measurable features

In [6]:
titanic_train.describe()


Out[6]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Fixing the noise in the dataset

  • Feature: Age

In [20]:
mean_age = titanic_train['Age'].mean()
titanic_train['Age'] = titanic_train['Age'].fillna(mean_age)
#return titanic_train

Finding the mean age of all the passengers excluding the NaN. Next assigning the null values with the Mean Age.

  • Feature: Cabin

In [27]:
#print(titanic_train['Cabin'].head())

titanic_train['Cabin'] = titanic_train['Cabin'].fillna('N')
titanic_train['Cabin'] = titanic_train['Cabin'].apply(lambda x: x[0])

For the Cabin feature, firstly filling the Null/NaN values with default: N. Since the cabin data is also having noise in the rest of the data, splitting only the first character of the data using lamda function. This will be bring a consistency to the data set

  • Feature: Embarked

In [ ]: