In [1]:
%matplotlib notebook
Original data set is from Kaggle
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we will complete the analysis of what sorts of people were likely to survive. In particular, we will apply the tools of machine learning to predict which passengers survived the tragedy.
Variable | Definition | Key |
---|---|---|
PassengerId | Passenger Id(乘客ID) | |
Survival | Survival(是否获救了) | 0 = No, 1 = Yes |
Pclass | Ticket class(几等仓) | 1 = 1st, 2 = 2nd, 3 = 3rd |
Name | 姓名 | |
Sex | 性别 | |
Age | 年龄 | |
SibSp | # of siblings / spouses aboard the Titanic(兄弟姐妹数量) | |
Parch | # of parents / children aboard the Titanic(带了几个老人孩子) | |
Ticket | Ticket number(票号) | |
Fare | Passenger fare | |
Cabin | Cabin number(房间号) | |
Embarked | Port of Embarkation(出发港) | C = Cherbourg, Q = Queenstown, S = Southampton |
1.Load the train data(titanic_train.csv) and test data(titanic_test.csv) into the application 将训练数据(titanic_train.csv)和测试数据(titanic_test.csv)加载到应用
In [2]:
import pandas
#load train and test data
titanic_df = pandas.read_csv("titanic_train.csv")
test_df = pandas.read_csv("titanic_test.csv")
2.Have a view with the train data and test data 对训练数据和测试数据进行观察
The train data:
In [3]:
titanic_df.head()
Out[3]:
The test data:
In [4]:
test_df.head()
Out[4]:
3. Get the general infomation about the data 获取数据的一般信息
We can get the total account of each features:
In [5]:
print("-----------------Train Data-------------")
titanic_df.info()
In [6]:
titanic_df.describe()
Out[6]:
In [7]:
print("-----------------Test Data-------------")
test_df.info()
In [8]:
test_df.describe()
Out[8]:
The meaning of each item:
We can see that there are several missing value in column "Age" and "Cabin".在“Age”和“Cabin”列,存在缺失值
In [9]:
titanic_df = titanic_df.drop(['PassengerId','Name','Ticket'], axis=1)
test_df = test_df.drop(['Name','Ticket'], axis=1)
In [10]:
titanic_df.head()
Out[10]:
In [11]:
test_df.head()
Out[11]:
For the Cabin Column, it has a lot missing values. So it's impossible for us to utilize it to analysis. Just drop it.
In [12]:
titanic_df.drop("Cabin",axis=1,inplace=True)
test_df.drop("Cabin",axis=1,inplace=True)
"Age" column is also missing in train and test data. We need to fill it. Here we can generate the age in one std within the mean.
In [13]:
import numpy as np
import matplotlib.pyplot as plt
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(10,4))
axis1.set_title('Original Age values - Titanic')
axis2.set_title('New Age values - Titanic')
# get average, std, and number of NaN values in titanic_df
average_age_titanic = titanic_df["Age"].mean()
std_age_titanic = titanic_df["Age"].std()
count_nan_age_titanic = titanic_df["Age"].isnull().sum()
# get average, std, and number of NaN values in test_df
average_age_test = test_df["Age"].mean()
std_age_test = test_df["Age"].std()
count_nan_age_test = test_df["Age"].isnull().sum()
# generate random numbers between (mean - std) & (mean + std)
rand_1 = np.random.randint(average_age_titanic - std_age_titanic, average_age_titanic + std_age_titanic, size = count_nan_age_titanic)
rand_2 = np.random.randint(average_age_test - std_age_test, average_age_test + std_age_test, size = count_nan_age_test)
# plot original Age values
# NOTE: drop all null values, and convert to int
titanic_df['Age'].dropna().astype(int).hist(bins=70, ax=axis1)
# test_df['Age'].dropna().astype(int).hist(bins=70, ax=axis1)
# fill NaN values in Age column with random values generated
titanic_df["Age"][np.isnan(titanic_df["Age"])] = rand_1
test_df["Age"][np.isnan(test_df["Age"])] = rand_2
# convert from float to int
titanic_df['Age'] = titanic_df['Age'].astype(int)
test_df['Age'] = test_df['Age'].astype(int)
# plot new Age Values
titanic_df['Age'].hist(bins=70, ax=axis2)
plt.show()
"Embark" also misses two values. We can fill it with the most occurred value, which is "S".
In [14]:
titanic_df.mode()
Out[14]:
In [15]:
titanic_df["Embarked"] = titanic_df["Embarked"].fillna("S")
"Fair" in the test data is also missing one value. We can fill it with just median value.
In [16]:
test_df["Fare"].fillna(test_df["Fare"].median(), inplace=True)
titanic_df['Fare'] = titanic_df['Fare'].astype(int)
test_df['Fare'] = test_df['Fare'].astype(int)
After that, we can check the train and test data again
In [17]:
print("-----------------Train Data-------------")
titanic_df.info()
print("-----------------Test Data-------------")
test_df.info()
In [18]:
print(titanic_df["Sex"].unique())
Let's transform the sex to integer.
In [19]:
titanic_df.loc[titanic_df["Sex"] == "male", "Sex"] = 0
titanic_df.loc[titanic_df["Sex"] == "female", "Sex"] = 1
test_df.loc[test_df["Sex"] == "male", "Sex"] = 0
test_df.loc[test_df["Sex"] == "female", "Sex"] = 1
titanic_df['Sex'] = titanic_df['Sex'].astype(int)
test_df['Sex'] = test_df['Sex'].astype(int)
2."Embarked" column has 3 different values, and we can divide it to 3 diffrent columns
In [20]:
print(titanic_df["Embarked"].value_counts())
In [21]:
embark_dummies_titanic = pandas.get_dummies(titanic_df['Embarked'],prefix='embarked')
embark_dummies_test = pandas.get_dummies(test_df['Embarked'],prefix='embarked')
titanic_df = titanic_df.join(embark_dummies_titanic)
test_df = test_df.join(embark_dummies_test)
titanic_df.drop(['Embarked'], axis=1,inplace=True)
test_df.drop(['Embarked'], axis=1,inplace=True)
titanic_df.head()
Out[21]:
In [22]:
test_df.head()
Out[22]:
3."Pclass" column is also a catategorial variables
In [23]:
pclass_dummies_titanic = pandas.get_dummies(titanic_df['Pclass'],prefix='pclass')
pclass_dummies_test = pandas.get_dummies(test_df['Pclass'], prefix='pclass')
titanic_df = titanic_df.join(pclass_dummies_titanic)
test_df = test_df.join(pclass_dummies_test)
titanic_df.drop(['Pclass'], axis=1,inplace=True)
test_df.drop(['Pclass'], axis=1,inplace=True)
titanic_df.head()
Out[23]:
In [24]:
test_df.head()
Out[24]:
Till now, we have finished all the preprocessing of the data.
In [25]:
print("-----------------Train Data-------------")
titanic_df.info()
print("-----------------Test Data-------------")
test_df.info()
In [26]:
X_train = titanic_df.drop("Survived",axis=1)
Y_train = titanic_df["Survived"]
X_test = test_df.drop("PassengerId",axis=1).copy()
In [27]:
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt
X_train_np = X_train.as_matrix()
Y_train_np = Y_train.as_matrix()
X_train_headers = X_train.columns.values
#fig, axes = plt.subplots(11, 1, figsize=(10, 20))
survived = X_train_np[Y_train_np == 1]
unsurvived = X_train_np[Y_train_np == 0]
# ax = axes.ravel()
# for i in range(11):
# _, bins = np.histogram(X_train_np[:, i], bins=50)
# ax[i].hist(survived[:, i], bins=bins, color='red', alpha=.5)
# ax[i].hist(unsurvived[:, i], bins=bins, color='yellow', alpha=.5)
# ax[i].set_title(X_train_headers[i])
# ax[i].set_yticks(())
# ax[0].set_xlabel("Feature")
# ax[0].set_ylabel("Frequency")
# ax[0].legend(["Survived", "Unsurvived"], loc="best")
# fig.tight_layout()
# plt.show()
In [28]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
logreg.score(X_train, Y_train)
Out[28]:
In [29]:
from pandas import DataFrame
# get Correlation Coefficient for each feature using Logistic Regression
coeff_df = DataFrame(titanic_df.columns.delete(0))
coeff_df.columns = ['Features']
coeff_df["Coefficient Estimate"] = pandas.Series(logreg.coef_[0])
# preview
coeff_df
Out[29]:
Why random forest? Random forest is derived from decision tree. In decision tree, each feature is processed separately, and the possible splits of the data don't depend on scaling, no preprocessing like normalization or standardization of features is needes for decision tree algorithms. In particular, decision trees work well when you have features that are on completely different scales, or a mix of binary and continuous features.
But a main drawback of decision trees is that they tend to overfit the training data. Random forests are one way to address this problem. A random forests is essentially a collection of decision trees, where each tree is slight different from others. The idea behind random forests is that each tree might do a relatively good job of predicting, but will likey overfit in different ways, we can reduce the amount of overfitting by averaging their results. Random forests get their name from injecting randomness into the tree building to ensure each tree is different.
In [31]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
Out[31]:
In [37]:
def plot_importance_feature(model):
plt.figure(0)
n_features = X_train_np.shape[1]
plt.barh(range(n_features), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features), X_train_headers)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.show()
plot_importance_feature(random_forest)
Ways to improve:
In [43]:
result = pandas.DataFrame({
"PassengerId": test_df["PassengerId"],
"Survived": Y_pred
})
result.to_csv('titanic_result.csv', index=False)
In [44]:
result.head()
Out[44]: