Lab：使用决策树探索泰坦尼克号乘客存活情况

开始

在引导项目中，你研究了泰坦尼克号存活数据并能够对乘客存活情况作出预测。在该项目中，你手动构建了一个决策树，该决策树在每个阶段都会选择一个与存活情况最相关的特征。幸运的是，这正是决策树的运行原理！在此实验室中，我们将通过在 sklearn 中实现决策树使这一流程速度显著加快。

我们首先将加载数据集并显示某些行。



In [ ]:

    
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

# Pretty display for notebooks
%matplotlib inline

# Set a random seed
import random
random.seed(42)

# Load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)

# Print the first few entries of the RMS Titanic data
display(full_data.head())

下面是每位乘客具备的各种特征：

Survived：存活结果（0 = 存活；1 = 未存活）
Pclass：社会阶层（1 = 上层；2 = 中层；3 = 底层）
Name：乘客姓名
Sex：乘客性别
Age：乘客年龄（某些条目为 NaN）
SibSp：一起上船的兄弟姐妹和配偶人数
Parch：一起上船的父母和子女人数
Ticket：乘客的票号
Fare：乘客支付的票价
Cabin：乘客的客舱号（某些条目为 NaN）
Embarked：乘客的登船港（C = 瑟堡；Q = 皇后镇；S = 南安普顿）

因为我们对每位乘客或船员的存活情况感兴趣，因此我们可以从此数据集中删除 Survived 特征，并将其存储在单独的变量 outcome 中。我们将使用这些结果作为预测目标。
运行以下代码单元格，以从数据集中删除特征 Survived 并将其存储到 outcome 中。



In [ ]:

    
# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)

# Show the new dataset with 'Survived' removed
display(features_raw.head())

相同的泰坦尼克号样本数据现在显示 DataFrame 中删除了 Survived 特征。注意 data（乘客数据）和 outcomes （存活结果）现在是成对的。意味着对于任何乘客 data.loc[i]，都具有存活结果 outcomes[i]。

预处理数据

现在我们对数据进行预处理。首先，我们将对特征进行独热编码。



In [ ]:

    
features = pd.get_dummies(features_raw)

现在用 0 填充任何空白处。



In [ ]:

    
features = features.fillna(0.0)
display(features.head())

(TODO) 训练模型

现在我们已经准备好在 sklearn 中训练模型了。首先，将数据拆分为训练集和测试集。然后用训练集训练模型。



In [ ]:

    
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)



In [ ]:

    
# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# TODO: Define the classifier, and fit it to the data
model = None

测试模型

现在看看模型的效果。我们计算下训练集和测试集的准确率。



In [ ]:

    
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The training accuracy is', test_accuracy)

练习：改善模型

结果是训练准确率很高，但是测试准确率较低。我们可能有点过拟合了。

现在该你来发挥作用了！训练新的模型，并尝试指定一些参数来改善测试准确率，例如：

max_depth
min_samples_leaf
min_samples_split

你可以根据直觉、采用试错法，甚至可以使用网格搜索法！

挑战： 尝试在测试集中获得 85% 的准确率，如果需要提示，可以查看接下来的解决方案 notebook。



In [ ]:

    
# TODO: Train the model

# TODO: Make predictions

# TODO: Calculate the accuracy