In [1]:
# Import the Pandas library
import pandas as pd
kaggle_path = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/"
# Load the train and test datasets to create two DataFrames
train_url = kaggle_path + "train.csv"
train = pd.read_csv(train_url)
test_url = kaggle_path + "test.csv"
test = pd.read_csv(test_url)
In [2]:
# Training Dataset
target = train['Survived'].values
train['Age'] = train['Age'].fillna(train['Age'].median())
train.loc[train['Sex'] == 'male', 'Sex'] = 0
train.loc[train['Sex'] == 'female', 'Sex'] = 1
# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna('S')
# Convert the Embarked classes to integer form
train.loc[train['Embarked'] == 'S', 'Embarked'] = 0
train.loc[train['Embarked'] == 'C', 'Embarked'] = 1
train.loc[train['Embarked'] == 'Q', 'Embarked'] = 2
In [3]:
# Test Dataset
test['Age'] = test['Age'].fillna(test['Age'].median())
test.loc[test['Sex'] == 'male', 'Sex'] = 0
test.loc[test['Sex'] == 'female', 'Sex'] = 1
# Impute the Embarked variable
test["Embarked"] = test["Embarked"].fillna('S')
# Convert the Embarked classes to integer form
test.loc[test['Embarked'] == 'S', 'Embarked'] = 0
test.loc[test['Embarked'] == 'C', 'Embarked'] = 1
test.loc[test['Embarked'] == 'Q', 'Embarked'] = 2
# Impute the missing value with the median
test.loc[152, 'Fare'] = test['Fare'].median()
A detailed study of Random Forests would take this tutorial a bit too far. However, since it's an often used machine learning technique, gaining a general understanding in Python won't hurt.
In layman's terms, the Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.
Building a random forest in Python looks almost the same as building a decision tree; so we can jump right to it. There are two key differences, however. Firstly, a different class is used. And second, a new argument is necessary. Also, we need to import the necessary library from scikit-learn.
Use RandomForestClassifier() class instead of the DecisionTreeClassifier() class.
n_estimators needs to be set when using the RandomForestClassifier() class. This argument allows you to set the number of trees you wish to plant and average over.
The latest training and testing data are preloaded for you.
Build the random forest with n_estimators set to 100.
Fit your random forest model with inputs features_forest and target.
Compute the classifier predictions on the selected test set features.
In [4]:
# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier
# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[["Pclass", "Age", "Sex", "Fare",
"SibSp", "Parch", "Embarked"]].values
# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2,
n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)
# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))
# Compute predictions on our test set features
# then print the length of the prediction vector
test_features = test[["Pclass", "Age", "Sex", "Fare",
"SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))
Remember how we looked at .feature_importances_ attribute for the decision trees? Well, you can request the same attribute from your random forest as well and interpret the relevance of the included variables. You might also want to compare the models in some quick and easy way. For this, we can use the .score() method. The .score() method takes the features data and the target vector and computes mean accuracy of your model. You can apply this method to both the forest and individual trees. Remember, this measure should be high but not extreme because that would be a sign of overfitting.
For this exercise, you have my_forest and my_tree_two available to you. The features and target arrays are also ready for use.
Explore the feature importance for both models
Compare the mean accuracy score of the two models
In [5]:
from sklearn import tree
# Create a new array with the added features: features_two
features_two = train[["Pclass","Age","Sex","Fare",
'SibSp', 'Parch', 'Embarked']].values
#Control overfitting by setting "max_depth" to 10
# and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = max_depth,
min_samples_split = min_samples_split,
random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)
In [6]:
#Request and print the `.feature_importances_` attribute
print(my_tree_two.feature_importances_)
print(my_forest.feature_importances_)
#Compute and print the mean accuracy score for both models
print(my_tree_two.score(features_two, target))
print(my_forest.score(features_two, target))
Based on your finding in the previous exercise determine which feature was of most importance, and for which model. After this final exercise, you will be able to submit your random forest model to Kaggle! Use my_forest, my_tree_two, and feature_importances_ to answer the question.
The most important feature was "Age", but it was more significant for "my_tree_two"
*The most important feature was "Sex", but it was more significant for "my_tree_two"
The most important feature was "Sex", but it was more significant for "my_forest"
The most important feature was "Age", but it was more significant for "my_forest"
In [ ]: