predicting-with-decision-trees

https://campus.datacamp.com/courses/kaggle-python-tutorial-on-machine-learning/predicting-with-decision-trees?ex=1



In [1]:

    
# Import the Pandas library
import pandas as pd

kaggle_path = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/"

# Load the train and test datasets to create two DataFrames
train_url = kaggle_path + "train.csv"
train = pd.read_csv(train_url)

test_url = kaggle_path + "test.csv"
test = pd.read_csv(test_url)

1. Intro to decision trees

https://campus.datacamp.com/courses/kaggle-python-tutorial-on-machine-learning/predicting-with-decision-trees?ex=1

In the previous chapter, you did all the slicing and dicing yourself to find subsets that have a higher chance of surviving. A decision tree automates this process for you and outputs a classification model or classifier.

Conceptually, the decision tree algorithm starts with all the data at the root node and scans all the variables for the best one to split on. Once a variable is chosen, you do the split and go down one level (or one node) and repeat. The final nodes at the bottom of the decision tree are known as terminal nodes, and the majority vote of the observations in that node determine how to predict for new observations that end up in that terminal node.

First, let's import the necessary libraries:

Instructions

Import the numpy library as np
From sklearn import the tree



In [2]:

    
# Import the Numpy library
import numpy as np

# Import 'tree' from scikit-learn library
from sklearn import tree

2. Cleaning and Formatting your Data

https://campus.datacamp.com/courses/kaggle-python-tutorial-on-machine-learning/predicting-with-decision-trees?ex=2

Before you can begin constructing your trees you need to get your hands dirty and clean the data so that you can use all the features available to you. In the first chapter, we saw that the Age variable had some missing value. Missingness is a whole subject with and in itself, but we will use a simple imputation technique where we substitute each missing value with the median of the all present values.
train["Age"] = train["Age"].fillna(train["Age"].median())

Another problem is that the Sex and Embarked variables are categorical but in a non-numeric format. Thus, we will need to assign each class a unique integer so that Python can handle the information. Embarked also has some missing values which you should impute witht the most common class of embarkation, which is "S".

Instructions

Assign the integer 1 to all females
Impute missing values in Embarked with class S. Use .fillna() method.
Replace each class of Embarked with a uniques integer. 0 for S, 1 for C, and 2 for Q.
Print the Sex and Embarked columns



In [3]:

    
# substitute each missing value with the median of the all present values.
train['Age'] = train['Age'].fillna(train['Age'].median())
print(train['Age'])









    



0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
5      28.0
6      54.0
7       2.0
8      27.0
9      14.0
10      4.0
11     58.0
12     20.0
13     39.0
14     14.0
15     55.0
16      2.0
17     28.0
18     31.0
19     28.0
20     35.0
21     34.0
22     15.0
23     28.0
24      8.0
25     38.0
26     28.0
27     19.0
28     28.0
29     28.0
       ... 
861    21.0
862    48.0
863    28.0
864    24.0
865    42.0
866    27.0
867    31.0
868    28.0
869     4.0
870    26.0
871    47.0
872    33.0
873    47.0
874    28.0
875    15.0
876    20.0
877    19.0
878    28.0
879    56.0
880    25.0
881    33.0
882    22.0
883    28.0
884    25.0
885    39.0
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: Age, dtype: float64



In [4]:

    
# Convert the male and female groups to integer form
# train["Sex"][train["Sex"] == "male"] = 0
# train["Sex"][train["Sex"] == "female"] = 1
train.loc[train['Sex'] == 'male', 'Sex'] = 0
train.loc[train['Sex'] == 'female', 'Sex'] = 1
print(train['Sex'])









    



0      0
1      1
2      1
3      1
4      0
5      0
6      0
7      0
8      1
9      1
10     1
11     1
12     0
13     0
14     1
15     1
16     0
17     0
18     1
19     1
20     0
21     0
22     1
23     0
24     1
25     1
26     0
27     0
28     1
29     0
      ..
861    0
862    1
863    1
864    0
865    1
866    1
867    0
868    0
869    0
870    0
871    1
872    0
873    0
874    1
875    1
876    0
877    0
878    0
879    1
880    1
881    0
882    1
883    0
884    0
885    1
886    0
887    1
888    1
889    0
890    0
Name: Sex, dtype: object



In [5]:

    
# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna('S')

# Convert the Embarked classes to integer form
# train["Embarked"][train["Embarked"] == "S"] = 0
# train["Embarked"][train["Embarked"] == "C"] = 1
# train["Embarked"][train["Embarked"] == "Q"] = 2
train.loc[train['Embarked'] == 'S', 'Embarked'] = 0
train.loc[train['Embarked'] == 'C', 'Embarked'] = 1
train.loc[train['Embarked'] == 'Q', 'Embarked'] = 2

#Print the Sex and Embarked columns
print(train['Embarked'])









    



0      0
1      1
2      0
3      0
4      0
5      2
6      0
7      0
8      0
9      1
10     0
11     0
12     0
13     0
14     0
15     0
16     2
17     0
18     0
19     1
20     0
21     0
22     2
23     0
24     0
25     0
26     1
27     0
28     2
29     0
      ..
861    0
862    0
863    0
864    0
865    0
866    1
867    0
868    0
869    0
870    0
871    0
872    0
873    0
874    1
875    1
876    0
877    0
878    0
879    1
880    0
881    0
882    0
883    0
884    0
885    2
886    0
887    0
888    0
889    1
890    2
Name: Embarked, dtype: object

3. Creating your first decision tree

https://campus.datacamp.com/courses/kaggle-python-tutorial-on-machine-learning/predicting-with-decision-trees?ex=3

You will use the scikit-learn and numpy libraries to build your first decision tree. scikit-learn can be used to create tree objects from the DecisionTreeClassifier class. The methods that we will use take numpy arrays as inputs and therefore we will need to create those from the DataFrame that we already have. We will need the following to build a decision tree
target: A one-dimensional numpy array containing the target/response from the train data. (Survival in your case)
features: A multidimensional numpy array containing the features/predictors from the train data. (ex. Sex, Age)

Take a look at the sample code below to see what this would look like:
target = train["Survived"].values
features = train[["Sex", "Age"]].values
my_tree = tree.DecisionTreeClassifier()
my_tree = my_tree.fit(features, target)

One way to quickly see the result of your decision tree is to see the importance of the features that are included. This is done by requesting the .feature_importances_ attribute of your tree object. Another quick metric is the mean accuracy that you can compute using the .score() function with features_one and target as arguments.

Ok, time for you to build your first decision tree in Python! The train and testing data from chapter 1 are available in your workspace.

Instructions

Build the target and features_one numpy arrays. The target will be based on the Survived column in train. The features array will be based on the variables Passenger, Class, Sex, Age, and Passenger Fare
Build a decision tree my_tree_one to predict survival using features_one and target
Look at the importance of features in your tree and compute the score



In [6]:

    
target = train['Survived'].values
features = train[['Sex', 'Age']].values
my_tree = tree.DecisionTreeClassifier()
my_tree = my_tree.fit(features, target)



In [7]:

    
# Print the train data to see the available features
print(train)

# Create the target and features numpy arrays: target, features_one
target = train['Survived'].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))









    



     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25            26         1       3   
26            27         0       3   
27            28         0       1   
28            29         1       3   
29            30         0       3   
..           ...       ...     ...   
861          862         0       2   
862          863         1       1   
863          864         0       3   
864          865         0       2   
865          866         1       2   
866          867         1       2   
867          868         0       1   
868          869         0       3   
869          870         1       3   
870          871         0       3   
871          872         1       1   
872          873         0       1   
873          874         0       3   
874          875         1       2   
875          876         1       3   
876          877         0       3   
877          878         0       3   
878          879         0       3   
879          880         1       1   
880          881         1       2   
881          882         0       3   
882          883         0       3   
883          884         0       2   
884          885         0       3   
885          886         0       3   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris   0  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...   1  38.0      1   
2                               Heikkinen, Miss. Laina   1  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)   1  35.0      1   
4                             Allen, Mr. William Henry   0  35.0      0   
5                                     Moran, Mr. James   0  28.0      0   
6                              McCarthy, Mr. Timothy J   0  54.0      0   
7                       Palsson, Master. Gosta Leonard   0   2.0      3   
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)   1  27.0      0   
9                  Nasser, Mrs. Nicholas (Adele Achem)   1  14.0      1   
10                     Sandstrom, Miss. Marguerite Rut   1   4.0      1   
11                            Bonnell, Miss. Elizabeth   1  58.0      0   
12                      Saundercock, Mr. William Henry   0  20.0      0   
13                         Andersson, Mr. Anders Johan   0  39.0      1   
14                Vestrom, Miss. Hulda Amanda Adolfina   1  14.0      0   
15                    Hewlett, Mrs. (Mary D Kingcome)    1  55.0      0   
16                                Rice, Master. Eugene   0   2.0      4   
17                        Williams, Mr. Charles Eugene   0  28.0      0   
18   Vander Planke, Mrs. Julius (Emelia Maria Vande...   1  31.0      1   
19                             Masselmani, Mrs. Fatima   1  28.0      0   
20                                Fynney, Mr. Joseph J   0  35.0      0   
21                               Beesley, Mr. Lawrence   0  34.0      0   
22                         McGowan, Miss. Anna "Annie"   1  15.0      0   
23                        Sloper, Mr. William Thompson   0  28.0      0   
24                       Palsson, Miss. Torborg Danira   1   8.0      3   
25   Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...   1  38.0      1   
26                             Emir, Mr. Farred Chehab   0  28.0      0   
27                      Fortune, Mr. Charles Alexander   0  19.0      3   
28                       O'Dwyer, Miss. Ellen "Nellie"   1  28.0      0   
29                                 Todoroff, Mr. Lalio   0  28.0      0   
..                                                 ...  ..   ...    ...   
861                        Giles, Mr. Frederick Edward   0  21.0      1   
862  Swift, Mrs. Frederick Joel (Margaret Welles Ba...   1  48.0      0   
863                  Sage, Miss. Dorothy Edith "Dolly"   1  28.0      8   
864                             Gill, Mr. John William   0  24.0      0   
865                           Bystrom, Mrs. (Karolina)   1  42.0      0   
866                       Duran y More, Miss. Asuncion   1  27.0      1   
867               Roebling, Mr. Washington Augustus II   0  31.0      0   
868                        van Melkebeke, Mr. Philemon   0  28.0      0   
869                    Johnson, Master. Harold Theodor   0   4.0      1   
870                                  Balkic, Mr. Cerin   0  26.0      0   
871   Beckwith, Mrs. Richard Leonard (Sallie Monypeny)   1  47.0      1   
872                           Carlsson, Mr. Frans Olof   0  33.0      0   
873                        Vander Cruyssen, Mr. Victor   0  47.0      0   
874              Abelson, Mrs. Samuel (Hannah Wizosky)   1  28.0      1   
875                   Najib, Miss. Adele Kiamie "Jane"   1  15.0      0   
876                      Gustafsson, Mr. Alfred Ossian   0  20.0      0   
877                               Petroff, Mr. Nedelio   0  19.0      0   
878                                 Laleff, Mr. Kristo   0  28.0      0   
879      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)   1  56.0      0   
880       Shelley, Mrs. William (Imanita Parrish Hall)   1  25.0      0   
881                                 Markun, Mr. Johann   0  33.0      0   
882                       Dahlberg, Miss. Gerda Ulrika   1  22.0      0   
883                      Banfield, Mr. Frederick James   0  28.0      0   
884                             Sutehall, Mr. Henry Jr   0  25.0      0   
885               Rice, Mrs. William (Margaret Norton)   1  39.0      0   
886                              Montvila, Rev. Juozas   0  27.0      0   
887                       Graham, Miss. Margaret Edith   1  19.0      0   
888           Johnston, Miss. Catherine Helen "Carrie"   1  28.0      1   
889                              Behr, Mr. Karl Howell   0  26.0      0   
890                                Dooley, Mr. Patrick   0  32.0      0   

     Parch            Ticket      Fare        Cabin Embarked  
0        0         A/5 21171    7.2500          NaN        0  
1        0          PC 17599   71.2833          C85        1  
2        0  STON/O2. 3101282    7.9250          NaN        0  
3        0            113803   53.1000         C123        0  
4        0            373450    8.0500          NaN        0  
5        0            330877    8.4583          NaN        2  
6        0             17463   51.8625          E46        0  
7        1            349909   21.0750          NaN        0  
8        2            347742   11.1333          NaN        0  
9        0            237736   30.0708          NaN        1  
10       1           PP 9549   16.7000           G6        0  
11       0            113783   26.5500         C103        0  
12       0         A/5. 2151    8.0500          NaN        0  
13       5            347082   31.2750          NaN        0  
14       0            350406    7.8542          NaN        0  
15       0            248706   16.0000          NaN        0  
16       1            382652   29.1250          NaN        2  
17       0            244373   13.0000          NaN        0  
18       0            345763   18.0000          NaN        0  
19       0              2649    7.2250          NaN        1  
20       0            239865   26.0000          NaN        0  
21       0            248698   13.0000          D56        0  
22       0            330923    8.0292          NaN        2  
23       0            113788   35.5000           A6        0  
24       1            349909   21.0750          NaN        0  
25       5            347077   31.3875          NaN        0  
26       0              2631    7.2250          NaN        1  
27       2             19950  263.0000  C23 C25 C27        0  
28       0            330959    7.8792          NaN        2  
29       0            349216    7.8958          NaN        0  
..     ...               ...       ...          ...      ...  
861      0             28134   11.5000          NaN        0  
862      0             17466   25.9292          D17        0  
863      2          CA. 2343   69.5500          NaN        0  
864      0            233866   13.0000          NaN        0  
865      0            236852   13.0000          NaN        0  
866      0     SC/PARIS 2149   13.8583          NaN        1  
867      0          PC 17590   50.4958          A24        0  
868      0            345777    9.5000          NaN        0  
869      1            347742   11.1333          NaN        0  
870      0            349248    7.8958          NaN        0  
871      1             11751   52.5542          D35        0  
872      0               695    5.0000  B51 B53 B55        0  
873      0            345765    9.0000          NaN        0  
874      0         P/PP 3381   24.0000          NaN        1  
875      0              2667    7.2250          NaN        1  
876      0              7534    9.8458          NaN        0  
877      0            349212    7.8958          NaN        0  
878      0            349217    7.8958          NaN        0  
879      1             11767   83.1583          C50        1  
880      1            230433   26.0000          NaN        0  
881      0            349257    7.8958          NaN        0  
882      0              7552   10.5167          NaN        0  
883      0  C.A./SOTON 34068   10.5000          NaN        0  
884      0   SOTON/OQ 392076    7.0500          NaN        0  
885      5            382652   29.1250          NaN        2  
886      0            211536   13.0000          NaN        0  
887      0            112053   30.0000          B42        0  
888      2        W./C. 6607   23.4500          NaN        0  
889      0            111369   30.0000         C148        1  
890      0            370376    7.7500          NaN        2  

[891 rows x 12 columns]
[ 0.1265047   0.31274009  0.23156378  0.32919143]
0.977553310887

4. Interpreting your decision tree

https://campus.datacamp.com/courses/kaggle-python-tutorial-on-machine-learning/predicting-with-decision-trees?ex=4

The feature_importances_ attribute make it simple to interpret the significance of the predictors you include. Based on your decision tree, what variable plays the most important role in determining whether or not a passenger survived? Your model (my_tree_one) is available in the console.

Possible Answers

Passenger Class
Sex/Gender
Passenger Fare
Age

5. Predict and submit to Kaggle

https://campus.datacamp.com/courses/kaggle-python-tutorial-on-machine-learning/predicting-with-decision-trees?ex=5

To send a submission to Kaggle you need to predict the survival rates for the observations in the test set. In the last exercise of the previous chapter, we created simple predictions based on a single subset. Luckily, with our decision tree, we can make use of some simple functions to "generate" our answer without having to manually perform subsetting.

First, you make use of the .predict() method. You provide it the model (my_tree_one), the values of features from the dataset for which predictions need to be made (test). To extract the features we will need to create a numpy array in the same way as we did when training the model. However, we need to take care of a small but important problem first. There is a missing value in the Fare feature that needs to be imputed.

Next, you need to make sure your output is in line with the submission requirements of Kaggle: a csv file with exactly 418 entries and two columns: PassengerId and Survived. Then use the code provided to make a new data frame using DataFrame(), and create a csv file using to_csv() method from Pandas.

Instructions

Impute the missing value for Fare in row 153 with the median of the column.
Make a prediction on the test set using the .predict() method and my_tree_one. Assign the result to my_prediction.
Create a data frame my_solution containing the solution and the passenger ids from the test set. Make sure the solution is in line with the standards set forth by Kaggle by naming the column appropriately.



In [8]:

    
# Convert the male and female groups to integer form
# test['Sex'][test['Sex'] == 'male'] = 0
# test['Sex'][test['Sex'] == 'female'] = 1
test.loc[test['Sex'] == 'male', 'Sex'] = 0
test.loc[test['Sex'] == 'female', 'Sex'] = 1

# substitute each missing Age with the median
test['Age'] = test['Age'].fillna(test['Age'].median())

# Impute the Embarked variable
test["Embarked"] = test["Embarked"].fillna('S')

# Convert the Embarked classes to integer form
# test["Embarked"][test["Embarked"] == "S"] = 0
# test["Embarked"][test["Embarked"] == "C"] = 1
# test["Embarked"][test["Embarked"] == "Q"] = 2
test.loc[test['Embarked'] == 'S', 'Embarked'] = 0
test.loc[test['Embarked'] == 'C', 'Embarked'] = 1
test.loc[test['Embarked'] == 'Q', 'Embarked'] = 2



In [9]:

    
# Impute the missing value with the median
# test.Fare[152] = test.Fare.median()
test.loc[152, 'Fare'] = test['Fare'].median()

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

# Make your prediction using the test set and print them.
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)

# Create a data frame with two columns: PassengerId & Survived. 
# Survived contains your predictions
PassengerId = np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)

# Check that your data frame has 418 entries
print(my_solution.shape)

# Write your solution to a csv file with the name my_solution.csv
csv_name = '02_predicting-with-decision-trees.csv'
my_solution.to_csv(csv_name, index_label = ["PassengerId"])









    



[0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0
 0 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0
 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
 1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 1 0 1 0 1 0
 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1
 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0
 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 1 0 0 0
 0 1 1 1 1 0 0 1 0 0 0]
      Survived
892          0
893          0
894          1
895          1
896          1
897          0
898          0
899          0
900          1
901          0
902          0
903          0
904          1
905          1
906          1
907          1
908          0
909          1
910          1
911          0
912          0
913          1
914          1
915          0
916          1
917          0
918          1
919          1
920          1
921          0
...        ...
1280         0
1281         0
1282         0
1283         1
1284         0
1285         0
1286         0
1287         1
1288         0
1289         1
1290         0
1291         0
1292         1
1293         0
1294         1
1295         1
1296         0
1297         0
1298         0
1299         0
1300         1
1301         1
1302         1
1303         1
1304         0
1305         0
1306         1
1307         0
1308         0
1309         0

[418 rows x 1 columns]
(418, 1)

6. Overfitting and how to control it

https://campus.datacamp.com/courses/kaggle-python-tutorial-on-machine-learning/predicting-with-decision-trees?ex=6

When you created your first decision tree the default arguments for max_depth and min_samples_split were set to None. This means that no limit on the depth of your tree was set. That's a good thing right? Not so fast. We are likely overfitting. This means that while your model describes the training data extremely well, it doesn't generalize to new data, which is frankly the point of prediction. Just look at the Kaggle submission results for the simple model based on Gender and the complex decision tree. Which one does better?

Maybe we can improve the overfit model by making a less complex model? In DecisionTreeRegressor, the depth of our model is defined by two parameters: - the max_depth parameter determines when the splitting up of the decision tree stops. - the min_samples_split parameter monitors the amount of observations in a bucket. If a certain threshold is not reached (e.g minimum 10 passengers) no further splitting can be done.

By limiting the complexity of your decision tree you will increase its generality and thus its usefulness for prediction!

Instructions

Include the Siblings/Spouses Aboard, Parents/Children Aboard, and Embarked features in a new set of features.
Fit your second tree my_tree_two with the new features, and control for the model compelexity by toggling the max_depth and min_samples_split arguments.



In [10]:

    
# Create a new array with the added features: features_two
features_two = train[["Pclass","Age","Sex","Fare", 
                      'SibSp', 'Parch', 'Embarked']].values

#Control overfitting by setting "max_depth" to 10 
# and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = max_depth, 
                                          min_samples_split = min_samples_split, 
                                          random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)

#Print the score of the new decison tree
print(my_tree_two.score(features_two, target))









    



0.905723905724

7. Feature-engineering for our Titanic data set

https://campus.datacamp.com/courses/kaggle-python-tutorial-on-machine-learning/predicting-with-decision-trees?ex=7

Data Science is an art that benefits from a human element. Enter feature engineering: creatively engineering your own features by combining the different existing variables.

While feature engineering is a discipline in itself, too broad to be covered here in detail, you will have a look at a simple example by creating your own new predictive attribute: family_size.

A valid assumption is that larger families need more time to get together on a sinking ship, and hence have lower probability of surviving. Family size is determined by the variables SibSp and Parch, which indicate the number of family members a certain passenger is traveling with. So when doing feature engineering, you add a new variable family_size, which is the sum of SibSp and Parch plus one (the observation itself), to the test and train set.

Instructions

Create a new train set train_two that differs from train only by having an extra column with your feature engineered variable family_size.
Add your feature engineered variable family_size in addition to Pclass, Sex, Age, Fare, SibSp and Parch to features_three.
Create a new decision tree as my_tree_three and fit the decision tree with your new feature set features_three. Then check out the score of the decision tree.



In [11]:

    
# Create train_two with the newly defined feature
train_two = train.copy()
train_two["family_size"] = train_two['SibSp'] + train_two['Parch'] + 1

# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", 'family_size']].values

# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three, target)

# Print the score of this decision tree
print(my_tree_three.score(features_three, target))









    



0.979797979798