Titanic: Machine Learning from Disaster

Get the Data with Pandas

Import the Pandas library


In [1]:
import pandas as pd

Load the train and test datasets to create two DataFrames


In [2]:
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

Print the 'head' of the train and test dataframes


In [3]:
print(train.head())
print(test.head())


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
   PassengerId  Pclass                                          Name     Sex  \
0          892       3                              Kelly, Mr. James    male   
1          893       3              Wilkes, Mrs. James (Ellen Needs)  female   
2          894       2                     Myles, Mr. Thomas Francis    male   
3          895       3                              Wirz, Mr. Albert    male   
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female   

    Age  SibSp  Parch   Ticket     Fare Cabin Embarked  
0  34.5      0      0   330911   7.8292   NaN        Q  
1  47.0      1      0   363272   7.0000   NaN        S  
2  62.0      0      0   240276   9.6875   NaN        Q  
3  27.0      0      0   315154   8.6625   NaN        S  
4  22.0      1      1  3101298  12.2875   NaN        S  

Understanding your data


In [4]:
print(train.shape)
print(test.shape)


(891, 12)
(418, 11)

In [5]:
print(train.describe())
print(test.describe())


       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
       PassengerId      Pclass         Age       SibSp       Parch        Fare
count   418.000000  418.000000  332.000000  418.000000  418.000000  417.000000
mean   1100.500000    2.265550   30.272590    0.447368    0.392344   35.627188
std     120.810458    0.841838   14.181209    0.896760    0.981429   55.907576
min     892.000000    1.000000    0.170000    0.000000    0.000000    0.000000
25%     996.250000    1.000000   21.000000    0.000000    0.000000    7.895800
50%    1100.500000    3.000000   27.000000    0.000000    0.000000   14.454200
75%    1204.750000    3.000000   39.000000    1.000000    0.000000   31.500000
max    1309.000000    3.000000   76.000000    8.000000    9.000000  512.329200

Rose vs Jack, or Female vs Male

Passengers that survived vs passengers that passed away


In [6]:
print(train["Survived"].value_counts())


0    549
1    342
Name: Survived, dtype: int64

As proportions


In [7]:
print(train["Survived"].value_counts(normalize=True))


0    0.616162
1    0.383838
Name: Survived, dtype: float64

Males that survived vs males that passed away


In [8]:
print(train["Survived"][train["Sex"] == 'male'].value_counts())


0    468
1    109
Name: Survived, dtype: int64

Females that survived vs Females that passed away


In [9]:
print(train["Survived"][train["Sex"] == 'female'].value_counts())


1    233
0     81
Name: Survived, dtype: int64

Normalized male survival


In [10]:
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize=True))


0    0.811092
1    0.188908
Name: Survived, dtype: float64

Normalized female survival


In [11]:
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize=True))


1    0.742038
0    0.257962
Name: Survived, dtype: float64

Does age play a role?

Create the column Child and assign to 'NaN'


In [12]:
train["Child"] = float('NaN')

Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.


In [13]:
train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0
print(train['Child'])


0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
5      NaN
6      0.0
7      1.0
8      0.0
9      1.0
10     1.0
11     0.0
12     0.0
13     0.0
14     1.0
15     0.0
16     1.0
17     NaN
18     0.0
19     NaN
20     0.0
21     0.0
22     1.0
23     0.0
24     1.0
25     0.0
26     NaN
27     0.0
28     NaN
29     NaN
      ... 
861    0.0
862    0.0
863    NaN
864    0.0
865    0.0
866    0.0
867    0.0
868    NaN
869    1.0
870    0.0
871    0.0
872    0.0
873    0.0
874    0.0
875    1.0
876    0.0
877    0.0
878    NaN
879    0.0
880    0.0
881    0.0
882    0.0
883    0.0
884    0.0
885    0.0
886    0.0
887    0.0
888    NaN
889    0.0
890    0.0
Name: Child, Length: 891, dtype: float64
/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  

Print normalized Survival Rates for passengers under 18


In [14]:
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))


1    0.539823
0    0.460177
Name: Survived, dtype: float64

Print normalized Survival Rates for passengers 18 or older


In [15]:
print(train["Survived"][train["Child"] == 0].value_counts(normalize = True))


0    0.618968
1    0.381032
Name: Survived, dtype: float64

First prediction

Create a copy of test: test_one


In [16]:
test_one = test

Initialize a Survived column to 0


In [17]:
test_one['Survived'] = 0

Set Survived to 1 if Sex equals "female" and print the Survived column from test_one


In [18]:
test_one['Survived'][test_one['Sex'] == "female"] = 1
print(test_one['Survived'])


0      0
1      1
2      0
3      0
4      1
5      0
6      1
7      0
8      1
9      0
10     0
11     0
12     1
13     0
14     1
15     1
16     0
17     0
18     1
19     1
20     0
21     0
22     1
23     0
24     1
25     0
26     1
27     0
28     0
29     0
      ..
388    0
389    0
390    0
391    1
392    0
393    0
394    0
395    1
396    0
397    1
398    0
399    0
400    1
401    0
402    1
403    0
404    0
405    0
406    0
407    0
408    1
409    1
410    1
411    1
412    1
413    0
414    1
415    0
416    0
417    0
Name: Survived, Length: 418, dtype: int64
/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Cleaning and Formatting your Data

Convert the male and female groups to integer form


In [19]:
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1


/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.

Impute the Embarked variable


In [20]:
train["Embarked"] = train["Embarked"].fillna('S')
test["Embarked"] = test["Embarked"].fillna('S')

Convert the Embarked classes to integer form


In [21]:
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2
test["Embarked"][test["Embarked"] == "S"] = 0
test["Embarked"][test["Embarked"] == "C"] = 1
test["Embarked"][test["Embarked"] == "Q"] = 2


/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
/home/florent/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  

Print the Sex and Embarked columns


In [22]:
print(train["Embarked"])
print(train["Sex"])


0      0
1      1
2      0
3      0
4      0
5      2
6      0
7      0
8      0
9      1
10     0
11     0
12     0
13     0
14     0
15     0
16     2
17     0
18     0
19     1
20     0
21     0
22     2
23     0
24     0
25     0
26     1
27     0
28     2
29     0
      ..
861    0
862    0
863    0
864    0
865    0
866    1
867    0
868    0
869    0
870    0
871    0
872    0
873    0
874    1
875    1
876    0
877    0
878    0
879    1
880    0
881    0
882    0
883    0
884    0
885    2
886    0
887    0
888    0
889    1
890    2
Name: Embarked, Length: 891, dtype: object
0      0
1      1
2      1
3      1
4      0
5      0
6      0
7      0
8      1
9      1
10     1
11     1
12     0
13     0
14     1
15     1
16     0
17     0
18     1
19     1
20     0
21     0
22     1
23     0
24     1
25     1
26     0
27     0
28     1
29     0
      ..
861    0
862    1
863    1
864    0
865    1
866    1
867    0
868    0
869    0
870    0
871    1
872    0
873    0
874    1
875    1
876    0
877    0
878    0
879    1
880    1
881    0
882    1
883    0
884    0
885    1
886    0
887    1
888    1
889    0
890    0
Name: Sex, Length: 891, dtype: object

In [23]:
print(test["Embarked"])
print(test["Sex"])


0      2
1      0
2      2
3      0
4      0
5      0
6      2
7      0
8      1
9      0
10     0
11     0
12     0
13     0
14     0
15     1
16     2
17     1
18     0
19     1
20     1
21     0
22     0
23     1
24     1
25     0
26     1
27     1
28     0
29     1
      ..
388    2
389    0
390    0
391    0
392    0
393    0
394    0
395    0
396    2
397    1
398    0
399    2
400    0
401    0
402    1
403    0
404    1
405    1
406    0
407    1
408    2
409    0
410    2
411    2
412    0
413    0
414    1
415    0
416    0
417    1
Name: Embarked, Length: 418, dtype: object
0      0
1      1
2      0
3      0
4      1
5      0
6      1
7      0
8      1
9      0
10     0
11     0
12     1
13     0
14     1
15     1
16     0
17     0
18     1
19     1
20     0
21     0
22     1
23     0
24     1
25     0
26     1
27     0
28     0
29     0
      ..
388    0
389    0
390    0
391    1
392    0
393    0
394    0
395    1
396    0
397    1
398    0
399    0
400    1
401    0
402    1
403    0
404    0
405    0
406    0
407    0
408    1
409    1
410    1
411    1
412    1
413    0
414    1
415    0
416    0
417    0
Name: Sex, Length: 418, dtype: object

Creating your first decision tree

Import the Numpy library


In [24]:
import numpy as np

Import 'tree' from scikit-learn library


In [25]:
from sklearn import tree

Print the train data to see the available features


In [26]:
print(train)


     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25            26         1       3   
26            27         0       3   
27            28         0       1   
28            29         1       3   
29            30         0       3   
..           ...       ...     ...   
861          862         0       2   
862          863         1       1   
863          864         0       3   
864          865         0       2   
865          866         1       2   
866          867         1       2   
867          868         0       1   
868          869         0       3   
869          870         1       3   
870          871         0       3   
871          872         1       1   
872          873         0       1   
873          874         0       3   
874          875         1       2   
875          876         1       3   
876          877         0       3   
877          878         0       3   
878          879         0       3   
879          880         1       1   
880          881         1       2   
881          882         0       3   
882          883         0       3   
883          884         0       2   
884          885         0       3   
885          886         0       3   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris   0  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...   1  38.0      1   
2                               Heikkinen, Miss. Laina   1  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)   1  35.0      1   
4                             Allen, Mr. William Henry   0  35.0      0   
5                                     Moran, Mr. James   0   NaN      0   
6                              McCarthy, Mr. Timothy J   0  54.0      0   
7                       Palsson, Master. Gosta Leonard   0   2.0      3   
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)   1  27.0      0   
9                  Nasser, Mrs. Nicholas (Adele Achem)   1  14.0      1   
10                     Sandstrom, Miss. Marguerite Rut   1   4.0      1   
11                            Bonnell, Miss. Elizabeth   1  58.0      0   
12                      Saundercock, Mr. William Henry   0  20.0      0   
13                         Andersson, Mr. Anders Johan   0  39.0      1   
14                Vestrom, Miss. Hulda Amanda Adolfina   1  14.0      0   
15                    Hewlett, Mrs. (Mary D Kingcome)    1  55.0      0   
16                                Rice, Master. Eugene   0   2.0      4   
17                        Williams, Mr. Charles Eugene   0   NaN      0   
18   Vander Planke, Mrs. Julius (Emelia Maria Vande...   1  31.0      1   
19                             Masselmani, Mrs. Fatima   1   NaN      0   
20                                Fynney, Mr. Joseph J   0  35.0      0   
21                               Beesley, Mr. Lawrence   0  34.0      0   
22                         McGowan, Miss. Anna "Annie"   1  15.0      0   
23                        Sloper, Mr. William Thompson   0  28.0      0   
24                       Palsson, Miss. Torborg Danira   1   8.0      3   
25   Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...   1  38.0      1   
26                             Emir, Mr. Farred Chehab   0   NaN      0   
27                      Fortune, Mr. Charles Alexander   0  19.0      3   
28                       O'Dwyer, Miss. Ellen "Nellie"   1   NaN      0   
29                                 Todoroff, Mr. Lalio   0   NaN      0   
..                                                 ...  ..   ...    ...   
861                        Giles, Mr. Frederick Edward   0  21.0      1   
862  Swift, Mrs. Frederick Joel (Margaret Welles Ba...   1  48.0      0   
863                  Sage, Miss. Dorothy Edith "Dolly"   1   NaN      8   
864                             Gill, Mr. John William   0  24.0      0   
865                           Bystrom, Mrs. (Karolina)   1  42.0      0   
866                       Duran y More, Miss. Asuncion   1  27.0      1   
867               Roebling, Mr. Washington Augustus II   0  31.0      0   
868                        van Melkebeke, Mr. Philemon   0   NaN      0   
869                    Johnson, Master. Harold Theodor   0   4.0      1   
870                                  Balkic, Mr. Cerin   0  26.0      0   
871   Beckwith, Mrs. Richard Leonard (Sallie Monypeny)   1  47.0      1   
872                           Carlsson, Mr. Frans Olof   0  33.0      0   
873                        Vander Cruyssen, Mr. Victor   0  47.0      0   
874              Abelson, Mrs. Samuel (Hannah Wizosky)   1  28.0      1   
875                   Najib, Miss. Adele Kiamie "Jane"   1  15.0      0   
876                      Gustafsson, Mr. Alfred Ossian   0  20.0      0   
877                               Petroff, Mr. Nedelio   0  19.0      0   
878                                 Laleff, Mr. Kristo   0   NaN      0   
879      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)   1  56.0      0   
880       Shelley, Mrs. William (Imanita Parrish Hall)   1  25.0      0   
881                                 Markun, Mr. Johann   0  33.0      0   
882                       Dahlberg, Miss. Gerda Ulrika   1  22.0      0   
883                      Banfield, Mr. Frederick James   0  28.0      0   
884                             Sutehall, Mr. Henry Jr   0  25.0      0   
885               Rice, Mrs. William (Margaret Norton)   1  39.0      0   
886                              Montvila, Rev. Juozas   0  27.0      0   
887                       Graham, Miss. Margaret Edith   1  19.0      0   
888           Johnston, Miss. Catherine Helen "Carrie"   1   NaN      1   
889                              Behr, Mr. Karl Howell   0  26.0      0   
890                                Dooley, Mr. Patrick   0  32.0      0   

     Parch            Ticket      Fare        Cabin Embarked  Child  
0        0         A/5 21171    7.2500          NaN        0    0.0  
1        0          PC 17599   71.2833          C85        1    0.0  
2        0  STON/O2. 3101282    7.9250          NaN        0    0.0  
3        0            113803   53.1000         C123        0    0.0  
4        0            373450    8.0500          NaN        0    0.0  
5        0            330877    8.4583          NaN        2    NaN  
6        0             17463   51.8625          E46        0    0.0  
7        1            349909   21.0750          NaN        0    1.0  
8        2            347742   11.1333          NaN        0    0.0  
9        0            237736   30.0708          NaN        1    1.0  
10       1           PP 9549   16.7000           G6        0    1.0  
11       0            113783   26.5500         C103        0    0.0  
12       0         A/5. 2151    8.0500          NaN        0    0.0  
13       5            347082   31.2750          NaN        0    0.0  
14       0            350406    7.8542          NaN        0    1.0  
15       0            248706   16.0000          NaN        0    0.0  
16       1            382652   29.1250          NaN        2    1.0  
17       0            244373   13.0000          NaN        0    NaN  
18       0            345763   18.0000          NaN        0    0.0  
19       0              2649    7.2250          NaN        1    NaN  
20       0            239865   26.0000          NaN        0    0.0  
21       0            248698   13.0000          D56        0    0.0  
22       0            330923    8.0292          NaN        2    1.0  
23       0            113788   35.5000           A6        0    0.0  
24       1            349909   21.0750          NaN        0    1.0  
25       5            347077   31.3875          NaN        0    0.0  
26       0              2631    7.2250          NaN        1    NaN  
27       2             19950  263.0000  C23 C25 C27        0    0.0  
28       0            330959    7.8792          NaN        2    NaN  
29       0            349216    7.8958          NaN        0    NaN  
..     ...               ...       ...          ...      ...    ...  
861      0             28134   11.5000          NaN        0    0.0  
862      0             17466   25.9292          D17        0    0.0  
863      2          CA. 2343   69.5500          NaN        0    NaN  
864      0            233866   13.0000          NaN        0    0.0  
865      0            236852   13.0000          NaN        0    0.0  
866      0     SC/PARIS 2149   13.8583          NaN        1    0.0  
867      0          PC 17590   50.4958          A24        0    0.0  
868      0            345777    9.5000          NaN        0    NaN  
869      1            347742   11.1333          NaN        0    1.0  
870      0            349248    7.8958          NaN        0    0.0  
871      1             11751   52.5542          D35        0    0.0  
872      0               695    5.0000  B51 B53 B55        0    0.0  
873      0            345765    9.0000          NaN        0    0.0  
874      0         P/PP 3381   24.0000          NaN        1    0.0  
875      0              2667    7.2250          NaN        1    1.0  
876      0              7534    9.8458          NaN        0    0.0  
877      0            349212    7.8958          NaN        0    0.0  
878      0            349217    7.8958          NaN        0    NaN  
879      1             11767   83.1583          C50        1    0.0  
880      1            230433   26.0000          NaN        0    0.0  
881      0            349257    7.8958          NaN        0    0.0  
882      0              7552   10.5167          NaN        0    0.0  
883      0  C.A./SOTON 34068   10.5000          NaN        0    0.0  
884      0   SOTON/OQ 392076    7.0500          NaN        0    0.0  
885      5            382652   29.1250          NaN        2    0.0  
886      0            211536   13.0000          NaN        0    0.0  
887      0            112053   30.0000          B42        0    0.0  
888      2        W./C. 6607   23.4500          NaN        0    NaN  
889      0            111369   30.0000         C148        1    0.0  
890      0            370376    7.7500          NaN        2    0.0  

[891 rows x 13 columns]

Fill the NaN values


In [27]:
train[["Pclass", "Sex", "Age", "Fare"]] = train[["Pclass", "Sex", "Age", "Fare"]].fillna(train[["Pclass", "Sex", "Age", "Fare"]].median())
print(train)


     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25            26         1       3   
26            27         0       3   
27            28         0       1   
28            29         1       3   
29            30         0       3   
..           ...       ...     ...   
861          862         0       2   
862          863         1       1   
863          864         0       3   
864          865         0       2   
865          866         1       2   
866          867         1       2   
867          868         0       1   
868          869         0       3   
869          870         1       3   
870          871         0       3   
871          872         1       1   
872          873         0       1   
873          874         0       3   
874          875         1       2   
875          876         1       3   
876          877         0       3   
877          878         0       3   
878          879         0       3   
879          880         1       1   
880          881         1       2   
881          882         0       3   
882          883         0       3   
883          884         0       2   
884          885         0       3   
885          886         0       3   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name  Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    0  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...    1  38.0      1   
2                               Heikkinen, Miss. Laina    1  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)    1  35.0      1   
4                             Allen, Mr. William Henry    0  35.0      0   
5                                     Moran, Mr. James    0  28.0      0   
6                              McCarthy, Mr. Timothy J    0  54.0      0   
7                       Palsson, Master. Gosta Leonard    0   2.0      3   
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)    1  27.0      0   
9                  Nasser, Mrs. Nicholas (Adele Achem)    1  14.0      1   
10                     Sandstrom, Miss. Marguerite Rut    1   4.0      1   
11                            Bonnell, Miss. Elizabeth    1  58.0      0   
12                      Saundercock, Mr. William Henry    0  20.0      0   
13                         Andersson, Mr. Anders Johan    0  39.0      1   
14                Vestrom, Miss. Hulda Amanda Adolfina    1  14.0      0   
15                    Hewlett, Mrs. (Mary D Kingcome)     1  55.0      0   
16                                Rice, Master. Eugene    0   2.0      4   
17                        Williams, Mr. Charles Eugene    0  28.0      0   
18   Vander Planke, Mrs. Julius (Emelia Maria Vande...    1  31.0      1   
19                             Masselmani, Mrs. Fatima    1  28.0      0   
20                                Fynney, Mr. Joseph J    0  35.0      0   
21                               Beesley, Mr. Lawrence    0  34.0      0   
22                         McGowan, Miss. Anna "Annie"    1  15.0      0   
23                        Sloper, Mr. William Thompson    0  28.0      0   
24                       Palsson, Miss. Torborg Danira    1   8.0      3   
25   Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...    1  38.0      1   
26                             Emir, Mr. Farred Chehab    0  28.0      0   
27                      Fortune, Mr. Charles Alexander    0  19.0      3   
28                       O'Dwyer, Miss. Ellen "Nellie"    1  28.0      0   
29                                 Todoroff, Mr. Lalio    0  28.0      0   
..                                                 ...  ...   ...    ...   
861                        Giles, Mr. Frederick Edward    0  21.0      1   
862  Swift, Mrs. Frederick Joel (Margaret Welles Ba...    1  48.0      0   
863                  Sage, Miss. Dorothy Edith "Dolly"    1  28.0      8   
864                             Gill, Mr. John William    0  24.0      0   
865                           Bystrom, Mrs. (Karolina)    1  42.0      0   
866                       Duran y More, Miss. Asuncion    1  27.0      1   
867               Roebling, Mr. Washington Augustus II    0  31.0      0   
868                        van Melkebeke, Mr. Philemon    0  28.0      0   
869                    Johnson, Master. Harold Theodor    0   4.0      1   
870                                  Balkic, Mr. Cerin    0  26.0      0   
871   Beckwith, Mrs. Richard Leonard (Sallie Monypeny)    1  47.0      1   
872                           Carlsson, Mr. Frans Olof    0  33.0      0   
873                        Vander Cruyssen, Mr. Victor    0  47.0      0   
874              Abelson, Mrs. Samuel (Hannah Wizosky)    1  28.0      1   
875                   Najib, Miss. Adele Kiamie "Jane"    1  15.0      0   
876                      Gustafsson, Mr. Alfred Ossian    0  20.0      0   
877                               Petroff, Mr. Nedelio    0  19.0      0   
878                                 Laleff, Mr. Kristo    0  28.0      0   
879      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)    1  56.0      0   
880       Shelley, Mrs. William (Imanita Parrish Hall)    1  25.0      0   
881                                 Markun, Mr. Johann    0  33.0      0   
882                       Dahlberg, Miss. Gerda Ulrika    1  22.0      0   
883                      Banfield, Mr. Frederick James    0  28.0      0   
884                             Sutehall, Mr. Henry Jr    0  25.0      0   
885               Rice, Mrs. William (Margaret Norton)    1  39.0      0   
886                              Montvila, Rev. Juozas    0  27.0      0   
887                       Graham, Miss. Margaret Edith    1  19.0      0   
888           Johnston, Miss. Catherine Helen "Carrie"    1  28.0      1   
889                              Behr, Mr. Karl Howell    0  26.0      0   
890                                Dooley, Mr. Patrick    0  32.0      0   

     Parch            Ticket      Fare        Cabin Embarked  Child  
0        0         A/5 21171    7.2500          NaN        0    0.0  
1        0          PC 17599   71.2833          C85        1    0.0  
2        0  STON/O2. 3101282    7.9250          NaN        0    0.0  
3        0            113803   53.1000         C123        0    0.0  
4        0            373450    8.0500          NaN        0    0.0  
5        0            330877    8.4583          NaN        2    NaN  
6        0             17463   51.8625          E46        0    0.0  
7        1            349909   21.0750          NaN        0    1.0  
8        2            347742   11.1333          NaN        0    0.0  
9        0            237736   30.0708          NaN        1    1.0  
10       1           PP 9549   16.7000           G6        0    1.0  
11       0            113783   26.5500         C103        0    0.0  
12       0         A/5. 2151    8.0500          NaN        0    0.0  
13       5            347082   31.2750          NaN        0    0.0  
14       0            350406    7.8542          NaN        0    1.0  
15       0            248706   16.0000          NaN        0    0.0  
16       1            382652   29.1250          NaN        2    1.0  
17       0            244373   13.0000          NaN        0    NaN  
18       0            345763   18.0000          NaN        0    0.0  
19       0              2649    7.2250          NaN        1    NaN  
20       0            239865   26.0000          NaN        0    0.0  
21       0            248698   13.0000          D56        0    0.0  
22       0            330923    8.0292          NaN        2    1.0  
23       0            113788   35.5000           A6        0    0.0  
24       1            349909   21.0750          NaN        0    1.0  
25       5            347077   31.3875          NaN        0    0.0  
26       0              2631    7.2250          NaN        1    NaN  
27       2             19950  263.0000  C23 C25 C27        0    0.0  
28       0            330959    7.8792          NaN        2    NaN  
29       0            349216    7.8958          NaN        0    NaN  
..     ...               ...       ...          ...      ...    ...  
861      0             28134   11.5000          NaN        0    0.0  
862      0             17466   25.9292          D17        0    0.0  
863      2          CA. 2343   69.5500          NaN        0    NaN  
864      0            233866   13.0000          NaN        0    0.0  
865      0            236852   13.0000          NaN        0    0.0  
866      0     SC/PARIS 2149   13.8583          NaN        1    0.0  
867      0          PC 17590   50.4958          A24        0    0.0  
868      0            345777    9.5000          NaN        0    NaN  
869      1            347742   11.1333          NaN        0    1.0  
870      0            349248    7.8958          NaN        0    0.0  
871      1             11751   52.5542          D35        0    0.0  
872      0               695    5.0000  B51 B53 B55        0    0.0  
873      0            345765    9.0000          NaN        0    0.0  
874      0         P/PP 3381   24.0000          NaN        1    0.0  
875      0              2667    7.2250          NaN        1    1.0  
876      0              7534    9.8458          NaN        0    0.0  
877      0            349212    7.8958          NaN        0    0.0  
878      0            349217    7.8958          NaN        0    NaN  
879      1             11767   83.1583          C50        1    0.0  
880      1            230433   26.0000          NaN        0    0.0  
881      0            349257    7.8958          NaN        0    0.0  
882      0              7552   10.5167          NaN        0    0.0  
883      0  C.A./SOTON 34068   10.5000          NaN        0    0.0  
884      0   SOTON/OQ 392076    7.0500          NaN        0    0.0  
885      5            382652   29.1250          NaN        2    0.0  
886      0            211536   13.0000          NaN        0    0.0  
887      0            112053   30.0000          B42        0    0.0  
888      2        W./C. 6607   23.4500          NaN        0    NaN  
889      0            111369   30.0000         C148        1    0.0  
890      0            370376    7.7500          NaN        2    0.0  

[891 rows x 13 columns]

Create the target and features numpy arrays: target, features_one


In [28]:
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

Fit your first decision tree: my_tree_one


In [29]:
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

Look at the importance and score of the included features


In [30]:
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))


[ 0.12231561  0.31274009  0.22020536  0.34473895]
0.977553310887

Predict and submit to Kaggle

Impute the missing value with the median


In [31]:
#test.Fare[152] = test.Fare.median()
test[["Pclass", "Sex", "Age", "Fare"]] = test[["Pclass", "Sex", "Age", "Fare"]].fillna(test[["Pclass", "Sex", "Age", "Fare"]].median())

Extract the features from the test set: Pclass, Sex, Age, and Fare.


In [32]:
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

Make your prediction using the test set


In [33]:
first_prediction = my_tree_one.predict(test_features)
print(first_prediction)


[0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0
 0 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 1 0 1 0
 1 0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0
 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
 1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 1 0
 1 1 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0
 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1
 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0
 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 1 1 1 0 0 1 0 0 0]

Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions


In [34]:
PassengerId =np.array(test["PassengerId"]).astype(int)
print(PassengerId.shape)
first_solution = pd.DataFrame(first_prediction, PassengerId, columns = ["Survived"])
print(first_solution)


(418,)
      Survived
892          0
893          0
894          1
895          1
896          1
897          0
898          0
899          0
900          1
901          0
902          0
903          0
904          1
905          1
906          1
907          1
908          0
909          1
910          1
911          0
912          0
913          1
914          1
915          0
916          1
917          0
918          1
919          1
920          1
921          0
...        ...
1280         0
1281         0
1282         0
1283         1
1284         0
1285         0
1286         0
1287         1
1288         0
1289         1
1290         0
1291         0
1292         1
1293         0
1294         1
1295         0
1296         0
1297         0
1298         0
1299         0
1300         1
1301         1
1302         1
1303         1
1304         0
1305         0
1306         1
1307         0
1308         0
1309         0

[418 rows x 1 columns]

Check that your data frame has 418 entries


In [35]:
print(first_solution.shape)


(418, 1)

Write your solution to a csv file with the name my_solution.csv


In [36]:
first_solution.to_csv("../submissions/first_solution.csv", index_label = ["PassengerId"])

Overfitting and how to control it

Create a new array with the added features: features_two


In [37]:
features_two = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values

Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two


In [38]:
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)

Print the score of the new decison tree


In [39]:
print(my_tree_two.score(features_two, target))


0.905723905724

In [40]:
test_features_two = test[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values

In [41]:
second_prediction = my_tree_two.predict(test_features_two)
print(second_prediction)
print(second_prediction.shape)


[0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 1 1 1 1 1 0 1 0 0 0 0 0 1 0 0 0 0
 0 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0 0
 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0
 1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 1
 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 0
 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 1 1 0 1 1 0
 0 1 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0
 0 1 1 1 1 0 0 1 0 0 1]
(418,)

In [42]:
#PassengerId =np.array(test["PassengerId"]).astype(int)
second_solution = pd.DataFrame(second_prediction, PassengerId, columns = ["Survived"])
print(second_solution)


      Survived
892          0
893          0
894          0
895          0
896          1
897          0
898          0
899          0
900          1
901          0
902          0
903          0
904          1
905          0
906          1
907          1
908          0
909          0
910          1
911          0
912          1
913          1
914          1
915          1
916          1
917          0
918          1
919          0
920          0
921          0
...        ...
1280         0
1281         0
1282         1
1283         1
1284         1
1285         0
1286         0
1287         1
1288         0
1289         1
1290         0
1291         0
1292         1
1293         0
1294         1
1295         1
1296         0
1297         0
1298         0
1299         0
1300         1
1301         1
1302         1
1303         1
1304         0
1305         0
1306         1
1307         0
1308         0
1309         1

[418 rows x 1 columns]

In [43]:
print(second_solution.shape)


(418, 1)

In [44]:
second_solution.to_csv("../submissions/second_solution.csv", index_label = ["PassengerId"])

Feature-engineering for our Titanic data set


In [45]:
# Create train_two with the newly defined feature

In [46]:
train_two = train.copy()
train_two["family_size"] = train_two["SibSp"] + train_two["Parch"] + 1

In [47]:
# Create a new feature set and add the new feature

In [48]:
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values

In [49]:
# Define the tree classifier, then fit the model

In [50]:
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three, target)

In [51]:
# Print the score of this decision tree

In [52]:
print(my_tree_three.score(features_three, target))


0.979797979798

In [ ]:

A Random Forest analysis in Python

Import the RandomForestClassifier


In [53]:
from sklearn.ensemble import RandomForestClassifier

We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables


In [54]:
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
target = train["Survived"]

Building and fitting my_forest


In [55]:
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)

Print the score of the fitted random forest


In [56]:
print(my_forest.score(features_forest, target))


0.939393939394

Compute predictions on our test set features then print the length of the prediction vector


In [57]:
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))


418

In [58]:
PassengerId =np.array(test["PassengerId"]).astype(int)
third_solution = pd.DataFrame(pred_forest, PassengerId, columns = ["Survived"])
print(third_solution)


      Survived
892          0
893          0
894          0
895          0
896          0
897          0
898          0
899          0
900          1
901          0
902          0
903          0
904          1
905          0
906          1
907          1
908          0
909          0
910          0
911          0
912          1
913          0
914          1
915          1
916          1
917          0
918          1
919          0
920          0
921          0
...        ...
1280         0
1281         0
1282         0
1283         1
1284         0
1285         0
1286         0
1287         1
1288         0
1289         1
1290         0
1291         0
1292         1
1293         0
1294         1
1295         0
1296         0
1297         0
1298         0
1299         0
1300         1
1301         1
1302         1
1303         1
1304         0
1305         0
1306         1
1307         0
1308         0
1309         1

[418 rows x 1 columns]

In [59]:
print(third_solution.shape)


(418, 1)

In [60]:
third_solution.to_csv("../submissions/third_solution.csv", index_label = ["PassengerId"])

Interpreting and Comparing

Request and print the .feature_importances_ attribute


In [61]:
print(my_tree_two.feature_importances_)
print(my_forest.feature_importances_)


[ 0.14130255  0.17906027  0.41616727  0.17938711  0.05039699  0.01923751
  0.0144483 ]
[ 0.10384741  0.20139027  0.31989322  0.24602858  0.05272693  0.04159232
  0.03452128]

Compute and print the mean accuracy score for both models


In [62]:
print(my_tree_two.score(features_two, target))
print(my_forest.score(features_forest, target))


0.905723905724
0.939393939394

Conclude and Submit

Contributors

Florent Amato