Section 1-2 - Creating Dummy Variables

In previous sections, we replaced the categorical values {C, S, Q} in the column Embarked by the numerical values {1, 2, 3}. The latter, however, has a notion of ordering not present in the former (which is simply arranged in alphabetical order). To get around this problem, we shall introduce the concept of dummy variables.

Pandas - Extracting data



In [1]:

    
import pandas as pd
import numpy as np

df = pd.read_csv('../data/train.csv')

Pandas - Cleaning data



In [2]:

    
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)

age_mean = df['Age'].mean()
df['Age'] = df['Age'].fillna(age_mean)

from scipy.stats import mode

mode_embarked = mode(df['Embarked'])[0][0]
df['Embarked'] = df['Embarked'].fillna(mode_embarked)

As there are only two unique values for the column Sex, we have no problems of ordering.



In [3]:

    
df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)

For the column Embarked, however, replacing {C, S, Q} by {1, 2, 3} would seem to imply the ordering C < S < Q when in fact they are simply arranged alphabetically.

To avoid this problem, we create dummy variables. Essentially this involves creating new columns to represent whether the passenger embarked at C with the value 1 if true, 0 otherwise. Pandas has a built-in function to create these columns automatically.



In [4]:

    
pd.get_dummies(df['Embarked'], prefix='Embarked').head(10)

We now concatenate the columns containing the dummy variables to our main dataframe.



In [5]:

    
df = pd.concat([df, pd.get_dummies(df['Embarked'], prefix='Embarked')], axis=1)

Exercise

Write the code to create dummy variables for the column Sex.



In [6]:

    
df = df.drop(['Sex', 'Embarked'], axis=1)

cols = df.columns.tolist()
cols = [cols[1]] + cols[0:1] + cols[2:]
df = df[cols]

We review our processed training data.



In [7]:

    
df.head(10)









    Out[7]:






  
    
      
      Survived
      PassengerId
      Pclass
      Age
      SibSp
      Parch
      Fare
      Gender
      Embarked_C
      Embarked_Q
      Embarked_S
    
  
  
    
      0
       0
        1
       3
       22.000000
       1
       0
        7.2500
       1
       0
       0
       1
    
    
      1
       1
        2
       1
       38.000000
       1
       0
       71.2833
       0
       1
       0
       0
    
    
      2
       1
        3
       3
       26.000000
       0
       0
        7.9250
       0
       0
       0
       1
    
    
      3
       1
        4
       1
       35.000000
       1
       0
       53.1000
       0
       0
       0
       1
    
    
      4
       0
        5
       3
       35.000000
       0
       0
        8.0500
       1
       0
       0
       1
    
    
      5
       0
        6
       3
       29.699118
       0
       0
        8.4583
       1
       0
       1
       0
    
    
      6
       0
        7
       1
       54.000000
       0
       0
       51.8625
       1
       0
       0
       1
    
    
      7
       0
        8
       3
        2.000000
       3
       1
       21.0750
       1
       0
       0
       1
    
    
      8
       1
        9
       3
       27.000000
       0
       2
       11.1333
       0
       0
       0
       1
    
    
      9
       1
       10
       2
       14.000000
       1
       0
       30.0708
       0
       1
       0
       0



In [8]:

    
train_data = df.values

Scikit-learn - Training the model



In [9]:

    
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 100)
model = model.fit(train_data[0:,2:],train_data[0:,0])

Scikit-learn - Making predictions



In [10]:

    
df_test = pd.read_csv('../data/test.csv')

df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

df_test['Age'] = df_test['Age'].fillna(age_mean)

fare_means = df.pivot_table('Fare', index='Pclass', aggfunc='mean')
df_test['Fare'] = df_test[['Fare', 'Pclass']].apply(lambda x:
                            fare_means[x['Pclass']] if pd.isnull(x['Fare'])
                            else x['Fare'], axis=1)

df_test['Gender'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)









    



/Users/savarin/anaconda/lib/python2.7/site-packages/pandas/core/index.py:496: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
  type(self).__name__),FutureWarning)

Similarly we create dummy variables for the test data.



In [11]:

    
df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')],
                axis=1)



In [12]:

    
df_test = df_test.drop(['Sex', 'Embarked'], axis=1)

test_data = df_test.values

output = model.predict(test_data[:,1:])

Pandas - Preparing for submission



In [13]:

    
result = np.c_[test_data[:,0].astype(int), output.astype(int)]

df_result = pd.DataFrame(result[:,0:2], columns=['PassengerId', 'Survived'])
df_result.to_csv('../results/titanic_1-2.csv', index=False)

	Survived	PassengerId	Pclass	Age	SibSp	Parch	Fare	Gender	Embarked_C	Embarked_Q	Embarked_S
0	0	1	3	22.000000	1	0	7.2500	1	0	0	1
1	1	2	1	38.000000	1	0	71.2833	0	1	0	0
2	1	3	3	26.000000	0	0	7.9250	0	0	0	1
3	1	4	1	35.000000	1	0	53.1000	0	0	0	1
4	0	5	3	35.000000	0	0	8.0500	1	0	0	1
5	0	6	3	29.699118	0	0	8.4583	1	0	1	0
6	0	7	1	54.000000	0	0	51.8625	1	0	0	1
7	0	8	3	2.000000	3	1	21.0750	1	0	0	1
8	1	9	3	27.000000	0	2	11.1333	0	0	0	1
9	1	10	2	14.000000	1	0	30.0708	0	1	0	0