Example Usage: Titanic Dataset

An example of training a model on the titanic dataset.

The name of the package is pandas-learn, a mixing pandas into scikit-learn. Therefore, you should always use pandas to handle your data if you are using the package(!):


In [1]:
import pandas as pd
%matplotlib inline

pandas-learn has an identical module structure to scikit-learn, so you already know where to find all the models you already use:


In [2]:
from pdlearn.ensemble import RandomForestClassifier

You can use pandas to manipulate your data with ease:


In [3]:
data = pd.read_csv('titanic-train.csv')                   \
                 .append(pd.read_csv('titanic-test.csv')) \
                 .set_index('name')
        
data['sex'] = data.sex == 'male'
data['child'] = data.age.fillna(20) < 15
X = data[['sex', 'p_class', 'child']].astype(int)
y = data['survived']

train = y.notnull()

In [4]:
X.head(10)


Out[4]:
sex p_class child
name
Braund, Mr. Owen Harris 1 3 0
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 0 1 0
Heikkinen, Miss. Laina 0 3 0
Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 1 0
Allen, Mr. William Henry 1 3 0
Moran, Mr. James 1 3 0
McCarthy, Mr. Timothy J 1 1 0
Palsson, Master. Gosta Leonard 1 3 1
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 0 3 0
Nasser, Mrs. Nicholas (Adele Achem) 0 2 1

(Just in case you are wondering, Mrs. Nasser was apparently 14, so was in fact a child despite being married!).


In [5]:
y.head(10)


Out[5]:
name
Braund, Mr. Owen Harris                                0
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    1
Heikkinen, Miss. Laina                                 1
Futrelle, Mrs. Jacques Heath (Lily May Peel)           1
Allen, Mr. William Henry                               0
Moran, Mr. James                                       0
McCarthy, Mr. Timothy J                                0
Palsson, Master. Gosta Leonard                         0
Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)      1
Nasser, Mrs. Nicholas (Adele Achem)                    1
Name: survived, dtype: float64

pandas-learn modules inherit directly from scikit learn models. They have basically the same interface:


In [6]:
rf = RandomForestClassifier(n_estimators=500, criterion='gini')

When you fit to pandas data, it saves the feature and target names automatically:


In [7]:
rf.fit(X[train], y[train]);
print('Feature names: ', rf.feature_names_)
print('Target names:  ', rf.target_names_)


Feature names:  Index(['sex', 'p_class', 'child'], dtype='object')
Target names:   Index(['survived'], dtype='object')

In [8]:
rf.predict(X[~train]).head(10)


Out[8]:
name
Kelly, Mr. James                                0
Wilkes, Mrs. James (Ellen Needs)                1
Myles, Mr. Thomas Francis                       0
Wirz, Mr. Albert                                0
Hirvonen, Mrs. Alexander (Helga E Lindqvist)    1
Svensson, Mr. Johan Cervin                      0
Connolly, Miss. Kate                            1
Caldwell, Mr. Albert Francis                    0
Abrahim, Mrs. Joseph (Sophie Halaut Easu)       1
Davies, Mr. John Samuel                         0
Name: survived, dtype: float64

In [9]:
rf.predict_proba(X[~train]).head(10)


Out[9]:
survived
0 1
name
Kelly, Mr. James 0.881033 0.118967
Wilkes, Mrs. James (Ellen Needs) 0.490276 0.509724
Myles, Mr. Thomas Francis 0.918252 0.081748
Wirz, Mr. Albert 0.881033 0.118967
Hirvonen, Mrs. Alexander (Helga E Lindqvist) 0.490276 0.509724
Svensson, Mr. Johan Cervin 0.659940 0.340060
Connolly, Miss. Kate 0.490276 0.509724
Caldwell, Mr. Albert Francis 0.918252 0.081748
Abrahim, Mrs. Joseph (Sophie Halaut Easu) 0.490276 0.509724
Davies, Mr. John Samuel 0.881033 0.118967

In [10]:
rf.feature_importances_.plot.bar()


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x10ea158d0>