An example of training a model on the titanic dataset.
The name of the package is pandas-learn, a mixing pandas into scikit-learn. Therefore, you should always use pandas to handle your data if you are using the package(!):
In [1]:
import pandas as pd
%matplotlib inline
pandas-learn has an identical module structure to scikit-learn, so you already know where to find all the models you already use:
In [2]:
from pdlearn.ensemble import RandomForestClassifier
You can use pandas to manipulate your data with ease:
In [3]:
data = pd.read_csv('titanic-train.csv') \
.append(pd.read_csv('titanic-test.csv')) \
.set_index('name')
data['sex'] = data.sex == 'male'
data['child'] = data.age.fillna(20) < 15
X = data[['sex', 'p_class', 'child']].astype(int)
y = data['survived']
train = y.notnull()
In [4]:
X.head(10)
Out[4]:
(Just in case you are wondering, Mrs. Nasser was apparently 14, so was in fact a child despite being married!).
In [5]:
y.head(10)
Out[5]:
pandas-learn modules inherit directly from scikit learn models. They have basically the same interface:
In [6]:
rf = RandomForestClassifier(n_estimators=500, criterion='gini')
When you fit to pandas data, it saves the feature and target names automatically:
In [7]:
rf.fit(X[train], y[train]);
print('Feature names: ', rf.feature_names_)
print('Target names: ', rf.target_names_)
In [8]:
rf.predict(X[~train]).head(10)
Out[8]:
In [9]:
rf.predict_proba(X[~train]).head(10)
Out[9]:
In [10]:
rf.feature_importances_.plot.bar()
Out[10]: