[4-1] NumPy, matplotlibに加えて、pandasをインポートします。



In [1]:

    
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

[4-2] タイタニック号のデータを読み込んで、データフレーム data に格納します。



In [2]:

    
data = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv')

[4-3] scikit-learnの機械学習モジュールをインポートします。



In [3]:

    
from sklearn.cross_validation import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression

[4-4] 年齢（age）、性別（gender）、生存（survived）を取り出したデータフレームを用意します。



In [4]:

    
tmp = data[['age', 'sex', 'survived']].dropna()
tmp['gender'] = tmp['sex'].map({'female': 0, 'male': 1}).astype(int)
features = tmp.drop(['sex'], axis=1)
features.head()

[4-5] 特徴変数（age, gender）と正解ラベル（survived）を別々の変数に保存します。



In [5]:

    
X = features[['age', 'gender']]
y = features['survived']

[4-6] クロスバリデーションを実施する関数を定義します。



In [6]:

    
def cross_val(clf, X, y, K, random_state=0):
    cv = KFold(len(y), K, shuffle=True, random_state=random_state)
    scores = cross_val_score(clf, X, y, cv=cv)
    return scores

[4-7] クロスバリデーションを実施して、結果を表示します。



In [7]:

    
clf = LogisticRegression()
scores = cross_val(clf, X, y, 5)
print 'Scores:', scores
print 'Mean Score: %f ± %.3f' % (scores.mean(), scores.std())









    



Scores: [ 0.75238095  0.79425837  0.784689    0.77990431  0.784689  ]
Mean Score: 0.779184 ± 0.014

	age	survived	gender
0	29.00	1	0
1	0.92	1	1
2	2.00	0	0
3	30.00	0	1
4	25.00	0	0