機械学習をやってみる

この章で扱うもの

  • 前章で作った各種データを再利用
  • 機械学習を行ってみる
  • 学習結果の評価を行う
  • 機械学習の改善方法

この章で取り扱う手順

  • 学習データとテストデータの分割
  • sklearnを用いて学習
  • 評価
  • 交差検証
  • グリットサーチ
  • 各種モデルを試す
  • 評価結果の再確認

In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_pickle("df.db")

学習データとテストデータの分割

http://qiita.com/terapyon/items/8f8d3518ee8eeb4f96b2


In [4]:
df.head()


Out[4]:
報告数 流行 増加 平均気温(℃) 最高気温(℃) 最低気温(℃) 平均湿度(%) 最小相対湿度(%) 平均現地気圧(hPa) 降水量の合計(mm) 日照時間(時間) 平均風速(m/s)
2014-01-01 0.178571 0 0 9.8 13.7 3.9 54.0 37.0 1005.3 0.0 9.2 5.3
2014-01-02 0.178571 0 0 8.0 12.9 4.4 41.0 26.0 1011.3 0.0 9.1 3.0
2014-01-03 0.178571 0 0 5.9 9.9 2.7 43.0 32.0 1014.9 0.0 4.1 1.6
2014-01-04 0.178571 0 0 6.7 11.5 2.1 47.0 29.0 1009.5 0.0 5.9 2.4
2014-01-05 0.178571 0 0 4.4 6.9 2.3 40.0 28.0 1016.6 0.0 1.1 2.5

In [5]:
X = df.iloc[:, 3:]

In [6]:
X.head()


Out[6]:
平均気温(℃) 最高気温(℃) 最低気温(℃) 平均湿度(%) 最小相対湿度(%) 平均現地気圧(hPa) 降水量の合計(mm) 日照時間(時間) 平均風速(m/s)
2014-01-01 9.8 13.7 3.9 54.0 37.0 1005.3 0.0 9.2 5.3
2014-01-02 8.0 12.9 4.4 41.0 26.0 1011.3 0.0 9.1 3.0
2014-01-03 5.9 9.9 2.7 43.0 32.0 1014.9 0.0 4.1 1.6
2014-01-04 6.7 11.5 2.1 47.0 29.0 1009.5 0.0 5.9 2.4
2014-01-05 4.4 6.9 2.3 40.0 28.0 1016.6 0.0 1.1 2.5

In [7]:
y = df['流行']

In [8]:
y.head()


Out[8]:
2014-01-01    0
2014-01-02    0
2014-01-03    0
2014-01-04    0
2014-01-05    0
Name: 流行, dtype: int32

In [9]:
from sklearn.model_selection import train_test_split

In [10]:
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8)

ロジスティック回帰


In [11]:
from sklearn.linear_model import LogisticRegression

In [12]:
clf = LogisticRegression()

In [13]:
clf.fit(X_train, y_train)


Out[13]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [14]:
y_train_pred = clf.predict(X_train)

正答率を確認


In [15]:
from sklearn.metrics import accuracy_score

In [16]:
accuracy_score(y_train, y_train_pred)


Out[16]:
0.87671232876712324

In [17]:
y_val_pred = clf.predict(X_val)

In [18]:
accuracy_score(y_val, y_val_pred)


Out[18]:
0.87727272727272732

混同行列


In [19]:
from sklearn.metrics import confusion_matrix

In [20]:
cm = confusion_matrix(y_val, y_val_pred)

In [21]:
cm


Out[21]:
array([[168,   9],
       [ 18,  25]])

In [22]:
cm_t = confusion_matrix(y_train, y_train_pred)
cm_t


Out[22]:
array([[682,  48],
       [ 60,  86]])

混同行列の評価

[[TP, FN],
 [FP, TN]]

適合率(precision)・再現率(recall)・F値(f1-score)

適合率

  • P(今回の場合は、流行していない) に判定された率 (178 / (178+14) = 0.93)
  • N(今回の場合は、流行している) に判定された率 (17 / (17+11) = 0.61)

再現率

  • Tと正しく予測できた割合 (178 / (178+11) = 0.94)
  • Fと正しく予測できた割合 (17 / (17+14) = 0.55)

F値

2 / (1/適合率+1/再現率) = 2 * 適合率 * 再現率 / (適合率+再現率)

  • 0のF値 (2 * 0.93 * 0.94 / (0.93 + 0.94) = 0.93
  • 1のF値 (2 * 0.61 * 0.55 / (0.61 + 0.55) = 0.58

In [24]:
from sklearn.metrics import classification_report

In [25]:
print(classification_report(y_val, y_val_pred))


             precision    recall  f1-score   support

          0       0.90      0.95      0.93       177
          1       0.74      0.58      0.65        43

avg / total       0.87      0.88      0.87       220

レポート関係を関数化し再利用可能にする


In [26]:
def report(y, pred):
    print(accuracy_score(y, pred))
    cm = confusion_matrix(y, pred)
    print(cm)
    cr = classification_report(y, pred)
    print(cr)

In [27]:
report(y_train, y_train_pred)


0.876712328767
[[682  48]
 [ 60  86]]
             precision    recall  f1-score   support

          0       0.92      0.93      0.93       730
          1       0.64      0.59      0.61       146

avg / total       0.87      0.88      0.87       876

学習から評価までを関数化


In [28]:
def fit_to_pred(clf, X_train, X_val, y_train, y_val):
    # 学習
    clf.fit(X_train, y_train)
    
    # 学習データで評価
    y_train_pred = clf.predict(X_train)
    print("y_train_pred: ")
    report(y_train, y_train_pred)
    
    # テストデータで評価
    y_val_pred = clf.predict(X_val)
    print("y_val_pred: ")
    report(y_val, y_val_pred)
    
    # 学習済みデータを返す
    return clf

In [29]:
clf = LogisticRegression()
fit_to_pred(clf, X_train, X_val, y_train, y_val)


y_train_pred: 
0.876712328767
[[682  48]
 [ 60  86]]
             precision    recall  f1-score   support

          0       0.92      0.93      0.93       730
          1       0.64      0.59      0.61       146

avg / total       0.87      0.88      0.87       876

y_val_pred: 
0.877272727273
[[168   9]
 [ 18  25]]
             precision    recall  f1-score   support

          0       0.90      0.95      0.93       177
          1       0.74      0.58      0.65        43

avg / total       0.87      0.88      0.87       220

Out[29]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

その他の機械学習アルゴリズム

  • サポートベクターマシン SVC
  • カーネルSVM
  • 決定木 DecisionTreeClassifier
  • ランダムフォレスト RandomForestClassifier
  • k近傍

In [30]:
from sklearn.svm import SVC

In [31]:
svc = SVC(kernel="linear")
fit_to_pred(svc, X_train, X_val, y_train, y_val)


y_train_pred: 
0.881278538813
[[681  49]
 [ 55  91]]
             precision    recall  f1-score   support

          0       0.93      0.93      0.93       730
          1       0.65      0.62      0.64       146

avg / total       0.88      0.88      0.88       876

y_val_pred: 
0.859090909091
[[167  10]
 [ 21  22]]
             precision    recall  f1-score   support

          0       0.89      0.94      0.92       177
          1       0.69      0.51      0.59        43

avg / total       0.85      0.86      0.85       220

Out[31]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [32]:
k_svc = SVC(kernel="rbf")
fit_to_pred(k_svc, X_train, X_val, y_train, y_val)


y_train_pred: 
0.996575342466
[[730   0]
 [  3 143]]
             precision    recall  f1-score   support

          0       1.00      1.00      1.00       730
          1       1.00      0.98      0.99       146

avg / total       1.00      1.00      1.00       876

y_val_pred: 
0.804545454545
[[177   0]
 [ 43   0]]
             precision    recall  f1-score   support

          0       0.80      1.00      0.89       177
          1       0.00      0.00      0.00        43

avg / total       0.65      0.80      0.72       220

/Users/terapyon/dev/scsk/env/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
Out[32]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [33]:
from sklearn.tree import DecisionTreeClassifier

In [34]:
tree = DecisionTreeClassifier(max_depth=2)
fit_to_pred(tree, X_train, X_val, y_train, y_val)


y_train_pred: 
0.867579908676
[[709  21]
 [ 95  51]]
             precision    recall  f1-score   support

          0       0.88      0.97      0.92       730
          1       0.71      0.35      0.47       146

avg / total       0.85      0.87      0.85       876

y_val_pred: 
0.840909090909
[[171   6]
 [ 29  14]]
             precision    recall  f1-score   support

          0       0.85      0.97      0.91       177
          1       0.70      0.33      0.44        43

avg / total       0.82      0.84      0.82       220

Out[34]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [35]:
from sklearn.ensemble import RandomForestClassifier

In [36]:
rf = RandomForestClassifier()
fit_to_pred(rf, X_train, X_val, y_train, y_val)


y_train_pred: 
0.996575342466
[[729   1]
 [  2 144]]
             precision    recall  f1-score   support

          0       1.00      1.00      1.00       730
          1       0.99      0.99      0.99       146

avg / total       1.00      1.00      1.00       876

y_val_pred: 
0.845454545455
[[170   7]
 [ 27  16]]
             precision    recall  f1-score   support

          0       0.86      0.96      0.91       177
          1       0.70      0.37      0.48        43

avg / total       0.83      0.85      0.83       220

Out[36]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [37]:
from sklearn.neighbors import KNeighborsClassifier

In [38]:
knn = KNeighborsClassifier()
fit_to_pred(knn, X_train, X_val, y_train, y_val)


y_train_pred: 
0.897260273973
[[693  37]
 [ 53  93]]
             precision    recall  f1-score   support

          0       0.93      0.95      0.94       730
          1       0.72      0.64      0.67       146

avg / total       0.89      0.90      0.89       876

y_val_pred: 
0.827272727273
[[166  11]
 [ 27  16]]
             precision    recall  f1-score   support

          0       0.86      0.94      0.90       177
          1       0.59      0.37      0.46        43

avg / total       0.81      0.83      0.81       220

Out[38]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

交差検証(クロスバリデーション)


In [39]:
from sklearn.model_selection import cross_val_score

In [40]:
from sklearn.model_selection import KFold

In [41]:
cv = KFold(5, shuffle=True)

In [42]:
clf = LogisticRegression()
cross_val_score(clf, X, y, cv=cv)


Out[42]:
array([ 0.83636364,  0.87214612,  0.88584475,  0.89954338,  0.85388128])

In [43]:
k_svc = SVC(kernel="rbf")
cross_val_score(k_svc, X, y, cv=cv)


Out[43]:
array([ 0.79545455,  0.85844749,  0.80821918,  0.85388128,  0.83561644])

In [44]:
rf = RandomForestClassifier()
cross_val_score(rf, X, y, cv=cv)


Out[44]:
array([ 0.84090909,  0.87671233,  0.84474886,  0.85388128,  0.81278539])

F1-score で評価


In [45]:
clf = LogisticRegression()
cross_val_score(clf, X, y, cv=cv, scoring="f1")


Out[45]:
array([ 0.58823529,  0.55384615,  0.5       ,  0.66666667,  0.63736264])

In [46]:
k_svc = SVC(kernel="rbf")
cross_val_score(k_svc, X, y, cv=cv, scoring="f1")


/Users/terapyon/dev/scsk/env/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/terapyon/dev/scsk/env/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
Out[46]:
array([ 0.        ,  0.        ,  0.08      ,  0.12121212,  0.05714286])

In [47]:
rf = RandomForestClassifier()
cross_val_score(rf, X, y, cv=cv, scoring="f1")


Out[47]:
array([ 0.23255814,  0.45070423,  0.49230769,  0.48717949,  0.50666667])

グリッドサーチ


In [48]:
from sklearn.model_selection import GridSearchCV

In [49]:
param_grid = {'max_depth': [2, 3, 4, 5, 10, 15, 20, 30], 'n_estimators': [2, 3, 4, 5, 10, 20, 30, 40]}

In [50]:
rf = RandomForestClassifier(max_depth=2, n_estimators=2)

In [51]:
grid_search = GridSearchCV(rf, param_grid, cv=cv, n_jobs=-1, verbose=1)

In [52]:
grid_search.fit(X, y)


Fitting 5 folds for each of 64 candidates, totalling 320 fits
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:    4.6s finished
Out[52]:
GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=True),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=2, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [2, 3, 4, 5, 10, 15, 20, 30], 'n_estimators': [2, 3, 4, 5, 10, 20, 30, 40]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)

In [53]:
grid_search.best_score_


Out[53]:
0.87317518248175185

In [54]:
grid_search.best_params_


Out[54]:
{'max_depth': 2, 'n_estimators': 40}

In [55]:
rf = RandomForestClassifier(max_depth=2, n_estimators=2)
grid_search = GridSearchCV(rf, param_grid, cv=cv, n_jobs=-1, verbose=1, scoring='f1')
grid_search.fit(X, y)


Fitting 5 folds for each of 64 candidates, totalling 320 fits
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:    4.9s finished
Out[55]:
GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=True),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=2, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [2, 3, 4, 5, 10, 15, 20, 30], 'n_estimators': [2, 3, 4, 5, 10, 20, 30, 40]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1', verbose=1)

In [56]:
print(grid_search.best_score_)
print(grid_search.best_params_)


0.628228780268
{'max_depth': 2, 'n_estimators': 3}

In [57]:
rf = RandomForestClassifier(max_depth=3, n_estimators=40)
cross_val_score(rf, X, y, cv=cv, scoring="f1")


Out[57]:
array([ 0.47368421,  0.63636364,  0.63636364,  0.66666667,  0.68493151])

最終確認


In [58]:
rf = RandomForestClassifier(max_depth=3, n_estimators=40)
fit_to_pred(rf, X_train, X_val, y_train, y_val)


y_train_pred: 
0.883561643836
[[681  49]
 [ 53  93]]
             precision    recall  f1-score   support

          0       0.93      0.93      0.93       730
          1       0.65      0.64      0.65       146

avg / total       0.88      0.88      0.88       876

y_val_pred: 
0.877272727273
[[170   7]
 [ 20  23]]
             precision    recall  f1-score   support

          0       0.89      0.96      0.93       177
          1       0.77      0.53      0.63        43

avg / total       0.87      0.88      0.87       220

Out[58]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=40, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [59]:
rf.predict(X_val)


Out[59]:
array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0], dtype=int32)

In [60]:
from sklearn.externals import joblib

In [61]:
joblib.dump(rf, "clf_rf.db")


Out[61]:
['clf_rf.db']

In [ ]: