機械学習をやってみる

この章で扱うもの

前章で作った各種データを再利用
機械学習を行ってみる
学習結果の評価を行う
機械学習の改善方法

この章で取り扱う手順

学習データとテストデータの分割
sklearnを用いて学習
評価
交差検証
グリットサーチ
各種モデルを試す
評価結果の再確認



In [1]:

    
%matplotlib inline



In [2]:

    
import numpy as np
import pandas as pd



In [3]:

    
df = pd.read_pickle("df.db")

学習データとテストデータの分割

http://qiita.com/terapyon/items/8f8d3518ee8eeb4f96b2



In [4]:

    
df.head()









    Out[4]:







  
    
      
      報告数
      流行
      増加
      平均気温(℃)
      最高気温(℃)
      最低気温(℃)
      平均湿度(％)
      最小相対湿度(％)
      平均現地気圧(hPa)
      降水量の合計(mm)
      日照時間(時間)
      平均風速(m/s)
    
  
  
    
      2014-01-01
      0.178571
      0
      0
      9.8
      13.7
      3.9
      54.0
      37.0
      1005.3
      0.0
      9.2
      5.3
    
    
      2014-01-02
      0.178571
      0
      0
      8.0
      12.9
      4.4
      41.0
      26.0
      1011.3
      0.0
      9.1
      3.0
    
    
      2014-01-03
      0.178571
      0
      0
      5.9
      9.9
      2.7
      43.0
      32.0
      1014.9
      0.0
      4.1
      1.6
    
    
      2014-01-04
      0.178571
      0
      0
      6.7
      11.5
      2.1
      47.0
      29.0
      1009.5
      0.0
      5.9
      2.4
    
    
      2014-01-05
      0.178571
      0
      0
      4.4
      6.9
      2.3
      40.0
      28.0
      1016.6
      0.0
      1.1
      2.5



In [5]:

    
X = df.iloc[:, 3:]



In [6]:

    
X.head()









    Out[6]:







  
    
      
      平均気温(℃)
      最高気温(℃)
      最低気温(℃)
      平均湿度(％)
      最小相対湿度(％)
      平均現地気圧(hPa)
      降水量の合計(mm)
      日照時間(時間)
      平均風速(m/s)
    
  
  
    
      2014-01-01
      9.8
      13.7
      3.9
      54.0
      37.0
      1005.3
      0.0
      9.2
      5.3
    
    
      2014-01-02
      8.0
      12.9
      4.4
      41.0
      26.0
      1011.3
      0.0
      9.1
      3.0
    
    
      2014-01-03
      5.9
      9.9
      2.7
      43.0
      32.0
      1014.9
      0.0
      4.1
      1.6
    
    
      2014-01-04
      6.7
      11.5
      2.1
      47.0
      29.0
      1009.5
      0.0
      5.9
      2.4
    
    
      2014-01-05
      4.4
      6.9
      2.3
      40.0
      28.0
      1016.6
      0.0
      1.1
      2.5



In [7]:

    
y = df['流行']



In [8]:

    
y.head()









    Out[8]:





2014-01-01    0
2014-01-02    0
2014-01-03    0
2014-01-04    0
2014-01-05    0
Name: 流行, dtype: int32



In [9]:

    
from sklearn.model_selection import train_test_split



In [10]:

    
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8)

ロジスティック回帰



In [11]:

    
from sklearn.linear_model import LogisticRegression



In [12]:

    
clf = LogisticRegression()



In [13]:

    
clf.fit(X_train, y_train)









    Out[13]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [14]:

    
y_train_pred = clf.predict(X_train)

正答率を確認



In [15]:

    
from sklearn.metrics import accuracy_score



In [16]:

    
accuracy_score(y_train, y_train_pred)









    Out[16]:





0.87671232876712324



In [17]:

    
y_val_pred = clf.predict(X_val)



In [18]:

    
accuracy_score(y_val, y_val_pred)









    Out[18]:





0.87727272727272732

混同行列



In [19]:

    
from sklearn.metrics import confusion_matrix



In [20]:

    
cm = confusion_matrix(y_val, y_val_pred)



In [21]:

    
cm









    Out[21]:





array([[168,   9],
       [ 18,  25]])



In [22]:

    
cm_t = confusion_matrix(y_train, y_train_pred)
cm_t









    Out[22]:





array([[682,  48],
       [ 60,  86]])

混同行列の評価

[[TP, FN],
 [FP, TN]]

適合率(precision)・再現率(recall)・F値(f1-score)

適合率

P(今回の場合は、流行していない) に判定された率 (178 / (178+14) = 0.93)
N(今回の場合は、流行している) に判定された率 (17 / (17+11) = 0.61)

再現率

Tと正しく予測できた割合 (178 / (178+11) = 0.94)
Fと正しく予測できた割合 (17 / (17+14) = 0.55)

F値

2 / (1/適合率+1/再現率) = 2 * 適合率 * 再現率 / (適合率+再現率）

0のF値 (2 * 0.93 * 0.94 / (0.93 + 0.94) = 0.93
1のF値 (2 * 0.61 * 0.55 / (0.61 + 0.55) = 0.58



In [24]:

    
from sklearn.metrics import classification_report



In [25]:

    
print(classification_report(y_val, y_val_pred))









    



             precision    recall  f1-score   support

          0       0.90      0.95      0.93       177
          1       0.74      0.58      0.65        43

avg / total       0.87      0.88      0.87       220

レポート関係を関数化し再利用可能にする



In [26]:

    
def report(y, pred):
    print(accuracy_score(y, pred))
    cm = confusion_matrix(y, pred)
    print(cm)
    cr = classification_report(y, pred)
    print(cr)



In [27]:

    
report(y_train, y_train_pred)









    



0.876712328767
[[682  48]
 [ 60  86]]
             precision    recall  f1-score   support

          0       0.92      0.93      0.93       730
          1       0.64      0.59      0.61       146

avg / total       0.87      0.88      0.87       876

学習から評価までを関数化



In [28]:

    
def fit_to_pred(clf, X_train, X_val, y_train, y_val):
    # 学習
    clf.fit(X_train, y_train)
    
    # 学習データで評価
    y_train_pred = clf.predict(X_train)
    print("y_train_pred: ")
    report(y_train, y_train_pred)
    
    # テストデータで評価
    y_val_pred = clf.predict(X_val)
    print("y_val_pred: ")
    report(y_val, y_val_pred)
    
    # 学習済みデータを返す
    return clf



In [29]:

    
clf = LogisticRegression()
fit_to_pred(clf, X_train, X_val, y_train, y_val)









    



y_train_pred: 
0.876712328767
[[682  48]
 [ 60  86]]
             precision    recall  f1-score   support

          0       0.92      0.93      0.93       730
          1       0.64      0.59      0.61       146

avg / total       0.87      0.88      0.87       876

y_val_pred: 
0.877272727273
[[168   9]
 [ 18  25]]
             precision    recall  f1-score   support

          0       0.90      0.95      0.93       177
          1       0.74      0.58      0.65        43

avg / total       0.87      0.88      0.87       220







    Out[29]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

その他の機械学習アルゴリズム

サポートベクターマシン SVC
カーネルSVM
決定木 DecisionTreeClassifier
ランダムフォレスト RandomForestClassifier
k近傍



In [30]:

    
from sklearn.svm import SVC



In [31]:

    
svc = SVC(kernel="linear")
fit_to_pred(svc, X_train, X_val, y_train, y_val)









    



y_train_pred: 
0.881278538813
[[681  49]
 [ 55  91]]
             precision    recall  f1-score   support

          0       0.93      0.93      0.93       730
          1       0.65      0.62      0.64       146

avg / total       0.88      0.88      0.88       876

y_val_pred: 
0.859090909091
[[167  10]
 [ 21  22]]
             precision    recall  f1-score   support

          0       0.89      0.94      0.92       177
          1       0.69      0.51      0.59        43

avg / total       0.85      0.86      0.85       220







    Out[31]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)



In [32]:

    
k_svc = SVC(kernel="rbf")
fit_to_pred(k_svc, X_train, X_val, y_train, y_val)









    



y_train_pred: 
0.996575342466
[[730   0]
 [  3 143]]
             precision    recall  f1-score   support

          0       1.00      1.00      1.00       730
          1       1.00      0.98      0.99       146

avg / total       1.00      1.00      1.00       876

y_val_pred: 
0.804545454545
[[177   0]
 [ 43   0]]
             precision    recall  f1-score   support

          0       0.80      1.00      0.89       177
          1       0.00      0.00      0.00        43

avg / total       0.65      0.80      0.72       220







    



/Users/terapyon/dev/scsk/env/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)






    Out[32]:





SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)



In [33]:

    
from sklearn.tree import DecisionTreeClassifier



In [34]:

    
tree = DecisionTreeClassifier(max_depth=2)
fit_to_pred(tree, X_train, X_val, y_train, y_val)









    



y_train_pred: 
0.867579908676
[[709  21]
 [ 95  51]]
             precision    recall  f1-score   support

          0       0.88      0.97      0.92       730
          1       0.71      0.35      0.47       146

avg / total       0.85      0.87      0.85       876

y_val_pred: 
0.840909090909
[[171   6]
 [ 29  14]]
             precision    recall  f1-score   support

          0       0.85      0.97      0.91       177
          1       0.70      0.33      0.44        43

avg / total       0.82      0.84      0.82       220







    Out[34]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')



In [35]:

    
from sklearn.ensemble import RandomForestClassifier



In [36]:

    
rf = RandomForestClassifier()
fit_to_pred(rf, X_train, X_val, y_train, y_val)









    



y_train_pred: 
0.996575342466
[[729   1]
 [  2 144]]
             precision    recall  f1-score   support

          0       1.00      1.00      1.00       730
          1       0.99      0.99      0.99       146

avg / total       1.00      1.00      1.00       876

y_val_pred: 
0.845454545455
[[170   7]
 [ 27  16]]
             precision    recall  f1-score   support

          0       0.86      0.96      0.91       177
          1       0.70      0.37      0.48        43

avg / total       0.83      0.85      0.83       220







    Out[36]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)



In [37]:

    
from sklearn.neighbors import KNeighborsClassifier



In [38]:

    
knn = KNeighborsClassifier()
fit_to_pred(knn, X_train, X_val, y_train, y_val)









    



y_train_pred: 
0.897260273973
[[693  37]
 [ 53  93]]
             precision    recall  f1-score   support

          0       0.93      0.95      0.94       730
          1       0.72      0.64      0.67       146

avg / total       0.89      0.90      0.89       876

y_val_pred: 
0.827272727273
[[166  11]
 [ 27  16]]
             precision    recall  f1-score   support

          0       0.86      0.94      0.90       177
          1       0.59      0.37      0.46        43

avg / total       0.81      0.83      0.81       220







    Out[38]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

交差検証(クロスバリデーション)



In [39]:

    
from sklearn.model_selection import cross_val_score



In [40]:

    
from sklearn.model_selection import KFold



In [41]:

    
cv = KFold(5, shuffle=True)



In [42]:

    
clf = LogisticRegression()
cross_val_score(clf, X, y, cv=cv)









    Out[42]:





array([ 0.83636364,  0.87214612,  0.88584475,  0.89954338,  0.85388128])



In [43]:

    
k_svc = SVC(kernel="rbf")
cross_val_score(k_svc, X, y, cv=cv)









    Out[43]:





array([ 0.79545455,  0.85844749,  0.80821918,  0.85388128,  0.83561644])



In [44]:

    
rf = RandomForestClassifier()
cross_val_score(rf, X, y, cv=cv)









    Out[44]:





array([ 0.84090909,  0.87671233,  0.84474886,  0.85388128,  0.81278539])

F1-score で評価



In [45]:

    
clf = LogisticRegression()
cross_val_score(clf, X, y, cv=cv, scoring="f1")









    Out[45]:





array([ 0.58823529,  0.55384615,  0.5       ,  0.66666667,  0.63736264])



In [46]:

    
k_svc = SVC(kernel="rbf")
cross_val_score(k_svc, X, y, cv=cv, scoring="f1")









    



/Users/terapyon/dev/scsk/env/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
/Users/terapyon/dev/scsk/env/lib/python3.6/site-packages/sklearn/metrics/classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)






    Out[46]:





array([ 0.        ,  0.        ,  0.08      ,  0.12121212,  0.05714286])



In [47]:

    
rf = RandomForestClassifier()
cross_val_score(rf, X, y, cv=cv, scoring="f1")









    Out[47]:





array([ 0.23255814,  0.45070423,  0.49230769,  0.48717949,  0.50666667])

グリッドサーチ



In [48]:

    
from sklearn.model_selection import GridSearchCV



In [49]:

    
param_grid = {'max_depth': [2, 3, 4, 5, 10, 15, 20, 30], 'n_estimators': [2, 3, 4, 5, 10, 20, 30, 40]}



In [50]:

    
rf = RandomForestClassifier(max_depth=2, n_estimators=2)



In [51]:

    
grid_search = GridSearchCV(rf, param_grid, cv=cv, n_jobs=-1, verbose=1)



In [52]:

    
grid_search.fit(X, y)









    



Fitting 5 folds for each of 64 candidates, totalling 320 fits






    



[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:    4.6s finished






    Out[52]:





GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=True),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=2, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [2, 3, 4, 5, 10, 15, 20, 30], 'n_estimators': [2, 3, 4, 5, 10, 20, 30, 40]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=1)



In [53]:

    
grid_search.best_score_









    Out[53]:





0.87317518248175185



In [54]:

    
grid_search.best_params_









    Out[54]:





{'max_depth': 2, 'n_estimators': 40}



In [55]:

    
rf = RandomForestClassifier(max_depth=2, n_estimators=2)
grid_search = GridSearchCV(rf, param_grid, cv=cv, n_jobs=-1, verbose=1, scoring='f1')
grid_search.fit(X, y)









    



Fitting 5 folds for each of 64 candidates, totalling 320 fits






    



[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed:    4.9s finished






    Out[55]:





GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=True),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=2, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [2, 3, 4, 5, 10, 15, 20, 30], 'n_estimators': [2, 3, 4, 5, 10, 20, 30, 40]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1', verbose=1)



In [56]:

    
print(grid_search.best_score_)
print(grid_search.best_params_)









    



0.628228780268
{'max_depth': 2, 'n_estimators': 3}



In [57]:

    
rf = RandomForestClassifier(max_depth=3, n_estimators=40)
cross_val_score(rf, X, y, cv=cv, scoring="f1")









    Out[57]:





array([ 0.47368421,  0.63636364,  0.63636364,  0.66666667,  0.68493151])

最終確認



In [58]:

    
rf = RandomForestClassifier(max_depth=3, n_estimators=40)
fit_to_pred(rf, X_train, X_val, y_train, y_val)









    



y_train_pred: 
0.883561643836
[[681  49]
 [ 53  93]]
             precision    recall  f1-score   support

          0       0.93      0.93      0.93       730
          1       0.65      0.64      0.65       146

avg / total       0.88      0.88      0.88       876

y_val_pred: 
0.877272727273
[[170   7]
 [ 20  23]]
             precision    recall  f1-score   support

          0       0.89      0.96      0.93       177
          1       0.77      0.53      0.63        43

avg / total       0.87      0.88      0.87       220







    Out[58]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=40, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)



In [59]:

    
rf.predict(X_val)









    Out[59]:





array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0], dtype=int32)



In [60]:

    
from sklearn.externals import joblib



In [61]:

    
joblib.dump(rf, "clf_rf.db")









    Out[61]:





['clf_rf.db']



In [ ]:

	報告数	平均気温(℃)	最高気温(℃)	最低気温(℃)	平均湿度(％)	最小相対湿度(％)	平均現地気圧(hPa)	日照時間(時間)	平均風速(m/s)
2014-01-01	0.178571	9.8	13.7	3.9	54.0	37.0	1005.3	9.2	5.3
2014-01-02	0.178571	8.0	12.9	4.4	41.0	26.0	1011.3	9.1	3.0
2014-01-03	0.178571	5.9	9.9	2.7	43.0	32.0	1014.9	4.1	1.6
2014-01-04	0.178571	6.7	11.5	2.1	47.0	29.0	1009.5	5.9	2.4
2014-01-05	0.178571	4.4	6.9	2.3	40.0	28.0	1016.6	1.1	2.5