Background

In this notebook, we show how to feed the embeddings from the language model into the MLP classifier. Then, we take the github repo, kubernetes/kubernetes, as an example. We do transfer learning and show the results.

Data

combined_sig_df.pkl https://storage.googleapis.com/issue_label_bot/notebook_files/combined_sig_df.pkl This file includes the github issue contents including titles, bodies, and labels.

feat_df.csv https://storage.googleapis.com/issue_label_bot/notebook_files/feat_df.csv This file includes 1600-dimentional embeddings of 14390 issues from kubernetes/kubernetes.


In [1]:
import pandas as pd

combined_sig_df = pd.read_pickle('combined_sig_df.pkl')
feat_df = pd.read_csv('feat_df.csv')

In [2]:
# github issue contents
combined_sig_df.head(3)


Out[2]:
index updated_at last_time labels repo url title body len_labels index ... sig/aws sig/cluster-ops sig/multicluster sig/instrumentation sig/openstack sig/contributor-experience sig/architecture sig/vmware sig/service-catalog part
0 0 2018-02-24 15:09:51 UTC 2018-02-24 15:09:51 UTC [lifecycle/rotten, priority/backlog, sig/clust... kubernetes/kubernetes "https://api.github.com/repos/kubernetes/kuber... minions ip does not follow --hostname-override... according to 9267, the kubelet should registe... 4 0 ... 0 0 0 0 0 0 0 0 0 6
1 10 2018-08-09 13:48:32 UTC 2018-08-09 13:48:32 UTC [lifecycle/stale, sig/cluster-lifecycle] kubernetes/kubernetes "https://api.github.com/repos/kubernetes/kuber... node-pool upgrade reverts reserved ip to ephem... <!-- thanks for filing an issue! before hittin... 2 1 ... 0 0 0 0 0 0 0 0 0 6
2 12 2018-05-24 19:39:07 UTC 2018-05-24 19:39:07 UTC [help wanted, kind/feature, lifecycle/rotten, ... kubernetes/kubernetes "https://api.github.com/repos/kubernetes/kuber... allow defining different timeouts for differen... or at least different for steaming and non-str... 5 2 ... 0 0 0 0 0 0 0 0 0 6

3 rows × 39 columns


In [3]:
# embeddings of github issues [mean, max]
feat_df.head(3)


Out[3]:
f_0 f_1 f_2 f_3 f_4 f_5 f_6 f_7 f_8 f_9 ... f_1590 f_1591 f_1592 f_1593 f_1594 f_1595 f_1596 f_1597 f_1598 f_1599
0 -0.059059 -0.024261 -0.014145 -0.047195 -0.104533 0.072735 -0.024336 0.040534 0.037922 -0.185270 ... 0.031312 0.133795 0.384244 0.147940 0.404347 0.355213 0.269353 0.227154 0.178882 0.309761
1 -0.040260 0.046072 -0.023587 -0.024443 0.013061 -0.051123 0.036322 0.035255 0.003225 -0.072199 ... 0.280282 0.399076 0.573803 0.797381 0.377034 0.601118 0.275607 0.765224 0.337764 0.480431
2 -0.005449 -0.030515 -0.019329 0.018486 0.001672 0.008174 -0.009112 -0.053917 0.073985 0.034680 ... 0.293691 0.019976 0.516499 0.465600 0.233908 0.161443 0.155191 0.228200 0.252346 0.304970

3 rows × 1600 columns


In [4]:
# count the labels in the holdout set
from collections import Counter
c = Counter()

for row in combined_sig_df[combined_sig_df.part == 6].labels:
    c.update(row)

Split data

Split the data into two sets according to the column, part. There are 28 labels in total because 28 sig labels have at least 30 issues, which are preprocessed in the notebook, EvaluateEmbeddings.


In [5]:
train_mask = combined_sig_df.part != 6
holdout_mask = ~train_mask

In [6]:
X = feat_df[train_mask].values
label_columns = [x for x in combined_sig_df.columns if 'sig/' in x]
y = combined_sig_df[label_columns][train_mask].values

print(X.shape)
print(y.shape)


(7236, 1600)
(7236, 28)

In [7]:
X_holdout = feat_df[holdout_mask].values
y_holdout = combined_sig_df[label_columns][holdout_mask].values

print(X_holdout.shape)
print(y_holdout.shape)


(7154, 1600)
(7154, 28)

In [8]:
from sklearn.metrics import roc_auc_score

def calculate_auc(predictions):
    auc_scores = []
    counts = []

    for i, l in enumerate(label_columns):
        y_hat = predictions[:, i]
        y = y_holdout[:, i]
        auc = roc_auc_score(y_true=y, y_score=y_hat)
        auc_scores.append(auc)
        counts.append(c[l])
    
    df = pd.DataFrame({'label': label_columns, 'auc': auc_scores, 'count': counts})    
    display(df)
    weightedavg_auc = df.apply(lambda x: x.auc * x['count'], axis=1).sum() / df['count'].sum()
    print(f'Weighted Average AUC: {weightedavg_auc}')
    return df, weightedavg_auc

Sklearn MLP

Feed the embeddings from the language model to the MLP classifier.


In [9]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV

In [10]:
mlp = MLPClassifier(early_stopping=True, n_iter_no_change=5, max_iter=500, solver='adam', 
                   random_state=1234)

In [11]:
mlp.fit(X, y)


Out[11]:
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=500, momentum=0.9,
       n_iter_no_change=5, nesterovs_momentum=True, power_t=0.5,
       random_state=1234, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [12]:
mlp_predictions = mlp.predict_proba(X_holdout)

In [13]:
mlp_df, mlp_auc = calculate_auc(mlp_predictions)


label auc count
0 sig/cluster-lifecycle 0.863932 498
1 sig/node 0.884496 1311
2 sig/api-machinery 0.892453 1090
3 sig/scalability 0.907244 258
4 sig/cli 0.935913 544
5 sig/autoscaling 0.949778 100
6 sig/network 0.945694 923
7 sig/cloud-provider 0.934848 29
8 sig/storage 0.965592 824
9 sig/scheduling 0.926638 397
10 sig/apps 0.893835 440
11 sig/windows 0.973496 84
12 sig/auth 0.952659 292
13 sig/docs 0.967213 110
14 sig/testing 0.908711 361
15 sig/federation 0.952171 85
16 sig/gcp 0.808062 98
17 sig/release 0.947304 158
18 sig/azure 0.966767 157
19 sig/aws 0.942163 217
20 sig/cluster-ops 0.733281 31
21 sig/multicluster 0.953977 81
22 sig/instrumentation 0.938337 124
23 sig/openstack 0.946715 70
24 sig/contributor-experience 0.917553 72
25 sig/architecture 0.848095 52
26 sig/vmware 0.923500 20
27 sig/service-catalog 0.736944 21
Weighted Average AUC: 0.9168608333252417

Precision & Recall

Calculate precision & recall to let us make a approapriate threshold.


In [14]:
import numpy as np

In [15]:
def calculate_max_range_count(x):
    max_range_count = [0] * 11 # [0,0.1), [0.1,0.2), ... , [0.9,1), [1,1]
    for i in x:
        max_range_count[int(max(i) // 0.1)] += 1
    
    thresholds_lower = [0.1 * i for i in range(11)]
    thresholds_upper = [0.1 * (i+1) for i in range(10)] + [1]
    df = pd.DataFrame({'l': thresholds_lower, 'u': thresholds_upper, 'count': max_range_count})
    display(df)
    return df, max_range_count

In [16]:
_, _ = calculate_max_range_count(mlp_predictions)


l u count
0 0.0 0.1 18
1 0.1 0.2 234
2 0.2 0.3 479
3 0.3 0.4 640
4 0.4 0.5 744
5 0.5 0.6 699
6 0.6 0.7 797
7 0.7 0.8 790
8 0.8 0.9 984
9 0.9 1.0 1769
10 1.0 1.0 0

In [17]:
def calculate_result(y_true, y_pred, threshold=0.0):
    total_true = np.array([0] * len(y_pred[0]))
    total_pred_true = np.array([0] * len(y_pred[0]))
    pred_correct = np.array([0] * len(y_pred[0]))
    for i in range(len(y_pred)):
        y_true_label = np.where(y_true[i] == 1)[0]
        total_true[y_true_label] += 1

        y_pred_true = np.where(y_pred[i] >= threshold)[0]
        total_pred_true[y_pred_true] += 1
        
        for j in y_true_label:
            if j in y_pred_true:
                pred_correct[j] += 1

    df = pd.DataFrame({'label': label_columns, 'precision': (pred_correct / total_pred_true), 'recall': (pred_correct / total_true)})
    print(f'Threshold: {threshold}')
    display(df)
    return df, (pred_correct / total_pred_true), (pred_correct / total_true)

In [18]:
_, _, _ = calculate_result(y_holdout, mlp_predictions, threshold=0.0)


Threshold: 0.0
label precision recall
0 sig/cluster-lifecycle 0.069611 1.0
1 sig/node 0.183114 1.0
2 sig/api-machinery 0.152362 1.0
3 sig/scalability 0.036064 1.0
4 sig/cli 0.076041 1.0
5 sig/autoscaling 0.013978 1.0
6 sig/network 0.128879 1.0
7 sig/cloud-provider 0.004054 1.0
8 sig/storage 0.115041 1.0
9 sig/scheduling 0.055493 1.0
10 sig/apps 0.061504 1.0
11 sig/windows 0.011742 1.0
12 sig/auth 0.040816 1.0
13 sig/docs 0.015376 1.0
14 sig/testing 0.050461 1.0
15 sig/federation 0.011881 1.0
16 sig/gcp 0.013699 1.0
17 sig/release 0.022086 1.0
18 sig/azure 0.021946 1.0
19 sig/aws 0.030333 1.0
20 sig/cluster-ops 0.004333 1.0
21 sig/multicluster 0.011322 1.0
22 sig/instrumentation 0.017333 1.0
23 sig/openstack 0.009785 1.0
24 sig/contributor-experience 0.010064 1.0
25 sig/architecture 0.007269 1.0
26 sig/vmware 0.002796 1.0
27 sig/service-catalog 0.002935 1.0

In [19]:
_, _, _ = calculate_result(y_holdout, mlp_predictions, threshold=0.3)


Threshold: 0.3
/home/chunhsiang/Documents/Issue-Label-Bot/Issue-Label-Bot-v2/env/lib/python3.6/site-packages/ipykernel_launcher.py:16: RuntimeWarning: invalid value encountered in true_divide
  app.launch_new_instance()
label precision recall
0 sig/cluster-lifecycle 0.567686 0.261044
1 sig/node 0.548829 0.733588
2 sig/api-machinery 0.488928 0.769725
3 sig/scalability 0.623596 0.430233
4 sig/cli 0.729730 0.694853
5 sig/autoscaling 0.618421 0.470000
6 sig/network 0.603730 0.842733
7 sig/cloud-provider NaN 0.000000
8 sig/storage 0.796651 0.809235
9 sig/scheduling 0.676282 0.531486
10 sig/apps 0.654378 0.322727
11 sig/windows 0.750000 0.607143
12 sig/auth 0.638060 0.585616
13 sig/docs 0.525000 0.381818
14 sig/testing 0.450000 0.398892
15 sig/federation 0.532468 0.482353
16 sig/gcp 0.428571 0.153061
17 sig/release 0.439490 0.436709
18 sig/azure 0.838095 0.560510
19 sig/aws 0.741627 0.714286
20 sig/cluster-ops NaN 0.000000
21 sig/multicluster 0.467742 0.358025
22 sig/instrumentation 0.568182 0.403226
23 sig/openstack 0.447368 0.242857
24 sig/contributor-experience 0.369231 0.333333
25 sig/architecture NaN 0.000000
26 sig/vmware 1.000000 0.050000
27 sig/service-catalog NaN 0.000000
/home/chunhsiang/Documents/Issue-Label-Bot/Issue-Label-Bot-v2/env/lib/python3.6/site-packages/ipykernel_launcher.py:19: RuntimeWarning: invalid value encountered in true_divide

In [20]:
_, _, _ = calculate_result(y_holdout, mlp_predictions, threshold=0.5)


Threshold: 0.5
/home/chunhsiang/Documents/Issue-Label-Bot/Issue-Label-Bot-v2/env/lib/python3.6/site-packages/ipykernel_launcher.py:16: RuntimeWarning: invalid value encountered in true_divide
  app.launch_new_instance()
label precision recall
0 sig/cluster-lifecycle 0.704082 0.138554
1 sig/node 0.687279 0.593893
2 sig/api-machinery 0.631783 0.598165
3 sig/scalability 0.717391 0.255814
4 sig/cli 0.795455 0.579044
5 sig/autoscaling 0.829268 0.340000
6 sig/network 0.714136 0.739696
7 sig/cloud-provider NaN 0.000000
8 sig/storage 0.860399 0.733900
9 sig/scheduling 0.781095 0.395466
10 sig/apps 0.679612 0.159091
11 sig/windows 0.804348 0.440476
12 sig/auth 0.747312 0.476027
13 sig/docs 0.564103 0.200000
14 sig/testing 0.592857 0.229917
15 sig/federation 0.638889 0.270588
16 sig/gcp 0.777778 0.071429
17 sig/release 0.486842 0.234177
18 sig/azure 0.922078 0.452229
19 sig/aws 0.826087 0.612903
20 sig/cluster-ops NaN 0.000000
21 sig/multicluster 0.450000 0.222222
22 sig/instrumentation 0.642857 0.217742
23 sig/openstack 0.555556 0.071429
24 sig/contributor-experience 0.560000 0.194444
25 sig/architecture NaN 0.000000
26 sig/vmware NaN 0.000000
27 sig/service-catalog NaN 0.000000
/home/chunhsiang/Documents/Issue-Label-Bot/Issue-Label-Bot-v2/env/lib/python3.6/site-packages/ipykernel_launcher.py:19: RuntimeWarning: invalid value encountered in true_divide

In [21]:
_, _, _ = calculate_result(y_holdout, mlp_predictions, threshold=0.7)


Threshold: 0.7
/home/chunhsiang/Documents/Issue-Label-Bot/Issue-Label-Bot-v2/env/lib/python3.6/site-packages/ipykernel_launcher.py:16: RuntimeWarning: invalid value encountered in true_divide
  app.launch_new_instance()
label precision recall
0 sig/cluster-lifecycle 0.724138 0.042169
1 sig/node 0.798875 0.433588
2 sig/api-machinery 0.766831 0.428440
3 sig/scalability 0.900000 0.139535
4 sig/cli 0.865979 0.463235
5 sig/autoscaling 0.952381 0.200000
6 sig/network 0.790569 0.618221
7 sig/cloud-provider NaN 0.000000
8 sig/storage 0.884746 0.634265
9 sig/scheduling 0.900000 0.294710
10 sig/apps 0.714286 0.068182
11 sig/windows 0.896552 0.309524
12 sig/auth 0.808696 0.318493
13 sig/docs 0.687500 0.100000
14 sig/testing 0.637931 0.102493
15 sig/federation 0.642857 0.105882
16 sig/gcp 1.000000 0.010204
17 sig/release 0.461538 0.113924
18 sig/azure 0.941176 0.305732
19 sig/aws 0.893204 0.423963
20 sig/cluster-ops NaN 0.000000
21 sig/multicluster 0.625000 0.185185
22 sig/instrumentation 0.800000 0.129032
23 sig/openstack NaN 0.000000
24 sig/contributor-experience 0.700000 0.097222
25 sig/architecture NaN 0.000000
26 sig/vmware NaN 0.000000
27 sig/service-catalog NaN 0.000000
/home/chunhsiang/Documents/Issue-Label-Bot/Issue-Label-Bot-v2/env/lib/python3.6/site-packages/ipykernel_launcher.py:19: RuntimeWarning: invalid value encountered in true_divide

To tune the MLP.


In [22]:
# params = {'hidden_layer_sizes': [(100,), (200,), (400, ), (50, 50), (100, 100), (200, 200)],
#               'alpha': [.001, .01, .1, 1, 10],
#               'learning_rate': ['constant', 'adaptive'],
#               'learning_rate_init': [.001, .01, .1]}

params = {'hidden_layer_sizes': [(100,), (200,), (400, ), (50, 50), (100, 100), (200, 200)],
              'alpha': [.001],
              'learning_rate': ['adaptive'],
              'learning_rate_init': [.001]}

mlp_clf = MLPClassifier(early_stopping=True, validation_fraction=.2, n_iter_no_change=4, max_iter=500)

gscvmlp = GridSearchCV(mlp_clf, params, cv=5, n_jobs=-1)

gscvmlp.fit(X, y)


Out[22]:
GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=500, momentum=0.9,
       n_iter_no_change=4, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.2, verbose=False, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'hidden_layer_sizes': [(100,), (200,), (400,), (50, 50), (100, 100), (200, 200)], 'alpha': [0.001], 'learning_rate': ['adaptive'], 'learning_rate_init': [0.001]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [23]:
print(f'The best model from grid search is:\n=====================================\n{gscvmlp.best_estimator_}')


The best model from grid search is:
=====================================
MLPClassifier(activation='relu', alpha=0.001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(200, 200), learning_rate='adaptive',
       learning_rate_init=0.001, max_iter=500, momentum=0.9,
       n_iter_no_change=4, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.2, verbose=False, warm_start=False)

In [24]:
mlp_tuned_predictions = gscvmlp.predict_proba(X_holdout)

In [25]:
mlp_tuned_df, mlp_tuned_auc = calculate_auc(mlp_tuned_predictions)


label auc count
0 sig/cluster-lifecycle 0.869233 498
1 sig/node 0.892056 1311
2 sig/api-machinery 0.898887 1090
3 sig/scalability 0.915412 258
4 sig/cli 0.939360 544
5 sig/autoscaling 0.947720 100
6 sig/network 0.943835 923
7 sig/cloud-provider 0.907514 29
8 sig/storage 0.967008 824
9 sig/scheduling 0.930041 397
10 sig/apps 0.906505 440
11 sig/windows 0.976928 84
12 sig/auth 0.948900 292
13 sig/docs 0.951038 110
14 sig/testing 0.914745 361
15 sig/federation 0.950783 85
16 sig/gcp 0.851503 98
17 sig/release 0.943631 158
18 sig/azure 0.965039 157
19 sig/aws 0.947626 217
20 sig/cluster-ops 0.783699 31
21 sig/multicluster 0.951687 81
22 sig/instrumentation 0.936635 124
23 sig/openstack 0.956072 70
24 sig/contributor-experience 0.910175 72
25 sig/architecture 0.801532 52
26 sig/vmware 0.934833 20
27 sig/service-catalog 0.817835 21
Weighted Average AUC: 0.9208602252797448

In [26]:
_, _ = calculate_max_range_count(mlp_tuned_predictions)


l u count
0 0.0 0.1 3
1 0.1 0.2 139
2 0.2 0.3 376
3 0.3 0.4 591
4 0.4 0.5 622
5 0.5 0.6 725
6 0.6 0.7 801
7 0.7 0.8 813
8 0.8 0.9 1013
9 0.9 1.0 2071
10 1.0 1.0 0

In [27]:
_, _, _ = calculate_result(y_holdout, mlp_tuned_predictions, threshold=0.0)


Threshold: 0.0
label precision recall
0 sig/cluster-lifecycle 0.069611 1.0
1 sig/node 0.183114 1.0
2 sig/api-machinery 0.152362 1.0
3 sig/scalability 0.036064 1.0
4 sig/cli 0.076041 1.0
5 sig/autoscaling 0.013978 1.0
6 sig/network 0.128879 1.0
7 sig/cloud-provider 0.004054 1.0
8 sig/storage 0.115041 1.0
9 sig/scheduling 0.055493 1.0
10 sig/apps 0.061504 1.0
11 sig/windows 0.011742 1.0
12 sig/auth 0.040816 1.0
13 sig/docs 0.015376 1.0
14 sig/testing 0.050461 1.0
15 sig/federation 0.011881 1.0
16 sig/gcp 0.013699 1.0
17 sig/release 0.022086 1.0
18 sig/azure 0.021946 1.0
19 sig/aws 0.030333 1.0
20 sig/cluster-ops 0.004333 1.0
21 sig/multicluster 0.011322 1.0
22 sig/instrumentation 0.017333 1.0
23 sig/openstack 0.009785 1.0
24 sig/contributor-experience 0.010064 1.0
25 sig/architecture 0.007269 1.0
26 sig/vmware 0.002796 1.0
27 sig/service-catalog 0.002935 1.0

In [28]:
_, _, _ = calculate_result(y_holdout, mlp_tuned_predictions, threshold=0.7)


Threshold: 0.7
/home/chunhsiang/Documents/Issue-Label-Bot/Issue-Label-Bot-v2/env/lib/python3.6/site-packages/ipykernel_launcher.py:16: RuntimeWarning: invalid value encountered in true_divide
  app.launch_new_instance()
label precision recall
0 sig/cluster-lifecycle 0.700000 0.126506
1 sig/node 0.799247 0.486260
2 sig/api-machinery 0.784790 0.444954
3 sig/scalability 0.744186 0.248062
4 sig/cli 0.828125 0.487132
5 sig/autoscaling 0.843750 0.270000
6 sig/network 0.805121 0.613883
7 sig/cloud-provider NaN 0.000000
8 sig/storage 0.905694 0.618469
9 sig/scheduling 0.870056 0.387909
10 sig/apps 0.792208 0.138636
11 sig/windows 0.848485 0.333333
12 sig/auth 0.855932 0.345890
13 sig/docs 0.608696 0.127273
14 sig/testing 0.782609 0.049861
15 sig/federation 0.750000 0.105882
16 sig/gcp NaN 0.000000
17 sig/release 0.573770 0.221519
18 sig/azure 0.916667 0.420382
19 sig/aws 0.871560 0.437788
20 sig/cluster-ops NaN 0.000000
21 sig/multicluster 0.592593 0.197531
22 sig/instrumentation 0.694444 0.201613
23 sig/openstack 0.714286 0.071429
24 sig/contributor-experience 0.583333 0.097222
25 sig/architecture NaN 0.000000
26 sig/vmware NaN 0.000000
27 sig/service-catalog NaN 0.000000
/home/chunhsiang/Documents/Issue-Label-Bot/Issue-Label-Bot-v2/env/lib/python3.6/site-packages/ipykernel_launcher.py:19: RuntimeWarning: invalid value encountered in true_divide

Save model


In [29]:
import dill as dpickle

In [30]:
with open('mlp_k8s.dpkl', 'wb') as f:
    dpickle.dump(gscvmlp, f)

Load model

Load model from pickle file and retest it.


In [31]:
import dill as dpickle

In [32]:
with open('mlp_k8s.dpkl', 'rb') as f:
    gscvmlp = dpickle.load(f)

In [33]:
mlp_tuned_predictions = gscvmlp.predict_proba(X_holdout)

In [34]:
mlp_tuned_df, mlp_tuned_auc = calculate_auc(mlp_tuned_predictions)


label auc count
0 sig/cluster-lifecycle 0.869233 498
1 sig/node 0.892056 1311
2 sig/api-machinery 0.898887 1090
3 sig/scalability 0.915412 258
4 sig/cli 0.939360 544
5 sig/autoscaling 0.947720 100
6 sig/network 0.943835 923
7 sig/cloud-provider 0.907514 29
8 sig/storage 0.967008 824
9 sig/scheduling 0.930041 397
10 sig/apps 0.906505 440
11 sig/windows 0.976928 84
12 sig/auth 0.948900 292
13 sig/docs 0.951038 110
14 sig/testing 0.914745 361
15 sig/federation 0.950783 85
16 sig/gcp 0.851503 98
17 sig/release 0.943631 158
18 sig/azure 0.965039 157
19 sig/aws 0.947626 217
20 sig/cluster-ops 0.783699 31
21 sig/multicluster 0.951687 81
22 sig/instrumentation 0.936635 124
23 sig/openstack 0.956072 70
24 sig/contributor-experience 0.910175 72
25 sig/architecture 0.801532 52
26 sig/vmware 0.934833 20
27 sig/service-catalog 0.817835 21
Weighted Average AUC: 0.9208602252797448

Write it as a class


In [35]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from collections import Counter
import dill as dpickle
import numpy as np
import pandas as pd

In [41]:
class MLP:
    def __init__(self,
                 counter, # for calculate auc
                 label_columns,
                 activation='relu',
                 alpha=0.0001,
                 early_stopping=True,
                 epsilon=1e-08,
                 hidden_layer_sizes=(100,),
                 learning_rate='constant',
                 learning_rate_init=0.001,
                 max_iter=500,
                 model_file="model.dpkl",
                 momentum=0.9,
                 n_iter_no_change=5,
                 precision_thre=0.7,
                 prob_thre=0.0,
                 random_state=1234,
                 recall_thre=0.5,
                 solver='adam',
                 validation_fraction=0.1):
        self.clf = MLPClassifier(activation=activation,
                                 alpha=alpha,
                                 early_stopping=early_stopping,
                                 epsilon=epsilon,
                                 hidden_layer_sizes=hidden_layer_sizes,
                                 learning_rate=learning_rate,
                                 learning_rate_init=learning_rate_init,
                                 max_iter=max_iter,
                                 momentum=momentum,
                                 n_iter_no_change=n_iter_no_change,
                                 random_state=random_state,
                                 solver=solver,
                                 validation_fraction=validation_fraction)
        self.model_file = model_file
        self.precision_thre = precision_thre
        self.prob_thre = prob_thre
        self.recall_thre = recall_thre
        self.counter = counter
        self.label_columns = label_columns
        self.precision = None
        self.recall = None
        self.exclusion_list = None

    def fit(self, X, y):
        self.clf.fit(X, y)

    def predict_proba(self, X):
        return self.clf.predict_proba(X)

    def calculate_auc(self, y_holdout, predictions):
        auc_scores = []
        counts = []

        for i, l in enumerate(self.label_columns):
            y_hat = predictions[:, i]
            y = y_holdout[:, i]
            auc = roc_auc_score(y_true=y, y_score=y_hat)
            auc_scores.append(auc)
            counts.append(self.counter[l])

        df = pd.DataFrame({'label': self.label_columns, 'auc': auc_scores, 'count': counts})    
        display(df)
        weightedavg_auc = df.apply(lambda x: x.auc * x['count'], axis=1).sum() / df['count'].sum()
        print(f'Weighted Average AUC: {weightedavg_auc}')
        return df, weightedavg_auc
    
    def calculate_max_range_count(self, prob):
        thresholds_lower = [0.1 * i for i in range(11)]
        thresholds_upper = [0.1 * (i+1) for i in range(10)] + [1]
        max_range_count = [0] * 11 # [0,0.1), [0.1,0.2), ... , [0.9,1), [1,1]
        for i in prob:
            max_range_count[int(max(i) // 0.1)] += 1

        df = pd.DataFrame({'l': thresholds_lower, 'u': thresholds_upper, 'count': max_range_count})
        display(df)
        return df, max_range_count
    
    def calculate_result(self, y_true, y_pred, display_table=True, prob_thre=0.0):
        if prob_thre:
            self.prob_thre = prob_thre

        total_true = np.array([0] * len(y_pred[0]))
        total_pred_true = np.array([0] * len(y_pred[0]))
        pred_correct = np.array([0] * len(y_pred[0]))
        for i in range(len(y_pred)):
            y_true_label = np.where(y_true[i] == 1)[0]
            total_true[y_true_label] += 1

            y_pred_true = np.where(y_pred[i] >= prob_thre)[0]
            total_pred_true[y_pred_true] += 1

            for j in y_true_label:
                if j in y_pred_true:
                    pred_correct[j] += 1

        self.precision = pred_correct / total_pred_true
        self.recall = pred_correct / total_true

        df = pd.DataFrame({'label': self.label_columns, 'precision': self.precision, 'recall': self.recall})
        if display_table:
            print(f'Threshold: {self.prob_thre}')
            display(df)
        return df, self.precision, self.recall

    def find_best_prob_thre(self, y_true, y_pred):
        best_prob_thre = 0
        prec_count = 0
        reca_count = 0
        
        print (f'Precision threshold: {self.precision_thre}\nRecall threshold:{self.recall_thre}')
        thre = 0.0
        while thre < 1:
            _, prec, reca = self.calculate_result(y_true, y_pred, display_table=False, prob_thre=thre)

            pc = 0
            for p in prec:
                if p >= self.precision_thre:
                    pc += 1
            rc = 0
            for r in reca:
                if r >= self.recall_thre:
                    rc += 1

            if pc > prec_count or pc == prec_count and rc >= reca_count:
                best_prob_thre = thre
                prec_count = pc
                reca_count = rc
            thre += 0.1

        self.best_prob_thre = best_prob_thre
        print(f'Best probability threshold: {best_prob_thre},\n{min(prec_count, reca_count)} labels meet both of the precision threshold and the recall threshold')

    def get_exclusion_list(self):
        assert len(self.precision) == len(self.recall)
        self.exclusion_list = []

        for p, r, label in zip(self.precision, self.recall, self.label_columns):
            if p < self.precision_thre or r < self.recall_thre:
                self.exclusion_list.append(label)
        return self.exclusion_list

    def grid_search(self, params, cv=5, n_jobs=-1):
        self.clf = GridSearchCV(self.clf, params, cv=cv, n_jobs=n_jobs)
        
    def save_model(self):
        with open(self.model_file, 'wb') as f:
            dpickle.dump(self.clf, f)

    def load_model(self):
        with open(self.model_file, 'rb') as f:
            self.clf = dpickle.load(f)

In [42]:
c = Counter()

for row in combined_sig_df[combined_sig_df.part == 6].labels:
    c.update(row)

In [43]:
clf = MLP(c, label_columns, early_stopping=True, n_iter_no_change=5, max_iter=500,
          solver='adam', random_state=1234, precision_thre=0.7, recall_thre=0.3)
clf.fit(X, y)
mlp_predictions = clf.predict_proba(X_holdout)
mlp_df, mlp_auc = clf.calculate_auc(y_holdout, mlp_predictions)


label auc count
0 sig/cluster-lifecycle 0.863932 498
1 sig/node 0.884496 1311
2 sig/api-machinery 0.892453 1090
3 sig/scalability 0.907244 258
4 sig/cli 0.935913 544
5 sig/autoscaling 0.949778 100
6 sig/network 0.945694 923
7 sig/cloud-provider 0.934848 29
8 sig/storage 0.965592 824
9 sig/scheduling 0.926638 397
10 sig/apps 0.893835 440
11 sig/windows 0.973496 84
12 sig/auth 0.952659 292
13 sig/docs 0.967213 110
14 sig/testing 0.908711 361
15 sig/federation 0.952171 85
16 sig/gcp 0.808062 98
17 sig/release 0.947304 158
18 sig/azure 0.966767 157
19 sig/aws 0.942163 217
20 sig/cluster-ops 0.733281 31
21 sig/multicluster 0.953977 81
22 sig/instrumentation 0.938337 124
23 sig/openstack 0.946715 70
24 sig/contributor-experience 0.917553 72
25 sig/architecture 0.848095 52
26 sig/vmware 0.923500 20
27 sig/service-catalog 0.736944 21
Weighted Average AUC: 0.9168608333252417

In [44]:
_, _ = clf.calculate_max_range_count(mlp_predictions)


l u count
0 0.0 0.1 18
1 0.1 0.2 234
2 0.2 0.3 479
3 0.3 0.4 640
4 0.4 0.5 744
5 0.5 0.6 699
6 0.6 0.7 797
7 0.7 0.8 790
8 0.8 0.9 984
9 0.9 1.0 1769
10 1.0 1.0 0

In [45]:
_, _, _ = clf.calculate_result(y_holdout, mlp_predictions)


Threshold: 0.0
label precision recall
0 sig/cluster-lifecycle 0.069611 1.0
1 sig/node 0.183114 1.0
2 sig/api-machinery 0.152362 1.0
3 sig/scalability 0.036064 1.0
4 sig/cli 0.076041 1.0
5 sig/autoscaling 0.013978 1.0
6 sig/network 0.128879 1.0
7 sig/cloud-provider 0.004054 1.0
8 sig/storage 0.115041 1.0
9 sig/scheduling 0.055493 1.0
10 sig/apps 0.061504 1.0
11 sig/windows 0.011742 1.0
12 sig/auth 0.040816 1.0
13 sig/docs 0.015376 1.0
14 sig/testing 0.050461 1.0
15 sig/federation 0.011881 1.0
16 sig/gcp 0.013699 1.0
17 sig/release 0.022086 1.0
18 sig/azure 0.021946 1.0
19 sig/aws 0.030333 1.0
20 sig/cluster-ops 0.004333 1.0
21 sig/multicluster 0.011322 1.0
22 sig/instrumentation 0.017333 1.0
23 sig/openstack 0.009785 1.0
24 sig/contributor-experience 0.010064 1.0
25 sig/architecture 0.007269 1.0
26 sig/vmware 0.002796 1.0
27 sig/service-catalog 0.002935 1.0

Find best probability threshold


In [46]:
clf.find_best_prob_thre(y_holdout, mlp_predictions)


Precision threshold: 0.7
Recall threshold:0.3
/home/chunhsiang/Documents/Issue-Label-Bot/Issue-Label-Bot-v2/env/lib/python3.6/site-packages/ipykernel_launcher.py:97: RuntimeWarning: invalid value encountered in true_divide
Best probability threshold: 0.7,
9 labels meet both of the precision threshold and the recall threshold

In [47]:
_, _, _ = clf.calculate_result(y_holdout, mlp_predictions, prob_thre=0.7)


Threshold: 0.7
/home/chunhsiang/Documents/Issue-Label-Bot/Issue-Label-Bot-v2/env/lib/python3.6/site-packages/ipykernel_launcher.py:97: RuntimeWarning: invalid value encountered in true_divide
label precision recall
0 sig/cluster-lifecycle 0.724138 0.042169
1 sig/node 0.798875 0.433588
2 sig/api-machinery 0.766831 0.428440
3 sig/scalability 0.900000 0.139535
4 sig/cli 0.865979 0.463235
5 sig/autoscaling 0.952381 0.200000
6 sig/network 0.790569 0.618221
7 sig/cloud-provider NaN 0.000000
8 sig/storage 0.884746 0.634265
9 sig/scheduling 0.900000 0.294710
10 sig/apps 0.714286 0.068182
11 sig/windows 0.896552 0.309524
12 sig/auth 0.808696 0.318493
13 sig/docs 0.687500 0.100000
14 sig/testing 0.637931 0.102493
15 sig/federation 0.642857 0.105882
16 sig/gcp 1.000000 0.010204
17 sig/release 0.461538 0.113924
18 sig/azure 0.941176 0.305732
19 sig/aws 0.893204 0.423963
20 sig/cluster-ops NaN 0.000000
21 sig/multicluster 0.625000 0.185185
22 sig/instrumentation 0.800000 0.129032
23 sig/openstack NaN 0.000000
24 sig/contributor-experience 0.700000 0.097222
25 sig/architecture NaN 0.000000
26 sig/vmware NaN 0.000000
27 sig/service-catalog NaN 0.000000

In [48]:
clf.get_exclusion_list()


Out[48]:
['sig/cluster-lifecycle',
 'sig/scalability',
 'sig/autoscaling',
 'sig/cloud-provider',
 'sig/scheduling',
 'sig/apps',
 'sig/docs',
 'sig/testing',
 'sig/federation',
 'sig/gcp',
 'sig/release',
 'sig/cluster-ops',
 'sig/multicluster',
 'sig/instrumentation',
 'sig/openstack',
 'sig/contributor-experience',
 'sig/architecture',
 'sig/vmware',
 'sig/service-catalog']

Grid search for MLP class


In [42]:
params = {'hidden_layer_sizes': [(100,), (200,), (400, ), (50, 50), (100, 100), (200, 200)],
              'alpha': [.001],
              'learning_rate': ['adaptive'],
              'learning_rate_init': [.001]}

clf.grid_search(params, cv=5, n_jobs=-1)
clf.fit(X, y)
mlp_predictions = clf.predict_proba(X_holdout)
mlp_df, mlp_auc = clf.calculate_auc(y_holdout, mlp_predictions)


label auc count
0 sig/cluster-lifecycle 0.870295 498
1 sig/node 0.890448 1311
2 sig/api-machinery 0.895079 1090
3 sig/scalability 0.904370 258
4 sig/cli 0.939470 544
5 sig/autoscaling 0.959497 100
6 sig/network 0.947110 923
7 sig/cloud-provider 0.892811 29
8 sig/storage 0.967395 824
9 sig/scheduling 0.927299 397
10 sig/apps 0.903220 440
11 sig/windows 0.973476 84
12 sig/auth 0.952733 292
13 sig/docs 0.956812 110
14 sig/testing 0.908038 361
15 sig/federation 0.938684 85
16 sig/gcp 0.818113 98
17 sig/release 0.949306 158
18 sig/azure 0.964417 157
19 sig/aws 0.942871 217
20 sig/cluster-ops 0.761540 31
21 sig/multicluster 0.948151 81
22 sig/instrumentation 0.927359 124
23 sig/openstack 0.931752 70
24 sig/contributor-experience 0.912050 72
25 sig/architecture 0.747146 52
26 sig/vmware 0.913849 20
27 sig/service-catalog 0.764275 21
Weighted Average AUC: 0.9184314089291179

In [43]:
clf.save_model()
clf.load_model()
mlp_predictions = clf.predict_proba(X_holdout)
mlp_df, mlp_auc = clf.calculate_auc(y_holdout, mlp_predictions)


label auc count
0 sig/cluster-lifecycle 0.870295 498
1 sig/node 0.890448 1311
2 sig/api-machinery 0.895079 1090
3 sig/scalability 0.904370 258
4 sig/cli 0.939470 544
5 sig/autoscaling 0.959497 100
6 sig/network 0.947110 923
7 sig/cloud-provider 0.892811 29
8 sig/storage 0.967395 824
9 sig/scheduling 0.927299 397
10 sig/apps 0.903220 440
11 sig/windows 0.973476 84
12 sig/auth 0.952733 292
13 sig/docs 0.956812 110
14 sig/testing 0.908038 361
15 sig/federation 0.938684 85
16 sig/gcp 0.818113 98
17 sig/release 0.949306 158
18 sig/azure 0.964417 157
19 sig/aws 0.942871 217
20 sig/cluster-ops 0.761540 31
21 sig/multicluster 0.948151 81
22 sig/instrumentation 0.927359 124
23 sig/openstack 0.931752 70
24 sig/contributor-experience 0.912050 72
25 sig/architecture 0.747146 52
26 sig/vmware 0.913849 20
27 sig/service-catalog 0.764275 21
Weighted Average AUC: 0.9184314089291179

In [51]:
new_clf = MLP(c, label_columns)
new_clf.load_model()
mlp_predictions = new_clf.predict_proba(X_holdout)
mlp_df, mlp_auc = new_clf.calculate_auc(y_holdout, mlp_predictions)


label auc count
0 sig/cluster-lifecycle 0.863932 498
1 sig/node 0.884496 1311
2 sig/api-machinery 0.892453 1090
3 sig/scalability 0.907244 258
4 sig/cli 0.935913 544
5 sig/autoscaling 0.949778 100
6 sig/network 0.945694 923
7 sig/cloud-provider 0.934848 29
8 sig/storage 0.965592 824
9 sig/scheduling 0.926638 397
10 sig/apps 0.893835 440
11 sig/windows 0.973496 84
12 sig/auth 0.952659 292
13 sig/docs 0.967213 110
14 sig/testing 0.908711 361
15 sig/federation 0.952171 85
16 sig/gcp 0.808062 98
17 sig/release 0.947304 158
18 sig/azure 0.966767 157
19 sig/aws 0.942163 217
20 sig/cluster-ops 0.733281 31
21 sig/multicluster 0.953977 81
22 sig/instrumentation 0.938337 124
23 sig/openstack 0.946715 70
24 sig/contributor-experience 0.917553 72
25 sig/architecture 0.848095 52
26 sig/vmware 0.923500 20
27 sig/service-catalog 0.736944 21
Weighted Average AUC: 0.9168608333252417

In [52]:
_, _ = new_clf.calculate_max_range_count(mlp_predictions)


l u count
0 0.0 0.1 18
1 0.1 0.2 234
2 0.2 0.3 479
3 0.3 0.4 640
4 0.4 0.5 744
5 0.5 0.6 699
6 0.6 0.7 797
7 0.7 0.8 790
8 0.8 0.9 984
9 0.9 1.0 1769
10 1.0 1.0 0

In [53]:
_, _, _ = new_clf.calculate_result(y_holdout, mlp_predictions)


Threshold: 0.0
label precision recall
0 sig/cluster-lifecycle 0.069611 1.0
1 sig/node 0.183114 1.0
2 sig/api-machinery 0.152362 1.0
3 sig/scalability 0.036064 1.0
4 sig/cli 0.076041 1.0
5 sig/autoscaling 0.013978 1.0
6 sig/network 0.128879 1.0
7 sig/cloud-provider 0.004054 1.0
8 sig/storage 0.115041 1.0
9 sig/scheduling 0.055493 1.0
10 sig/apps 0.061504 1.0
11 sig/windows 0.011742 1.0
12 sig/auth 0.040816 1.0
13 sig/docs 0.015376 1.0
14 sig/testing 0.050461 1.0
15 sig/federation 0.011881 1.0
16 sig/gcp 0.013699 1.0
17 sig/release 0.022086 1.0
18 sig/azure 0.021946 1.0
19 sig/aws 0.030333 1.0
20 sig/cluster-ops 0.004333 1.0
21 sig/multicluster 0.011322 1.0
22 sig/instrumentation 0.017333 1.0
23 sig/openstack 0.009785 1.0
24 sig/contributor-experience 0.010064 1.0
25 sig/architecture 0.007269 1.0
26 sig/vmware 0.002796 1.0
27 sig/service-catalog 0.002935 1.0

In [ ]: