Modeling

ML Tasks



In [1]:

    
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

Input



In [2]:

    
file_path = "../data/df_model01.pkl"
df = pd.read_pickle(file_path)

print(df.shape)
df.head()









    



(75, 36)






    Out[2]:






  
    
      
      buzzwordy_title
      buzzwordy_organizer
      days
      distance
      ticket_prize
      rating
      main_topic_Daten
      main_topic_Geo
      main_topic_IT
      main_topic_Innovation
      ...
      city_Passau
      city_Salzburg
      city_Steyr
      city_Wels
      city_Wien
      city_Zürich
      country_Austria
      country_Germany
      country_Slovakia
      country_Switzerland
    
  
  
    
      0
      0
      0
      1
      29
      0
      9
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      1
      1
      1
      1
      3
      0
      3
      0
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
    
    
      2
      1
      0
      1
      1
      0
      6
      0
      0
      0
      0
      ...
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
    
    
      3
      1
      1
      1
      1
      0
      2
      0
      0
      0
      1
      ...
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
    
    
      4
      1
      0
      1
      1
      0
      2
      0
      0
      0
      1
      ...
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
    
  

5 rows × 36 columns

Binning



In [3]:

    
bins = [10, 40, 120, 180] # SR, LNZ, SZG, VIE, >VIE

df["binned_distance"] = np.digitize(df.distance.values, bins=bins)

Conversion for Scikit-learn

Feature selection based on expert knowledge. Model-based hardly interpretable, at least confirmed "binned_distance" as relevant feature.



In [4]:

    
feature_names = ["buzzwordy_title", "main_topic_Daten", "binned_distance"]
X = df[feature_names].values
y = df.rating.map(lambda x: 1 if x>5 else 0).values # binary target: >5 (better as all the same) was worth attending

print("X:", X.shape, "y:", y.shape)









    



X: (75, 3) y: (75,)

Train-Test Split



In [5]:

    
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=23, 
                                                    test_size=0.5) # 50% split, small dataset size

Modeling

Model



In [6]:

    
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier

linreg = LinearRegression() # Benchmark model
dec_tree = DecisionTreeClassifier() # Actual model

Benchmark

Linear Regression



In [7]:

    
# Benchmark
linreg.fit(X_train, y_train)
print("Score (r^2): {:.3f}".format(linreg.score(X_test, y_test)))
print("Coef: {}".format(linreg.coef_))









    



Score (r^2): -0.220
Coef: [ 0.02485585  0.04313925  0.08000478]

=> Really bad perfomance

Model

Decision Tree CV

<= Ability to visualize, hasn't performed much worse than other applied models

Parameter Tuning



In [8]:

    
from sklearn.model_selection import GridSearchCV

parameter_grid = {"criterion": ["gini", "entropy"], 
                  "max_depth": [None, 1, 2, 3, 4, 5, 6], 
                  "min_samples_leaf": list(range(1, 14)), 
                  "max_leaf_nodes": list(range(3, 25))}

grid_search = GridSearchCV(DecisionTreeClassifier(presort=True), parameter_grid, cv=5) # 5 fold cross-val
grid_search.fit(X_train, y_train)

print("Score (Accuracy): {:.3f}".format(grid_search.score(X_test, y_test)))
print("Best Estimator: {}".format(grid_search.best_estimator_))
print("Best Parameters: {}".format(grid_search.best_params_))









    



Score (Accuracy): 0.632
Best Estimator: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=5, min_impurity_split=1e-07,
            min_samples_leaf=2, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=True, random_state=None,
            splitter='best')
Best Parameters: {'criterion': 'gini', 'max_depth': None, 'max_leaf_nodes': 5, 'min_samples_leaf': 2}

=> Not that good accuracy, but at least better than random draw

Build Model



In [9]:

    
model = DecisionTreeClassifier(presort=True, criterion="gini", max_depth=None, 
                               min_samples_leaf=2, max_leaf_nodes=5)
model.fit(X_train, y_train)

print("Score (Accuracy): {:.3f}".format(model.score(X_test, y_test)))









    



Score (Accuracy): 0.632

Print Decision Tree



In [10]:

    
from sklearn.tree import export_graphviz

export_graphviz(model, class_names=True, feature_names=feature_names,
                rounded=True, filled=True, label="root", impurity=False, proportion=True,
                out_file="plots/dectree_Model_best.dot")

Model Evaluation

Evaluation Scores



In [11]:

    
from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))









    



             precision    recall  f1-score   support

          0       0.71      0.77      0.74        26
          1       0.40      0.33      0.36        12

avg / total       0.62      0.63      0.62        38

=> Not so good in predicting class 1 ("worth attending), better in predicting class 0. Weighted average of precision and recall 0.62 (F1 Score)

ROC Curve



In [12]:

    
from sklearn.metrics import roc_curve, roc_auc_score

fpr, tpr, thresholds = roc_curve(y_test, y_pred)

plt.plot(fpr, tpr, label="ROC")
plt.plot([0, 1], c="r", label="Chance level")
plt.title("ROC Curve")
plt.xlabel("False Positive Rate")
plt.ylabel("Recall")
plt.legend(loc=4)

plt.savefig("plots/ROC_Curve_Model.png", dpi=180)
plt.show()

print("AUC: {:.3f}".format(roc_auc_score(y_test, y_pred)))

=> Indeed only slightly better than random guess, ~ 0.05 percentage points (AUC)



In [13]:

    
from sklearn.model_selection import cross_val_score

scores_benchmark = cross_val_score(model, X, y, cv=5) # 5 folds

print("Cross-Val Scores (Accuracy): {}".format(scores_benchmark))
print("Cross-Val Mean (Accuracy): {}".format(scores_benchmark.mean()))









    



Cross-Val Scores (Accuracy): [ 0.5625      0.625       0.46666667  0.71428571  0.64285714]
Cross-Val Mean (Accuracy): 0.6022619047619048

=> Generalization acceptable, but not good (as expected from a Decision Tree)

Persist Model



In [14]:

    
from sklearn.externals import joblib

file_path = "../data/model_trained.pkl"

joblib.dump(model, file_path)









    Out[14]:





['../data/model_trained.pkl']



In [ ]:

	buzzwordy_title	buzzwordy_organizer	days	distance	rating	main_topic_Innovation	...	city_Steyr	country_Austria
0	0	0	1	29	9	0	...	0	1
1	1	1	1	3	3	0	...	1	1
2	1	0	1	1	6	0	...	1	1
3	1	1	1	1	2	1	...	1	1
4	1	0	1	1	2	1	...	1	1

	buzzwordy_title	buzzwordy_organizer	days	distance	rating	main_topic_Innovation	...	city_Steyr	country_Austria
0	0	0	1	29	9	0	...	0	1
1	1	1	1	3	3	0	...	1	1
2	1	0	1	1	6	0	...	1	1
3	1	1	1	1	2	1	...	1	1
4	1	0	1	1	2	1	...	1	1

	buzzwordy_title	buzzwordy_organizer	days	distance	rating	main_topic_Innovation	...	city_Steyr	country_Austria
0	0	0	1	29	9	0	...	0	1
1	1	1	1	3	3	0	...	1	1
2	1	0	1	1	6	0	...	1	1
3	1	1	1	1	2	1	...	1	1
4	1	0	1	1	2	1	...	1	1