Simple Tickets prediction with DNN

Predicting the number of tickets requested by different clients

Supervised Learning. Regression

Data taken from Udacity's problem solving with advanced analytics

Here a neural network is effectively appllied in a simple problem usually solved with linear models



In [1]:

    
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import helper
import keras

helper.info_gpu()
helper.reproducible(seed=9) # setup reproducible results from run to run using Keras

%matplotlib inline









    



Using TensorFlow backend.






    



/device:GPU:0
Keras		v2.1.3
TensorFlow	v1.4.1

1. Data Processing and Exploratory Data Analysis



In [2]:

    
data_path = 'data/simple_tickets_data.csv'
target = ['Average Number of Tickets']

df = pd.read_csv(data_path)
print("rows: {} \ncolumns: {} \ntarget: {}".format(*df.shape, target))









    



rows: 300 
columns: 5 
target: ['Average Number of Tickets']

Show the data



In [3]:

    
df.head()









    Out[3]:







  
    
      
      Client ID
      Average Number of Tickets
      Number of Employees
      Value of Contract
      Industry
    
  
  
    
      0
      TK669
      90
      561
      200000
      Retail
    
    
      1
      EY859
      85
      530
      200000
      Retail
    
    
      2
      TN571
      5
      82
      25000
      Manufacturing
    
    
      3
      LJ675
      82
      491
      225000
      Manufacturing
    
    
      4
      KH713
      12
      230
      75000
      Services

Numerical data



In [4]:

    
df.describe(percentiles=[0.5])









    Out[4]:







  
    
      
      Average Number of Tickets
      Number of Employees
      Value of Contract
    
  
  
    
      count
      300.000000
      300.000000
      300.000000
    
    
      mean
      44.596667
      421.120000
      142083.333333
    
    
      std
      42.582055
      245.465265
      72128.947071
    
    
      min
      1.000000
      50.000000
      25000.000000
    
    
      50%
      28.000000
      428.000000
      150000.000000
    
    
      max
      207.000000
      1116.000000
      250000.000000

Missing values



In [5]:

    
helper.missing(df);









    



No missing values found

Transform the data

Remove irrelevant features



In [6]:

    
df = df.drop('Client ID', axis='columns')

Classify variables



In [7]:

    
numerical = ['Number of Employees', 'Value of Contract', 'Average Number of Tickets']

df = helper.classify_data(df, target, numerical)

pd.DataFrame(dict(df.dtypes), index=["Type"])[df.columns].head()  # show data types









    



numerical features:   2
categorical features: 1
target 'Average Number of Tickets': float32






    Out[7]:







  
    
      
      Number of Employees
      Value of Contract
      Industry
      Average Number of Tickets
    
  
  
    
      Type
      float32
      float32
      category
      float32

Visualize the data

Categorical features



In [8]:

    
helper.show_categorical(df)

Target vs Categorical features



In [9]:

    
helper.show_target_vs_categorical(df, target)

Numerical features



In [10]:

    
helper.show_numerical(df, kde=True)

Target vs Numerical features



In [11]:

    
helper.show_target_vs_numerical(df, target, point_size=20)

Target vs All features



In [12]:

    
g = sns.PairGrid(df, y_vars=target, x_vars=['Number of Employees', 'Value of Contract'], 
                 size=7, hue='Industry', aspect=1.5)
g.map(sns.regplot).add_legend();

#sns.pairplot(df, hue = 'Industry', vars=['Number of Employees', 'Value of Contract'] + 
#             targets, size = 4)

These figures suggest that a simple linear model could be used to make accurate predictions

Correlation between numerical features and target



In [13]:

    
helper.show_correlation(df, target, figsize=(7,4))

2. Neural Network model

Select the features



In [14]:

    
droplist = []  # features to drop

# For the model 'data' instead of 'df'
data = df.copy()
data.drop(droplist, axis='columns', inplace=True)
data.head(3)









    Out[14]:







  
    
      
      Number of Employees
      Value of Contract
      Industry
      Average Number of Tickets
    
  
  
    
      0
      561.0
      200000.0
      Retail
      90.0
    
    
      1
      530.0
      200000.0
      Retail
      85.0
    
    
      2
      82.0
      25000.0
      Manufacturing
      5.0

Scale numerical variables

Shift and scale numerical variables to a standard normal distribution. The scaling factors are saved to be used for predictions.



In [15]:

    
data, scale_param = helper.scale(data)

Create dummy features

Replace categorical features (no target) with dummy features



In [16]:

    
data, dict_dummies = helper.replace_by_dummies(data, target)

model_features = [f for f in data if f not in target] # sorted neural network inputs

data.head(3)









    Out[16]:







  
    
      
      Number of Employees
      Value of Contract
      Average Number of Tickets
      Industry_Manufacturing
      Industry_Retail
      Industry_Services
    
  
  
    
      0
      0.569857
      0.80296
      1.066255
      0
      1
      0
    
    
      1
      0.443566
      0.80296
      0.948835
      0
      1
      0
    
    
      2
      -1.381540
      -1.62325
      -0.929891
      1
      0
      0

Split the data into training and test set

Data leakage: Test set hidden when training the model, but seen when preprocessing the dataset

No validation set will be used here (300 samples)



In [17]:

    
test_size = 0.2
random_state = 0

from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=test_size, random_state=random_state)

# Separate the data into features and target (x=features, y=target)
x_train, y_train = train.drop(target, axis=1).values, train[target].values
x_test, y_test = test.drop(target, axis=1).values, test[target].values

One-hot encode the output not needed for regression



In [18]:

    
print("train size \t X:{} \t Y:{}".format(x_train.shape, y_train.shape))
print("test size  \t X:{} \t Y:{} ".format(x_test.shape, y_test.shape))









    



train size 	 X:(240, 5) 	 Y:(240, 1)
test size  	 X:(60, 5) 	 Y:(60, 1)

Build the Neural Network for Regression



In [19]:

    
from keras.models import Sequential
from keras.layers.core import Dense, Dropout


def build_nn(input_size, output_size, summary=False):

    input_nodes = input_size
    weights = keras.initializers.RandomNormal(stddev=0.001)

    model = Sequential()
    model.add(
        Dense(
            input_nodes,
            input_dim=input_size,
            activation='tanh',
            kernel_initializer=weights,
            bias_initializer=weights))

    model.add(Dense(1, activation=None, kernel_initializer=weights, bias_initializer=weights))

    model.compile(loss='mean_squared_error', optimizer='adam')

    if summary:
        model.summary()

    return model

Train the Neural Network



In [20]:

    
from time import time
model_path = os.path.join("models", "simple_tickets.h5")


def train_nn(model, x_train, y_train, validation_data=None, path=False, show=True):
    """ 
    Train the neural network model. If no validation_datais provided, a split for validation
    will be used
    """

    if show:
        print('Training ....')

    #callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, verbose=0)]
    t0 = time()

    history = model.fit(
        x_train,
        y_train,
        epochs=30,
        batch_size=16,
        validation_split=0,
        validation_data=validation_data,
        callbacks=None,
        verbose=0)

    if show:
        print("time: \t {:.1f} s".format(time() - t0))
        helper.show_training(history)

    if path:
        model.save(path)
        print("\nModel saved at", path)

    return history


model = None
model = build_nn(x_train.shape[1], y_train.shape[1], summary=False)
train_nn(model, x_train, y_train, validation_data=None, path=model_path);

from sklearn.metrics import r2_score

ypred_train = model.predict(x_train)
#ypred_val = model.predict(x_val)
print('Training   R2-score: \t{:.3f}'.format(r2_score(y_train, ypred_train)))
#print('Validation R2-score: \t{:.3f}'.format(r2_score(y_val, ypred_val)))









    



Training ....
time: 	 1.4 s






    












    



Training loss:  	0.3762

Model saved at models/simple_tickets.h5
Training   R2-score: 	0.603

Evaluate the model



In [21]:

    
# model = keras.models.load_model(model_path)
# print("Model loaded:", model_path)


def evaluate_nn(model, x_test, y_test):
    
    score = model.evaluate(x_test, y_test, verbose=0)
    print("\nTest loss:\t\t{:.4f}".format(score))

    ypred_test = model.predict(x_test)
    print('\nTest R2-score: \t\t{:.3f}'.format(r2_score(y_test, ypred_test)))

evaluate_nn(model, x_test, y_test)









    



Test loss:		0.6103

Test R2-score: 		0.493

Make predictions



In [22]:

    
def predict_nn(model, x_test, target):
    """ Return a dataframe with actual and predicted targets in original scale"""

    for t in target:
        pred = model.predict(x_test, verbose=0)
        restore_pred = pred * scale_param[t][1] + scale_param[t][0]
        restore_pred = restore_pred.round()

        restore_y = y_test * scale_param[t][1] + scale_param[t][0]
        restore_y = restore_y.round()

        pred_label = 'Predicted_' + t
        error_label = t + ' error (%)'

        pred_df = pd.DataFrame({
            t: np.squeeze(restore_y),
            pred_label: np.squeeze(restore_pred)
        })

        pred_df[error_label] = ((pred_df[pred_label] - pred_df[t]) * 100 / pred_df[t]).round(1)

        print(t, ". Prediction error:")
        print("Mean: \t {:.2f}%".format(pred_df[error_label].mean()))
        print("Stddev:  {:.2f}%".format(pred_df[error_label].std()))
        sns.distplot(pred_df[error_label])
        plt.xlim(xmin=-600, xmax=600)

    return pred_df


pred_df = predict_nn(model, x_test, target)









    



Average Number of Tickets . Prediction error:
Mean: 	 76.35%
Stddev:  200.50%



In [23]:

    
pred_df.head()









    Out[23]:







  
    
      
      Average Number of Tickets
      Predicted_Average Number of Tickets
      Average Number of Tickets error (%)
    
  
  
    
      0
      56.0
      90.0
      60.700001
    
    
      1
      9.0
      29.0
      222.199997
    
    
      2
      32.0
      35.0
      9.400000
    
    
      3
      34.0
      27.0
      -20.600000
    
    
      4
      50.0
      64.0
      28.000000

The prediction error (%) can be especially high when the number of tickets is low. The absolute error could be a better indicator here.

Compare with linear regression



In [24]:

    
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(x_train, y_train)
pred = reg.predict(x_test)

t=target[0]
restore_pred = pred * scale_param[t][1] + scale_param[t][0]
restore_pred = restore_pred.round()

restore_y = y_test * scale_param[t][1] + scale_param[t][0]
restore_y = restore_y.round()

pred_label = 'Predicted_' + t
error_label = t + ' error (%)'

pred_df = pd.DataFrame({
    t: np.squeeze(restore_y),
    pred_label: np.squeeze(restore_pred)
})

pred_df[error_label] = ((pred_df[pred_label] - pred_df[t]) * 100 / pred_df[t]).round(1)

print(t, ". Prediction error:")
print("Mean: \t {:.2f}%".format(pred_df[error_label].mean()))
print("Stddev:  {:.2f}%".format(pred_df[error_label].std()))
sns.distplot(pred_df[error_label])
plt.xlim(xmin=-600, xmax=600)









    



Average Number of Tickets . Prediction error:
Mean: 	 85.51%
Stddev:  214.45%






    Out[24]:





(-600, 600)

The mean and standard deviation of the error is higher with the linear model.

Compare with classical ML



In [25]:

    
helper.ml_regression(x_train, y_train[:,0], x_test, y_test[:,0])









    



Linear
Bayesian Ridge
Decision Tree
KNeighbors
AdaBoost
Random Forest






    Out[25]:







  
    
      
      Time (s)
      Test loss
      Test R2 score
    
  
  
    
      Bayesian Ridge
      0.0
      0.622982
      0.482503
    
    
      Linear
      0.0
      0.624954
      0.480864
    
    
      AdaBoost
      0.0
      0.693624
      0.423822
    
    
      Random Forest
      0.1
      0.710892
      0.409478
    
    
      KNeighbors
      0.0
      0.764439
      0.364997
    
    
      Decision Tree
      0.0
      0.894821
      0.256692

	Client ID	Average Number of Tickets	Number of Employees	Value of Contract	Industry
0	TK669	90	561	200000	Retail
1	EY859	85	530	200000	Retail
2	TN571	5	82	25000	Manufacturing
3	LJ675	82	491	225000	Manufacturing
4	KH713	12	230	75000	Services

	Average Number of Tickets	Number of Employees	Value of Contract
count	300.000000	300.000000	300.000000
mean	44.596667	421.120000	142083.333333
std	42.582055	245.465265	72128.947071
min	1.000000	50.000000	25000.000000
50%	28.000000	428.000000	150000.000000
max	207.000000	1116.000000	250000.000000

	Number of Employees	Value of Contract	Industry	Average Number of Tickets
0	561.0	200000.0	Retail	90.0
1	530.0	200000.0	Retail	85.0
2	82.0	25000.0	Manufacturing	5.0

	Number of Employees	Value of Contract	Average Number of Tickets	Industry_Manufacturing	Industry_Retail
0	0.569857	0.80296	1.066255	0	1
1	0.443566	0.80296	0.948835	0	1
2	-1.381540	-1.62325	-0.929891	1	0

	Average Number of Tickets	Predicted_Average Number of Tickets	Average Number of Tickets error (%)
0	56.0	90.0	60.700001
1	9.0	29.0	222.199997
2	32.0	35.0	9.400000
3	34.0	27.0	-20.600000
4	50.0	64.0	28.000000

	Time (s)	Test loss	Test R2 score
Bayesian Ridge	0.0	0.622982	0.482503
Linear	0.0	0.624954	0.480864
AdaBoost	0.0	0.693624	0.423822
Random Forest	0.1	0.710892	0.409478
KNeighbors	0.0	0.764439	0.364997
Decision Tree	0.0	0.894821	0.256692