The aim of this RFC is to predict in which area within an Oil & Gas company will a candidate fit (Corporate or Business) based on her/his preferred ways of working. The qe¡uestion that is trying to answer is:
Can we predict accurately in more tan 2/3 of the cases where a candidate will have the best cultural fit within the Company (corporate or business) as the first step of the hiring process?
There are only two options that can be broadly described (without entering into the specific background of the candidate) based on the preferred ways of working. The outcome variable (called Section in the survey) has only two possible outcomes:
This is a real experiment for which a survey has been designed and sent out to more than 500 staff members based only in Spain. In this survey, 30 questions have been asked about their preferred ways of working (WoW or wow). The staff was given a month to complete the survey, closing the window to receive more surveys after that time. The threshold to start the analysis was either receiving 90% of the surveys or the duration of one month. The first one to happen.
The wow used that will act as predictors for the classification are:
To answer the survey a likert scale from 1-5 (less-high) has been given.
The 30 questions have been grouped by 10 different aspects of the personality adding the result of the 3 questions under each. Additionally, they have been asked to submit if they are part of the corporate or the business within the company (labelled data).
From the results obtained, two analysis have been carried out:
For the purpose of this exercise, only the first option (10 umbrella wow) will be included in this notebook.
In [110]:
#Import Python libraries that will be used in this model
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split,cross_val_score, KFold, cross_val_predict
from sklearn.decomposition import PCA as sklearn_pca
from sklearn.decomposition import PCA
from sklearn.utils import resample
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn import preprocessing, decomposition
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model.stochastic_gradient import SGDClassifier
import time
In [2]:
# Read and import data
wow_raw = pd.read_csv('WOWanalysis.csv', encoding='latin-1')
wow_raw.head()
Out[2]:
The dataset that contains the results from the surveys have additional information that will not be required for the analysis. Let´s visualize the 56 columns of the dataset
In [3]:
#Identify all the columns in the file
wow_raw.columns
Out[3]:
Columns that are not ending in _Score hold raw data that will not be used in this analysis. Before cleansing the datset lets analyse the type of data it contains and the number of NaN or nulls it contains
In [4]:
#Analyse types of variables.length fo the dataset, number of null datapoints
wow_raw.info()
There are only two null objects in the dataset and the information is stored as objects although it is numerical. Columns that are not relevant for the analysis will be dropped. For the purpose of the experiment, information related to the date in which the test was answered, progress, duration and id of the respondents have been sanitized (in the case of the id) and dropped. Lets also inspect the unique values contained in the output variable since additional information (as year repondents were hired) was asked in the survey.
In [5]:
#Drop columns that are not relevant for the model
#Drop columns that have null-object (identification only)
wow_useful = wow_raw.drop(['StartDate', 'Progress', 'Duration (in seconds)', 'Identification'], axis = 1)
In [6]:
#Identify unique values in the Section Area
wow_useful.Section.unique()
Out[6]:
Although it is a binary classification and only two options are given, the information regarding the year respondents were hired has been introduced in different formats, making it useless for the analysis. Additionaly, the first row of each column contains the real name of the column. The first row will be used to name the columns and then dropped to avoid repetition. Afterwards columns will be called to understand the real information they contain.
In [7]:
#Clean data and rename columns with row 0 from the original data set
wow_useful = wow_useful.rename(columns=wow_raw.iloc[0])
#Drop row 1 to avoid repetition and reset index
wow_useful = wow_useful.drop(wow_raw.index[0:3]).reset_index(drop=True)
In [8]:
#Identify all the columns in the file
wow_useful.columns
Out[8]:
From initial inspection of the columns, only the ones that store the scores of each way of working is interesting to be used as predictors. Hence, the rest of the columns will be dropped. Only Section, gender and all columns ending in _score will be kept in the new dataframe.
In [9]:
#Keep only the columns that end in -score
#Columns that end with "-score" are the addition of the ones that have the same name. For example:
#Benefactor_Score = Benefactor-Defending + Benefactor-Empathizing + Benefactor-Developing
wow_scores = wow_useful[['Section','Gender','Catalyst_Score', 'Orderer_Score', 'Influencer_Score',
'Benefactor_Score', 'Harmonizer_Score', 'Investigator_Score',
'Quantifier_Score', 'Distiller_Score', 'Innovator_Score',
'Creator_Score']]
#Show new dataframe
wow_scores.head()
Out[9]:
Information regarding gender and Section (company´s area) has to be cleaned. Gender is stored in English and Spanih and Section stores the first four letter od the company´s area folled by the year.
In [10]:
#Clean the "Gender" column
#Assign values to Gender: Male = 0, Hombre = 0, Mujer = 1, Female = 1
wow_scores['Gender'] = wow_scores.loc[:, 'Gender'].map({'Female': 1,'Mujer': 1, 'Male': 0,'Hombre': 0 })
In [11]:
#Section column contains the cohort of people and the year they were interviewed.
#Not all of them keep the same format so map Sections to areas (i.e. & MBD = MBD) to two unique values:
#CORP = Corporate Functions, BUS = Business
wow_scores['Section'] = wow_scores.loc[:, 'Section'].map({'CORP-01': 'CORP',
'CORP-2015': 'CORP',
'CORP-02': 'CORP',
'CORP-2016': 'CORP',
'CORP-2017': 'CORP',
'CORP-2014': 'CORP',
'BUSI-01': 'BUS',
'BUSI-02': 'BUS',
'BUSI-2017': 'BUS',
'BUSI-2016': 'BUS'})
To have a better understanding of the answers that have been received, a basic analysis of the responses by gender and copany´s area will be carried out. This will help to see if the outcome variable "Section" is imbalanced and the demographics of the company.
In [12]:
#Visualize the number of answers by Gender and by Category
#Check the outcome variable and see if there is any imbalance
plt.figure(figsize=(20, 5))
sns.set_style("whitegrid")
plt.subplot(1, 2, 1)
ax = sns.countplot(x="Gender", data=wow_scores, palette="Set2")
ax.set_xlabel('Responses by Gender')
ax.set_ylabel('Number of Occurrences')
ax.set_xticklabels(['Male','Female'], fontsize=10)
plt.ylim(0, 300)
plt.subplot(1, 2, 2)
ax = sns.countplot(x="Section", data=wow_scores, palette="Set1")
ax.set_xlabel('Outcome Variable: Section (BUS & CORP)')
ax.set_ylabel('Number of Occurrences')
plt.ylim(0, 300)
plt.tight_layout()
plt.show()
In the process of international expansion naturally the corporate part of the business is slightly lower than the business side. In this specific case, 43% is the corporate side of the company while 57% is the business side. This is aligned with the activity that this company is carrying out in Spain at the moment and the future plans it has.
Furthermore, from the exploratory analysis, there is a gender imbalance 62% women against 38% men.
To avoid the outcome variable (Section) imbalance when creating the classifier, the dataset has been resampled and the minority has been up-sampled. Before doing so, datapoints need to be changed from objects to floats. To analyze the socres, a new dataframe is built containing only the scores (predictors) of each wow.
In [13]:
#Build a new dataframe that contains the ratings obtained per category, dropping section and gender
#Drop Section & Gender
wow_scores_only = wow_scores.drop(['Section','Gender'],axis = 1)
#Create new dataframe only with ratings
wow_scores_only = wow_scores_only.astype(np.float64)
#Check the new dataframe
wow_scores_only.info()
Exploratory analysis of the data to understand the distribution of the different wow scores (predictors) and the realtionship between them is carried out plotting joint relationships and histograms for univariate distributions. Furthermore regression curves are included to show the correlation that might exist between scores.
In [111]:
#Plot the distribution of the different ways of working (wow)
g = sns.pairplot(wow_scores_only, kind="reg")
None of the variables follow a normal distribution. This could be due to the small number of responses (only 466 datapoints). In all cases, except for the "benefactor_score" and the "innovator_score", the values obtained for the wow gravitate around 10, hence respondents are moved into the upper part of the likert scale (3 or more) as three questions are added per each of the wow represented in this project.
From a joint relationship standpoint we can see that are severl variables that show a high positive correlation such as "Investigator_score" and "Innovator_score", "Quantifier_Score" and "Investigator_Score" and "Distiller_Score" and "Investigator_Score". Further analysis through a correlation matrix is needed to confirm the high positiv correlation that has been initially seen between the varaibles. As a first step a summary of the main statistics of the sample is obtained:
In [15]:
#Describe the data using statistics
wow_scores_only.describe()
Out[15]:
Distributions are in all cases centered round the mean with the mean and the median being very much alike. The first quartile appears to be between 7 and 9 which validates that respondents are centered towards 3 in each individual question (3 questions per wow). The third quartile is between 11 and 13, being the maximum value that can be obtained 15. There are no outliers for any of the variables (under the criteria of 1.5 times the interquartile range).
To avoid the 3-15 scale in place, data is scaled following a Standard Normal Distribution N(0,1):
In [16]:
#Preproces the ratings
names = wow_scores_only.columns
scaled_scores_only = pd.DataFrame(preprocessing.scale(wow_scores_only), columns = names)
To have a better unerstanding of the correlation between variables, a heatmap has been produced. This allows to check if the variables that have been discovered earlier to have a high positive correlation are further investigated.
In [17]:
#Prepare heatmap to visually inspect the correlation between scores
#Build the correlation matrix between scores
correlation_mat = scaled_scores_only.corr()
#Plot heatmap
plt.figure(figsize=(10, 10))
ax = sns.heatmap(correlation_mat, annot=True)
plt.tight_layout()
plt.show()
The correlations between the scores of 10 ways of working have been analyzed being in the range 0.2 and 0.68 with two exceptions that are close to zero. As it was discovered when the joint relationship was plotted, the following variables show a high correlation (all of them higher than 50%):
To the dataframe containing the wow scores, gender and section are added once the socres are pre-processed.
In [19]:
#Incorporate the Section and Gender column to the dataset
#Build the secgen (section/gender) dataframe to be merged with the ratings dataframe
section_gender = wow_scores[['Section','Gender']]
#Build the new dataframe that incorporates the processed scores, section and gender
scaled_wow_scores = pd.concat([section_gender, scaled_scores_only],axis=1)
#Substitute the categorical output variable Section with numerical values: BUS = 0 & CORP = 1
scaled_wow_scores['Section'] = scaled_wow_scores['Section'].map({'BUS' :0,'CORP':1})
As the two classes are unbalanced: BUS: 264 and CORP 202, they are resampled to be balanced. The minority one is upsampled in this case:
In [20]:
#Balance the sections by upsamplng the minority class
# Separate majority and minority classes
Section_majority = scaled_wow_scores[scaled_wow_scores.Section==0]
Section_minority = scaled_wow_scores[scaled_wow_scores.Section==1]
# Upsample the section minority class "CORP"
Section_minority_upsampled = resample(Section_minority, replace=True, n_samples=264, random_state=123)
# Combine in a new dataframe 'data' the majority class with the upsampled minority class
wow_data = pd.concat([Section_majority,Section_minority_upsampled])
# Display new class counts
wow_data.Section.value_counts()
Out[20]:
Once the classes have been resampled, predictors and outcome variable are built to extract the features for the model.
In [21]:
#Define the predictors and the outcome variable (y = Section )
#From the dataframe 'wowdata' drop the outcome variable Section
X = wow_data.drop('Section', axis = 1)
#Build the outcome variable Section
Y = wow_data['Section']
The feature selection process will start with a PCA analysis to understand the number of features required to describe more than 90% of the variance of the outcome variable.
Contraints
The features built using PCA cannot be used as the company wants to use the selected features to build a refined survey for potential candidates
Features cannot be engineered as the features that the company wants to use must be in the set of features used in the initial survey
Once the number of features is identified, the following methods will be used to select the features that will be used to run build the binary classifier:
The features obtained by each of the methods will be commpared to see if the selected ones are stable and the final number of features will be determined by the number of features obtanied from PCA.
The number of features is determined using PCA analysis:
In [22]:
#PCA Analysis
# Build the correlation mtarix
correlation_matrix = X.corr()
#Calculate the eigenvectores & eigenvalues
eig_vals, eig_vecs = np.linalg.eig(correlation_matrix)
sklearn_pca = PCA(n_components=len(X.columns))
Y_sklearn = sklearn_pca.fit_transform(correlation_matrix)
#Plot the scree plot for visual analysis of the PCA features
plt.title('Scree Plot')
plt.plot(eig_vals)
plt.show()
#For additional aid, print the total variance explained by each of the eigenvalues
print('The percentage of total variance in the dataset explained \n', sklearn_pca.explained_variance_ratio_)
From the scree plot and the percentage of total variance explained, six features explain 93.8% of the variance. Hence the number of features initially tested are between five and six. In an initial test on a logistic regression model, if the number of features is reduced from 6 to 5, the overall accuracy of the model (on the whole dataset) is reduced 1%. Any chang in the number of features out of this set, significantly reduces the accuracy of the model.
Hence, six features are used when the recursive features elimination is run to determine the final set of features.
In [23]:
#From the scree plot, the number of features that will maximize the explanation of the variance in the dataset is:
num_features=6
Although the PCA features cannot be used due to the constraints established by the Company regarding the nature of the features, PCA features have been built to test the number of features in a basic, not tunned logistic regression model.
In [120]:
#Build PCA features
# Create a scaler object
sc = StandardScaler()
# Fit the scaler to the features and transform
X_std = sc.fit_transform(X)
# Create a PCA object from Scree plot the number of components is 6
pca = decomposition.PCA(n_components=num_features)
# Fit the PCA and transform the data
X_std_pca = pca.fit_transform(X_std)
# View the new feature data's shape
X_std_pca.shape
# Create a new dataframe with the new features
XPCA = pd.DataFrame(X_std_pca)
# Create a PCA object from Scree plot the number of components is 5
pca = decomposition.PCA(n_components=num_features-1)
# Fit the PCA and transform the data
X_std_pca_five = pca.fit_transform(X_std)
# View the new feature data's shape
X_std_pca_five.shape
# Create a new dataframe with the new features
XPCAfive = pd.DataFrame(X_std_pca_five)
#Calculate percentage of variance from using five to six features
print('%of variance due to the number of features (%):', (cross_val_score(LogisticRegression(),XPCA,Y,cv=kf).mean()/
cross_val_score(LogisticRegression(),XPCAfive,Y,cv=kf).mean()-1)*100
)
To select the features, Feature Importance using Random Forest is used:
In [25]:
#Calculate Feature Importance using Random Forest
#Start and fit the Random Forest Classifier
rf = RandomForestClassifier()
rf.fit(X, Y)
#Define feature importance
feature_importance = rf.feature_importances_
# Make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5
#Plot the relative importance of each feature
plt.figure(figsize=(7, 7))
plt.subplot(1, 1, 1)
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Features Selection (Random Forest)')
plt.show()
From the Random Forest features importance analysis gender, investigator and benefactor are the ones that are the less meaningful features. Results are aligned with the exploratory analysis done in terms of correlations in which the investigator was correlated with the quantifier and distiller and the harmonizer with the benefactor.
To check the estability of the feature selection the same analysis is carried out suing kbest:
In [26]:
#Feature Selection using KBest
#Scores for the most relevant features start with the one that has more explanatory power
# Initialize and fit the model for features extraction
test = SelectKBest()
fit = test.fit(X, Y)
#Identify features with highest score from a predictive perspective
#Create dataframe with the features ordered by their explanatory power
features_names = X.columns
Bestfeatures = pd.DataFrame(fit.scores_, index = features_names)
Bestfeatures.columns = ['Best Features']
Bestfeatures.sort_values(by=['Best Features'], ascending=False)
Out[26]:
In this case, results are different from the Random Forest relative importance analysis. In this case the explanatory power of gender is higher being in second place. This reinforces the correlation analysis previously done, in which gender did not present a high correlation with any of the other features.
Investigator appears as the fourth worst after innovator, influencer and catalyst. This is aligned with the exploratory analysis previously done regarding the correlation between features. Quantifier appears in the first position, harmonizer in the third.
Recursive Feature Elimination is carried out considering the maximum number of features given by the PCA.
In [27]:
# Features selection with Recursive Feature Elimination RFE model
#Set up the max number of features as indicated by PCA Analysis: number of features = 6
n_features = num_features
#Initialize the model and fit
lr = LogisticRegression()
rfe = RFE(lr,n_features)
fit = rfe.fit(X,Y)
# Summarize the features selection. Based on the number of features from the PCA analysis
#show all the features selected (true) and left out (false)
result_RFE = pd.DataFrame(list(zip(X.head(0), rfe.ranking_, rfe.support_)),columns=['Features','Ranking','Support'] )
result_RFE.sort_values('Ranking')
Out[27]:
In the RFE analysis and taking into consideration the number of features that are meaningful given by PCA, it appears that the ones that had a high correlation between them should not be considered. This is aligned with the exploratory analysis and with the relative importance analysis with Random Forest. Hence, this will be the set of features chosen to model the classifier.
The sets of features from each of the analysis is displayed below for comparison purposes. As it can be seen all of them include the same features in different order:
In [71]:
#Features seleced using each of the methodologies
#Feature Selection using Random Forest
X_randomforest = X[['Gender', 'Catalyst_Score', 'Orderer_Score', 'Influencer_Score',
'Benefactor_Score', 'Harmonizer_Score', 'Investigator_Score',
'Quantifier_Score', 'Distiller_Score', 'Innovator_Score',
'Creator_Score']]
#Feature Selection using KBest
X_kbest = X[['Gender', 'Orderer_Score',
'Benefactor_Score', 'Harmonizer_Score', 'Investigator_Score',
'Quantifier_Score', 'Distiller_Score',
'Creator_Score']]
#Feature Selection using RFE & PCA
X_rfe = X[['Gender','Influencer_Score',
'Harmonizer_Score','Quantifier_Score',
'Distiller_Score', 'Creator_Score']]
From a features selection analysis standpoint, the PCA analysis reveals that 6 features are enough to explain the variance of the dataset. An importance features selection, explanatory power selection and RFE analysis have been carried out. The feature importance and the RFE results have aligned with the correlation shown in first instance between the features.
The features that have been selected are the six (from the PCA analysis) given by the RFE analysis. These features are:
From the initial set of wow proposed to the company in the initial survey and considering the given constraints, only 5 are required added to Gender that was not initially considered due to legal reasons. Further analysis of the regulation in place showed that the Gender can be asked in a survey therefore the variable is included.
The dataset has been split 70/30 train test and several models have been tuned in the training set and run on the test set calculating its accuracy using cross validation. The purpose of this is to train and test the binary classification models avoiding overfitting.
All models' hyperparameters were tuned while trained in the training set. As a result, the confusion matrix, type 1 and 2 errors and the classification report have been obtained as part of the model selection.
As the purpose of the analysis is to predict accurately if a new candidate will be a better fit for each of the categories, accuracy and types I & II errors have been considered.
The models to be tested for the lowest misclassification errors and best accuracy are:
Both classes (business and corporate) represented by 0 and 1 are balanced and the misclassification is equally important in all cases. There is no "negative or positive" case as a candidate that is falsely predicted to perform better in a corporate has the same effect as a candidate that is falsely predicted to perform better in a business role.
The null accuracy is initially calculated to check that the sample us balanced and the prediction of a "dumb" model for comparative purposes as it must be the minimum accuracy that a model should achieve.
As both misclassiffications in the corporate and in the business area have the same cost, the overall accuracy will be used as the main score in the model selection process
Further inspection of the false "positives" and "negatives" will be done only when the overall accuracy is the similar between models
The time required to fit the model and run the cross validation using five folds will be used to indicate the computational effort of each model
In [72]:
#Split the data into training and testing datasets. Split: 70/30; train/test
X_train, X_test, y_train, y_test = train_test_split(X_rfe,Y, test_size=0.3, random_state=111)
#Initialize the cross validation generator, N splits = 5
kf = KFold(5)
In [73]:
# Calculate null accuracy
max(y_test.mean(), 1 - y_test.mean())
Out[73]:
The null accuracy (similar to the one obtained by a "random monkey") due to resampling will always predict 51.6 % of the times 0 (business). This will be used as the minimum that one of the models should achieve
The first model to be run is the Logistic Regression model. The following hyperparameters of the model have been tuned using searchgridCV and the overall accuracy as the selection strategy:
The model is initialized and several values are tested. Lower values of parameter "C" will give a stronger regularization of the model.
In [74]:
# Initialize and fit the model.
log_reg = LogisticRegression()
#Tune parameters
#C parameter
c_param = [0.01,0.1,0.5,1]
#Tune the type of penalty used between l1 and l2
penalty_type = ['l1','l2']
parameters = {'C': c_param, 'penalty': penalty_type}
#Fit parameters
log_reg_tuned = GridSearchCV(log_reg, param_grid=parameters, cv=kf)
#Fit the tunned classifier in the traiing space
log_reg_tuned.fit(X_train, y_train)
#Print the best parameters
print(log_reg_tuned.best_params_)
The tuned model is fit and run on the test set and the timer is started to check the computational effort:
In [75]:
#Start the timer as a measure of the computing effort
start_time = time.time()
#Fit the model on the test set
log_reg_tuned.fit(X_test, y_test)
pred_test_y = log_reg_tuned.predict(X_test)
In [76]:
#Evaluate model (test set)
#Cross validate the accuracy of the predictions
accuracy_log_reg_tuned = cross_val_score(log_reg_tuned,X_test,y_test,cv=kf).mean()
#Print the time required to fit and evaluate the model
print("--- %s seconds ---" % (time.time() - start_time))
#Define the target names to be evaluated in the classification report
target_names = ['0.0', '1.0']
#Build the confusion matrix
confusion = confusion_matrix(y_test, pred_test_y)
# Build Accuracy tables
table_test = pd.crosstab(y_test, pred_test_y, margins=True)
#Extract type 1 and 2 errors
test_tI_errors = table_test.loc[0.0,1.0] / table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0] / table_test.loc['All','All']
#Print results: classification report, confussion matrix, accuracy and % of errors
print(classification_report(y_test, pred_test_y, target_names=target_names))
print(confusion)
print((
'Logistic Regression accuracy: {}\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}\n\n'
).format(accuracy_log_reg_tuned,test_tI_errors, test_tII_errors))
The overall accuracy of the Logistic Regression model is 64.1%. The obtained result is lower than the 2/3 requested by the company. Although the classification threshold can be changed from 50% to a lower number, other alternatives to the Logistic Regression model will be investigated. As a starting point, the overall acuracy is 24.5 % higher than the one obtanied by a "random monkey".
Delving into the results, the precision of the model is approx. the same for both classes although the recall (completeness) of the model is lower for class 1. F1 score shows that the balance between precision and recall is more or less balanced around 70%.
From the confusion matrix, false positives and negatives are equally balanced scoring between 14.5% and 15.7% respectively. The computational effort is low so if we are able to maintain the computational effort (1.78 s) improving the overall accuracy we could have a potential candidate to ove into production.
In [77]:
# Initialize and fit the model.
naive_bayes = BernoulliNB()
#Tune hyperparameters
#Create range of values to fit parameters
alpha = [0.01,0.1,0.5,1]
parameters = {'alpha': alpha}
#Fit parameters using gridsearch
naive_bayes_tuned = GridSearchCV(naive_bayes, param_grid=parameters, cv=kf)
#Fit the tunned classifier in the traiing space
naive_bayes_tuned.fit(X_train, y_train)
#Print the best hyperparameters set
print(naive_bayes_tuned.best_params_)
The tuned model is fit and run on the test set and the timer is started to check the computational effort:
In [78]:
#Start the timer as a measure of the computing effort
start_time = time.time()
# Predict on the test data set
#Fit the model with the new hyperparameters
naive_bayes_tuned.fit(X_test, y_test)
# Predict on training set
pred_test_y = naive_bayes_tuned.predict(X_test)
In [79]:
#Evaluate model on the test set
#Cross validate the accuracy of the predictions
accuracy_naive_bayes_tuned = cross_val_score(naive_bayes_tuned,X_test,y_test,cv=kf).mean()
#Print the time required to fit and evaluate the model
print("--- %s seconds ---" % (time.time() - start_time))
#Define the target names to be evaluated in the classification report
target_names = ['0.0', '1.0']
#Build the confusion matrix
confusion = confusion_matrix(y_test, pred_test_y)
# Build Accuracy tables
table_test = pd.crosstab(y_test, pred_test_y, margins=True)
#Extract type 1 and 2 errors
test_tI_errors = table_test.loc[0.0,1.0] / table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0] / table_test.loc['All','All']
#Print results: classification report, confussion matrix, accuracy and % of errors
print(classification_report(y_test, pred_test_y, target_names=target_names))
print(confusion)
print((
'Naive Bayes accuracy: {}\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}\n\n'
).format(accuracy_naive_bayes_tuned,test_tI_errors, test_tII_errors))
The Naïve-Bayes model scores an overall accuracy of 62.8 %, lower than the Logistic Regression model and lower than the threshold of 2/3 impossed by the company. From a computational effort standpoint, the time required is very similar to the Logistic Regression model is lower (1.44 s) so it is a better candidate for production.
In this case, precision is again very similar for both classes although recall is lower for class tagged as 1. The model is performing worse than the Logistic Regression as both type I and II errors are higher and are between 13.8%-17.6%.
As the minimum threshod set by the company has not been achieved, further models will be tested:
A Kneighbors model has been implemented and tuned on the train set. The parameters tuned are:
The number of neighbors when tuning the model has been capped to 13, the highest number under 10% of the datapoints in the test set equating to 15 to reduce overfitting.
In this case, overfitting will exist as the number of datapoints 159 in the test set is small (and in the whole dataset) being a model that has a bad performance with small datasets.
In [104]:
# Initialize and fit the model
KNN = KNeighborsClassifier()
#Create range of values to fit parameters
neighbors = [3,5,7,9,11,13]
weight_sys = ['distance','uniform']
parameters = {'n_neighbors': neighbors, 'weights': weight_sys}
#Fit parameters using gridsearch
clf = GridSearchCV(KNN, param_grid=parameters, cv=kf)
#Fit the tunned model on the training set
clf.fit(X_train, y_train)
#Print the best hyperparameters set
print("Best Hyper Parameters:", clf.best_params_)
The tuned model is fit and run on the test set and the timer is started to check the computational effort:
In [105]:
#Start the timer as a measure of the computing effort
start_time = time.time()
#Initialize the model on test dataset
clf.fit(X_test, y_test)
# Predict on test dataset
pred_test_y = clf.predict(X_test)
In [106]:
#Evaluate model on the test set
#Cross validate the accuracy of the predictions
accuracy_clf = cross_val_score(clf,X_test,y_test,cv=kf).mean()
#Print the time required to fit and evaluate the model
print("--- %s seconds ---" % (time.time() - start_time))
#Define the target names to be evaluated in the classification report
target_names = ['0', '1']
#Build the confusion matrix
confusion = confusion_matrix(y_test, pred_test_y)
# Build Accuracy tables
table_test = pd.crosstab(y_test, pred_test_y, margins=True)
#Extract type 1 and 2 errors
test_tI_errors = table_test.loc[0.0,1.0] / table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0] / table_test.loc['All','All']
#Print results: classification report, confussion matrix, accuracy and % of errors
print(classification_report(y_test, pred_test_y, target_names=target_names))
print(confusion)
print((
'KNN accuracy: {}\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}\n\n'
).format(accuracy_clf,test_tI_errors, test_tII_errors))
The overall accuracy of the kneighbors model is 66 %. In this case, the KNieghbors model has a better performance as a binary classifier and is the first model that achieves an overall accuracy equal to the threshold set by the company. From a computational effort standpoint, this model presentes a higher computational effort (2.37 s), roughly 64.6 % higher than the computational effort required by the Naïve-Bayes Classifier.
The model presents overfitting as expected (type I and II erros are zero being precision and recall equal to 1) due to the small number of datapoints in the dataset. For this model to be more accurate and reduce overfitting, bigger datasets are required. Although overall accuracy has significantly improved, further models will be tested:
A support vector classifier has been set up and tuned on the training data and run on the test set. The hyperparameters that have been tuned are:
In this case as in the case of the KNieghbors classifier, some overfitting is expected and not the best performance due to the size of the dataset, which is too small for this type of models.
In [107]:
# Initialize and fit the model
svc = SVC()
#Create range of values to fit parameters
c_param = np.arange(20)+1
kernel_type = ['linear','rbf']
parameters = {'C': c_param, 'kernel': kernel_type}
#Fit parameters using gridsearch
svc_parameters_tune = GridSearchCV(svc, param_grid=parameters, cv=kf)
#Fit the model on the training set
svc_parameters_tune.fit(X_train, y_train)
#Print the best hyperparameters set
print("Best Hyper Parameters:", svc_parameters_tune.best_params_)
The tuned model is fit and run on the test set and the timer is started to check the computational effort:
In [108]:
#Start the timer as a measure of the computing effort
start_time = time.time()
#Initialize the model on test dataset
svc_parameters_tune.fit(X_test, y_test)
# Predict on test dataset
pred_test_y = svc_parameters_tune.predict(X_test)
In [109]:
#Evaluate model on the test set
#Cross validate the accuracy of the predictions
accuracy_svc_parameters_tune = cross_val_score(svc_parameters_tune,X_test,y_test,cv=kf).mean()
#Print the time required to fit and evaluate the model
print("--- %s seconds ---" % (time.time() - start_time))
#Define the target names to be evaluated in the classification report
target_names = ['0.0', '1.0']
#Build the confusion matrix
cnf = confusion_matrix(y_test, pred_test_y)
# Build Accuracy tables
table_test = pd.crosstab(y_test, pred_test_y, margins=True)
#Extract type 1 and 2 errors
test_tI_errors = table_test.loc[0.0,1.0] / table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0] / table_test.loc['All','All']
#Print results: classification report, confussion matrix, accuracy and % of errors
print(classification_report(y_test, pred_test_y, target_names=target_names))
print(cnf)
print((
'SVC accuracy:{}\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}\n\n'
).format(accuracy_svc_parameters_tune,test_tI_errors, test_tII_errors))
The overall accuracy is 67.8 %, which very similar to the one obtained by the Kneighbors classifier. In this case the computational effort is significantly higher being 4.5 times higher than the one required by the KNiehgbor classifier. This model presents a better perfomance in terms of overfitting and also in terms of the type I and II error tht equate to 4.4& and 10.1% approx.
The accuracy is still low considering the requirements of the RFC and the computational effort is high if compared to the previous classifiers that achieve similar accuracy resuults. Another model will be tested:
In [43]:
# Initialize and fit the model
dtc = DecisionTreeClassifier()
#Tune hyperparameters
#Create range of values to fit parameters
max_leaf_nodes_options = [5, 10, 15, 20]
param_max_leaf_nodes = {'max_leaf_nodes': max_leaf_nodes_options}
#Tune parameters using gridsearch
max_leaf_nodes = GridSearchCV(dtc, param_grid=param_max_leaf_nodes, cv=kf)
#Fit the classifier in the traiing space
max_leaf_nodes.fit(X_train, y_train)
#The best hyperparameters set
print("Best Hyper Parameters:", max_leaf_nodes.best_params_)
The tuned model is fit and run on the test set and the timer is started to check the computational effort:
In [44]:
#Start the timer as a measure of the computing effort
start_time = time.time()
# Predict on the test data set
#Fit the model with the new hyperparameters
max_leaf_nodes.fit(X_test, y_test)
# Predict on test set
pred_test_y = max_leaf_nodes.predict(X_test)
In [45]:
#Evaluate model on the test set
#Cross validate the accuracy of the predictions
accuracy_dtc = cross_val_score(max_leaf_nodes,X_test,y_test,cv=kf).mean()
#Print the time required to fit and evaluate the model
print("--- %s seconds ---" % (time.time() - start_time))
#Define the target names to be evaluated in the classification report
target_names = ['0.0', '1.0']
#Build the confusion matrix
cnf = confusion_matrix(y_test, pred_test_y)
# Build Accuracy tables
table_test = pd.crosstab(y_test, pred_test_y, margins=True)
#Extract type 1 and 2 errors
test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']
#Print results: classification report, confussion matrix, accuracy and % of errors
print(classification_report(y_test, pred_test_y, target_names=target_names))
print(cnf)
print((
'Decision Tree accuracy:{}\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}'
).format(accuracy_dtc,test_tI_errors, test_tII_errors))
The accuracy of the decision tree (67.9 %) is nearly equal to the one obtained by the Support Vector Classifier. In this case, the overall accuracy seems within the reasonable values that a decision tree obtaines in similar circumstances. Error type I and II are smaller than in the previous models that did not present overfitting equating to 5 % in both cases.
In this case, the computational effort to achieve a similar overall accuracy than previous models is significantly lower, being the first classifier that requires less than 1 s. Compared to the Suport Vector Classifier, the Decision Tree requires 2.8 % of the computational effort used by the former.
Hence, comparing the accuracy obtained by all the classifiers so far, the decision tree classifier would be the one used as for a similar accuracy a lower computational effort is required. As the overall accuracy is still low, further models will be tried:
The hyperparamters of the random forest model have been tuned one by one. After trying to tune them all at once, a significant increase of the overall performance of the classifier was obtained with the proposed method (one by one). The parameters to be tuned are (in the same order as the hyperparameter tuning has been performed):
In [46]:
#For the Random Forest hyperparameters tuning,due to computational restrictions,
#grid search will be applied to one paramter at a time on the train set
#updating the value as we move along the hyperparameters tuning
# Initialize the model and tune the hyperparameters
#Create range of values to fit parameters for the number of estimators
param_n_estim = {'n_estimators':range(1,201,10)}
#Fit parameters using gridsearch
n_estimator = GridSearchCV(estimator = RandomForestClassifier(),
param_grid = param_n_estim, scoring='roc_auc',n_jobs=4,iid=False,
cv=kf)
#Fit the model
n_estimator.fit(X_train, y_train)
#The best hyper parameters set
n_estimator.grid_scores_, n_estimator.best_params_, n_estimator.best_score_
Out[46]:
In [51]:
#With the number of estimators set, determine max depth and min sample split
#Create range of values to fit parameters
param_depth = {'max_depth':range(12,20,1)}
#Tune hyperparameters with gridsearch
depth = GridSearchCV(estimator = RandomForestClassifier(n_estimators=181),
param_grid = param_depth, scoring='roc_auc',n_jobs=4,iid=False, cv=kf)
#Fit the model
depth.fit(X_train, y_train)
#The best hyper parameters set
depth.grid_scores_, depth.best_params_, depth.best_score_
Out[51]:
The tuned model is fit and run on the test set and the timer is started to check the computational effort:
In [54]:
#Start the timer as a measure of the computing effort
start_time = time.time()
#Fit the model using the test dataset
depth.fit(X_test, y_test)
#Predict on test dataset
pred_test_y = depth.predict(X_test)
In [55]:
#Evaluate model on the test set
#Cross validate the accuracy of the predictions
accuracy_depth = cross_val_score(depth,X_test,y_test,cv=kf).mean()
#Print the time required to fit and evaluate the model
print("--- %s seconds ---" % (time.time() - start_time))
#Define the target names to be evaluated in the classification report
target_names = ['0', '1']
#Build the confusion matrix
cnf = confusion_matrix(y_test, pred_test_y)
# Build Accuracy tables
table_test = pd.crosstab(y_test, pred_test_y, margins=True)
#Extract type 1 and 2 errors
test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']
#Print results: classification report, confussion matrix, accuracy and % of errors
print(classification_report(y_test, pred_test_y, target_names=target_names))
print(cnf)
print((
'Random Forest accuracy:{}\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}'
).format(accuracy_depth,test_tI_errors, test_tII_errors))
The overall accuracy of the model has significantly increase compared to the previous classifiers achieving 82.3%. This result is aligned with the type of classifier used being a reasonable accuracy for this type of classifier. From the classification report and the errors type I and II, the model presents overfitting due to the size fo the dataset. The increase of accuracy comes at a computational cost. Compared to the Decision Tree Classifier, the accuracy has improved by 121 % while the computational effort has also increased significantly (from less than 1s to 14.5s) reaching similar values to the Support Vector Calssifier. Due to the requirement given, this is the first classifier that performs better than the threshold set by the company. Further testing will be done in another model to see if the computation effort can be reduced, maintaining or improving the overall accuracy.
The gradient boosting model has been trained and parameters tuned. The parameters that have been tuned are:
In [57]:
#For the Gradient Boosting hyperparameters tuning
#grid search will be applied to one paramter at a time on the train set
#updating the value as we move along the hyperparameters tuning
# Initialize the model and tune the hyperparameters
#Create range of values to fit parameters for the number of estimators
parameters_basic_gdc = {'n_estimators':range(80,200,10),'max_depth':range(10,20,2), 'min_samples_split':range(2,10,2)}
#Fit parameters using gridsearch
basic_gdc = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1,
max_features='sqrt',
subsample=0.8,
random_state=10),
param_grid = parameters_basic_gdc, scoring='roc_auc',n_jobs=-1,iid=False, cv=kf)
#Fit the model
basic_gdc.fit(X_train, y_train)
#Print the hyperparameters set
basic_gdc.grid_scores_, basic_gdc.best_params_, basic_gdc.best_score_
Out[57]:
In [58]:
#Re run the min_sample split with the min_sample leaf
parameters_tuning_gdc = {'min_samples_split':range(2,11,1),'min_samples_leaf':range(1,71,10)}
tuning_gdc = GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=190,
max_depth=16,
min_samples_split=4,
max_features='sqrt',
subsample=0.8,
random_state=10),
param_grid = parameters_tuning_gdc, scoring='roc_auc',n_jobs=4,iid=False, cv=kf)
#Fit the model
tuning_gdc.fit(X_train, y_train)
#Print the hyperparameters set
tuning_gdc.grid_scores_, tuning_gdc.best_params_, tuning_gdc.best_score_
Out[58]:
In [59]:
#With the number of estimators min_sammples_leaf and min sample split, tune the max_features
#Create range of values to fit parameters
param_max_features = {'max_features':range(1,6,1)}
#Tune hyperparameters with gridsearch
max_features_gbc= GridSearchCV(estimator = GradientBoostingClassifier(learning_rate=0.1, n_estimators=101,
max_depth=13,
min_samples_split=3,
min_samples_leaf = 1,
max_features='sqrt',
subsample=0.8,
random_state=10),
param_grid = param_max_features, scoring='roc_auc',n_jobs=4,iid=False, cv=kf)
#Fit the model
max_features_gbc.fit(X_train, y_train)
#The best hyper parameters set
max_features_gbc.grid_scores_, max_features_gbc.best_params_, max_features_gbc.best_score_
Out[59]:
The tuned model is fit and run on the test set and the timer is started to check the computational effort:
In [60]:
#Start the timer as a measure of the computing effort
start_time = time.time()
# Predict on the test data set
#Fit the model with the new hyperparameters
max_features_gbc.fit(X_test, y_test)
# Predict on the test set
pred_test_y = max_features_gbc.predict(X_test)
In [61]:
#Evaluate model on the test set
#Cross validate the accuracy of the predictions
accuracy_features_gdc = cross_val_score(max_features_gbc,X_test,y_test,cv=kf).mean()
#Print the time required to fit and evaluate the model
print("--- %s seconds ---" % (time.time() - start_time))
#Define the target names to be evaluated in the classification report
target_names = ['0.0', '1.0']
#Build the confusion matrix
cnf = confusion_matrix(y_test, pred_test_y)
#Build Accuracy tables
table_test = pd.crosstab(y_test, pred_test_y, margins=True)
#Extract type 1 and 2 errors
test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']
#Print results: classification report, confussion matrix, accuracy and % of errors
print(classification_report(y_test, pred_test_y, target_names=target_names))
print(cnf)
print((
'Gradient Boosting accuracy:{}\n'
'Percent Type I errors: {}\n'
'Percent Type II errors: {}'
).format(accuracy_features_gdc,test_tI_errors, test_tII_errors))
The Gradient Boosting Classifier achieves a similar accuracy (82.5 %) to the one obtained by the Random Forest (82.3 %). In this case, the computational power nearly halves the computational power required by the Random Forest for a similar overall accuracy. This model also presents overfitting (errors are equal to zero), most probably due to the strategy used to tune the hyperparameters of the model based on the roc_auc and the size of the dataset.
As in the case of the Random Forest Classifier accuracy is within the region of accuracy reqested by the company.
This is the best candidate of all models to be moved into production as it fulfills the company requirements in terms of accuracy and requires half the computational effort required by the Random Forest Classifier.
The purpose of this project was to be able to predict the best match between candidates in two broad areas within an Oil & Gas Company based in Spain. To do it a survey was designed with 30 questions to analyze ten preferred ways of working by employees of different areas of the company.
The 30 questions were grouped into the 10 ways of working analyzing three different aspects of each way of working. The areas have been divided into two major areas: corporate or business. The former includes all support functions while the latter includes the frontline business area. A time window of one month or receiving more than 90% of the surveys completed was given to the company, closing the survey time once one of the two thresholds was achieved. In this case, 93% of the surveys was received before the time ended.
The model required to solve the problem is a binary classification model (only two outcomes). The initial set of predictors was 11, the 10 ways of working plus gender. There has been an imbalance in gender according to the surveys that have been received with men/women 38%/62%.
An exploratory data analysis was carried out to see if there are any patterns in the data and what kind of distribution are each of the features following. From this analysis, none of the features are following a normal distribution, but nearly all (except two predictors: "benefactor_score" and the "innovator_score") have distributions that are centered towards 10. Mean and median for all the distributions are more or less the same so all of them are centered on the mean.
The predictors were analyzed and high correlation (over 50%) was found in the following cases:
PCA analysis was run to determine the number of features that are required to describe the data. In this case, six features seem to be enough to describe the data and were used for the recursive features elimination. Features importance using Random Forest and feature selection using kbest were also run.
In all cases, the same features were selected and the ones that were presenting a high correlation were excluded in all cases. The final features that have been selected are:
Although PCA features were built the requirements of the company to use as features only a subset or set of the initial features designed makes the use of PCA features or any kind of feature engineering unfeasible. The main reasin for this is that the best features will be used to design the second version of the survey for potential candidates.
The outcome variable is “Section” having two possible outcomes 0 or 1 (business or corporate).
The models used to build the binary classifier have been:
For the model selection process, the overall accuracy has been used as main selection strategy. As missclassification on both ways (corprate and business) have the same cost, the overall accuracy is an acceptable indicator.
In all cases, the main hyperparameters were tuned on the train set (70%) and run on the test set (30%) after the data was normalized. A classification report, the total accuracy and type I and II errors were calculated on the test set.
Computational effort, to see which classifier could be taken into rpodction has been computed consdering the time required to fit and complete the cross validation on the test set with five folds.
The Logistics Regression (64%), Naïve-Bayes (63%), KNeighbors (66%), Sopport Vector Classifier (67%) and Decision Tree (68%) Classifiers reache an accuracy close to the threshold imposssed by th company 66%.Although they all achieve similar accuracies from a computational effort standpoint, Support Vector Classifier is the one that requires more computational power (10.85s)for the a similar accuracy compared to the Decision Tree that only requires 0.30s.
The only two classifiers that present higher accuracy are Random Forest Classifier and Gradient Boosting Classifier. In both cases, the accuracy achieved by the models is of 82% approx. although the computational power required by the Gradient Boosting Classifier (7.84s) is half the one required by the Random Forest Classifier (14.48%).
Regarding misclassification errors, Kneighbors Classifier, Random Forest and Gradient Boosting have zero value which might indicate overfitting due to the small size of the dataset (159 datapoints on the test set). Clssifiers as Kneighbors and Support Vector Machine have worse performance with smaller datasets (as in this case).
From the overall accuracy standpoint (being the strategy used for model selection due to the constraints of the project) and the computation power required, the Classifier that would be implemented in production would be the Gradient Boosting Classifier as it reaches 82% of overall accuracy with a computational power between the one required by the KNeighbor Classifier and the Suppoet Vector Classifier.