MSc Machine Learning Assignment - Classification task. Private Kaggle Competition: "Are you sure Brighton's seagull is not a man-made object?" The aim of the assignment was to build a classifier able to distinguish between man-made and not man-made objects. Each data instance was represented by a 4608 dimensional feature vector. This vector was a concatenation of 4096 dimensional deep Convolutional Neural Networks (CNNs) features extracted from the fc7 activation layer of CaffeNet and 512 dimensional GIST features.
Three Additional pieces of information were granted: confidence label for each training instance, the test data class proportions, and additional training data containing missing values.
This Notebook contains the final workflow employed, to produce the model used to make the final predictions. The original Notebook contained a lot of trial and error methods, such as fine tuning the range of parameters fitted in the model. Some of these details from the original, and rather messy Notebook have been excluded here. Thus, this Notebook only intends to show the code for the key processes leading up to the development of the final model. In addition, this notebook contains a report/commentary documenting the theory concerning the steps of the workflow. The theory is adopted from the Literature, and is referenced appropriately.
Report also available in PDF, contact me for request.
1.1) Introduction of SVM
The approach of choice here was the Support Vector Machine (SVM). SVMs were pioneered in the late seventies [1]. SVMs are supervised learning models, which are extensively used for classification [2] and regression tasks [3]. In this context, SVM was employed for a binary classification task.
In layman’s terms, the basic premise of SVM for classification tasks is to find the optimal separating hyperplane (also called the decision boundary) between classes, through maximizing the margin between the data points closest to the decision boundary and the decision boundary itself. The points closest to the decision boundary are termed support vectors. The reason why the margin is maximized is to improve generalisation of the decision boundary; it is likely that many decision boundaries exist, but the decision boundary that maximizes the margin increases the likelihood that future outliers will be correctly classified [4]. This intuition seems relatively simple, but is complicated by ‘soft’ and ‘hard’ margins. A hard margin is only applicable when the data set is linearly separable. A soft margin is applicable when the data set is not linearly separable. Essentially, a soft margin is aware of future misclassifications due to the data not being linearly separable, thus tolerates misclassifications using a penalty term. On the contrary, a hard margin does not tolerate misclassifications. For a hard margin, misclassifications are dealt with by minimizing the margin. These concepts will be formalized below.
Assume the problem of binary classification on a dataset $\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}$ , where $x_i$ $\in$ $R^d$, i.e. $x_i$ is a data point represented as a d-dimensional vector, and $y_i$ $\in$ $\{-1, 1\}$ , which represents the class label of that data point, for $i= 1, 2, ..., n$ . A better optimal separation can be found by first transforming the data into a higher dimensional feature space by a non-linear mapping function $\phi$ [2]. This $\phi$ is also referred to as the ‘kernel’. A possible decision boundary can then be represented by $w \cdot \phi(x) + b = 0$ , where $w$ is the weight vector orthogonal to the decision boundary and $b$ is an intercept term. It follows that, if the data set is linearly separable, then the decision boundary that maximizes the margin can be found by solving the following optimization: $\min (\frac{1}{2} w \cdot w) $ under the constraint $y_i (w \cdot \phi(x_i) + b) \ge 1 $ where $i = 1, 2, ..., n$ . This encapsulates the concept of a ‘hard’ margin. However, in the case of non-linearly separable data, the above constraint has to be relaxed by the introduction of a slack variable $\varepsilon$ . The optimization problem then becomes: $\min(\frac{1}{2} w \cdot w + C \sum_{i=1}^n \varepsilon_i$) such that $y_i (w \cdot \phi(x_i) + b) \ge 1 - \varepsilon_i $ where $i = 1, 2, ..., n$ and $\varepsilon_i \ge 0$. The $\sum_{i=1}^n \varepsilon_i$ term can be interpreted as the misclassification cost. This new objective function comprises two aims. The first aim still remains to maximize the margin, and the second aim is to reduce the number of misclassifications. The trade-off between these two aims is controlled by the parameter $C$. This encapsulates the concept of a ‘soft’ margin.
$C$ is coined the regularization parameter. A high value of $C$ increases the penalty for misclassifications, thus places more emphasis on the second goal. A large misclassification penalty enforces the model to reduce the number of misclassifications. Hence, a high enough value of $C$ could induce over-fitting. A small $C$ decreases the penalty for misclassifications, thus places more emphasis on the first goal. A small classification penalty enforces the model to tolerate classifications more readily. Hence, a small enough value of $C$ could induce under-fitting.
The SVM classifier is trained using the hinge-loss as the loss function [5].
1.2) Suitability of SVM
SVM is a popular technique because of its solid mathematical background, high generalisation capability, ability to find global solutions and ability to find solutions that are non-linear [6]. However, SVMs can be impacted by data sets that do not have equal class balances. The methods for dealing with this will be discussed in Section 2.1 of this report. Thus, SVMs are still applicable to data sets with class imbalances, such as the data set provided here. It has been argued that SVMs show superior performance than other techniques when analysis is conducted on high- dimensional data [7]. The dataset here, even after pre- processing, has many dimensions. Thus, the use of SVM in this context is justified. Another downfall of SVM is the dependency on feature scaling; the performance of an SVM can be highly impacted by the selection of the feature scaling method. However, feature scaling is an important pre-processing technique. One encouraging reason for employing feature scaling is that the gradient descent algorithm converges much faster with feature scaling than without feature scaling. In particular, feature scaling reduces the time it takes for the SVM to find support vectors [8].
This section will cover how the training data for the final model was prepared. Several additonal pieces of information were provided in the assignment outline. This section will demonstrate how these strands of information were incorporated, if they were incorporated at all.
In [1]:
#Import Relevant Modules and Packages
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from scipy import stats
from sklearn.feature_selection import VarianceThreshold
#see all rows of dataframe
#pd.set_option('display.max_rows', 500)
In [2]:
#Load the complete training data set
training_data = pd.read_csv("/Users/Max/Desktop/Max's Folder/Uni Work/Data Science MSc/Machine Learning/ML Kaggle Competition /Data Sets/Training Data Set.csv", header=0, index_col=0)
In [3]:
#Observe the original training data
training_data.head()
Out[3]:
In [4]:
#quantify class counts of original training data
training_data.prediction.value_counts()
Out[4]:
2.1) Dealing with Missing Values – Imputation
Imputation is the act of replacing missing data values in a data set with meaningful values. Simply removing rows with missing feature values is bad practice if the data is scarce, as a lot of information could be lost. In addition, deletion methods can introduce bias [9]. The incomplete additional training data was combined with the complete original training data because the complete original data was scarce in number. However, the incomplete additional training data was missing values, therefore imputation was appropriate, if not required. Two methods of imputation were employed. The first method of imputation employed was imputation via feature means. However, this method has been heavily criticized. In particular, it has been hypothesized that imputation via mean introduces bias and underestimates variability [10]. The second method of imputation employed was k- Nearest-Neighbours (kNN) [11]. This is a technique, which is part of hot-deck imputation techniques [12], where missing feature values are filled from data points that are similar, or geometrically speaking, points that are closest in distance. This method is more appropriate than using the mean imputation, given the flaws of feature mean imputation. Therefore, kNN was the imputation method used to build the final model. The kNN implementation was found in the ‘fancyimpute’ package [13]. The k of kNN can be considered a parameter that needs to be chosen carefully. Fortunately, the literature provides some direction on this. The work of [14] suggests that kNN with 3 nearest-neighbours is the best for the trade-off between imputation error and preservation of data structure. In summary, kNN was employed for imputation, and k was set to 3.
This section will cover how the incomplete additional training data set was incorporated to develop a larger training data set. In particular, the additional training data was combined with the original training data. The additonal training data was incomplete, with several NaN entries. Thus, imputation was performed to replace NaN entries with meaningful values.
In [5]:
#Load additional training data
add_training_data = pd.read_csv("/Users/Max/Desktop/Max's Folder/Uni Work/Data Science MSc/Machine Learning/ML Kaggle Competition /Data Sets/Additional Training Data Set .csv", header=0, index_col=0)
In [6]:
#observe additional training data
add_training_data
Out[6]:
In [7]:
#quantify class counts of additional training data
add_training_data.prediction.value_counts()
Out[7]:
In [8]:
#find number of NAs for each column for additional training data
add_training_data.isnull().sum()
Out[8]:
In [9]:
#concatenate original training data with additional training data
full_training_data_inc = pd.concat([training_data, add_training_data])
#observe concatenated training data
full_training_data_inc
Out[9]:
A couple of imputation methods were tried in the original Notebook:
The most effective and theoretically advocated method producing the best results was the second method; imputation using the K-Nearest Neighbours. Note: the fancyimpute package may need installing before running. K was set to 3 here, see report for justification.
In [10]:
#imputation via KNN
from fancyimpute import KNN
knn_trial = full_training_data_inc
knn_trial
complete_knn = KNN(k=3).complete(knn_trial)
In [11]:
#convert imputed matrix back to dataframe for visualisation and convert 'prediction' dtype to int
complete_knn_df = pd.DataFrame(complete_knn, index=full_training_data_inc.index, columns=full_training_data_inc.columns)
full_training_data = complete_knn_df
full_training_data.prediction = full_training_data.prediction.astype('int')
full_training_data
Out[11]:
In [12]:
#quantify class counts for full training data
full_training_data.prediction.value_counts()
Out[12]:
2.2) Dealing with Confidence Labels
One approach employed to incorporate the confidence labels was to use the confidence label of each instance as the corresponding sample weight for the instance. Theoretically, a confidence label of smaller than 1 would reduce the C parameter, which results in a lower penalty for misclassification of an instance whose label is not known with certainty. However, the implementation of this did not follow the theory; introducing the sample weights reduced the overall accuracy of the model. This matter was complicated more by the fact that samples generated from over-sampling via SMOTE had to be assigned a confidence label, which is difficult to determine objectively. Thus, it was decided that only data instances with a confidence label of 1 should be retained in the training data. This obviously leads to a massive loss of information. However, after removing instances, which do not have a confidence label of 1, 1922 training instances remained, which can be assumed to be a reasonable training data size. After truncating the data set, the procedure described in Section 2.2 was repeated for the truncated training data. In summary, the training data was truncated to only include instances that have a confidence label of 1. The minority class of the training data was over-sampled using SMOTE to balance the class split. Class weights were then applied during the training of the SVM to ensure that the model was more sensitive to correctly classifying the majority class of the test data.
This section will cover how the confidence labels, one of the additional pieces of information provided in the assignment outline, were incorporated into the final training data set.
In [13]:
#Load confidence annotations
confidence_labels = pd.read_csv("/Users/Max/Desktop/Max's Folder/Uni Work/Data Science MSc/Machine Learning/ML Kaggle Competition /Data Sets/Annotation Confidence .csv", header=0, index_col=0)
In [14]:
#quantify confidence labels (how many are 1, how many are 0.66)
print(confidence_labels.confidence.value_counts())
#observe confidence annotations
confidence_labels
Out[14]:
In [15]:
#adding confidence of label column to imputed full training data set
full_train_wcl = pd.merge(full_training_data, confidence_labels, left_index=True, right_index=True)
full_train_wcl
Out[15]:
The original Notebook tried a couple of methods of incorporating the confidence labels into the model:
The best model was based on Method 2. Thus, only method 2 will be shown for this section here.
In [16]:
#only keep data instance with confidence label = 1
conf_full_train = full_train_wcl.loc[full_train_wcl['confidence'] == 1]
conf_full_train
Out[16]:
In [17]:
#quantify class counts
conf_full_train.prediction.value_counts()
Out[17]:
In [18]:
#convert full training data dataframe with confidence instances only to matrix
conf_ft_matrix = conf_full_train.as_matrix(columns=None)
conf_ft_matrix
conf_ft_matrix.shape
Out[18]:
In [19]:
#splitting full training data with confidence into inputs and outputs
conf_ft_inputs = conf_ft_matrix[:,0:4608]
print(conf_ft_inputs.shape)
conf_ft_outputs = conf_ft_matrix[:,4608]
print(conf_ft_outputs.shape)
2.3) Dealing with Class Imbalance
Binary classification tasks suffer from imbalanced class splits. Training a model on a data set with more instances for one class than the other class can result in biases towards the majority class, as sensitivity will be lost in detecting the minority class [17]. This is pertinent because the training data (additional and original data included) has an unbalanced class split, with more instances of Class 1 than Class 0. Thus, training the model on this data would result in a model that is biased towards Class 1 detections. To exacerbate this issue, the test data is also unbalanced, but the majority class for the test data is Class 0. Some researchers have already attempted to tackle this problem. There are two primary methods of dealing with class imbalances: balancing or further unbalancing the data set as needs fit, or introducing class weights, where the underlying algorithm applies disparate misclassification penalties to different classes [15]. Both approaches will be combined here, to first balance the data set, and then train the model to be bias towards Class 0 instances, as Class 0 is the majority class in the test data. The ‘imbalanced-learn’ API [16] has implementations of class balancing strategies found in the literature, such as SMOTE [17]. The premise of SMOTE is over-sampling of the minority class until a balance in the data set is reached. The sampling method is based on sampling via kNN. Unlike kNN for imputation, the best k was suggested to be 5 here. Once the data set was balanced through SMOTE, class weights were introduced. Considering the test data has more instances belonging to Class 0, the class weights were adjusted so that misclassification of Class 0 is penalized more heavily than misclassification of Class 1. The ratio of class weights for training were adjusted to match the class proportions of the test data, i.e. Class 0 weight = 1.33 and Class 1 weight = 1. The reason over-sampling of the minority class was preferred over under-sampling of the majority class is because the data quantity was already scarce (evident from Section 2.3). Furthermore, over-sampling to reach a class balance permits the use of ‘accuracy’ as the accuracy metric, as opposed to using AUC, which is more complex. In summary, as well as balancing the train data class-split, the model itself was adjusted to place more emphasis on correct Class 0 classifications.
This section will cover how the class imbalance of the training data was addressed. The best approach for this was Over-Sampling using SMOTE. This technique over-samples the minority class until the data set is completely balanced. Note: may need to install imblearn package first.
In [20]:
from imblearn.over_sampling import SMOTE
from collections import Counter
In [21]:
#fit over-sampling to training data inputs and putputs
over_sampler = SMOTE(ratio='auto', k_neighbors=5, kind='regular', random_state=0)
over_sampler.fit(conf_ft_inputs, conf_ft_outputs)
Out[21]:
In [22]:
#create new inputs and outputs with correct class proportions
resampled_x, resampled_y = over_sampler.fit_sample(conf_ft_inputs, conf_ft_outputs)
In [23]:
#quantify original class proportions prior to over-sampling
Counter(conf_ft_outputs)
Out[23]:
In [24]:
#quantify class proportions after over-sampling
Counter(resampled_y)
Out[24]:
In [25]:
#assign newly sampled input and outputs to old variable name used for inputs and outputs before
#over-sampling
conf_ft_inputs = resampled_x
conf_ft_outputs = resampled_y
print(Counter(conf_ft_outputs))
The pre-processing of the data consisted of several steps. First, the features were rescaled appropriately. Secondly, Feature Extraction was performed to reduce the unwieldy dimensionality of the training data set, concomitantly increasing the signal-to-noise ratio and decreasing time complexity.
This section will cover the Pre-Processing conducted that produced the model capable of producing the best predictions. Feature Scaling was achieved via several methods. The best method was standardisation. Feature Extraction was achieved via PCA.
3.1) Feature Scaling Feature scaling is important because it ensures that features have values plotted on the same scale, irrespective of the original units used to describe the original features. Feature scaling can be in the form of standardization, normalization or rescaling. The correct choice of feature scaling method is arbitrary and highly dependent on context. Thus, all three approaches were tried. The optimal results were obtained for standardization.
In [26]:
#standardise the full training data with confidence labels 1 only
scaler_2 = preprocessing.StandardScaler().fit(conf_ft_inputs)
std_conf_ft_in = scaler_2.transform(conf_ft_inputs)
std_conf_ft_in
Out[26]:
3.2) Principal Component Analysis (PCA)
High-dimensionality should be reduced because it is likely to contain noisy features and because high-dimensionality increases computational time complexity [18]. Dimensionality reduction can be achieved via feature selection methods, such as filters and wrappers [19], or via feature extraction methods, such as PCA [20]. Here, the dimensionality reduction was conducted via feature extraction, vicariously through PCA. The rationale behind this is that the relative importance of GIST and CNN features is undetermined. Furthermore, feature selection methods may require some domain expertise to be effective. PCA uses the covariance matrix, its eigenvectors and eigenvalues to engineer principal components, which are uncorrelated eigenvectors that explain some proportion of the variance found in the dataset. The optimal number of principal components to engineer is arbitrary. Thus, the optimal number of principal components can be configured experimentally. This can be achieved by plotting the change in variance explained as a function of the number of principal components included, and by calculating the test score during cross validation for data transformed using different numbers of principal components.
In [27]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#preprocessing: PCA (feature construction). High number of pcs chosen to plot a graph
#showing how much more variance is explained as pc number increases
pca_2 = PCA(n_components=700, random_state=0)
std_conf_ft_in_pca = pca_2.fit_transform(std_conf_ft_in)
#quantify amount of variance explained by principal components
print("Total Variance Explained by PCs (%): ", np.sum(pca_2.explained_variance_ratio_))
The cell below will plot how much more of the variance in the data set is explained as the number of principal components included is increased.
In [28]:
#calculate a list of cumulative sums for amount of variance explained
cumulative_variance = np.cumsum(pca_2.explained_variance_ratio_)
len(cumulative_variance)
#add 0 to the beginning of the list, otherwise list starts with variance explained by 1 pc
cumulative_variance = np.insert(cumulative_variance, 0, 0)
#define range of pcs
pcs_4_var_exp = np.arange(0,701,1)
len(pcs_4_var_exp)
fig_1 = plt.figure(figsize=(7,4))
plt.title('Number of PCs and Change In Variance Explained')
plt.xlabel('Number of PCs')
plt.ylabel('Variance Explained (%)')
plt.plot(pcs_4_var_exp, cumulative_variance, 'x-', color="r")
plt.show()
The graph above suggests that the maximum number of principal components should not exceed 300, as less and less variance is explained as the number of principal components included increases beyond 300. For the optimisation, the optimal number of principal components was initially assumed to be 230.
In [29]:
#preprocessing: PCA (feature construction)
pca_2 = PCA(n_components=230, random_state=0)
std_conf_ft_in_pca = pca_2.fit_transform(std_conf_ft_in)
#quantify ratio of variance explain by principal components
print("Total Variance Explained by PCs (%): ", np.sum(pca_2.explained_variance_ratio_))
The optimization was conducted through the use of a Grid search. In addition, the optimization was conducted for two kernels: the polynomial kernel and the RBF kernel. The initial search for optimal parameters was conducted on a logarithmic scale to explore as much of the parameter space as possible. From the results, the parameter ranges were refined and pruned to only include the potential best candidates. The choice of parameters was purely based on accuracy metrics, not on any other practical factors such as memory consumption or time complexity of predictions. The best model was determined on the following merits:
4.1) Parameter Optimisation
In [30]:
#this cell takes around 7 minutes to run
#parameter optimisation with Exhaustive Grid Search, with class weight
original_c_range = np.arange(0.85, 1.01, 0.01)
gamma_range = np.arange(0.00001, 0.00023, 0.00002)
#define parameter ranges to test
param_grid = [{'C': original_c_range, 'gamma': gamma_range, 'kernel': ['rbf'],
'class_weight':[{0:1.33, 1:1}]}]
#define model to do parameter search on
svr = SVC()
clf = GridSearchCV(svr, param_grid, scoring='accuracy', cv=5,)
clf.fit(std_conf_ft_in_pca, conf_ft_outputs)
#create dictionary of results
results_dict = clf.cv_results_
#convert the results into a dataframe
df_results = pd.DataFrame.from_dict(results_dict)
df_results
Out[30]:
The cell below will plot two heat-maps side by side: one for showing how the training accuracy changes during cross-validation for different combinations of parameters, and one for showing how the testing accuracy changes during cross-validation for different combinations of parameters.
In [31]:
#Draw heatmap of the validation accuracy as a function of gamma and C
fig = plt.figure(figsize=(10, 10))
ix=fig.add_subplot(1,2,1)
val_scores = clf.cv_results_['mean_test_score'].reshape(len(original_c_range),len(gamma_range))
val_scores
ax = sns.heatmap(val_scores, linewidths=0.5, square=True, cmap='PuBuGn',
xticklabels=gamma_range, yticklabels=original_c_range, cbar_kws={'shrink':0.5})
ax.invert_yaxis()
plt.yticks(rotation=0, fontsize=10)
plt.xticks(rotation= 70,fontsize=10)
plt.xlabel('Gamma', fontsize=15)
plt.ylabel('C', fontsize=15)
plt.title('Validation Accuracy', fontsize=15)
#Draw heatmap of the validation accuracy as a function of gamma and C
ix=fig.add_subplot(1,2,2)
train_scores = clf.cv_results_['mean_train_score'].reshape(len(original_c_range),len(gamma_range))
train_scores
#plt.figure(figsize=(6, 6))
ax_1 = sns.heatmap(train_scores, linewidths=0.5, square=True, cmap='PuBuGn',
xticklabels=gamma_range, yticklabels=original_c_range, cbar_kws={'shrink':0.5})
ax_1.invert_yaxis()
plt.yticks(rotation=0, fontsize=10)
plt.xticks(rotation= 70,fontsize=10)
plt.xlabel('Gamma', fontsize=15)
plt.ylabel('C', fontsize=15)
plt.title('Training Accuracy', fontsize=15)
plt.show()
The cells below will plot a Validation Curves for Gamma.
In [32]:
#import module/library
from sklearn.model_selection import validation_curve
import matplotlib.pyplot as plt
%matplotlib inline
In [33]:
#specifying gamma parameter range to plot for validation curve
param_range = gamma_range
param_range
Out[33]:
In [34]:
#calculating train and validation scores
train_scores, valid_scores = validation_curve(SVC(C=0.92, kernel='rbf', class_weight={0:1.33, 1:1}), std_conf_ft_in_pca, conf_ft_outputs, param_name='gamma',param_range=param_range,scoring='accuracy')
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)
valid_scores_std = np.std(valid_scores, axis=1)
In [35]:
#plotting validation curve
plt.title('Gamma Validation Curve for SVM With RBF Kernel | C=0.92')
plt.xlabel('Gamma')
plt.ylabel('Score')
plt.xticks(rotation=70)
plt.ylim(0.8,1.0)
plt.xlim(0.0001,0.00021)
plt.xticks(param_range)
lw=2
plt.plot(param_range, train_scores_mean, 'o-',label="Training Score", color='darkorange', lw=lw)
plt.fill_between(param_range, train_scores_mean-train_scores_std, train_scores_mean+train_scores_std, alpha=0.2, color='darkorange', lw=lw)
plt.plot(param_range, valid_scores_mean, 'o-',label="Testing Score", color='navy', lw=lw)
plt.fill_between(param_range, valid_scores_mean-valid_scores_std, valid_scores_mean+valid_scores_std, alpha=0.2, color='navy', lw=lw)
plt.legend(loc='best')
plt.show()
The cells below will plot the Learning Curve.
In [36]:
#import module/library
from sklearn.model_selection import learning_curve
In [37]:
#define training data size increments
td_size = np.arange(0.1, 1.1, 0.1)
#calculating train and validation scores
train_sizes, train_scores, valid_scores = learning_curve(SVC(C=0.92, kernel='rbf', gamma=0.00011, class_weight={0:1.33, 1:1}), std_conf_ft_in_pca, conf_ft_outputs, train_sizes=td_size ,scoring='accuracy')
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)
valid_scores_std = np.std(valid_scores, axis=1)
In [38]:
#plotting learning curve
fig = plt.figure(figsize=(5,5))
plt.title('Learning Curve with SVM with RBF Kernel| C=0.92 & Gamma = 0.00011', fontsize=9)
plt.xlabel('Train Data Size')
plt.ylabel('Score')
plt.ylim(0.8,1)
lw=2
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training Score")
plt.fill_between(train_sizes, train_scores_mean-train_scores_std, train_scores_mean+train_scores_std, alpha=0.2, color='red', lw=lw)
plt.plot(train_sizes, valid_scores_mean, 'o-', color="g",label="Testing Score")
plt.fill_between(train_sizes, valid_scores_mean-valid_scores_std, valid_scores_mean+valid_scores_std, alpha=0.2, color='green', lw=lw)
plt.legend(loc='best')
plt.show()
The cells below will show the optimisation for the number of principal components to include. This is done by doing using a range of principal components, conducting PCA for each specified number in the interval and calculating the average of the test score over 3-fold cross-validation. This procedure is repeated 5 times to combat the randomness of PCA. The average test accuracy over the 5 runs is then plotted against the number of principal components included.
In [39]:
#this cell may take several minutes to run
#plot how the number of PC's changes the test accuracy
no_pcs = np.arange(20, 310, 10)
compute_average_of_5 = []
for t in range(0,5):
pcs_accuracy_change = []
for i in no_pcs:
dummy_inputs = std_conf_ft_in
dummy_outputs = conf_ft_outputs
pca_dummy = PCA(n_components=i,)
pca_dummy.fit(dummy_inputs)
dummy_inputs_pca = pca_dummy.transform(dummy_inputs)
dummy_model = SVC(C=0.92, kernel='rbf', gamma=0.00011, class_weight={0:1.33, 1:1})
dummy_model.fit(dummy_inputs_pca, dummy_outputs,)
dummy_scores = cross_val_score(dummy_model, dummy_inputs_pca, dummy_outputs, cv=3, scoring='accuracy')
mean_cv = dummy_scores.mean()
pcs_accuracy_change.append(mean_cv)
print (len(pcs_accuracy_change))
compute_average_of_5.append(pcs_accuracy_change)
In [40]:
#calculate position specific average for the five trials
from __future__ import division
average_acc_4_pcs = [sum(e)/len(e) for e in zip(*compute_average_of_5)]
In [41]:
plt.title('Number of PCs and Change In Accuracy')
plt.xlabel('Number of PCs')
plt.ylabel('Accuracy (%)')
plt.plot(no_pcs, average_acc_4_pcs, 'o-', color="r")
plt.show()
The following cells will prepare the test data by getting it into the right format.
In [43]:
#Load the complete training data set
test_data = pd.read_csv("/Users/Max/Desktop/Max's Folder/Uni Work/Data Science MSc/Machine Learning/ML Kaggle Competition /Data Sets/Testing Data Set.csv", header=0, index_col=0)
In [44]:
##Observe the test data
test_data
Out[44]:
In [45]:
#turn test dataframe into matrix
test_data_matrix = test_data.as_matrix(columns=None)
test_data_matrix.shape
Out[45]:
The following cell will apply the same pre-processing applied to the training data to the test data.
In [46]:
#pre-process test data in same way as train data
scaled_test = scaler_2.transform(test_data_matrix)
transformed_test = pca_2.transform(scaled_test)
transformed_test.shape
Out[46]:
The following cells will produce predictions on the test data using the final model.
In [47]:
#define and fit final model with best parameters from grid search
final_model = SVC(C=0.92, cache_size=1000, kernel='rbf', gamma=0.00011, class_weight={0:1.33, 1:1})
final_model.fit(std_conf_ft_in_pca, conf_ft_outputs)
Out[47]:
In [48]:
#make test data predictions
predictions = final_model.predict(transformed_test)
#create dictionary for outputs matched with ID
to_export = {'ID': np.arange(1, 4201, 1), 'prediction': predictions}
to_export
#convert to dataframe
final_predictions = pd.DataFrame.from_dict(to_export)
final_predictions
Out[48]:
In [49]:
#convert prediction column float type entries to integers
final_predictions = final_predictions.astype('int')
final_predictions
Out[49]:
In [50]:
#check properties of predictions: class balance should be 42.86(1):57.14(0)
#i.e. should predict 2400 Class 0 instances, and 1800 Class 1 instances
final_predictions.prediction.value_counts()
Out[50]:
[1] Vapnik V. (1979) Estimation of Dependences based on Empirical Data. Springer Verlag, New York, 1982.
[2] Cortes C, Vapnik V. (1995) Support Vector Networks. Machine Learning. Vol. 20: pages 273-297.
[3] Drucker H, Burges CJC, Kaufman AS, Vapnik V. (1997) Support vector regression machines. Advances in Neural Information Processing Systems. Vol. 9: pages 155-161.
[4] Vapnik VN. (1982) Estimation of Dependences Based on Empirical Data. Addendum 1, New York: Springer-Verlag.
[5] Rosasco L, De Vito ED, Caponnetto A, Piana M, Verri A. (2004) Are Loss Functions All The Same? Neural Computation. Vol. 16: pages 1063-1076.
[6] Batuwita R, Palade V. (2012) Class Imbalance learning methods for Support Vector Machines. In: Imbalanced Learning: Foundations, Algorithms and Applications, by He H, Ma Y. John Wiley & Sons: Chapter 6.
[7] Lian H. (2012) On feature selection with principal component analysis for one-class SVM. Pattern Recognition Letters. Vol. 33: pages 1027-1031.
[8] Juszczak P, Tax DJ, Dui RW. (2002) Feature scaling in support vector data descriptions. Proc. 8th Annual. Conf. Adv. School Comput. Imaging: pages 1-8. Accessed on link: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.100.25 24&rep=rep1&type=pdf
[9] Greenland S, Finkle WD. (1995) A critical look at methods for handling missing covariates in epidemiologic regression analyses. AM J Epdimiol. Vol. 142: pages 1255-1264.
[10] Horton NJ, Kleinman KP. (2007) Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. Am Stat. Vol. 61: pages 79- 90.
[11] Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics. Vol. 17: pages 520-525.
[12] Andridge RR, Little RJ. (2010) A Review of Hot Deck Imputation for Survey Non-response. Int Stat Review. Vol. 78: pages 40-64.
[13] Rubinsteyn A, Feldman S, O’Donnell T, Beaulieu-Jones B. (2015) fancyimpute 0.2.0. Package found on: https://github.com/hammerlab/fancyimpute.
[14] Beretta L, Santaniello A. (2016) Nearest neighbour imputation algorithms: a critical evaluation. BMC Medical Informatics and Decision Making. Vol. 16: pages 197-208.
[15] Barandela R, Valdovinos RM, Sanchez JS, Ferri Fj. (2004) The Imbalanced Training Sample Problem: Under or Over Sampling? Spring-Verlag, Berlin: pages 806-814.
[16] Lematre G, Nogueira F, Adrias CK. (2017) Imbalanced- learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research. Vol. 18: pages 1-5.
[17] Chawla NV, Bowyer KW, Hall Lo, Kegelmeyer WP. (2002) SMOTE: Synthetic minority oversampling. Journal of Artifical Intelligence Research. Vol. 16: pages 321-357.
[18] Strong DM, Lee YW, Wang RY. (1997) Data Quality in context. Communications of the ACM. Vol. 40: pages 103-110.
[19] Blum AL, Langley P. (1997) Selection of relevant features and examples in Machine Learning. Artificial Intelligence. Vol. 97: pages 245-271.
[20] Hira ZM, Gillies DF. (2015) A Review of Feature Selection and Feature Extraction Methods Applied on Microarray data. Advances in Bioinformatics. Vol. 2015: pages 1-13.