In [2]:
import numpy as np
import pandas as pd
Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. Each example is a pair consisting of an input object (typically a vector) and a desired output value. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.
This dataset contains 13910 measurements from 16 chemical sensors utilized in simulations for drift compensation in a discrimination task of 6 gases at various levels of concentrations. The dataset comprises recordings from six distinct pure gaseous substances, namely Ammonia, Acetaldehyde, Acetone, Ethylene, Ethanol, and Toluene, each dosed at a wide variety of concentration values ranging from 5 to 1000 ppmv.
In [3]:
data = pd.read_csv("./formatted_data.csv",header=0, index_col=False)
data.head()
Out[3]:
In [4]:
drop_cols = ['Sensor_'+x+'1' for x in map(chr,range(65,81))]
drop_cols.append('Batch_No')
data = data.drop(drop_cols, axis=1)
data.head()
Out[4]:
In [5]:
data.describe()
Out[5]:
Dealing with missing values - Real world datasets often contain missing values, represented by blanks, NaNs etc
Encoding categorical features - Using a label encoder helps us transform non-numerical labels to numerical labels. Another approach is Dummy encoding where you convert an attribute by creating duplicate variables which represents one level of a categorical variable. Presence of a level is represent by 1 and absence is represented by 0. For every level present, one dummy variable will be created.
In this dataset our target labels are categorical values that have already been encoded as numerical
[1: Ethanol; 2: Ethylene; 3:Ammonia; 4: Acetaldehyde; 5: Acetone; 6: Toluene]
In [6]:
from sklearn import preprocessing
target = data['Label']
data = data.drop('Label', axis=1)
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(-1,1))
data_scaled = min_max_scaler.fit_transform(data)
If we use the entire dataset to train our model, it will end up modeling random error/noise present in the data, and will have poor predictive performance on unseen future data. This situation is known as Overfitting. To avoid this we hold out part of the available data as a test set and use the remaining for training. Some common splits are 90/10, 80/20, 75/25.
There is a risk of overfitting on the test set as you try to optimize the hyperparameters of parametric models to achieve optimal performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”. Thus training is carried out on the training set, evaluation is done on the validation set and once the parameters have been tuned, final evaluation is carried out on the "unseen" test set.
The drawback of this approach is that we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets. To solve this we use a procedure called Cross-validation which is discussed later.
In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_scaled, target, test_size=0.25, random_state=0)
Binomial classification problem is one where the dataset has 2 target classes in the dataset. We are dealing with a Multinomial classification problem, as we have more than 2 target classes in our dataset. To leverage binary classifiers for multinomial classification we can use one of the following strategies-
One-vs-All : It involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives.
One-vs-One : It involves training K(K-1)/2 binary classifiers for a K-multiclass problem; each receives the samples of a pair of classes from the original training set, and must learn to distinguish these two classes. At prediction time, a voting scheme is applied: all K (K − 1) / 2 classifiers are applied to an unseen sample and the class that got the highest number of "+1" predictions gets predicted by the combined classifier.
Decision Trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable based on several input variables. It is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node.
Non-parametric models (can) become more and more complex with an increasing amount of data.
In [8]:
from sklearn import tree
dt_classifier = tree.DecisionTreeClassifier()
dt_classifier = dt_classifier.fit(X_train, y_train)
y_pred = dt_classifier.predict(X_test)
print "Accuracy: %0.2f" %dt_classifier.score(X_test, y_test)
In the above code snippet we fit a decision tree to our dataset, used it to make predictions on our test set and we calculated its accuracy as the number of correct predictions from all predictions made. Accuracy is a starting point but is not a sufficient measure for evaluating a model's predictive power due to a phenomena known as Accuracy Paradox. It yields misleading results if the data set is unbalanced.
A clean and unambiguous way to visualize the performance of a classifier is to use a use a Confusion matrix
Predicted class - Positive | Predicted class - Negative | |
---|---|---|
Acutal class - Positive | True Positive (TP) | False negative (FN) |
Acutal class - Negative | False Positive (FP) | True negative (TN) |
We use these values to calculate Precision and Recall-
Precision answers the following question : out of all the examples the classifier labeled as positive, what fraction were correct. $$Precision = \frac{TP}{TP + FP}$$
Recall answers out of all the positive examples there were, what fraction did the classifier pick up. It is calculated as - $$Recall = \frac{TP}{TP + FN}$$
The harmonic mean of Precision and Recall is known as the F1 Score. It conveys the balance between the precision and the recall. $$ F_1 score = \frac{2 \times Precision \times Recall}{Precision + Recall}$$
Let's visualize the confusion matrix for the decision tree. The sci-kit learn method just returns a nested array without any labels, so we plot for easier interpretation
In [9]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import itertools
%matplotlib inline
def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j], horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
return
#plt.figure()
plot_confusion_matrix(confusion_matrix(y_test, y_pred), classes=["Ethanol", "Ethylene", "Ammonia", "Acetaldehyde", "Acetone", "Toluene"],
title='Confusion matrix')
In [10]:
from sklearn.metrics import classification_report
print classification_report(y_test, y_pred, target_names=["Ethanol", "Ethylene", "Ammonia", "Acetaldehyde", "Acetone", "Toluene"])
Cross validation is a method for estimating the prediction accuracy of a model on an unseen dataset without using a validation set. Instead of just holding out one part of the data to train on, you hold out different parts. For each part, you train on the rest, and evaluate the set you held out. Now you have effectively used all of your data for testing & training, without testing on data you trained on.
The different methods are -
In [11]:
from sklearn.model_selection import cross_val_score
def cv_score(clf,k):
f1_scores = cross_val_score(clf, data_scaled, target, cv=k, scoring='f1_macro')
print f1_scores
print("F1 score: %0.2f (+/- %0.2f)" % (f1_scores.mean(), f1_scores.std() * 2))
return
By default, the score computed at each CV iteration is the score method of the estimator. It is possible to change this by using the scoring parameter
In [12]:
cv_score(dt_classifier,10)
Ensemble methods are a divide-and-conquer approach used to improve performance. The main principle behind ensemble methods is that a group of “weak learners” can come together to form a “strong learner”.Ensemble learning methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. One such method is Random Forests.
Decision trees are a popular & easy to interpret method but trees that are grown very deep tend to learn highly irregular patterns (noise). They tend to overfit their training sets i.e have low bias but very high variance.
Random Forests a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance.This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance of the final model as the individual decision trees are less correlated. So how are the trees different? Well,
In [13]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators=5)
#rf_classifier = rf_classifier.fit(X_train, y_train)
#y_pred_rf = classifier.predict(X_test)
cv_score(rf_classifier,10)
Uncomment the following snippets of code to view the confusion matrix and classification report for the Random forest model.
In [14]:
#plot_confusion_matrix(confusion_matrix(y_test, y_pred_rf), classes=["Ethanol", "Ethylene", "Ammonia", "Acetaldehyde", "Acetone", "Toluene"],
# title='Confusion matrix')
In [15]:
#print classification_report(y_test, y_pred_rf, target_names=["Ethanol", "Ethylene", "Ammonia", "Acetaldehyde", "Acetone", "Toluene"])
The bias–variance tradeoff is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set.
Bias : error from erroneous assumptions in the learning algorithm. High bias can cause underfitting : algorithm misses the relevant relations between features and target outputs.
Variance : error from sensitivity to small fluctuations in the training set. High variance can cause overfitting: modeling the random noise in the training data, rather than the intended outputs.
An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. It is a parametric learner hence we have a finite number of parameters.
In [16]:
from sklearn import svm
In [17]:
svm_classifier = svm.SVC(C=1.0, kernel='rbf', gamma='auto', cache_size=9000, decision_function_shape = 'ovr')
cv_score(svm_classifier,10)
Hyperparameter optimization is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance.
Grid search, or a parameter sweep, is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set.
In [ ]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedShuffleSplit
param_grid = [
{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
{'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
grid = GridSearchCV(svm.SVC(), param_grid=param_grid, cv=cv)
grid.fit(data, target)
print("The best parameters are %s with a score of %0.2f"
% (grid.best_params_, grid.best_score_))
StratifiedShuffleSplit used above is a variation of ShuffleSplit, which returns stratified splits, i.e which creates splits by preserving the same percentage for each target class as in the complete set.