Classification example

In this example we will be exploring an exercise of binary classification using logistic regression to estimate whether a room is occupied or not, based on physical parameters measured from it using sensors. The implementation of the logistic regression using gradient descend algorithm shares many similarities with that of linear regression explained in last unit. In this unit we will rely on the implementation offered by sklearn.

1) Reading and inspecting the data

For this example we will use the Occupancy Detection Dataset obtained here: https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+

The dataset is described here: Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, Véronique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 28-39


In [1]:
%matplotlib inline

import pandas as pd #used for reading/writing data 
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library

occupancyData = pd.read_csv('data/occupancy_data/datatraining.txt')

We can visualize its contents:

we first look at the first 10 records. Then, we can compute some general statistics for all records and finally we can also look at mean and std for the 2 classes we will want to classify into (occupied and not occupied).


In [5]:
occupancyData.head(10)


Out[5]:
date Temperature Humidity Light CO2 HumidityRatio Occupancy
1 2015-02-04 17:51:00 23.180 27.2720 426.0 721.250000 0.004793 1
2 2015-02-04 17:51:59 23.150 27.2675 429.5 714.000000 0.004783 1
3 2015-02-04 17:53:00 23.150 27.2450 426.0 713.500000 0.004779 1
4 2015-02-04 17:54:00 23.150 27.2000 426.0 708.250000 0.004772 1
5 2015-02-04 17:55:00 23.100 27.2000 426.0 704.500000 0.004757 1
6 2015-02-04 17:55:59 23.100 27.2000 419.0 701.000000 0.004757 1
7 2015-02-04 17:57:00 23.100 27.2000 419.0 701.666667 0.004757 1
8 2015-02-04 17:57:59 23.100 27.2000 419.0 699.000000 0.004757 1
9 2015-02-04 17:58:59 23.100 27.2000 419.0 689.333333 0.004757 1
10 2015-02-04 18:00:00 23.075 27.1750 419.0 688.000000 0.004745 1

In [6]:
occupancyData.describe()


Out[6]:
Temperature Humidity Light CO2 HumidityRatio Occupancy
count 8143.000000 8143.000000 8143.000000 8143.000000 8143.000000 8143.000000
mean 20.619084 25.731507 119.519375 606.546243 0.003863 0.212330
std 1.016916 5.531211 194.755805 314.320877 0.000852 0.408982
min 19.000000 16.745000 0.000000 412.750000 0.002674 0.000000
25% 19.700000 20.200000 0.000000 439.000000 0.003078 0.000000
50% 20.390000 26.222500 0.000000 453.500000 0.003801 0.000000
75% 21.390000 30.533333 256.375000 638.833333 0.004352 0.000000
max 23.180000 39.117500 1546.333333 2028.500000 0.006476 1.000000

In [7]:
occupancyData.groupby('Occupancy').mean()


Out[7]:
Temperature Humidity Light CO2 HumidityRatio
Occupancy
0 20.334931 25.349685 27.776442 490.320312 0.003730
1 21.673192 27.147938 459.854347 1037.704786 0.004355

In [8]:
occupancyData.groupby('Occupancy').std()


Out[8]:
Temperature Humidity Light CO2 HumidityRatio
Occupancy
0 0.909973 5.294887 89.598692 152.919609 0.000753
1 0.622891 6.128497 42.286862 377.603278 0.001006

A priori we can see that there is a big difference between Light and CO2 in occupied vs non occupied status. We will see whether these parameters play an important role in the classification.

To continue, we split the data into the input and output parameters


In [9]:
occupancyDataInput = occupancyData.drop(['Occupancy', 'date'], axis=1)
occupancyDataOutput = occupancyData['Occupancy']

As we saw in last unit, in order to improve convergence speed and accuracy we usually normalize the input parameters to zero mean and unit variance.


In [10]:
occupancyDataInput = (occupancyDataInput - occupancyDataInput.mean())/ occupancyDataInput.std()
occupancyDataInput.describe()


Out[10]:
Temperature Humidity Light CO2 HumidityRatio
count 8.143000e+03 8.143000e+03 8.143000e+03 8.143000e+03 8.143000e+03
mean -7.083962e-13 -1.177775e-13 -8.376778e-16 1.961562e-15 -3.568507e-14
std 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
min -1.592150e+00 -1.624691e+00 -6.136884e-01 -6.165554e-01 -1.394270e+00
25% -9.037947e-01 -1.000054e+00 -6.136884e-01 -5.330420e-01 -9.200910e-01
50% -2.252728e-01 8.876767e-02 -6.136884e-01 -4.869108e-01 -7.243307e-02
75% 7.580921e-01 8.681329e-01 7.027037e-01 1.027202e-01 5.742176e-01
max 2.518315e+00 2.420084e+00 7.326169e+00 4.523892e+00 3.066304e+00

2) Applying Logistic regression on the whole data (don't do it at home...)

We are now ready to instantiate the logistic regression from sklearn and to learn parameters $\Theta$ to optimally map input parameters to output class.


In [11]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [12]:
lr.fit(occupancyDataInput, occupancyDataOutput)


Out[12]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We can see how this system performs on the whole data by implementing ourselves the comparison or by using the internal function to score the results. We will see both give the same value (+- numerical resolution differences).


In [15]:
predictedOccupancy = lr.predict(occupancyDataInput)
comparison = np.logical_xor(occupancyDataOutput, predictedOccupancy)
(occupancyDataOutput.shape[0] - np.sum(comparison))/occupancyDataOutput.shape[0]


Out[15]:
0.9860002456097261

In [16]:
lr.score(occupancyDataInput, occupancyDataOutput)


Out[16]:
0.9860002456097261

Is this a good score? we check what the percentage of 1/0 are in the output data:


In [17]:
occupancyDataOutput.mean()


Out[17]:
0.21232960825248681

This means that by always returning "yes" we would get a 79% accuracy. Not bad to obtain approx 20% absolute above chance.

Now, which features are most important in the classification? we can see this by looking at the estimated values of the $\Theta$ parameters


In [19]:
pd.DataFrame(list(zip(occupancyDataInput.columns, np.transpose(lr.coef_))))


Out[19]:
0 1
0 Temperature [-1.24840035829]
1 Humidity [0.170937181512]
2 Light [3.83643435106]
3 CO2 [1.88258341568]
4 HumidityRatio [-0.382370112499]

As expected, Light and CO2 are the most relevant variables, and Temperature follows. Note that we can compare these values only because we normalized the input features, else the individual $\theta$ variables would not be comparable.

3) Train-test sets

Applying any machine learning to datasets as a whole is always a bad idea as we are looking into predicted results over data that has been used for the training. This has a big danger of overfitting and giving us the wrong information.

To solve this, let's do a proper train/test set split on our data. We will train on one set and test on the other. If we ever need to set metaparameters after training the model we will usually define a third set (usually called cross validation or development) which is independent from training and test.

In this case we will split 70% to 30%. You will leran more about train/test sets in future units.


In [20]:
from sklearn.model_selection import train_test_split

occupancyDataInput_train, occupancyDataInput_test, occupancyDataOutput_train, occupancyDataOutput_test = train_test_split(occupancyDataInput, occupancyDataOutput, test_size=0.3, random_state=0)
lr2 = LogisticRegression()
lr2.fit(occupancyDataInput_train, occupancyDataOutput_train)


Out[20]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.


In [21]:
predicted = lr2.predict(occupancyDataInput_test)
print(predicted)
probs = lr2.predict_proba(occupancyDataInput_test)
print(probs)


[1 1 0 ..., 1 0 1]
[[  5.37686998e-02   9.46231300e-01]
 [  1.48170513e-01   8.51829487e-01]
 [  9.98808887e-01   1.19111304e-03]
 ..., 
 [  4.25577666e-02   9.57442233e-01]
 [  9.99838352e-01   1.61648431e-04]
 [  1.48112490e-01   8.51887510e-01]]

The model is assigning a true whenever the value in the second column (probability of "true") is > 0.5

Let us now see some evaluation metrics:


In [24]:
# generate evaluation metrics
from sklearn import metrics

print("Accuracy: %f", metrics.accuracy_score(occupancyDataOutput_test, predicted))
print("AUC: %f", metrics.roc_auc_score(occupancyDataOutput_test, probs[:, 1]))
print("Classification confusion matrix:")
print(metrics.confusion_matrix(occupancyDataOutput_test, predicted))
print("Classification report:")
print(metrics.classification_report(occupancyDataOutput_test, predicted))


Accuracy: %f 0.984445354073
AUC: %f 0.993319289501
Classification confusion matrix:
[[1887   28]
 [  10  518]]
Classification report:
             precision    recall  f1-score   support

          0       0.99      0.99      0.99      1915
          1       0.95      0.98      0.96       528

avg / total       0.98      0.98      0.98      2443

4) Cross-validation datasets

Not to cunfuse these with the subset we can use to set some metaparameters, we can use the cross-validation technique (also called jackknifing technique) when we do not have much data over all and the idea of loosing some for testing is not a good idea. We normally split the data into 10 parts and perform train/test on each 9/1 groups.


In [27]:
# evaluate the model using 10-fold cross-validation
from sklearn.model_selection import cross_val_score

scores = cross_val_score(LogisticRegression(), occupancyDataInput, occupancyDataOutput, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())


[ 0.9791411   0.92147239  0.9791411   0.99877301  0.99385749  1.          0.9987715
  1.          0.97542998  0.96555966]
0.981214623102

We see how average results over all tests are the same as above. All good to go.


In [ ]: