Classification example

In this example we will be exploring an exercise of binary classification using logistic regression to estimate whether a room is occupied or not, based on physical parameters measured from it using sensors. The implementation of the logistic regression using gradient descend algorithm shares many similarities with that of linear regression explained in last unit. In this unit we will rely on the implementation offered by sklearn.

1) Reading and inspecting the data

For this example we will use the Occupancy Detection Dataset obtained here: https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+

The dataset is described here: Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, VÃ©ronique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 28-39



In [1]:

    
%matplotlib inline

import pandas as pd #used for reading/writing data 
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library

occupancyData = pd.read_csv('data/occupancy_data/datatraining.txt')

We can visualize its contents:

we first look at the first 10 records. Then, we can compute some general statistics for all records and finally we can also look at mean and std for the 2 classes we will want to classify into (occupied and not occupied).



In [5]:

    
occupancyData.head(10)









    Out[5]:






  
    
      
      date
      Temperature
      Humidity
      Light
      CO2
      HumidityRatio
      Occupancy
    
  
  
    
      1
      2015-02-04 17:51:00
      23.180
      27.2720
      426.0
      721.250000
      0.004793
      1
    
    
      2
      2015-02-04 17:51:59
      23.150
      27.2675
      429.5
      714.000000
      0.004783
      1
    
    
      3
      2015-02-04 17:53:00
      23.150
      27.2450
      426.0
      713.500000
      0.004779
      1
    
    
      4
      2015-02-04 17:54:00
      23.150
      27.2000
      426.0
      708.250000
      0.004772
      1
    
    
      5
      2015-02-04 17:55:00
      23.100
      27.2000
      426.0
      704.500000
      0.004757
      1
    
    
      6
      2015-02-04 17:55:59
      23.100
      27.2000
      419.0
      701.000000
      0.004757
      1
    
    
      7
      2015-02-04 17:57:00
      23.100
      27.2000
      419.0
      701.666667
      0.004757
      1
    
    
      8
      2015-02-04 17:57:59
      23.100
      27.2000
      419.0
      699.000000
      0.004757
      1
    
    
      9
      2015-02-04 17:58:59
      23.100
      27.2000
      419.0
      689.333333
      0.004757
      1
    
    
      10
      2015-02-04 18:00:00
      23.075
      27.1750
      419.0
      688.000000
      0.004745
      1



In [6]:

    
occupancyData.describe()









    Out[6]:






  
    
      
      Temperature
      Humidity
      Light
      CO2
      HumidityRatio
      Occupancy
    
  
  
    
      count
      8143.000000
      8143.000000
      8143.000000
      8143.000000
      8143.000000
      8143.000000
    
    
      mean
      20.619084
      25.731507
      119.519375
      606.546243
      0.003863
      0.212330
    
    
      std
      1.016916
      5.531211
      194.755805
      314.320877
      0.000852
      0.408982
    
    
      min
      19.000000
      16.745000
      0.000000
      412.750000
      0.002674
      0.000000
    
    
      25%
      19.700000
      20.200000
      0.000000
      439.000000
      0.003078
      0.000000
    
    
      50%
      20.390000
      26.222500
      0.000000
      453.500000
      0.003801
      0.000000
    
    
      75%
      21.390000
      30.533333
      256.375000
      638.833333
      0.004352
      0.000000
    
    
      max
      23.180000
      39.117500
      1546.333333
      2028.500000
      0.006476
      1.000000



In [7]:

    
occupancyData.groupby('Occupancy').mean()









    Out[7]:






  
    
      
      Temperature
      Humidity
      Light
      CO2
      HumidityRatio
    
    
      Occupancy
      
      
      
      
      
    
  
  
    
      0
      20.334931
      25.349685
      27.776442
      490.320312
      0.003730
    
    
      1
      21.673192
      27.147938
      459.854347
      1037.704786
      0.004355



In [8]:

    
occupancyData.groupby('Occupancy').std()









    Out[8]:






  
    
      
      Temperature
      Humidity
      Light
      CO2
      HumidityRatio
    
    
      Occupancy
      
      
      
      
      
    
  
  
    
      0
      0.909973
      5.294887
      89.598692
      152.919609
      0.000753
    
    
      1
      0.622891
      6.128497
      42.286862
      377.603278
      0.001006

A priori we can see that there is a big difference between Light and CO2 in occupied vs non occupied status. We will see whether these parameters play an important role in the classification.

To continue, we split the data into the input and output parameters



In [9]:

    
occupancyDataInput = occupancyData.drop(['Occupancy', 'date'], axis=1)
occupancyDataOutput = occupancyData['Occupancy']

As we saw in last unit, in order to improve convergence speed and accuracy we usually normalize the input parameters to zero mean and unit variance.



In [10]:

    
occupancyDataInput = (occupancyDataInput - occupancyDataInput.mean())/ occupancyDataInput.std()
occupancyDataInput.describe()









    Out[10]:






  
    
      
      Temperature
      Humidity
      Light
      CO2
      HumidityRatio
    
  
  
    
      count
      8.143000e+03
      8.143000e+03
      8.143000e+03
      8.143000e+03
      8.143000e+03
    
    
      mean
      -7.083962e-13
      -1.177775e-13
      -8.376778e-16
      1.961562e-15
      -3.568507e-14
    
    
      std
      1.000000e+00
      1.000000e+00
      1.000000e+00
      1.000000e+00
      1.000000e+00
    
    
      min
      -1.592150e+00
      -1.624691e+00
      -6.136884e-01
      -6.165554e-01
      -1.394270e+00
    
    
      25%
      -9.037947e-01
      -1.000054e+00
      -6.136884e-01
      -5.330420e-01
      -9.200910e-01
    
    
      50%
      -2.252728e-01
      8.876767e-02
      -6.136884e-01
      -4.869108e-01
      -7.243307e-02
    
    
      75%
      7.580921e-01
      8.681329e-01
      7.027037e-01
      1.027202e-01
      5.742176e-01
    
    
      max
      2.518315e+00
      2.420084e+00
      7.326169e+00
      4.523892e+00
      3.066304e+00

2) Applying Logistic regression on the whole data (don't do it at home...)

We are now ready to instantiate the logistic regression from sklearn and to learn parameters $\Theta$ to optimally map input parameters to output class.



In [11]:

    
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()



In [12]:

    
lr.fit(occupancyDataInput, occupancyDataOutput)









    Out[12]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We can see how this system performs on the whole data by implementing ourselves the comparison or by using the internal function to score the results. We will see both give the same value (+- numerical resolution differences).



In [15]:

    
predictedOccupancy = lr.predict(occupancyDataInput)
comparison = np.logical_xor(occupancyDataOutput, predictedOccupancy)
(occupancyDataOutput.shape[0] - np.sum(comparison))/occupancyDataOutput.shape[0]









    Out[15]:





0.9860002456097261



In [16]:

    
lr.score(occupancyDataInput, occupancyDataOutput)









    Out[16]:





0.9860002456097261

Is this a good score? we check what the percentage of 1/0 are in the output data:



In [17]:

    
occupancyDataOutput.mean()









    Out[17]:





0.21232960825248681

This means that by always returning "yes" we would get a 79% accuracy. Not bad to obtain approx 20% absolute above chance.

Now, which features are most important in the classification? we can see this by looking at the estimated values of the $\Theta$ parameters



In [19]:

    
pd.DataFrame(list(zip(occupancyDataInput.columns, np.transpose(lr.coef_))))









    Out[19]:






  
    
      
      0
      1
    
  
  
    
      0
      Temperature
      [-1.24840035829]
    
    
      1
      Humidity
      [0.170937181512]
    
    
      2
      Light
      [3.83643435106]
    
    
      3
      CO2
      [1.88258341568]
    
    
      4
      HumidityRatio
      [-0.382370112499]

As expected, Light and CO2 are the most relevant variables, and Temperature follows. Note that we can compare these values only because we normalized the input features, else the individual $\theta$ variables would not be comparable.

3) Train-test sets

Applying any machine learning to datasets as a whole is always a bad idea as we are looking into predicted results over data that has been used for the training. This has a big danger of overfitting and giving us the wrong information.

To solve this, let's do a proper train/test set split on our data. We will train on one set and test on the other. If we ever need to set metaparameters after training the model we will usually define a third set (usually called cross validation or development) which is independent from training and test.

In this case we will split 70% to 30%. You will leran more about train/test sets in future units.



In [20]:

    
from sklearn.model_selection import train_test_split

occupancyDataInput_train, occupancyDataInput_test, occupancyDataOutput_train, occupancyDataOutput_test = train_test_split(occupancyDataInput, occupancyDataOutput, test_size=0.3, random_state=0)
lr2 = LogisticRegression()
lr2.fit(occupancyDataInput_train, occupancyDataOutput_train)









    Out[20]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.



In [21]:

    
predicted = lr2.predict(occupancyDataInput_test)
print(predicted)
probs = lr2.predict_proba(occupancyDataInput_test)
print(probs)









    



[1 1 0 ..., 1 0 1]
[[  5.37686998e-02   9.46231300e-01]
 [  1.48170513e-01   8.51829487e-01]
 [  9.98808887e-01   1.19111304e-03]
 ..., 
 [  4.25577666e-02   9.57442233e-01]
 [  9.99838352e-01   1.61648431e-04]
 [  1.48112490e-01   8.51887510e-01]]

The model is assigning a true whenever the value in the second column (probability of "true") is > 0.5

Let us now see some evaluation metrics:



In [24]:

    
# generate evaluation metrics
from sklearn import metrics

print("Accuracy: %f", metrics.accuracy_score(occupancyDataOutput_test, predicted))
print("AUC: %f", metrics.roc_auc_score(occupancyDataOutput_test, probs[:, 1]))
print("Classification confusion matrix:")
print(metrics.confusion_matrix(occupancyDataOutput_test, predicted))
print("Classification report:")
print(metrics.classification_report(occupancyDataOutput_test, predicted))









    



Accuracy: %f 0.984445354073
AUC: %f 0.993319289501
Classification confusion matrix:
[[1887   28]
 [  10  518]]
Classification report:
             precision    recall  f1-score   support

          0       0.99      0.99      0.99      1915
          1       0.95      0.98      0.96       528

avg / total       0.98      0.98      0.98      2443

4) Cross-validation datasets

Not to cunfuse these with the subset we can use to set some metaparameters, we can use the cross-validation technique (also called jackknifing technique) when we do not have much data over all and the idea of loosing some for testing is not a good idea. We normally split the data into 10 parts and perform train/test on each 9/1 groups.



In [27]:

    
# evaluate the model using 10-fold cross-validation
from sklearn.model_selection import cross_val_score

scores = cross_val_score(LogisticRegression(), occupancyDataInput, occupancyDataOutput, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())









    



[ 0.9791411   0.92147239  0.9791411   0.99877301  0.99385749  1.          0.9987715
  1.          0.97542998  0.96555966]
0.981214623102

We see how average results over all tests are the same as above. All good to go.



In [ ]:

	date	Temperature	Humidity	Light	CO2	HumidityRatio	Occupancy
1	2015-02-04 17:51:00	23.180	27.2720	426.0	721.250000	0.004793	1
2	2015-02-04 17:51:59	23.150	27.2675	429.5	714.000000	0.004783	1
3	2015-02-04 17:53:00	23.150	27.2450	426.0	713.500000	0.004779	1
4	2015-02-04 17:54:00	23.150	27.2000	426.0	708.250000	0.004772	1
5	2015-02-04 17:55:00	23.100	27.2000	426.0	704.500000	0.004757	1
6	2015-02-04 17:55:59	23.100	27.2000	419.0	701.000000	0.004757	1
7	2015-02-04 17:57:00	23.100	27.2000	419.0	701.666667	0.004757	1
8	2015-02-04 17:57:59	23.100	27.2000	419.0	699.000000	0.004757	1
9	2015-02-04 17:58:59	23.100	27.2000	419.0	689.333333	0.004757	1
10	2015-02-04 18:00:00	23.075	27.1750	419.0	688.000000	0.004745	1

	Temperature	Humidity	Light	CO2	HumidityRatio	Occupancy
count	8143.000000	8143.000000	8143.000000	8143.000000	8143.000000	8143.000000
mean	20.619084	25.731507	119.519375	606.546243	0.003863	0.212330
std	1.016916	5.531211	194.755805	314.320877	0.000852	0.408982
min	19.000000	16.745000	0.000000	412.750000	0.002674	0.000000
25%	19.700000	20.200000	0.000000	439.000000	0.003078	0.000000
50%	20.390000	26.222500	0.000000	453.500000	0.003801	0.000000
75%	21.390000	30.533333	256.375000	638.833333	0.004352	0.000000
max	23.180000	39.117500	1546.333333	2028.500000	0.006476	1.000000

	Temperature	Humidity	Light	CO2	HumidityRatio
Occupancy
0	20.334931	25.349685	27.776442	490.320312	0.003730
1	21.673192	27.147938	459.854347	1037.704786	0.004355

	Temperature	Humidity	Light	CO2	HumidityRatio
Occupancy
0	0.909973	5.294887	89.598692	152.919609	0.000753
1	0.622891	6.128497	42.286862	377.603278	0.001006

	Temperature	Humidity	Light	CO2	HumidityRatio
count	8.143000e+03	8.143000e+03	8.143000e+03	8.143000e+03	8.143000e+03
mean	-7.083962e-13	-1.177775e-13	-8.376778e-16	1.961562e-15	-3.568507e-14
std	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00
min	-1.592150e+00	-1.624691e+00	-6.136884e-01	-6.165554e-01	-1.394270e+00
25%	-9.037947e-01	-1.000054e+00	-6.136884e-01	-5.330420e-01	-9.200910e-01
50%	-2.252728e-01	8.876767e-02	-6.136884e-01	-4.869108e-01	-7.243307e-02
75%	7.580921e-01	8.681329e-01	7.027037e-01	1.027202e-01	5.742176e-01
max	2.518315e+00	2.420084e+00	7.326169e+00	4.523892e+00	3.066304e+00

	0	1
0	Temperature	[-1.24840035829]
1	Humidity	[0.170937181512]
2	Light	[3.83643435106]
3	CO2	[1.88258341568]
4	HumidityRatio	[-0.382370112499]