In this example we will be exploring an exercise of binary classification using logistic regression to estimate whether a room is occupied or not, based on physical parameters measured from it using sensors. The implementation of the logistic regression using gradient descend algorithm shares many similarities with that of linear regression explained in last unit. In this unit we will rely on the implementation offered by sklearn.
For this example we will use the Occupancy Detection Dataset obtained here: https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+
The dataset is described here: Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, Véronique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 28-39
In [1]:
%matplotlib inline
import pandas as pd #used for reading/writing data
import numpy as np #numeric library library
from matplotlib import pyplot as plt #used for plotting
import sklearn #machine learning library
occupancyData = pd.read_csv('data/occupancy_data/datatraining.txt')
We can visualize its contents:
we first look at the first 10 records. Then, we can compute some general statistics for all records and finally we can also look at mean and std for the 2 classes we will want to classify into (occupied and not occupied).
In [5]:
occupancyData.head(10)
Out[5]:
In [6]:
occupancyData.describe()
Out[6]:
In [7]:
occupancyData.groupby('Occupancy').mean()
Out[7]:
In [8]:
occupancyData.groupby('Occupancy').std()
Out[8]:
A priori we can see that there is a big difference between Light and CO2 in occupied vs non occupied status. We will see whether these parameters play an important role in the classification.
To continue, we split the data into the input and output parameters
In [9]:
occupancyDataInput = occupancyData.drop(['Occupancy', 'date'], axis=1)
occupancyDataOutput = occupancyData['Occupancy']
As we saw in last unit, in order to improve convergence speed and accuracy we usually normalize the input parameters to zero mean and unit variance.
In [10]:
occupancyDataInput = (occupancyDataInput - occupancyDataInput.mean())/ occupancyDataInput.std()
occupancyDataInput.describe()
Out[10]:
In [11]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
In [12]:
lr.fit(occupancyDataInput, occupancyDataOutput)
Out[12]:
We can see how this system performs on the whole data by implementing ourselves the comparison or by using the internal function to score the results. We will see both give the same value (+- numerical resolution differences).
In [15]:
predictedOccupancy = lr.predict(occupancyDataInput)
comparison = np.logical_xor(occupancyDataOutput, predictedOccupancy)
(occupancyDataOutput.shape[0] - np.sum(comparison))/occupancyDataOutput.shape[0]
Out[15]:
In [16]:
lr.score(occupancyDataInput, occupancyDataOutput)
Out[16]:
Is this a good score? we check what the percentage of 1/0 are in the output data:
In [17]:
occupancyDataOutput.mean()
Out[17]:
This means that by always returning "yes" we would get a 79% accuracy. Not bad to obtain approx 20% absolute above chance.
Now, which features are most important in the classification? we can see this by looking at the estimated values of the $\Theta$ parameters
In [19]:
pd.DataFrame(list(zip(occupancyDataInput.columns, np.transpose(lr.coef_))))
Out[19]:
As expected, Light and CO2 are the most relevant variables, and Temperature follows. Note that we can compare these values only because we normalized the input features, else the individual $\theta$ variables would not be comparable.
Applying any machine learning to datasets as a whole is always a bad idea as we are looking into predicted results over data that has been used for the training. This has a big danger of overfitting and giving us the wrong information.
To solve this, let's do a proper train/test set split on our data. We will train on one set and test on the other. If we ever need to set metaparameters after training the model we will usually define a third set (usually called cross validation or development) which is independent from training and test.
In this case we will split 70% to 30%. You will leran more about train/test sets in future units.
In [20]:
from sklearn.model_selection import train_test_split
occupancyDataInput_train, occupancyDataInput_test, occupancyDataOutput_train, occupancyDataOutput_test = train_test_split(occupancyDataInput, occupancyDataOutput, test_size=0.3, random_state=0)
lr2 = LogisticRegression()
lr2.fit(occupancyDataInput_train, occupancyDataOutput_train)
Out[20]:
We now need to predict class labels for the test set. We will also generate the class probabilities, just to take a look.
In [21]:
predicted = lr2.predict(occupancyDataInput_test)
print(predicted)
probs = lr2.predict_proba(occupancyDataInput_test)
print(probs)
The model is assigning a true whenever the value in the second column (probability of "true") is > 0.5
Let us now see some evaluation metrics:
In [24]:
# generate evaluation metrics
from sklearn import metrics
print("Accuracy: %f", metrics.accuracy_score(occupancyDataOutput_test, predicted))
print("AUC: %f", metrics.roc_auc_score(occupancyDataOutput_test, probs[:, 1]))
print("Classification confusion matrix:")
print(metrics.confusion_matrix(occupancyDataOutput_test, predicted))
print("Classification report:")
print(metrics.classification_report(occupancyDataOutput_test, predicted))
Not to cunfuse these with the subset we can use to set some metaparameters, we can use the cross-validation technique (also called jackknifing technique) when we do not have much data over all and the idea of loosing some for testing is not a good idea. We normally split the data into 10 parts and perform train/test on each 9/1 groups.
In [27]:
# evaluate the model using 10-fold cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(LogisticRegression(), occupancyDataInput, occupancyDataOutput, scoring='accuracy', cv=10)
print(scores)
print(scores.mean())
We see how average results over all tests are the same as above. All good to go.
In [ ]: