https://github.com/georgetown-analytics/classroom-occupancy
Data consist of temperature, humidity, CO2 levels, light, # of bluetooth devices, noise levels and count of people in the room.
In [19]:
import pandas as pd
%matplotlib inline
In [21]:
dataset = pd.read_csv('dataset.csv')
In [23]:
dataset.head(5)
Out[23]:
In [24]:
dataset.count_total.describe()
Out[24]:
In [25]:
#add a new column to create a binary class for room occupancy
countmed = dataset.count_total.median()
dataset['room_occupancy'] = dataset['count_total'].apply(lambda x: 'occupied' if x > 4 else 'empty')
In [26]:
# map room occupancy to a number
dataset['room_occupancy_num'] = dataset.room_occupancy.map({'empty':0, 'occupied':1})
In [27]:
dataset.head(5)
Out[27]:
In [28]:
dataset.room_occupancy.describe()
Out[28]:
In [29]:
import os
import sys
# Modify the path
sys.path.append("..")
import pandas as pd
import yellowbrick as yb
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (12, 8)
In [30]:
g = yb.anscombe()
Feature analysis visualizers are designed to visualize instances in data space in order to detect features or targets that might impact downstream fitting. Because ML operates on high-dimensional data sets (usually at least 35), the visualizers focus on aggregation, optimization, and other techniques to give overviews of the data. It is our intent that the steering process will allow the data scientist to zoom and filter and explore the relationships between their instances and between dimensions.
At the moment we have three feature analysis visualizers implemented:
Rank2D: rank pairs of features to detect covariance
RadViz: plot data points along axes ordered around a circle to detect separability
Parallel Coordinates: plot instances as lines along vertical axes to detect clusters
Feature analysis visualizers implement the Transformer API from Scikit-Learn, meaning they can be used as intermediate transform steps in a Pipeline (particularly a VisualPipeline). They are instantiated in the same way, and then fit and transform are called on them, which draws the instances correctly. Finally show or show is called which displays the image.
In [63]:
from yellowbrick.features.rankd import Rank2D
from yellowbrick.features.radviz import RadViz
from yellowbrick.features.pcoords import ParallelCoordinates
Rank1D and Rank2D evaluate single features or pairs of features using a variety of metrics that score the features on the scale [-1, 1] or [0, 1] allowing them to be ranked. A similar concept to SPLOMs, the scores are visualized on a lower-left triangle heatmap so that patterns between pairs of features can be easily discerned for downstream analysis.
In [32]:
# Load the classification data set
data = dataset
# Specify the features of interest
features = ['temperature','humidity','co2','light','noise','bluetooth_devices']
# Extract the numpy arrays from the data frame
X = data[features].as_matrix()
y = data['count_total'].as_matrix()
In [33]:
# Instantiate the visualizer with the Covariance ranking algorithm
visualizer = Rank2D(features=features, algorithm='covariance')
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.show() # Draw/show/show the data
In [34]:
# Instantiate the visualizer with the Pearson ranking algorithm
visualizer = Rank2D(features=features, algorithm='pearson')
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.show() # Draw/show/show the data
RadViz is a multivariate data visualization algorithm that plots each feature dimension uniformely around the circumference of a circle then plots points on the interior of the circle such that the point normalizes its values on the axes from the center to each arc. This meachanism allows as many dimensions as will easily fit on a circle, greatly expanding the dimensionality of the visualization. Data scientists use this method to dect separability between classes. E.g. is there an opportunity to learn from the feature set or is there just too much noise?
In [35]:
# Specify the features of interest and the classes of the target
features = ['temperature','humidity','co2','light','noise','bluetooth_devices']
classes = ['empty', 'occupied']
# Extract the numpy arrays from the data frame
X = data[features].as_matrix()
y = data.room_occupancy_num.as_matrix()
In [36]:
# Instantiate the visualizer
visualizer = visualizer = RadViz(classes=classes, features=features)
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.show() # Draw/show/show the data
For regression, the RadViz visualizer should use a color sequence to display the target information, as opposed to discrete colors.
In [ ]:
# Specify the features of interest and the classes of the target
#features = ['temperature','humidity','co2','light','noise','bluetooth_devices']
#classes = ['empty', 'occupied']
# Extract the numpy arrays from the data frame
#X = data[features].as_matrix()
#y = data.room_occupancy_num.as_matrix()
In [ ]:
# Instantiate the visualizer
#visualizer = visualizer = ParallelCoordinates(classes=classes, features=features)
#visualizer.fit(X, y) # Fit the data to the visualizer
#visualizer.transform(X) # Transform the data
#visualizer.show() # Draw/show/show the data
Regression models attempt to predict a target in a continuous space. Regressor score visualizers display the instances in model space to better understand how the model is making predictions. We currently have implemented two regressor evaluations:
Residuals Plot: plot the difference between the expected and actual values
Prediction Error: plot expected vs. the actual values in model space
Estimator score visualizers wrap Scikit-Learn estimators and expose the Estimator API such that they have fit(), predict(), and score() methods that call the appropriate estimator methods under the hood. Score visualizers can wrap an estimator and be passed in as the final step in a Pipeline or VisualPipeline.
In [40]:
# Regression Evaluation Imports
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
from yellowbrick.regressor import PredictionError, ResidualsPlot
In [46]:
# Load the data
df = data
feature_names = ['temperature','humidity','co2','light','noise','bluetooth_devices']
target_name = 'count_total'
# Get the X and y data from the DataFrame
X = df[feature_names].as_matrix()
y = df[target_name].as_matrix()
# Create the train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
In [47]:
# Instantiate the linear model and visualizer
ridge = Ridge()
visualizer = ResidualsPlot(ridge)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.show() # Draw/show/show the data
In [54]:
# Load the data
df = data
feature_names = ['temperature','humidity','co2','light','noise','bluetooth_devices']
target_name = 'count_total'
# Get the X and y data from the DataFrame
X = df[feature_names].as_matrix()
y = df[target_name].as_matrix()
# Create the train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
In [55]:
# Instantiate the linear model and visualizer
lasso = Lasso()
visualizer = PredictionError(lasso)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.show() # Draw/show/show the data
Classification models attempt to predict a target in a discrete space, that is assign an instance of dependent variables one or more categories. Classification score visualizers display the differences between classes as well as a number of classifier-specific visual evaluations. We currently have implemented three classifier evaluations:
ClassificationReport: Presents the confusion matrix of the classifier as a heatmap
ROCAUC: Presents the graph of receiver operating characteristics along with area under the curve
ClassBalance: Displays the difference between the class balances and support
Estimator score visualizers wrap Scikit-Learn estimators and expose the Estimator API such that they have fit(), predict(), and score() methods that call the appropriate estimator methods under the hood. Score visualizers can wrap an estimator and be passed in as the final step in a Pipeline or VisualPipeline.
In [56]:
# Classifier Evaluation Imports
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from yellowbrick.classifier import ClassificationReport, ROCAUC, ClassBalance
In [57]:
# Load the classification data set
data = dataset
# Specify the features of interest and the classes of the target
features = ['temperature','humidity','co2','light','noise','bluetooth_devices']
classes = ['empty', 'occupied']
# Extract the numpy arrays from the data frame
X = data[features].as_matrix()
y = data.room_occupancy_num.as_matrix()
# Create the train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
In [58]:
# Instantiate the classification model and visualizer
bayes = GaussianNB()
visualizer = ClassificationReport(bayes, classes=classes)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.show() # Draw/show/show the data
In [59]:
# Instantiate the classification model and visualizer
logistic = LogisticRegression()
visualizer = ROCAUC(logistic)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.show() # Draw/show/show the data
In [60]:
# Instantiate the classification model and visualizer
forest = RandomForestClassifier()
visualizer = ClassBalance(forest, classes=classes)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.show() # Draw/show/show the data
In [ ]: