In this lecture we will learn about two things: Exploratory Data Analysis (EDA) and clustering. The first, is a set of techniques that are used to better understand datasets and enable the postulation of useful hypotheses that can later be tested more formally. The latter is a set of techniques used to discover patterns and structure in a dataset, in an unsupervised manner.
To do this, we will be using a dataset of the energy demand for some campus buildings, as well as the temperature around campus. Because there was a similar assignment using this dataset in 2014, the whole document is based on that assignment and has been extended to reflect the specific tools and techniques used in Paper # for our 2016 course.
We begin by importing all of the libraries that are necessary, and setting up the plotting environment:
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt
import itertools
%matplotlib inline
Then, we import a csv file containing the campus electricity demand into a variable called data. This valriable will be a Pandas Data Frame. We leverage a new lambda function that will allow the importer to convert the timestamp strings into datetime objects:
In [2]:
f = open('data/campusDemand.csv')
data = pd.read_csv(f,sep=',', header='infer', parse_dates=[1])
First let's see the first few lines:
In [3]:
data.head()
Out[3]:
Now let's see the last few lines:
In [4]:
data.tail()
Out[4]:
At this point, I can see that the last two columns are not very interesting/important, so let's delete them.
In [5]:
data = data.drop(data.columns[[3,4]],axis=1)
Let's now make sure the columns that remain are the right ones:
In [6]:
data.columns
Out[6]:
Let's show the first few samples in the "Point name" column:
In [7]:
data['Point name'].head()
Out[7]:
One of the very useful methods that come with the DataFrame object is describe(). It provides us with basic statistics for the columns in the data frame:
In [8]:
data.describe()
Out[8]:
To find the unique number of point names, we use the unique function from Numpy, and apply it to the 'Point_name' column in data:
In [9]:
# Both of these are equivalent, but the second one is the Pandas way to do it.
#pointNames = np.unique(data['Point name'])
pointNames = data['Point name'].unique()
print('There are {} unique meters.'.format(pointNames.shape[0]))
We now print the contents of the pointNames array:
In [10]:
print(pointNames)
Now let's inspect data from one of the meters (e.g., the second one in the list above). Specifically, let's create a scatter plot off the Time column against the row number. Ideally this should be a straight line, but if there are discontinuities in the sampling process, then they will also appear as discontinuities in this line.
In [11]:
# There are many ways to do this, but here are two:
#plt.plot(data[(data['Point name'].isin([pointNames[1]]))]['Time'],'rd')
plt.plot(data[data['Point name'] == pointNames[1]]['Time'],'rd')
Out[11]:
To count the numer of samples present on each power meter, there are many ways to achieve it. For instance, we can use an iterator to loop over all pointNames and create a list of tuples in the process (this is formally called a List Comprehension). Every tuple will then contain two elements: the meter name, and the number of samples in it:
In [12]:
pd.DataFrame(
[(pn, len(data[data['Point name'] == pn])) for pn in pointNames],
columns=['Point name','Number of samples']
)
Out[12]:
Before going any further, let's confirm that the data types for our columns are what we expect (I expect the time column to be a datetime object, and the value to be a float).
In [13]:
data.dtypes
Out[13]:
First, we can use another List Comprehension to iterate over the point names and create a new list whose elements are in turn tuples with the indeces for the samples corresponding to this meter. Then we can iterate over them and print the time by doing a simple difference (since the values in the 'Time' column are datetime objects, subtracting them generates a time_delta object that Python can interpret.
In [14]:
idx = [np.where(data['Point name']==meter) for meter in pointNames]
print('idx is now a {0} of {1} items.'.format(type(idx),len(idx)))
print('Each item in idx is of {0}.\n'.format(type(idx[0])))
for meterNo, meterRows in enumerate(idx):
print("{} recorded for {}.".format(pointNames[meterNo], data['Time'].iloc[meterRows[0][-1]] - data['Time'].iloc[meterRows[0][0]]))
For this task, we are going to directly take the difference between any two consecutive datetime objects and display the result in terms of, say, number of seconds elapsed between these timestamps.
Before we do this, though, it is useful to plot the timestamps to figure out if there are discontinuities that we can visually see:
In [15]:
fig = plt.figure(figsize=(20,30)) # A 20 inch x 30 inch figure box
for meter,i in zip(pointNames,range(len(pointNames))):
plt.subplot(4,2,i+1) # 4 rows and 2 columns of subplots
plt.plot(data[data['Point name']==meter]['Time'])
plt.title(meter)
plt.xlabel('Row number')
plt.ylabel('Timestamp')
As you may have seen, gaps were easily identifiable as discontinuities in the lines that were plotted. If no gaps existed, the plot would be a straight line.
But now let's get back to solving this using exact numbers...
First, you need to know that applying the difference operator (-) on two datetime objects results in a timedelta object. These objects (timedelta) describe time differences in terms of number of days, seconds and microseconds (see the link above for more details).
Because of this, we could quickly convert any timedelta object (say dt) into the number of seconds by doing:
dt.days*3600*24+dt.seconds+dt.microseconds/1000000
In this case, however, our timestamps do not contain information about the microseconds, so we will skip that part of the converstion.
All that said, what we are using are not timedelta objects per se, but rather timedelta64 objects (which are part of pandas) and they will spare us much trouble and we don't need to do the conversion anymore in this way. Much easier ways are available if you look through the documentation here.
Using this knowledge, we can create a list of lists (a nested list) in a similar manner as we've done before (i.e. using list comprehensions), and in it store the timedeltas in seconds for each meter. In other words, the outer list is a list of the same length as pointNames, and each element is a list of timedeltas for the corresponding meter.
One more thing comes in handy for this task: the np.diff function, which takes an array (or a list) and returns the difference between any two consecutive items of the list.
Now, in a single line of code we can get the nested list we talked about:
In [16]:
delta_t = [[d/np.timedelta64(1, 'D') for d in np.diff(data['Time'].iloc[i[0]])] for i in idx]
Now we need to be able to print out the exact times during which there are gaps. We will define gaps to be any timedelta that is longer than the median timedelta for a meter.
We will achieve this as follows:
In [17]:
np.set_printoptions(threshold=np.nan)
for i in range(len(delta_t)):
resolution = np.median(delta_t[i]) # use median as the resolution
gap_idx = np.where(np.array(delta_t[i]) > (resolution)) # define the gap to be the delta_t> resolution
# print out the gap for each power meter
print("-=-=-=-=-=-=-=-=-\n{}\nresolution:{}\ngaps:".format(pointNames[i],resolution))
# This is pretty slow, so uncomment and run only if you want to see the output.
#for the_idx in gap_idx[0]:
# print(data['Time'].iloc[idx[i]].iloc[the_idx],' to ',data['Time'].iloc[idx[i]].iloc[the_idx+1])
np.set_printoptions(threshold=8)
First, we will define a new variable containing the weekday for each of the timestamps.
In [18]:
week_days = np.array([x.to_datetime().weekday() for x in data['Time']])
Monday = data.ix[week_days == 0]
Tuesday = data.ix[week_days == 1]
Wednesday = data.ix[week_days == 2]
Thursday = data.ix[week_days == 3]
Friday = data.ix[week_days == 4]
Saturday = data.ix[week_days == 5]
Sunday = data.ix[week_days == 6]
All = [Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday]
In this task we basically use two for loops and a the subplot functionality of PyPlot to do visualize the data contained in the variables we declared above.
The main trick is that we need to create a time index that only contains information about the hours, minutes and seconds (i.e. it completely disregards the exact day of the measurement) so that all of the measurements can be displayed within a single 24-hour period.
In [19]:
Days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
fig = plt.figure(figsize=(20,20))
for i in range(len(pointNames)): # iterate over meters
for j in range(7): # iterate over days of the week
plt.subplot(7,7,i*7+j+1)
# Data from the day being plotted = All[j]
# Data from the meter being plotted = All[j][All[j]['Point_name']==pointNames[i]]
time = np.array(All[j]['Time'].ix[All[j]['Point name']==pointNames[i]])
# plot the power vs the hours in a day
plt.plot(time,All[j]['Value'].ix[All[j]['Point name']==pointNames[i]],'.')
if i==6:
plt.xlabel('hours in a day')
if j==0:
plt.ylabel(pointNames[i].split('-')[0]+'\n'+pointNames[i].split('-')[1])
if i==0:
plt.title(Days[j])
fig.tight_layout()
plt.show()
Serveral findings: (more to be added)
Finally, let's save the Pandas Data Frame to disk by Pickling it (see the Python reference for Pickle):
In [22]:
import pickle
pickle_file = open('data/campusDemand.pkl','wb')
pickle.dump([data,pointNames,All,idx],pickle_file)
pickle_file.close()