Lecture #4: Implementing Tools from Paper #1

In this lecture we will learn about two things: Exploratory Data Analysis (EDA) and clustering. The first, is a set of techniques that are used to better understand datasets and enable the postulation of useful hypotheses that can later be tested more formally. The latter is a set of techniques used to discover patterns and structure in a dataset, in an unsupervised manner.

To do this, we will be using a dataset of the energy demand for some campus buildings, as well as the temperature around campus. Because there was a similar assignment using this dataset in 2014, the whole document is based on that assignment and has been extended to reflect the specific tools and techniques used in Paper # for our 2016 course.

We begin by importing all of the libraries that are necessary, and setting up the plotting environment:



In [1]:

    
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt
import itertools

%matplotlib inline

Then, we import a csv file containing the campus electricity demand into a variable called data. This valriable will be a Pandas Data Frame. We leverage a new lambda function that will allow the importer to convert the timestamp strings into datetime objects:



In [2]:

    
f = open('data/campusDemand.csv')
data = pd.read_csv(f,sep=',', header='infer', parse_dates=[1])

First let's see the first few lines:



In [3]:

    
data.head()









    Out[3]:






  
    
      
      Point name
      Time
      Value
      Rendered
      Annotation
    
  
  
    
      0
      Porter Hall Electric (Shark 30) - Watts
      2014-09-10 00:00:50
      80635.421875
      80635 Watts
      NaN
    
    
      1
      Porter Hall Electric (Shark 30) - Watts
      2014-09-10 00:01:50
      77046.992188
      77047 Watts
      NaN
    
    
      2
      Porter Hall Electric (Shark 30) - Watts
      2014-09-10 00:02:50
      76787.828125
      76788 Watts
      NaN
    
    
      3
      Porter Hall Electric (Shark 30) - Watts
      2014-09-10 00:03:50
      79804.031250
      79804 Watts
      NaN
    
    
      4
      Porter Hall Electric (Shark 30) - Watts
      2014-09-10 00:04:50
      79397.078125
      79397 Watts
      NaN

Now let's see the last few lines:



In [4]:

    
data.tail()









    Out[4]:






  
    
      
      Point name
      Time
      Value
      Rendered
      Annotation
    
  
  
    
      744887
      Baker Hall Electric (Shark 29) - Demand Watts
      2014-11-10 20:04:28
      197454.37500
      197454.375
      NaN
    
    
      744888
      Baker Hall Electric (Shark 29) - Demand Watts
      2014-11-10 20:34:28
      195065.00000
      195065.0
      NaN
    
    
      744889
      Baker Hall Electric (Shark 29) - Demand Watts
      2014-11-10 21:04:28
      190167.78125
      190167.78125
      NaN
    
    
      744890
      Baker Hall Electric (Shark 29) - Demand Watts
      2014-11-10 21:34:28
      188231.90625
      188231.90625
      NaN
    
    
      744891
      Baker Hall Electric (Shark 29) - Demand Watts
      2014-11-10 22:04:28
      182980.18750
      182980.1875
      NaN

At this point, I can see that the last two columns are not very interesting/important, so let's delete them.



In [5]:

    
data = data.drop(data.columns[[3,4]],axis=1)

Let's now make sure the columns that remain are the right ones:



In [6]:

    
data.columns









    Out[6]:





Index(['Point name', 'Time', 'Value'], dtype='object')

Let's show the first few samples in the "Point name" column:



In [7]:

    
data['Point name'].head()









    Out[7]:





0    Porter Hall Electric (Shark 30) - Watts
1    Porter Hall Electric (Shark 30) - Watts
2    Porter Hall Electric (Shark 30) - Watts
3    Porter Hall Electric (Shark 30) - Watts
4    Porter Hall Electric (Shark 30) - Watts
Name: Point name, dtype: object

One of the very useful methods that come with the DataFrame object is describe(). It provides us with basic statistics for the columns in the data frame:



In [8]:

    
data.describe()









    Out[8]:






  
    
      
      Value
    
  
  
    
      count
      744892.000000
    
    
      mean
      50440.790518
    
    
      std
      93628.102665
    
    
      min
      0.000000
    
    
      25%
      8460.539906
    
    
      50%
      9828.554625
    
    
      75%
      66911.550781
    
    
      max
      917585.375000

To find the unique number of point names, we use the unique function from Numpy, and apply it to the 'Point_name' column in data:



In [9]:

    
# Both of these are equivalent, but the second one is the Pandas way to do it.

#pointNames = np.unique(data['Point name'])
pointNames = data['Point name'].unique()

print('There are {} unique meters.'.format(pointNames.shape[0]))









    



There are 7 unique meters.

We now print the contents of the pointNames array:



In [10]:

    
print(pointNames)









    



['Porter Hall Electric (Shark 30) - Watts'
 'Baker Hall Electric (Shark 29) - Watts'
 'Scaife Hall Electric (Shark 21) - Watts'
 'Doherty Apts Electric (Shark 11) - Demand Watts'
 'University Center Electric (Shark 34) - Watts'
 'Electric kW Calculations - Main Campus kW'
 'Baker Hall Electric (Shark 29) - Demand Watts ']

Now let's inspect data from one of the meters (e.g., the second one in the list above). Specifically, let's create a scatter plot off the Time column against the row number. Ideally this should be a straight line, but if there are discontinuities in the sampling process, then they will also appear as discontinuities in this line.



In [11]:

    
# There are many ways to do this, but here are two:

#plt.plot(data[(data['Point name'].isin([pointNames[1]]))]['Time'],'rd')
plt.plot(data[data['Point name'] == pointNames[1]]['Time'],'rd')









    Out[11]:





[<matplotlib.lines.Line2D at 0x116f1f080>]

To count the numer of samples present on each power meter, there are many ways to achieve it. For instance, we can use an iterator to loop over all pointNames and create a list of tuples in the process (this is formally called a List Comprehension). Every tuple will then contain two elements: the meter name, and the number of samples in it:



In [12]:

    
pd.DataFrame(
    [(pn, len(data[data['Point name'] == pn])) for pn in pointNames],
    columns=['Point name','Number of samples']
)









    Out[12]:






  
    
      
      Point name
      Number of samples
    
  
  
    
      0
      Porter Hall Electric (Shark 30) - Watts
      86381
    
    
      1
      Baker Hall Electric (Shark 29) - Watts
      10342
    
    
      2
      Scaife Hall Electric (Shark 21) - Watts
      44819
    
    
      3
      Doherty Apts Electric (Shark 11) - Demand Watts
      744
    
    
      4
      University Center Electric (Shark 34) - Watts
      10062
    
    
      5
      Electric kW Calculations - Main Campus kW
      522410
    
    
      6
      Baker Hall Electric (Shark 29) - Demand Watts
      70134

Before going any further, let's confirm that the data types for our columns are what we expect (I expect the time column to be a datetime object, and the value to be a float).



In [13]:

    
data.dtypes









    Out[13]:





Point name            object
Time          datetime64[ns]
Value                float64
dtype: object

Task #5

First, we can use another List Comprehension to iterate over the point names and create a new list whose elements are in turn tuples with the indeces for the samples corresponding to this meter. Then we can iterate over them and print the time by doing a simple difference (since the values in the 'Time' column are datetime objects, subtracting them generates a time_delta object that Python can interpret.



In [14]:

    
idx = [np.where(data['Point name']==meter) for meter in pointNames]

print('idx is now a {0} of {1} items.'.format(type(idx),len(idx)))
print('Each item in idx is of {0}.\n'.format(type(idx[0])))

for meterNo, meterRows in enumerate(idx):
    print("{} recorded for {}.".format(pointNames[meterNo], data['Time'].iloc[meterRows[0][-1]] - data['Time'].iloc[meterRows[0][0]]))









    



idx is now a <class 'list'> of 7 items.
Each item in idx is of <class 'tuple'>.

Porter Hall Electric (Shark 30) - Watts recorded for 61 days 22:41:38.
Baker Hall Electric (Shark 29) - Watts recorded for 7 days 22:43:04.
Scaife Hall Electric (Shark 21) - Watts recorded for 31 days 22:45:13.
Doherty Apts Electric (Shark 11) - Demand Watts recorded for 31 days 12:38:32.
University Center Electric (Shark 34) - Watts recorded for 7 days 22:48:04.
Electric kW Calculations - Main Campus kW recorded for 365 days 22:52:57.
Baker Hall Electric (Shark 29) - Demand Watts  recorded for 271 days 15:01:01.

Task #6

For this task, we are going to directly take the difference between any two consecutive datetime objects and display the result in terms of, say, number of seconds elapsed between these timestamps.

Before we do this, though, it is useful to plot the timestamps to figure out if there are discontinuities that we can visually see:



In [15]:

    
fig = plt.figure(figsize=(20,30)) # A 20 inch x 30 inch figure box

for meter,i in zip(pointNames,range(len(pointNames))):
    plt.subplot(4,2,i+1) # 4 rows and 2 columns of subplots
    plt.plot(data[data['Point name']==meter]['Time'])
    plt.title(meter)
    plt.xlabel('Row number')
    plt.ylabel('Timestamp')

As you may have seen, gaps were easily identifiable as discontinuities in the lines that were plotted. If no gaps existed, the plot would be a straight line.

But now let's get back to solving this using exact numbers...

First, you need to know that applying the difference operator (-) on two datetime objects results in a timedelta object. These objects (timedelta) describe time differences in terms of number of days, seconds and microseconds (see the link above for more details).

Because of this, we could quickly convert any timedelta object (say dt) into the number of seconds by doing:

dt.days*3600*24+dt.seconds+dt.microseconds/1000000

In this case, however, our timestamps do not contain information about the microseconds, so we will skip that part of the converstion.

All that said, what we are using are not timedelta objects per se, but rather timedelta64 objects (which are part of pandas) and they will spare us much trouble and we don't need to do the conversion anymore in this way. Much easier ways are available if you look through the documentation here.

Using this knowledge, we can create a list of lists (a nested list) in a similar manner as we've done before (i.e. using list comprehensions), and in it store the timedeltas in seconds for each meter. In other words, the outer list is a list of the same length as pointNames, and each element is a list of timedeltas for the corresponding meter.

One more thing comes in handy for this task: the np.diff function, which takes an array (or a list) and returns the difference between any two consecutive items of the list.

Now, in a single line of code we can get the nested list we talked about:



In [16]:

    
delta_t = [[d/np.timedelta64(1, 'D') for d in np.diff(data['Time'].iloc[i[0]])] for i in idx]

Now we need to be able to print out the exact times during which there are gaps. We will define gaps to be any timedelta that is longer than the median timedelta for a meter.

We will achieve this as follows:

first we will create a for loop to iterate over every item in the list delta_t (which means we will iterate over all meters).
then, inside the loop, we will calculate the median value for the delta_t that corresponds to each meter
following this, we will find the indeces of delta_t where its value is greater than the median
lastly, we will iterate over all the indeces found in the previous step and print out their values



In [17]:

    
np.set_printoptions(threshold=np.nan)

for i in range(len(delta_t)):
    resolution = np.median(delta_t[i]) # use median as the resolution
    gap_idx = np.where(np.array(delta_t[i]) > (resolution)) # define the gap to be the delta_t> resolution
    # print out the gap for each power meter
    print("-=-=-=-=-=-=-=-=-\n{}\nresolution:{}\ngaps:".format(pointNames[i],resolution))
    # This is pretty slow, so uncomment and run only if you want to see the output.
    #for the_idx in gap_idx[0]:
    #    print(data['Time'].iloc[idx[i]].iloc[the_idx],' to ',data['Time'].iloc[idx[i]].iloc[the_idx+1])
        
np.set_printoptions(threshold=8)









    



-=-=-=-=-=-=-=-=-
Porter Hall Electric (Shark 30) - Watts
resolution:0.0006944444444444445
gaps:
-=-=-=-=-=-=-=-=-
Baker Hall Electric (Shark 29) - Watts
resolution:0.0006944444444444445
gaps:
-=-=-=-=-=-=-=-=-
Scaife Hall Electric (Shark 21) - Watts
resolution:0.0006944444444444445
gaps:
-=-=-=-=-=-=-=-=-
Doherty Apts Electric (Shark 11) - Demand Watts
resolution:0.041666666666666664
gaps:
-=-=-=-=-=-=-=-=-
University Center Electric (Shark 34) - Watts
resolution:0.0006944444444444445
gaps:
-=-=-=-=-=-=-=-=-
Electric kW Calculations - Main Campus kW
resolution:0.0006944444444444445
gaps:
-=-=-=-=-=-=-=-=-
Baker Hall Electric (Shark 29) - Demand Watts 
resolution:0.003472222222222222
gaps:

Task #7

First, we will define a new variable containing the weekday for each of the timestamps.



In [18]:

    
week_days = np.array([x.to_datetime().weekday() for x in data['Time']])

Monday = data.ix[week_days == 0]
Tuesday = data.ix[week_days == 1]
Wednesday = data.ix[week_days == 2]
Thursday = data.ix[week_days == 3]
Friday = data.ix[week_days == 4]
Saturday = data.ix[week_days == 5]
Sunday = data.ix[week_days == 6]

All = [Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday]

Task #8

In this task we basically use two for loops and a the subplot functionality of PyPlot to do visualize the data contained in the variables we declared above.

The main trick is that we need to create a time index that only contains information about the hours, minutes and seconds (i.e. it completely disregards the exact day of the measurement) so that all of the measurements can be displayed within a single 24-hour period.



In [19]:

    
Days = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']

fig = plt.figure(figsize=(20,20))
for i in range(len(pointNames)): # iterate over meters
    for j in range(7): # iterate over days of the week
        plt.subplot(7,7,i*7+j+1)
        # Data from the day being plotted = All[j]
        # Data from the meter being plotted = All[j][All[j]['Point_name']==pointNames[i]]
        time = np.array(All[j]['Time'].ix[All[j]['Point name']==pointNames[i]])
        # plot the power vs the hours in a day
        plt.plot(time,All[j]['Value'].ix[All[j]['Point name']==pointNames[i]],'.')
        if i==6:
            plt.xlabel('hours in a day')
        if j==0: 
            plt.ylabel(pointNames[i].split('-')[0]+'\n'+pointNames[i].split('-')[1])
        if i==0:
            plt.title(Days[j])
fig.tight_layout()
plt.show()

Task #9

Serveral findings: (more to be added)

Campus consume more energy during weekdays than weekends.
Higher energy consumption during working hours.
Many meters report a bi-modal distribution of the measurements, possibly due to seasonal effects.
Some meters (e.g., Porter Hall) show more erratic behavior during weekends.

Finally, let's save the Pandas Data Frame to disk by Pickling it (see the Python reference for Pickle):



In [22]:

    
import pickle

pickle_file = open('data/campusDemand.pkl','wb')
pickle.dump([data,pointNames,All,idx],pickle_file)
pickle_file.close()

	Point name	Time	Value	Rendered	Annotation
0	Porter Hall Electric (Shark 30) - Watts	2014-09-10 00:00:50	80635.421875	80635 Watts	NaN
1	Porter Hall Electric (Shark 30) - Watts	2014-09-10 00:01:50	77046.992188	77047 Watts	NaN
2	Porter Hall Electric (Shark 30) - Watts	2014-09-10 00:02:50	76787.828125	76788 Watts	NaN
3	Porter Hall Electric (Shark 30) - Watts	2014-09-10 00:03:50	79804.031250	79804 Watts	NaN
4	Porter Hall Electric (Shark 30) - Watts	2014-09-10 00:04:50	79397.078125	79397 Watts	NaN

	Point name	Time	Value	Rendered	Annotation
744887	Baker Hall Electric (Shark 29) - Demand Watts	2014-11-10 20:04:28	197454.37500	197454.375	NaN
744888	Baker Hall Electric (Shark 29) - Demand Watts	2014-11-10 20:34:28	195065.00000	195065.0	NaN
744889	Baker Hall Electric (Shark 29) - Demand Watts	2014-11-10 21:04:28	190167.78125	190167.78125	NaN
744890	Baker Hall Electric (Shark 29) - Demand Watts	2014-11-10 21:34:28	188231.90625	188231.90625	NaN
744891	Baker Hall Electric (Shark 29) - Demand Watts	2014-11-10 22:04:28	182980.18750	182980.1875	NaN

	Value
count	744892.000000
mean	50440.790518
std	93628.102665
min	0.000000
25%	8460.539906
50%	9828.554625
75%	66911.550781
max	917585.375000

	Point name	Number of samples
0	Porter Hall Electric (Shark 30) - Watts	86381
1	Baker Hall Electric (Shark 29) - Watts	10342
2	Scaife Hall Electric (Shark 21) - Watts	44819
3	Doherty Apts Electric (Shark 11) - Demand Watts	744
4	University Center Electric (Shark 34) - Watts	10062
5	Electric kW Calculations - Main Campus kW	522410
6	Baker Hall Electric (Shark 29) - Demand Watts	70134