CaaR package examples

Release 4.0 (Beta)



In [1]:

    
%cd '/home/nick/Documents/Research/thermo'









    



/home/nick/Documents/Research/thermo



In [2]:

    
import datetime as dt

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

import caar as cr

%matplotlib inline



In [3]:

    
import seaborn as sns; sns.set(color_codes=True)

Overview

This introduction to CaaR shows how delimited text files are converted into indexed time series.

The functionality applies to any text file with time stamped data observations (scalars or vectors).

A plot of a multi-dimensional time series (operating status for an air conditioner and indoor temperatuure) is shown, based on the output of a CaaR time series function.

With CaaR, text files only need to be read once in order to be converted into Python variables. The user can use the pickle_from_file() function to create a pickle file (binary) from a text file. CaaR can convert pickle files created this way into a pandas DataFrame at any time. This is significantly faster than reading raw text files into pandas directly.

CaaR offers functions for summarizing and analyzing the data, and can convert DataFrames into NumPy time series index arrays and data arrays.

To explore the capabilities in full, see the documentation at http://caar.readthedocs.org

Read original text files into key-value pairs (dicts)

The pickle_from_file() and dict_from_file() functions convert delimited text files (comma- or tab-delimited) into key-value pairs (Python dicts). If the data covers multiple states, the output may be restricted to a particular states or states (using a comma-separated set of state abbreviations).

See example .csv data files in the data folder at https://github.com/nickpowersys/caar

The use of the states parameter is optional, and depends on sensor and postal metadata files (examples are also in the data folder in the Github repository).

Raw data files

Each data file is handled independently.

This example shows each of the three types of data files CaaR can handle. If only one or two of these types of data are available, the other(s) is (are) not required.

The types are

cycle device operations
sensor data from a sensor or sensors
geospatial data identified by location instead of by sensor (such as weather from an external/API source)

The following are specific examples of the general categories described:



In [4]:

    
cycles = 'cycles.csv'         # Cycling device operations (in this example, air conditioners)
inside_temps = 'inside.csv'   # Sensor data from sensors (in this case, building thermostats)
outside_temps = 'outside.csv' # Temperatures associated with locations instead of sensors

Optional parameters



In [5]:

    
cycle_mode = 'Cool' # for selecting only particular operation modes

states = 'TX' # state or states, such as 'TX,CA[,ST]'

thermostats_file = 'thermostats.csv' # A metadata file is needed when 'states' parameter is used.
postal_file = 'us_postal_codes_clean.csv' # Needed when 'states' parameter is used.

# These files are both in the caar repository in the data folder to serve as examples.

This dict is used to avoid having to re-state optional arguments for each type of data.



In [6]:

    
kwargs = {'cycle':cycle_mode, 'states':states, 'sensors_file':thermostats_file, 'postal_file':postal_file}

Function detect_columns() - summarize file content automatically (optional)

The output summarizes file contents, without requiring the user to open the files themselves.

Required: Before detect_columns(), pickle_from_file() and dict_from_file(), specify type of data with the auto keyword parameter and any of the arguments: 'cycles', 'inside', 'outside'



In [8]:

    
kwargs['auto'] = 'cycles'



In [9]:

    
cr.detect_columns(cycles, **kwargs)









    Out[9]:





OrderedDict([('id',
              {'heading': 'ThermostatId', 'position': 0, 'type': 'ints'}),
             ('cycle',
              {'heading': 'CycleType', 'position': 1, 'type': 'alpha_only'}),
             ('start_time',
              {'heading': 'StartTime', 'position': 2, 'type': 'time'}),
             ('end_time',
              {'heading': 'EndTime', 'position': 3, 'type': 'time'}),
             ('Minutes',
              {'heading': 'Minutes', 'position': 4, 'type': 'floats'}),
             ('kwH', {'heading': 'kwH', 'position': 5, 'type': 'floats'}),
             ('BTUs', {'heading': 'BTUs', 'position': 6, 'type': 'floats'})])

Ignore columns (optional) when creating pandas DataFrames and NumPy arrays (0-based positions or headings)



In [10]:

    
kwargs['cols_to_ignore'] = [4, 5, 6]



In [11]:

    
cr.detect_columns(cycles, **kwargs)









    Out[11]:





OrderedDict([('id',
              {'heading': 'ThermostatId', 'position': 0, 'type': 'ints'}),
             ('cycle',
              {'heading': 'CycleType', 'position': 1, 'type': 'alpha_only'}),
             ('start_time',
              {'heading': 'StartTime', 'position': 2, 'type': 'time'}),
             ('end_time',
              {'heading': 'EndTime', 'position': 3, 'type': 'time'})])

Function pickle_from_file() - create file once to allow quick conversion to pandas and NumPy



In [12]:

    
cycles_pickle = cr.pickle_from_file(cycles, **kwargs)

The function return value holds the file name as a string.



In [13]:

    
print(cycles_pickle)









    



TX_cycles.pickle



In [14]:

    
kwargs['auto'] = 'sensors'



In [15]:

    
kwargs['cols_to_ignore'] = None



In [16]:

    
cr.detect_columns(inside_temps, **kwargs)









    Out[16]:





OrderedDict([('id',
              {'heading': 'ThermostatId', 'position': 0, 'type': 'ints'}),
             ('time', {'heading': 'LogDate', 'position': 1, 'type': 'time'}),
             ('Degrees',
              {'heading': 'Degrees', 'position': 2, 'type': 'ints'})])



In [17]:

    
inside_pickle = cr.pickle_from_file(inside_temps, **kwargs)



In [18]:

    
kwargs['auto'] = 'geospatial'



In [19]:

    
cr.detect_columns(outside_temps, **kwargs)









    Out[19]:





OrderedDict([('id', {'heading': 'LocationId', 'position': 0, 'type': 'ints'}),
             ('time', {'heading': 'LogDate', 'position': 1, 'type': 'time'}),
             ('Degrees',
              {'heading': 'Degrees', 'position': 2, 'type': 'ints'})])



In [20]:

    
outside_pickle = cr.pickle_from_file(outside_temps, **kwargs)

Names of binary files from ct.pickle_from_file() are automatically created unless specified using the picklepath keyword argument.

For example: kwargs['picklepath'] = 'my_cycle_file.pickle' can be specified before executing.



In [21]:

    
for f in [cycles_pickle, inside_pickle, outside_pickle]:
    print(f)









    



TX_cycles.pickle
TX_sensors.pickle
TX_geospatial.pickle

Function random_record() - selects data observation for inspection



In [22]:

    
random_inside = cr.random_record('TX_sensors.pickle', value_only=False)

The index for records containing sensor-type observations is based on an ID and a time stamp.



In [23]:

    
random_inside # The value of the data observation is 75.









    Out[23]:





(Sensor(sensor_id=240, timestamp=datetime.datetime(2012, 9, 21, 18, 38, 8)),
 75)

Function dict_from_file() - alternative to creating pickle file

An example of selecting a random data observation from a dict (instead of a pickle file holding a dict) follows.

Note the use of column headings to ignore some columns (above, integers indicating column positions were used).



In [24]:

    
kwargs['auto'] = 'cycles'
kwargs['cols_to_ignore'] = ['Minutes', 'kwH', 'BTUs']
cycles_dict = cr.dict_from_file(cycles, **kwargs)
random_cycles = cr.random_record(cycles_dict, value_only=False)

If the data contain two time stamps per record and the argument auto='cycles' is used, record keys are based on

a device or sensor ID
a cycle mode if one has been specified, otherwise this part of the index is left as 'None'
the first time stamp in the records (from left to right): the start time

The value is the second time stamp, or ending time of the cycle.



In [25]:

    
random_cycles









    Out[25]:





(Cycle(device_id=520, cycle_mode='Cool', start_time=datetime.datetime(2012, 7, 9, 3, 20, 30)),
 datetime.datetime(2012, 7, 9, 3, 23, 31))



In [26]:

    
kwargs['auto'] = 'outside'
kwargs['cols_to_ignore'] = None
random_outside = cr.random_record(outside_pickle, value_only=False)

In this example, the value of 55 is an outdoor temperature.



In [27]:

    
random_outside









    Out[27]:





(Geospatial(location_id=520, timestamp=datetime.datetime(2013, 4, 4, 22, 30, 49)),
 55)

The only difference between the dict_from_file() and pickle_from_file() functions is that pickle_from_file() creates a pickle file, which allows faster read access in later sessions than text files.

With both functions, the output can be an input to the DataFrame creation functions, which are described next.

Reading dicts into pandas DataFrames

Using the pickle file or dict (like cycles_dict above) as input, there is one function for each kind of data (for cycles, indoor temperatures, or outdoor temperatures) that creates a pandas DataFrame.



In [28]:

    
cycles_pickle = 'TX_cycles.pickle'
inside_pickle = 'TX_sensors.pickle'
outside_pickle = 'TX_geospatial.pickle'



In [29]:

    
cycles_df = cr.create_cycles_df(cycles_pickle)



In [30]:

    
cycles_df.head()









    Out[30]:






  
    
      
      
      
      EndTime
    
    
      ThermostatId
      CycleType
      StartTime
      
    
  
  
    
      24
      Cool
      2011-07-03 20:55:20
      2011-07-03 21:06:20
    
    
      2011-07-05 20:53:45
      2011-07-05 21:03:51
    
    
      2011-07-05 21:36:45
      2011-07-05 21:46:54
    
    
      2011-07-05 22:25:46
      2011-07-05 22:34:46
    
    
      2011-07-05 23:29:47
      2011-07-05 23:37:47



In [31]:

    
inside_df = cr.create_sensors_df(inside_pickle)

All column heading come directly from the original raw text files.



In [32]:

    
inside_df.head()









    Out[32]:






  
    
      
      
      Degrees
    
    
      ThermostatId
      LogDate
      
    
  
  
    
      24
      2011-07-03 20:39:02
      84
    
    
      2011-07-03 20:55:20
      85
    
    
      2011-07-03 21:00:20
      84
    
    
      2011-07-05 20:54:45
      85
    
    
      2011-07-05 21:00:45
      84



In [33]:

    
outside_df = cr.create_geospatial_df(outside_pickle)



In [34]:

    
outside_df.head()









    Out[34]:






  
    
      
      
      Degrees
    
    
      LocationId
      LogDate
      
    
  
  
    
      12
      2011-06-27 00:06:08
      91
    
    
      2011-08-23 09:45:15
      80
    
    
      2011-08-23 09:57:25
      80
    
    
      2011-08-23 10:17:13
      84
    
    
      2011-08-23 10:22:13
      80

Summarizing DataFrames

Additional functions summarize the extent of data observations across any date range, and within days. These functions can analyze the data for groups of thermostats or filter the data by thermostat ID or location ID. This supports the creation of a data pipeline for further analysis.

The functions included in the example as well as all of the other functions in the public API are described in the documentation.

Function df_select_ids() can take either scalar or list of ids as an argument

View individual thermostat data (cycles or inside temperatures) or outside temperatures:



In [35]:

    
cycles_df_240 = cr.df_select_ids(cycles_df, 240) # ID for thermostat is 240



In [36]:

    
cycles_df_240.head()









    Out[36]:






  
    
      
      
      
      EndTime
    
    
      ThermostatId
      CycleType
      StartTime
      
    
  
  
    
      240
      Cool
      2012-03-15 17:17:22
      2012-03-15 17:34:22
    
    
      2012-03-15 17:51:22
      2012-03-15 18:06:22
    
    
      2012-03-15 18:21:22
      2012-03-15 18:27:37
    
    
      2012-03-15 18:37:22
      2012-03-15 18:54:22
    
    
      2012-03-15 19:12:22
      2012-03-15 19:26:22

Function df_select_datetime_range() selects a range based on either string-formatted datetimes or datetime.datetime values.

It can be applied whether or not df_select_ids() has been applied.



In [37]:

    
cr.df_select_datetime_range(cycles_df_240, '2012-06-01 15:00:00','2012-06-01 20:00:00')









    Out[37]:






  
    
      
      
      
      EndTime
    
    
      ThermostatId
      CycleType
      StartTime
      
    
  
  
    
      240
      Cool
      2012-06-01 15:46:27
      2012-06-01 16:19:26
    
    
      2012-06-01 17:01:42
      2012-06-01 17:56:27
    
    
      2012-06-01 18:34:27
      2012-06-01 19:12:28



In [38]:

    
inside_df_240 = cr.df_select_ids(inside_df, 240)



In [39]:

    
inside_df_240 = cr.df_select_datetime_range(inside_df_240, '2012-07-11 14:00:00', '2012-07-11 16:00:00')



In [40]:

    
inside_df_240.head()









    Out[40]:






  
    
      
      
      Degrees
    
    
      ThermostatId
      LogDate
      
    
  
  
    
      240
      2012-07-11 14:00:28
      77
    
    
      2012-07-11 14:04:26
      76
    
    
      2012-07-11 14:15:23
      75
    
    
      2012-07-11 14:28:19
      76
    
    
      2012-07-11 14:44:14
      77

Selecting ID and slicing on date ranges (pure pandas approach)



In [41]:

    
idx = pd.IndexSlice

Sensor data in a DataFrame can be selected based on a sensor ID or IDs (through slicing) and/or a date range.



In [42]:

    
inside_df.loc[idx[240,'2012-07-11 11:45:00':'2012-07-11 16:30:00'],:].head()









    Out[42]:






  
    
      
      
      Degrees
    
    
      ThermostatId
      LogDate
      
    
  
  
    
      240
      2012-07-11 11:55:47
      77
    
    
      2012-07-11 12:02:45
      78
    
    
      2012-07-11 12:16:41
      77
    
    
      2012-07-11 12:23:39
      76
    
    
      2012-07-11 12:34:35
      75

Note that because the cycles DataFrame has one more column in the index for the cycle mode, the slice has an additional ':' between the sensor ID of 240 (in this case, the sensor is a thermostat) and the date range.



In [43]:

    
cycles_df.loc[idx[240,:,'2012-07-11 11:45:00':'2012-07-11 16:30:00'],:].head()









    Out[43]:






  
    
      
      
      
      EndTime
    
    
      ThermostatId
      CycleType
      StartTime
      
    
  
  
    
      240
      Cool
      2012-07-11 11:55:47
      2012-07-11 12:28:37
    
    
      2012-07-11 13:35:35
      2012-07-11 14:08:25
    
    
      2012-07-11 14:40:15
      2012-07-11 15:21:02

Function location_id_of_thermo() - location of thermostat, if available

See example data file assigned to the variable thermostats_file in the Github repository



In [44]:

    
location_id = cr.location_id_of_sensor(240, thermostats_file)

Outside temperatures at thermostat location



In [45]:

    
outside_temps_240 = cr.df_select_ids(outside_df, location_id)



In [46]:

    
outside_temps_240 = cr.df_select_datetime_range(outside_temps_240, '2012-07-11 14:30:00', '2012-07-11 16:30:00')



In [47]:

    
outside_temps_240.head()









    Out[47]:






  
    
      
      
      Degrees
    
    
      LocationId
      LogDate
      
    
  
  
    
      70
      2012-07-11 14:52:11
      84
    
    
      2012-07-11 15:12:05
      88
    
    
      2012-07-11 16:11:47
      75

Pure pandas approach



In [48]:

    
outside_df.loc[idx[location_id,'2012-07-11 14:30:00':'2012-07-11 16:30:00'],:].head()









    Out[48]:






  
    
      
      
      Degrees
    
    
      LocationId
      LogDate
      
    
  
  
    
      70
      2012-07-11 14:52:11
      84
    
    
      2012-07-11 15:12:05
      88
    
    
      2012-07-11 16:11:47
      75

Function days_of_data_by_id()

Number of days with data per ID (in cycles data, IDs are assigned to thermostats)



In [49]:

    
days_cycle_data_by_id = cr.days_of_data_by_id(cycles_df)



In [50]:

    
days_cycle_data_by_id[0:2]









    Out[50]:






  
    
      
      Days
    
    
      ThermostatId
      
    
  
  
    
      24
      18
    
    
      31
      11

Function df_select_ids()

In this example, select records for a list of IDs instead of a single ID



In [51]:

    
cr.df_select_ids(days_cycle_data_by_id, [92,93])









    Out[51]:






  
    
      
      Days
    
    
      ThermostatId
      
    
  
  
    
      92
      215
    
    
      93
      234

Function daily_data_points_by_id()

Shows count of daily observations for a DataFrame for an ID



In [52]:

    
daily_data_240 = cr.daily_data_points_by_id(inside_df, id=240)



In [53]:

    
daily_data_240.head()









    Out[53]:






  
    
      
      
      Degrees
    
    
      ThermostatId
      LogDate
      
    
  
  
    
      240
      2011-11-17
      30
    
    
      2011-11-18
      55
    
    
      2011-11-19
      19
    
    
      2011-11-20
      29
    
    
      2011-11-21
      3



In [54]:

    
daily_data_240.columns = ['Temperature Readings (Count)']
daily_data_240.head()









    Out[54]:






  
    
      
      
      Temperature Readings (Count)
    
    
      ThermostatId
      LogDate
      
    
  
  
    
      240
      2011-11-17
      30
    
    
      2011-11-18
      55
    
    
      2011-11-19
      19
    
    
      2011-11-20
      29
    
    
      2011-11-21
      3

Thermostat operations and temperature - plot as time series

time_series_cycling_and_temps(), plot_cycles_xy and plot_temps_xy

The output of time_series_cycling_and_temps() is used as input to the plot_cycles_xy() and plot_temps_xy() functions.



In [55]:

    
start = dt.datetime(2012, 7, 11, 11, 0)
end = dt.datetime(2012, 7, 11, 19, 0)
cycles_and_temps_240 = cr.cycling_and_obs_arrays(cycles_df=cycles_df, cycling_id=240, start=start, end=end,
                                                 sensors_df=inside_df, sensor_id=240, freq='1min')



In [56]:

    
cycles_x, cycles_y = cr.plot_cycles_xy(cycles_and_temps_240)
temps_x, temps_y = cr.plot_sensor_geo_xy(cycles_and_temps_240)



In [57]:

    
fig, ax1 = plt.subplots()

ax1.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M:%S'))
ax1.set_ylim(60.0,80.0)
ax1.set_xlabel('Time')
ax1.set_ylabel('Degrees (F)')
ax1.plot(temps_x, temps_y, marker='o', color='r')
ax2 = ax1.twinx()
ax2.set_ylim(0.0, 1.)
ax2.set_ylabel('ON/OFF (1/0)')
ax2.set_yticks([1.0])
ax2.yaxis.grid(False)
ax2.plot(cycles_x, cycles_y)
fig.set_size_inches(8, 4.5)
plt.title('Thermostat Operation (ON/OFF) and Indoor Temperature')
plt.show() # Note: timestamps in underlying data are from polling processes that have up to a minute delay.

			EndTime
ThermostatId	CycleType	StartTime
24	Cool	2011-07-03 20:55:20	2011-07-03 21:06:20
		2011-07-05 20:53:45	2011-07-05 21:03:51
		2011-07-05 21:36:45	2011-07-05 21:46:54
		2011-07-05 22:25:46	2011-07-05 22:34:46
		2011-07-05 23:29:47	2011-07-05 23:37:47

		Degrees
ThermostatId	LogDate
24	2011-07-03 20:39:02	84
	2011-07-03 20:55:20	85
	2011-07-03 21:00:20	84
	2011-07-05 20:54:45	85
	2011-07-05 21:00:45	84

		Degrees
LocationId	LogDate
12	2011-06-27 00:06:08	91
	2011-08-23 09:45:15	80
	2011-08-23 09:57:25	80
	2011-08-23 10:17:13	84
	2011-08-23 10:22:13	80

		Degrees
ThermostatId	LogDate
240	2012-07-11 14:00:28	77
	2012-07-11 14:04:26	76
	2012-07-11 14:15:23	75
	2012-07-11 14:28:19	76
	2012-07-11 14:44:14	77

		Degrees
ThermostatId	LogDate
240	2012-07-11 11:55:47	77
	2012-07-11 12:02:45	78
	2012-07-11 12:16:41	77
	2012-07-11 12:23:39	76
	2012-07-11 12:34:35	75

		Degrees
LocationId	LogDate
70	2012-07-11 14:52:11	84
	2012-07-11 15:12:05	88
	2012-07-11 16:11:47	75

		Degrees
ThermostatId	LogDate
240	2011-11-17	30
	2011-11-18	55
	2011-11-19	19
	2011-11-20	29
	2011-11-21	3