In [1]:
%cd '/home/nick/Documents/Research/thermo'
In [2]:
import datetime as dt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import caar as cr
%matplotlib inline
In [3]:
import seaborn as sns; sns.set(color_codes=True)
This introduction to CaaR shows how delimited text files are converted into indexed time series.
The functionality applies to any text file with time stamped data observations (scalars or vectors).
A plot of a multi-dimensional time series (operating status for an air conditioner and indoor temperatuure) is shown, based on the output of a CaaR time series function.
With CaaR, text files only need to be read once in order to be converted into Python variables. The user can use the pickle_from_file() function to create a pickle file (binary) from a text file. CaaR can convert pickle files created this way into a pandas DataFrame at any time. This is significantly faster than reading raw text files into pandas directly.
CaaR offers functions for summarizing and analyzing the data, and can convert DataFrames into NumPy time series index arrays and data arrays.
To explore the capabilities in full, see the documentation at http://caar.readthedocs.org
The pickle_from_file() and dict_from_file() functions convert delimited text files (comma- or tab-delimited) into key-value pairs (Python dicts). If the data covers multiple states, the output may be restricted to a particular states or states (using a comma-separated set of state abbreviations).
See example .csv data files in the data folder at https://github.com/nickpowersys/caar
The use of the states parameter is optional, and depends on sensor and postal metadata files (examples are also in the data folder in the Github repository).
Each data file is handled independently.
This example shows each of the three types of data files CaaR can handle. If only one or two of these types of data are available, the other(s) is (are) not required.
The types are
The following are specific examples of the general categories described:
In [4]:
cycles = 'cycles.csv' # Cycling device operations (in this example, air conditioners)
inside_temps = 'inside.csv' # Sensor data from sensors (in this case, building thermostats)
outside_temps = 'outside.csv' # Temperatures associated with locations instead of sensors
In [5]:
cycle_mode = 'Cool' # for selecting only particular operation modes
states = 'TX' # state or states, such as 'TX,CA[,ST]'
thermostats_file = 'thermostats.csv' # A metadata file is needed when 'states' parameter is used.
postal_file = 'us_postal_codes_clean.csv' # Needed when 'states' parameter is used.
# These files are both in the caar repository in the data folder to serve as examples.
This dict is used to avoid having to re-state optional arguments for each type of data.
In [6]:
kwargs = {'cycle':cycle_mode, 'states':states, 'sensors_file':thermostats_file, 'postal_file':postal_file}
The output summarizes file contents, without requiring the user to open the files themselves.
Required: Before detect_columns(), pickle_from_file() and dict_from_file(), specify type of data with the auto keyword parameter and any of the arguments: 'cycles', 'inside', 'outside'
In [8]:
kwargs['auto'] = 'cycles'
In [9]:
cr.detect_columns(cycles, **kwargs)
Out[9]:
In [10]:
kwargs['cols_to_ignore'] = [4, 5, 6]
In [11]:
cr.detect_columns(cycles, **kwargs)
Out[11]:
In [12]:
cycles_pickle = cr.pickle_from_file(cycles, **kwargs)
The function return value holds the file name as a string.
In [13]:
print(cycles_pickle)
In [14]:
kwargs['auto'] = 'sensors'
In [15]:
kwargs['cols_to_ignore'] = None
In [16]:
cr.detect_columns(inside_temps, **kwargs)
Out[16]:
In [17]:
inside_pickle = cr.pickle_from_file(inside_temps, **kwargs)
In [18]:
kwargs['auto'] = 'geospatial'
In [19]:
cr.detect_columns(outside_temps, **kwargs)
Out[19]:
In [20]:
outside_pickle = cr.pickle_from_file(outside_temps, **kwargs)
Names of binary files from ct.pickle_from_file() are automatically created unless specified using the picklepath keyword argument.
For example: kwargs['picklepath'] = 'my_cycle_file.pickle' can be specified before executing.
In [21]:
for f in [cycles_pickle, inside_pickle, outside_pickle]:
print(f)
In [22]:
random_inside = cr.random_record('TX_sensors.pickle', value_only=False)
The index for records containing sensor-type observations is based on an ID and a time stamp.
In [23]:
random_inside # The value of the data observation is 75.
Out[23]:
An example of selecting a random data observation from a dict (instead of a pickle file holding a dict) follows.
Note the use of column headings to ignore some columns (above, integers indicating column positions were used).
In [24]:
kwargs['auto'] = 'cycles'
kwargs['cols_to_ignore'] = ['Minutes', 'kwH', 'BTUs']
cycles_dict = cr.dict_from_file(cycles, **kwargs)
random_cycles = cr.random_record(cycles_dict, value_only=False)
If the data contain two time stamps per record and the argument auto='cycles' is used, record keys are based on
The value is the second time stamp, or ending time of the cycle.
In [25]:
random_cycles
Out[25]:
In [26]:
kwargs['auto'] = 'outside'
kwargs['cols_to_ignore'] = None
random_outside = cr.random_record(outside_pickle, value_only=False)
In this example, the value of 55 is an outdoor temperature.
In [27]:
random_outside
Out[27]:
The only difference between the dict_from_file() and pickle_from_file() functions is that pickle_from_file() creates a pickle file, which allows faster read access in later sessions than text files.
With both functions, the output can be an input to the DataFrame creation functions, which are described next.
Using the pickle file or dict (like cycles_dict above) as input, there is one function for each kind of data (for cycles, indoor temperatures, or outdoor temperatures) that creates a pandas DataFrame.
In [28]:
cycles_pickle = 'TX_cycles.pickle'
inside_pickle = 'TX_sensors.pickle'
outside_pickle = 'TX_geospatial.pickle'
In [29]:
cycles_df = cr.create_cycles_df(cycles_pickle)
In [30]:
cycles_df.head()
Out[30]:
In [31]:
inside_df = cr.create_sensors_df(inside_pickle)
All column heading come directly from the original raw text files.
In [32]:
inside_df.head()
Out[32]:
In [33]:
outside_df = cr.create_geospatial_df(outside_pickle)
In [34]:
outside_df.head()
Out[34]:
Additional functions summarize the extent of data observations across any date range, and within days. These functions can analyze the data for groups of thermostats or filter the data by thermostat ID or location ID. This supports the creation of a data pipeline for further analysis.
The functions included in the example as well as all of the other functions in the public API are described in the documentation.
In [35]:
cycles_df_240 = cr.df_select_ids(cycles_df, 240) # ID for thermostat is 240
In [36]:
cycles_df_240.head()
Out[36]:
In [37]:
cr.df_select_datetime_range(cycles_df_240, '2012-06-01 15:00:00','2012-06-01 20:00:00')
Out[37]:
In [38]:
inside_df_240 = cr.df_select_ids(inside_df, 240)
In [39]:
inside_df_240 = cr.df_select_datetime_range(inside_df_240, '2012-07-11 14:00:00', '2012-07-11 16:00:00')
In [40]:
inside_df_240.head()
Out[40]:
In [41]:
idx = pd.IndexSlice
Sensor data in a DataFrame can be selected based on a sensor ID or IDs (through slicing) and/or a date range.
In [42]:
inside_df.loc[idx[240,'2012-07-11 11:45:00':'2012-07-11 16:30:00'],:].head()
Out[42]:
Note that because the cycles DataFrame has one more column in the index for the cycle mode, the slice has an additional ':' between the sensor ID of 240 (in this case, the sensor is a thermostat) and the date range.
In [43]:
cycles_df.loc[idx[240,:,'2012-07-11 11:45:00':'2012-07-11 16:30:00'],:].head()
Out[43]:
See example data file assigned to the variable thermostats_file in the Github repository
In [44]:
location_id = cr.location_id_of_sensor(240, thermostats_file)
Outside temperatures at thermostat location
In [45]:
outside_temps_240 = cr.df_select_ids(outside_df, location_id)
In [46]:
outside_temps_240 = cr.df_select_datetime_range(outside_temps_240, '2012-07-11 14:30:00', '2012-07-11 16:30:00')
In [47]:
outside_temps_240.head()
Out[47]:
Pure pandas approach
In [48]:
outside_df.loc[idx[location_id,'2012-07-11 14:30:00':'2012-07-11 16:30:00'],:].head()
Out[48]:
In [49]:
days_cycle_data_by_id = cr.days_of_data_by_id(cycles_df)
In [50]:
days_cycle_data_by_id[0:2]
Out[50]:
In [51]:
cr.df_select_ids(days_cycle_data_by_id, [92,93])
Out[51]:
In [52]:
daily_data_240 = cr.daily_data_points_by_id(inside_df, id=240)
In [53]:
daily_data_240.head()
Out[53]:
In [54]:
daily_data_240.columns = ['Temperature Readings (Count)']
daily_data_240.head()
Out[54]:
time_series_cycling_and_temps(), plot_cycles_xy and plot_temps_xy
The output of time_series_cycling_and_temps() is used as input to the plot_cycles_xy() and plot_temps_xy() functions.
In [55]:
start = dt.datetime(2012, 7, 11, 11, 0)
end = dt.datetime(2012, 7, 11, 19, 0)
cycles_and_temps_240 = cr.cycling_and_obs_arrays(cycles_df=cycles_df, cycling_id=240, start=start, end=end,
sensors_df=inside_df, sensor_id=240, freq='1min')
In [56]:
cycles_x, cycles_y = cr.plot_cycles_xy(cycles_and_temps_240)
temps_x, temps_y = cr.plot_sensor_geo_xy(cycles_and_temps_240)
In [57]:
fig, ax1 = plt.subplots()
ax1.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M:%S'))
ax1.set_ylim(60.0,80.0)
ax1.set_xlabel('Time')
ax1.set_ylabel('Degrees (F)')
ax1.plot(temps_x, temps_y, marker='o', color='r')
ax2 = ax1.twinx()
ax2.set_ylim(0.0, 1.)
ax2.set_ylabel('ON/OFF (1/0)')
ax2.set_yticks([1.0])
ax2.yaxis.grid(False)
ax2.plot(cycles_x, cycles_y)
fig.set_size_inches(8, 4.5)
plt.title('Thermostat Operation (ON/OFF) and Indoor Temperature')
plt.show() # Note: timestamps in underlying data are from polling processes that have up to a minute delay.