This Notebook shows how to read and work with DDC Trend File data. It utilizes a Siemens CSV Reader function that is found in the ddc_readers
module to read the data into a Pandas DataFrame. It then shows how to do some quick exploration and summarization of the data. Next, some plotting techniques are shown. Finally, calculation of new values is shown with specific emphasis on how deal with missing data and non-overlapping timestamps.
In [1]:
# Import the needed libraries
import pandas as pd
import numpy as np
import ddc_readers # the module that has DDC trend file readers
# import matplotlib pyplot commands
import matplotlib.pyplot as plt
# Show Plots in the Notebook
%matplotlib inline
# Increase the size of plots and their fonts
plt.rcParams['figure.figsize']= (10, 8) # set Chart Size
plt.rcParams['font.size'] = 14 # set Font size in Chart
# Use a plotting style'style' the plot using 'bmh' style
plt.style.use('bmh')
In [2]:
# The parameter in the function below controls whether the location of each
# point is included in its point name.
df = ddc_readers.siemens_reader('data/siemens_sample.csv', include_location_in_point_name=False)
# Show the first 5 rows of the DataFrame
df.head()
Out[2]:
You can see that there is a separate column for each point. The index of the DataFrame is the timestamps of the readings. Some of the points don't have data for every possible timestamp, as indicated by the NaN values (these are Numpy.NaN values).
The head
method can show how many every rows you want:
In [3]:
df.head(3)
Out[3]:
The iloc
index locator approach also allows you to limit the number of columns shown.
In [4]:
# Here we will look at the first 5 rows and only the first 3 columns
# Note the use of square brackets, not parentheses!
df.iloc[:5, :3]
Out[4]:
In [5]:
# Look at the last 5 rows
df.tail()
Out[5]:
In [6]:
# Look at the last 5 rows, but only 3 columns.
# the -5 means start with the 5th row back from the end
df.iloc[-5:, :3]
Out[6]:
You can export the entire DataFrame to an Excel spreadsheet if you want. When you look at the spreadsheet you see that the bh_100e_tec_room_temp
point spans a greater range of date/times than the other points. That is why you see so many NaN's at the end of this DataFrame.
In [7]:
df.to_excel('siemens.xlsx')
You can see the names of all the points present by looking at the columns
attribute of the DataFrame.
In [8]:
df.columns
Out[8]:
If you want to know how many different timestamps there are, you can use the len
function on the DataFrame. Remember that most DDC points do not have a value for every timestamp.
In [9]:
len(df)
Out[9]:
The describe()
method will give you statistical summaries for each column. The count
value shown below only counts the rows where there is actual data, not a NaN. describe()
only includes the numeric columns, so that is why there are fewer columns in this summary than the whole DataFrame. Some of these DDC points have text values like 'On' and 'Off' or 'Heat' and 'Cool'.
In [10]:
df.describe()
Out[10]:
Pandas provides a number of plotting functions, which really access Matplotlib plotting functions under the hood. Here is how you would plot a couple columns using the Pandas plotting functions:
In [11]:
df[['bh_101_tec_room_temp', 'bh_uhw_tec_room_temp']].plot()
Out[11]:
I'll address the gap problems in a moment. Notice that you need to use double square brackets to select out particular columns in the DataFrame. Here is how you would implement the above graph through use of Matplotlib function calls. The values to use for the X axis are the date/times in the index of the DataFrame:
In [12]:
plt.plot(df.index, df.bh_101_tec_room_temp, label='bh_101_tec_room_temp')
plt.plot(df.index, df.bh_uhw_tec_room_temp, label='bh_uhw_tec_room_temp')
plt.xticks(rotation=35)
plt.legend()
Out[12]:
This is more complicated, but sometimes you need the flexibility of Matplotlib to accomplish what you need. So, it is good to know both methods. Also, you can mix Pandas plotting function with Matplotlib functions in the same plot. I'll stick to Pandas plotting calls for the rest of the tutorial.
The gaps in the lines are caused by the fact that there are NaN values scattered through out the data. These occur because the timestamps for each sensor point are not exactly the same as the timestamps for other sensors. In the graph above, some sensor has values on October 10 near 8 am, but the two sensors we are plotting do not have values then. To have these gaps spanned by a connecting line, you can drop out the NaN values before plotting using the dropna()
method:
In [13]:
df.bh_101_tec_room_temp.dropna().plot(legend=True)
df.bh_uhw_tec_room_temp.dropna().plot(legend=True)
Out[13]:
Note that I had to use the legend=True
parameter in the plot call because when Pandas plots one series, it defaults to not showing a legend. I had to plot the series separately because the NaN values for each series may not occur at the same set of timestamps, so separate dropna()
calls are needed on each series.
If you have a number of columns you want to plot, you can write a loop to simplify the code. The code below first creates a list of colummns to plot and then loops through to plot each one individually. The list is generated by finding every sensor point that has "temp" in its name using a Python "list comprehension".
In [14]:
temp_pts = [col for col in df.columns if 'temp' in col]
temp_pts
Out[14]:
In [15]:
for pt in temp_pts:
df[pt].dropna().plot(legend=True)
plt.legend(loc='upper right') # mixed in a Matplotlib command
This section will address how to perform calculations with the point data but deal with the fact that the timestamps of the sensors are not always aligned. For purposes of instruction, I'll create a DataFrame with two sensors that have obvious misalignment of the timestamps. The first sensor will be the actual sensor bh_100w_tec_room_temp
, which generally has 15 minute spacing between points, and the second sensor will be fabricated with some random values but with 5 minute spacing.
In [16]:
# Select out the bh_100w_tec_room_temp sensor, drop the NaNs,
# and give the column a shorter name.
df_t1 = df[['bh_100w_tec_room_temp']].dropna()
df_t1.columns = ['temp1'] # rename column
df_t1.head()
Out[16]:
In [17]:
# Create an index for the fabricated sensor that spans the time range
# of temp1 but has 5 minute spacing.
start_ts = df_t1.index[0] # first timestamp in df_t1
end_ts = df_t1.index[-1] # last timestamp in df_t1
fab_index = pd.date_range(start_ts, end_ts, freq='5Min')
fab_index
Out[17]:
In [18]:
# Create random temperature values, normally distributed, with a mean
# of 71.0 and a standard deviation of 3.0
vals = np.random.normal(71.0, 3.0, size=len(fab_index))
# make a new DataFrame with these values and the prior index
df_fab = pd.DataFrame(vals, index=fab_index, columns=['temp2'])
df_fab.head()
Out[18]:
In [19]:
# Combine the two DataFrames horizontally to make the final DataFrame
# that we will work with in this section
df_final = pd.concat([df_t1, df_fab], axis=1)
df_final.head(6)
Out[19]:
So, let's say we want to calculate the temperature difference between temp1 and temp2 and create a new column with that value. We can just do the calculation like the following:
In [20]:
df1 = df_final.copy() # make a copy of the DataFrame to work with in this example
df1['temp_diff'] = df1.temp1 - df1.temp2
df1.head(6)
Out[20]:
But, the temperature difference is only calculated at timestamps that are common to both sensors. The temp2 data at intervening timestamps has no influence on the calculation, so that information is totally neglected. This type of problem is very bad with wireless sensors that are not coordinated at all in when they report data; there could be no overlapping timestamps between two wireless sensors.
There are at least two reasonable ways to deal with the problem. First, you can average the data points into long enough intervals such that all the sensors have at least one data point in the interval. For this example, we can average the data into 15 minute intervals using the Pandas resample
function:
In [21]:
df2 = df_final.copy() # make a copy of the DataFrame to work with in this example
df2 = df2.resample('15min').mean()
df2.head()
Out[21]:
Each column is averaged into 15 minute intervals, and those intervals have a common timestamp (you can affect where that timestamp is located in the interval using the label
and loffset
parameters of resample). All of the temp2
values are used to come up with the average values for that sensor, so no data points are being thrown away. You now can subtract the two columns in the same way as before:
In [22]:
df2['temp_diff'] = df2.temp1 - df2.temp2
df2.head()
Out[22]:
That's a pretty good solution in many situations. Sometimes, though, you would like to keep a higher time-resolution for your final DataFrame. To do this, instead of averaging the sensors, you can interpolate the values to fill in those timestamps that contain NaN's. In this example, no NaN's appear for temp2, so we don't need to interpolate that sensor, but we do need to interpolate temp1:
In [23]:
df3 = df_final.copy() # make a copy of the DataFrame to work with in this example
df3.temp1.interpolate(inplace=True)
# inplace=True saves me from having to do: df3.temp1 = df3.temp1.interpolate()
df3.head()
Out[23]:
A linear interpolation was used to fill in the missing temp1 values. Other types of interpolation are available with the Pandas interpolate
function. Now we can do the subtraction:
In [24]:
df3['temp_diff'] = df3.temp1 - df3.temp2
df3.head()
Out[24]:
You can compare the results of these approaches in the following graph. The first hour of data is shown.
In [25]:
df1.loc[:'2017-10-10 1:00:00', 'temp_diff'].dropna().plot(legend=True, marker='o', label='no processing')
df2.loc[:'2017-10-10 1:00:00', 'temp_diff'].plot(legend=True, marker='o', label='15 min averages')
df3.loc[:'2017-10-10 1:00:00', 'temp_diff'].plot(legend=True, marker='o', label='interpolation')
plt.title('Calculated Temperature Difference');