In Part 1, the data compression algorithm was discussed, and in Part 2, the consequences of improperly preprocessing data were illustrated with a Markov simulation. Part 3 will explain the preprocessing algorithm that was developed to resolve the situation.
The following code will prepare a sample dataset of three variables using the Markov simulation from Part 2. However, two adjustments were made to the Markov simulation to better approximate the actual data retrieval procedure. First, it was assumed that the "true" signal values were unknown, and only "raw" data was available. Second, each value was assigned a timestamp, and the "None" values were removed, making the sampling frequency of the "raw" data appear to be irregular.
In [1]:
import numpy as np
import pandas as pd
from datetime import datetime
import random
mC = np.array([[0.03,0.97,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00],
[0.17,0.17,0.66,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00],
[0.00,0.11,0.42,0.47,0.00,0.00,0.00,0.00,0.00,0.00,0.00],
[0.00,0.00,0.06,0.70,0.24,0.00,0.00,0.00,0.00,0.00,0.00],
[0.00,0.00,0.00,0.03,0.87,0.10,0.00,0.00,0.00,0.00,0.00],
[0.00,0.00,0.00,0.00,0.03,0.94,0.03,0.00,0.00,0.00,0.00],
[0.00,0.00,0.00,0.00,0.00,0.22,0.73,0.05,0.00,0.00,0.00],
[0.00,0.00,0.00,0.00,0.00,0.00,0.38,0.52,0.10,0.00,0.00],
[0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.53,0.33,0.14,0.00],
[0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.66,0.18,0.16],
[0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.91,0.09]])
mCS = np.cumsum(mC,axis=1)
#rawArray will contain the final output for three iterations (variables)
rawArray = []
for nn in range(3):
#The code from Part 2 is repeated. However, we will not be saving the
#"true" signal values, only the "raw" signal values.
curState = 5
trueStates = [5]
for ii in range(100000):
rn = random.random()
for jj in range(11):
#Check if the random number is less than the cumulative
#probability at each state
if (rn < mCS[curState,jj]):
curState = jj
break
trueStates.append(curState)
prevState = 5
#rawStates is now a dict with the key being the timestamp
rawStates = {datetime.fromtimestamp(0) : 5}
#The difference in the following code is that "None" values are not
#saved anymore, and each value is assigned a timestamp within the dict
for ii in range(1,len(trueStates)):
ts = trueStates[ii]
if (prevState != ts):
rawStates[datetime.fromtimestamp(ii-1)] = prevState
rawStates[datetime.fromtimestamp(ii)] = ts
prevState = ts
#The results are three pandas Series with timestamps as the index
rawArray.append(pd.Series(data=rawStates,name="Var" + str(nn)))
An outer join must be selected when merging the three variables within a pandas DataFrame. Outer joins will preserve each row of data even if the column index does not exist for a given variable. Since data retrieved from the historian is compressed (i.e. values from non-stored timestamps do actually exist), an inner join (requiring timestamps to match exactly) would inappropriately reduce the amount of data. The following code will merge the three variables into a single DataFrame, where the index is the timestamp of the data point.
In [2]:
#Create a pandas DataFrame from the first variable
df = pd.DataFrame(rawArray[0])
for ii in range(1,3):
df = pd.merge(left=df,right=pd.DataFrame(rawArray[ii]),
left_index=True,right_index=True,how="outer")
df.head(15)
Out[2]:
Notice that: there are many NaN values; it is unlikely that many of the timestamps will match exactly; there are still missing indices, even though the data was originally sampled at a 1 Hz rate. To illustrate the dangers of preprocessing by matching timestamps exactly, the following code calculates the percentage of data reduction if an inner join was used. Data for which any one of the variables was "None" was removed.
In [3]:
#Percentage decrease formula is (Original - New)/Original
perc_dec = (len(df) - len(df.dropna()))/ len(df)
print("The percentage decrease is: {}%".format(round(perc_dec*100,1)))
Following the outer join, the data will be resampled so that it matches the original data acquisition rate (1 Hz in this case). This can be done with the pandas "resample", using "1S" (1 second) as the resampling rate.
In [4]:
df_rs = df.resample("1S").fillna(method="ffill")
df_rs.head(15)
Out[4]:
Finally, the missing data should be forward-filled, to replace each NaN entry with the uncompressed signal value.
In [5]:
df_ff = df_rs.ffill()
df_ff.head(15)
Out[5]:
As can be seen above, the data is now sampled at the correct rate (1 Hz), and all the compressed data points have been filled. In summary, the correct preprocessing algorithm was determined to be: