In Part 1, the data compression algorithm was discussed, and in Part 2, the consequences of improperly preprocessing data were illustrated with a Markov simulation. Part 3 will explain the preprocessing algorithm that was developed to resolve the situation.

The following code will prepare a sample dataset of three variables using the Markov simulation from Part 2. However, two adjustments were made to the Markov simulation to better approximate the actual data retrieval procedure. First, it was assumed that the "true" signal values were unknown, and only "raw" data was available. Second, each value was assigned a timestamp, and the "None" values were removed, making the sampling frequency of the "raw" data appear to be irregular.



In [1]:

    
import numpy as np
import pandas as pd
from datetime import datetime
import random
mC = np.array([[0.03,0.97,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00],
                [0.17,0.17,0.66,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00],
                [0.00,0.11,0.42,0.47,0.00,0.00,0.00,0.00,0.00,0.00,0.00],
                [0.00,0.00,0.06,0.70,0.24,0.00,0.00,0.00,0.00,0.00,0.00],
                [0.00,0.00,0.00,0.03,0.87,0.10,0.00,0.00,0.00,0.00,0.00],
                [0.00,0.00,0.00,0.00,0.03,0.94,0.03,0.00,0.00,0.00,0.00],
                [0.00,0.00,0.00,0.00,0.00,0.22,0.73,0.05,0.00,0.00,0.00],
                [0.00,0.00,0.00,0.00,0.00,0.00,0.38,0.52,0.10,0.00,0.00],
                [0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.53,0.33,0.14,0.00],
                [0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.66,0.18,0.16],
                [0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.91,0.09]])
mCS = np.cumsum(mC,axis=1)
#rawArray will contain the final output for three iterations (variables)
rawArray = []
for nn in range(3):
    #The code from Part 2 is repeated. However, we will not be saving the
    #"true" signal values, only the "raw" signal values.
    curState = 5
    trueStates = [5]
    for ii in range(100000):
        rn = random.random()
        for jj in range(11):
            #Check if the random number is less than the cumulative 
            #probability at each state
            if (rn < mCS[curState,jj]):
                curState = jj
                break
        trueStates.append(curState)
    prevState = 5
    #rawStates is now a dict with the key being the timestamp
    rawStates = {datetime.fromtimestamp(0) : 5}
    #The difference in the following code is that "None" values are not
    #saved anymore, and each value is assigned a timestamp within the dict
    for ii in range(1,len(trueStates)):
        ts = trueStates[ii]
        if (prevState != ts):
            rawStates[datetime.fromtimestamp(ii-1)] = prevState
            rawStates[datetime.fromtimestamp(ii)] = ts
            prevState = ts
    #The results are three pandas Series with timestamps as the index
    rawArray.append(pd.Series(data=rawStates,name="Var" + str(nn)))

An outer join must be selected when merging the three variables within a pandas DataFrame. Outer joins will preserve each row of data even if the column index does not exist for a given variable. Since data retrieved from the historian is compressed (i.e. values from non-stored timestamps do actually exist), an inner join (requiring timestamps to match exactly) would inappropriately reduce the amount of data. The following code will merge the three variables into a single DataFrame, where the index is the timestamp of the data point.



In [2]:

    
#Create a pandas DataFrame from the first variable
df = pd.DataFrame(rawArray[0])
for ii in range(1,3):
    df = pd.merge(left=df,right=pd.DataFrame(rawArray[ii]),
                  left_index=True,right_index=True,how="outer")
df.head(15)









    Out[2]:






  
    
      
      Var0
      Var1
      Var2
    
  
  
    
      1969-12-31 19:00:00
      5.0
      5.0
      5.0
    
    
      1969-12-31 19:00:01
      4.0
      6.0
      NaN
    
    
      1969-12-31 19:00:02
      5.0
      5.0
      NaN
    
    
      1969-12-31 19:00:04
      5.0
      NaN
      NaN
    
    
      1969-12-31 19:00:05
      4.0
      NaN
      NaN
    
    
      1969-12-31 19:00:08
      NaN
      NaN
      5.0
    
    
      1969-12-31 19:00:09
      4.0
      NaN
      6.0
    
    
      1969-12-31 19:00:10
      5.0
      NaN
      NaN
    
    
      1969-12-31 19:00:14
      NaN
      NaN
      6.0
    
    
      1969-12-31 19:00:15
      NaN
      NaN
      5.0
    
    
      1969-12-31 19:00:17
      NaN
      NaN
      5.0
    
    
      1969-12-31 19:00:18
      NaN
      NaN
      6.0
    
    
      1969-12-31 19:00:19
      NaN
      NaN
      5.0
    
    
      1969-12-31 19:00:20
      5.0
      NaN
      NaN
    
    
      1969-12-31 19:00:21
      6.0
      NaN
      NaN

Notice that: there are many NaN values; it is unlikely that many of the timestamps will match exactly; there are still missing indices, even though the data was originally sampled at a 1 Hz rate. To illustrate the dangers of preprocessing by matching timestamps exactly, the following code calculates the percentage of data reduction if an inner join was used. Data for which any one of the variables was "None" was removed.



In [3]:

    
#Percentage decrease formula is (Original - New)/Original
perc_dec = (len(df) - len(df.dropna()))/ len(df)
print("The percentage decrease is: {}%".format(round(perc_dec*100,1)))









    



The percentage decrease is: 98.4%

Following the outer join, the data will be resampled so that it matches the original data acquisition rate (1 Hz in this case). This can be done with the pandas "resample", using "1S" (1 second) as the resampling rate.



In [4]:

    
df_rs = df.resample("1S").fillna(method="ffill")
df_rs.head(15)









    Out[4]:






  
    
      
      Var0
      Var1
      Var2
    
  
  
    
      1969-12-31 19:00:00
      5.0
      5.0
      5.0
    
    
      1969-12-31 19:00:01
      4.0
      6.0
      NaN
    
    
      1969-12-31 19:00:02
      5.0
      5.0
      NaN
    
    
      1969-12-31 19:00:03
      5.0
      5.0
      NaN
    
    
      1969-12-31 19:00:04
      5.0
      NaN
      NaN
    
    
      1969-12-31 19:00:05
      4.0
      NaN
      NaN
    
    
      1969-12-31 19:00:06
      4.0
      NaN
      NaN
    
    
      1969-12-31 19:00:07
      4.0
      NaN
      NaN
    
    
      1969-12-31 19:00:08
      NaN
      NaN
      5.0
    
    
      1969-12-31 19:00:09
      4.0
      NaN
      6.0
    
    
      1969-12-31 19:00:10
      5.0
      NaN
      NaN
    
    
      1969-12-31 19:00:11
      5.0
      NaN
      NaN
    
    
      1969-12-31 19:00:12
      5.0
      NaN
      NaN
    
    
      1969-12-31 19:00:13
      5.0
      NaN
      NaN
    
    
      1969-12-31 19:00:14
      NaN
      NaN
      6.0

Finally, the missing data should be forward-filled, to replace each NaN entry with the uncompressed signal value.



In [5]:

    
df_ff = df_rs.ffill()
df_ff.head(15)









    Out[5]:






  
    
      
      Var0
      Var1
      Var2
    
  
  
    
      1969-12-31 19:00:00
      5.0
      5.0
      5.0
    
    
      1969-12-31 19:00:01
      4.0
      6.0
      5.0
    
    
      1969-12-31 19:00:02
      5.0
      5.0
      5.0
    
    
      1969-12-31 19:00:03
      5.0
      5.0
      5.0
    
    
      1969-12-31 19:00:04
      5.0
      5.0
      5.0
    
    
      1969-12-31 19:00:05
      4.0
      5.0
      5.0
    
    
      1969-12-31 19:00:06
      4.0
      5.0
      5.0
    
    
      1969-12-31 19:00:07
      4.0
      5.0
      5.0
    
    
      1969-12-31 19:00:08
      4.0
      5.0
      5.0
    
    
      1969-12-31 19:00:09
      4.0
      5.0
      6.0
    
    
      1969-12-31 19:00:10
      5.0
      5.0
      6.0
    
    
      1969-12-31 19:00:11
      5.0
      5.0
      6.0
    
    
      1969-12-31 19:00:12
      5.0
      5.0
      6.0
    
    
      1969-12-31 19:00:13
      5.0
      5.0
      6.0
    
    
      1969-12-31 19:00:14
      5.0
      5.0
      6.0

As can be seen above, the data is now sampled at the correct rate (1 Hz), and all the compressed data points have been filled. In summary, the correct preprocessing algorithm was determined to be:

Merge using an outer join
Resample to match the original data acquisition rate
Forward-fill all empty data

	Var0	Var1	Var2
1969-12-31 19:00:00	5.0	5.0	5.0
1969-12-31 19:00:01	4.0	6.0	NaN
1969-12-31 19:00:02	5.0	5.0	NaN
1969-12-31 19:00:04	5.0	NaN	NaN
1969-12-31 19:00:05	4.0	NaN	NaN
1969-12-31 19:00:08	NaN	NaN	5.0
1969-12-31 19:00:09	4.0	NaN	6.0
1969-12-31 19:00:10	5.0	NaN	NaN
1969-12-31 19:00:14	NaN	NaN	6.0
1969-12-31 19:00:15	NaN	NaN	5.0
1969-12-31 19:00:17	NaN	NaN	5.0
1969-12-31 19:00:18	NaN	NaN	6.0
1969-12-31 19:00:19	NaN	NaN	5.0
1969-12-31 19:00:20	5.0	NaN	NaN
1969-12-31 19:00:21	6.0	NaN	NaN