This notebook provides a means to convert hourly gate or head count data for upload into LibInsight by 'flattening' a multi-column table into two columns.
It is recommended that the source file be moved or copied to the same directory as this notebook.
Two means of checking the correctness of the output have been included:
Note that Pandas can import and export Excel files as well as CSV. If that is preferable, reading Excel and exporting CSV can be accomplished with only minor changes to the code.
NOTE: If confidence in this process is high, it is possible to run all of the cells with one command. Before doing so, please make sure to edit the source and output file names as needed.
In [58]:
import numpy as np
import pandas as pd
In [59]:
source_file = 'fadl.csv'
In [60]:
df = pd.read_csv(source_file, index_col=0)
df.head()
Out[60]:
In [61]:
expected_rows = len(df.index)*len(df.columns)
print("Expected number of rows in the output: ", str(expected_rows))
In [62]:
source_sum = df.sum().sum()
print("Sum total of source data: ", str(source_sum))
Transposition is quickly done using the Pandas 'stack' method, which creates a form of Pivot Table. The resulting data frame itself can't be exported because we still need to add a column with the correct date/time format.
Following stacking, double check the row count. The number of rows in the stacked data frame should be equal to the expected number of rows in the output. We could also get the sum total here, too, but for now that second check is left to the end of the process.
In [63]:
stacked = df.stack()
stacked_rows = len(stacked)
if stacked_rows == expected_rows:
print("So far so good. Expecting", str(expected_rows), "rows and transposed data frame has", str(stacked_rows), "rows.")
else:
print("Error. Expecting", str(expected_rows), "rows and transposed data frame has", str(stacked_rows), "rows. Please check.")
The "stacked" data frame has two indices - one for the date, a second for the time. These need to be "combined" into a single column. The approach taken here, which is simple though perhaps not most efficient, is to create a new, empty data frame to hold the final output. Then, for each row in the "stacked" data frame, append concatenated date/time info and count info as a new row into the output data frame.
Note that this step may take some time.
As a quick check, view the first five rows of the output. Ignore the leftmost "column" of zeros or integers. This is an index, which is required by Pandas data frames. The index will be excluded from the final CSV output.
In [64]:
cols = ['date_time', 'count']
output = pd.DataFrame(columns=cols)
for i in range(len(stacked)):
tocat = stacked.index[i]
dt = tocat[0]+":"+tocat[1]
v = stacked.iloc[i]
tmpdf = pd.DataFrame([[dt, v]], columns=cols)
output = output.append(tmpdf)
output.head()
Out[64]:
In [65]:
output_rows = len(output)
if output_rows == expected_rows:
print("First check successful. Expecting", str(expected_rows), "rows and final data frame has", str(output_rows), "rows.")
else:
print("Error. Expecting", str(expected_rows), " rows and final data frame has", str(output_rows), "rows. Please check.")
In [66]:
output_sum = output.sum()[1]
if output_sum == source_sum:
print("Second check successful. Expecting sum total of", str(source_sum), "and final total is", str(output_sum), ".")
else:
print("Error. Expecting sum total of", str(source_sum), "and final total is", str(output_sum), ".")
In [67]:
output_file = 'fadl_data.csv'
output.to_csv(output_file, index=False)