LibInsight Transposition Utility

This notebook provides a means to convert hourly gate or head count data for upload into LibInsight by 'flattening' a multi-column table into two columns.

It is recommended that the source file be moved or copied to the same directory as this notebook.

Two means of checking the correctness of the output have been included:

  1. The total count of expected rows in the output is calculated as the product of the number of rows in the source data multiplied by the number of columns in the source data. Because of the way in which the table is being flattened, the output of this process should have exactly that many rows.
  2. Because every data point measures the same thing, it is also possible to compare the sum of the all counts within the source data against the sum of counts following transposition. Again, these should be the same.

Note that Pandas can import and export Excel files as well as CSV. If that is preferable, reading Excel and exporting CSV can be accomplished with only minor changes to the code.

NOTE: If confidence in this process is high, it is possible to run all of the cells with one command. Before doing so, please make sure to edit the source and output file names as needed.


Import modules


In [58]:
import numpy as np
import pandas as pd

Provide the name of the source file

Edit the next line as needed. It is recommended to remove spaces and special characters from the source file name before proceeding.


In [59]:
source_file = 'fadl.csv'

Read/Open the source file and print the first five rows

NOTE: To read Excel files instead, simply change the pd.read_csv to pd.read_excel.


In [60]:
df = pd.read_csv(source_file, index_col=0)
df.head()


Out[60]:
8:30 9:30 10:30 11:30 12:30 13:30 14:30 15:30 16:30 17:30 18:30 19:30 20:30 21:30
1/17/2011 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1/18/2011 10 19 39 50 0 40 35 36 28 30 27 0 15 6
1/19/2011 7 12 48 64 52 53 52 57 0 0 22 18 19 15
1/20/2011 9 14 37 55 0 55 62 43 47 27 22 22 14 10
1/21/2011 3 16 0 65 69 44 0 47 0 31 0 0 0 0

Get the number of expected rows in the transposed data


In [61]:
expected_rows = len(df.index)*len(df.columns)
print("Expected number of rows in the output: ", str(expected_rows))


Expected number of rows in the output:  30324

Get the sum total of all the data in the source table


In [62]:
source_sum = df.sum().sum()
print("Sum total of source data: ", str(source_sum))


Sum total of source data:  696893

Transpose the columns into rows

Transposition is quickly done using the Pandas 'stack' method, which creates a form of Pivot Table. The resulting data frame itself can't be exported because we still need to add a column with the correct date/time format.

Following stacking, double check the row count. The number of rows in the stacked data frame should be equal to the expected number of rows in the output. We could also get the sum total here, too, but for now that second check is left to the end of the process.


In [63]:
stacked = df.stack()

stacked_rows = len(stacked)
if stacked_rows == expected_rows:
    print("So far so good. Expecting", str(expected_rows), "rows and transposed data frame has", str(stacked_rows), "rows.")
else:
    print("Error. Expecting", str(expected_rows), "rows and transposed data frame has", str(stacked_rows), "rows. Please check.")


So far so good. Expecting 30324 rows and transposed data frame has 30324 rows.

Add the date/time column

The "stacked" data frame has two indices - one for the date, a second for the time. These need to be "combined" into a single column. The approach taken here, which is simple though perhaps not most efficient, is to create a new, empty data frame to hold the final output. Then, for each row in the "stacked" data frame, append concatenated date/time info and count info as a new row into the output data frame.

Note that this step may take some time.

As a quick check, view the first five rows of the output. Ignore the leftmost "column" of zeros or integers. This is an index, which is required by Pandas data frames. The index will be excluded from the final CSV output.


In [64]:
cols = ['date_time', 'count']
output = pd.DataFrame(columns=cols)

for i in range(len(stacked)):
    tocat = stacked.index[i]
    dt = tocat[0]+":"+tocat[1]
    v = stacked.iloc[i]
    tmpdf = pd.DataFrame([[dt, v]], columns=cols)
    output = output.append(tmpdf)
    
output.head()


Out[64]:
date_time count
0 1/17/2011:8:30 0
0 1/17/2011:9:30 0
0 1/17/2011:10:30 0
0 1/17/2011:11:30 0
0 1/17/2011:12:30 0

Check for errors

Before outputting the results to CSV, check to verify that the data frame has the correct number of expected rows AND the same sum total as the source data.


In [65]:
output_rows = len(output)

if output_rows == expected_rows:
    print("First check successful. Expecting", str(expected_rows), "rows and final data frame has", str(output_rows), "rows.")
else:
    print("Error. Expecting", str(expected_rows), " rows and final data frame has", str(output_rows), "rows. Please check.")


First check successful. Expecting 30324 rows and final data frame has 30324 rows.

In [66]:
output_sum = output.sum()[1]

if output_sum == source_sum:
    print("Second check successful. Expecting sum total of", str(source_sum), "and final total is", str(output_sum), ".")
else:
    print("Error. Expecting sum total of", str(source_sum), "and final total is", str(output_sum), ".")


Second check successful. Expecting sum total of 696893 and final total is 696893 .

Write results to file

If both checks are successful, the final results can be written out to a CSV file. Provide a different name for the output file as needed.


In [67]:
output_file = 'fadl_data.csv'
output.to_csv(output_file, index=False)