Testing Pandas changing a trip in stop_times.txt

To do:

  • Open / load stop_times.txt
  • Retrieve a trip's entries
  • Do some changes or bring in another array
  • Delete those entries
  • Append the changed / new entries to the file
  • Measure how much time this all took.
  • Do this on a pure python program later.

Where we're doing it : A 42MB stop_times.txt file having 1,176,283 entries.
Why: Need to benchmark


In [1]:
import pandas as pd
import time
stop_times = 'GTFSbmtc/test/stop_times.txt'

In [2]:
start = time.time()
df = pd.read_csv(stop_times, na_filter=False)
tripEntries = df.query("trip_id == '994_21_d'")
print(tripEntries)
end = time.time()
print("took {} seconds.".format(round(end-start,2)))


          trip_id  route_id  stop_id  stop_sequence arrival_time  \
1175075  994_21_d       994       24              1     15:25:00   
1175076  994_21_d       994     2034              2     15:29:14   
1175077  994_21_d       994      624              3     15:33:28   
1175078  994_21_d       994     1802              4     15:37:42   
1175079  994_21_d       994     1803              5     15:41:56   
1175080  994_21_d       994     1804              6     15:46:10   
1175081  994_21_d       994     1876              7     15:50:24   
1175082  994_21_d       994     1958              8     15:54:38   
1175083  994_21_d       994     1959              9     15:58:52   
1175084  994_21_d       994     1960             10     16:03:06   
1175085  994_21_d       994     1961             11     16:07:20   
1175086  994_21_d       994     1962             12     16:11:34   
1175087  994_21_d       994     1963             13     16:15:48   
1175088  994_21_d       994     1951             14     16:20:02   

        departure_time  
1175075       15:25:00  
1175076       15:29:14  
1175077       15:33:28  
1175078       15:37:42  
1175079       15:41:56  
1175080       15:46:10  
1175081       15:50:24  
1175082       15:54:38  
1175083       15:58:52  
1175084       16:03:06  
1175085       16:07:20  
1175086       16:11:34  
1175087       16:15:48  
1175088       16:20:02  
took 3.38 seconds.

Now testing the same on tinydb.. but first, let's LOAD stop_times.txt into tinyDB:


In [ ]:
from tinydb import TinyDB, Query
from tinydb.operations import delete
start = time.time()
dbfile = 'GTFSbmtc/test/stop_times.json'
db = TinyDB(dbfile, sort_keys=True, indent=2)
db.purge_tables()

dfInsert = pd.read_csv(stop_times, na_filter=False).to_dict(orient='records')
feedDb = db.table('stop_times')
'''
feedDb.insert_multiple(dfInsert)
del dfInsert
'''
db.close()
end = time.time()
print("Time taken: {} seconds.".format(round(end-start,2)))

238 seconds it took to import stop__times.txt into tinydb json.
To purge the same, it took: 428 seconds.
WTF. Forget editing.. this will make the import and process times itself bloody painfull.

back to pandas


In [ ]:
%timeit df = pd.read_csv(stop_times, na_filter=False)
trip_id = '1991_1_d'
%timeit tripEntries = df.query('trip_id == "' + trip_id + '"')
df = pd.read_csv(stop_times, na_filter=False)
tripEntries = df.query('trip_id == "' + trip_id + '"')
print(tripEntries)

Inference: It seemed to take around 10 secs the very first time I ran it, but in subsequent runs (including with other values of trip_id) it got smaller.
Using %%time as advised here, the analysis:

The slowest run took 4.23 times longer than the fastest. This could mean that an intermediate result is being cached.
5.86 s ± 2.52 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Edit: Using %timeit on individual lines instead. There it shows the the file reading is taking up most of the time, whereas the query is taking much lesser time.


In [ ]:
# %%timeit
start = time.time()
%timeit df = pd.read_csv(stop_times, na_filter=False)
trip_id = '1991_1_d'
%timeit df.query('trip_id != "' + trip_id + '"',inplace=True)
# this command edits the dataframe in place, to include only rows where trip_id is not '1991_1_d'

end = time.time()
print("took {} seconds.".format(round(end-start,2)))

In [ ]:
# checking df to see if that trip_id entries are there any longer
df.query('trip_id == "' + trip_id + '"')
# yep, it's gone.

In [ ]:
# try loc :: tips = tips.loc[tips['tip'] <= 9]
start = time.time()
%timeit df = pd.read_csv(stop_times, na_filter=False)
trip_id = '1991_1_d'
%timeit df2 = df.loc[df['trip_id'] != trip_id]
end = time.time()
print("took {} seconds.".format(round(end-start,2)))

Inference from delete trials:

  • df.query seems to be marginally faster than df.loc. So, let's go with that. Anyways there's possibility of queries getting complicated.
  • More time seems to be going in reading the CSV.

How about writing to CSV after deleting:


In [ ]:
%timeit df = pd.read_csv(stop_times, na_filter=False)
trip_id = '1991_1_d'
%timeit df.query('trip_id != "' + trip_id + '"',inplace=True)
outfile = 'GTFSbmtc/test/stop_times_edited.txt'
%timeit df.to_csv(outfile, index=False)

Full edit : remove a trip's entries and insert a modified set of entries

Anyways we'll be doing the deleting only when we actually receive the modified set from frontend.
Ref:

- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html


In [ ]:
tripEntries

In [ ]:
# since our incoming stop_times data will be in the form of a dict, let's simulate that
dict2Add = tripEntries.to_dict('records')
dict2Add[0]

In [ ]:
# reading csv.. this might already be done but let's time it anyways
%timeit df = pd.read_csv(stop_times, na_filter=False)
trip_id = '1991_1_d'
# removing entries for trip = 1991_1_d
%timeit df.query('trip_id != "' + trip_id + '"',inplace=True)

# now we append.. 
%timeit df2add = pd.DataFrame(dict2Add)
%timeit df2 = df.append(df2add, ignore_index=True)

%timeit df2.query('trip_id == "' + trip_id + '"')

outfile = 'GTFSbmtc/test/stop_times_edited.txt'

%timeit df.to_csv(outfile, index=False)

Inference:

  • deleting entries took more time than appending, but still they're ~10x less as compared to reading the file
  • Writing the new file takes ~10x more time than reading, a little over 10s on my machine. So this is the most expensive part of it all.
  • But compared to the ~300 secs of tinydb this ~10sec writing time for stop_times.csv is still small and it's possible for

Possible workflows

Proposal 1:

  • web_wrapper.py loads all the CSV files as dataframes at start of session, before any API is called.
  • all changes made to dataframes only.
  • Write out to CSVs when commit.
  • Obvious drawback : Ctrl+C the main program and everything is GONE! Scratch it..

Proposal 2:

  • load all the CSVs as dataframes from beginning.
  • At every "Save changes.." step, change the dataframe accordingly AND write out to that changed DF's CSV.
  • Other tables will be fine, but this entails waiting about 10sec when saving any changes to stop_times.

In [4]:
df = pd.read_csv(stop_times, na_filter=False)
print(len(df))
trip_id = '1991_1_d'

# extract trip entries for adding later
tripEntries = df.query('trip_id == "' + trip_id + '"')
print(len(tripEntries))

# removing entries for trip = 1991_1_d
df.query('trip_id != "' + trip_id + '"',inplace=True)
print(len(df))

# append.
%timeit df2 = df.append(tripEntries, ignore_index=True)
df2 = df.append(tripEntries, ignore_index=True)
print(len(df2))

# concat
%timeit df3 = pd.concat([df,tripEntries], ignore_index=True)
df3 = pd.concat([df,tripEntries], ignore_index=True)
print(len(df3))


1176283
36
1176247
131 ms ± 4.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1176283
141 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1176283

... concat wins by a hair or.. WAIT, now concat is coming a few ms late! Darn!
Nope, when trying in actual python script, concat wins.

Check out HDF5

Tried running once.. errored out saying :
ImportError: HDFStore requires PyTables, "No module named 'tables'" problem importing

tried direction installing tables via pip, errored out.

Looking it up.. https://github.com/PyTables/PyTables For installing in Ununtu they gave a apt-get install command:
https://github.com/PyTables/PyTables

After pip install tables worked. (did sudo.. should put --user at the end next time!)

testing command:

python3 -m tables.tests.test_all
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version:    3.4.3
HDF5 version:        1.8.18
NumPy version:       1.13.3
Numexpr version:     2.6.4 (not using Intel's VML/MKL)
Zlib version:        1.2.8 (in Python interpreter)
LZO version:         2.09 (Feb 04 2015)
BZIP2 version:       1.0.6 (6-Sept-2010)
Blosc version:       1.14.3 (2018-04-06)
Blosc compressors:   blosclz (1.1.0), lz4 (1.8.1), lz4hc (1.8.1), snappy (1.1.1), zlib (1.2.8), zstd (1.3.4)
Blosc filters:       shuffle, bitshuffle
Cython version:      0.27.3
Python version:      3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
Platform:            Linux-4.13.0-38-generic-i686-with-Ubuntu-16.04-xenial
Byte-ordering:       little
Detected cores:      2
Default encoding:    utf-8
Default FS encoding: utf-8
Default locale:      (en_IN, ISO8859-1)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

In [ ]:
df = pd.read_csv(stop_times, na_filter=False)

hdf5 = 'GTFSbmtc/test/stop_times.h5'
%timeit df.to_hdf(hdf5,'df',mode='w')
# have to put ,mode='w' so that it over-writes instead of appending

outfile = 'GTFSbmtc/test/stop_times_edited.txt'
%timeit df.to_csv(outfile, index=False)

Inferenced: HDF5 Rocks
Also, apparently we can store multiple dataframes inside, just like tables in tinydb.


In [ ]:
# testing all forms of HDF5 : fixed and table
df = pd.read_csv(stop_times, na_filter=False)

# Fixed HDF5:
hdf5 = 'GTFSbmtc/test/stop_times.h5'
%timeit df.to_hdf(hdf5,'df',mode='w')
%timeit hdFixed = pd.read_hdf(hdf5,'df')

# Table HDF5:
hdf5Table = 'GTFSbmtc/test/stop_times-table.h5'
%timeit df.to_hdf(hdf5Table,'df',format='table',mode='w')
# %timeit hdTable = pd.read_hdf(hdf5Table,'df') # it's crashing the laptop here! Abort!

My laptop is HANGING for the command for reading HDF5 in table form. So, let's leave it. Anyways, the table way writing also is showing as too slow. So, let's go with the default FIXED form of HDF5.

terminal run

Ran an independent python script to benchmarks times. Output of two runs:

$ python3 edit-table.py
Load csv took 2.143 seconds.
36
Querying out a trip's entries took 0.5772 seconds.
1176247
Removing a trip's entries from full df took 0.1984 seconds.
1176283
Appending trip entires to full df took 0.1177 seconds.
1176283
Concatenating trip entires to full df took 0.1142 seconds.
Storing in .h5 file took 1.8359 seconds.
Storing in .csv file took 9.7207 seconds.
Reading .h5 file took 0.5111 seconds.



$ python3 edit-table.py
Load csv took 1.7913 seconds.
36
Querying out a trip's entries took 0.2839 seconds.
1176247
Removing a trip's entries from full df took 0.1998 seconds.
1176283
Appending trip entires to full df took 0.1195 seconds.
1176283
Concatenating trip entires to full df took 0.1145 seconds.
Storing in .h5 file took 1.779 seconds.
Storing in .csv file took 9.3677 seconds.
Reading .h5 file took 0.6003 seconds.

Dropping a stop


In [9]:
stop_id = '640'
hdf5 = 'GTFSbmtc/test/stop_times.h5'
hdFixed = pd.read_hdf(hdf5,'stop_times')
print('Loaded stop_times dataframe from .h5 file')

print('Querying stop_times dataframe for stop ' + stop_id)
start = time.time()
stopEntries = hdFixed.query('stop_id == "' + stop_id + '"')
end = time.time()
print(str(len(stopEntries)) + ' entries found.')
print("Query for stop thru full stop_times db took {} seconds.".format(round(end-start,4)))

print('Dropping the stop from stop_times table')
start = time.time()
hdFixed.query('stop_id != "' + stop_id + '"',inplace=True)
end = time.time()
print(str(len(hdFixed)) + ' entries remain in full dataframe.')
print("Removing a stop from full stop_times db took {} seconds.".format(round(end-start,4)))


Loaded stop_times dataframe from .h5 file
Querying stop_times dataframe for stop 640
380 entries found.
Query for stop thru full stop_times db took 0.1108 seconds.
Dropping the stop from stop_times table
1175903 entries remain in full dataframe.
Removing a stop from full stop_times db took 0.3179 seconds.

Revised Workflow

Proposal 3

  • All tables (or csvs in .txt files) get stored in .h5 versions at time of GTFS feed import.
  • Keep them in separate .h5 files of their own even though we apparently can store multiple tables in same file.. will help to do things quickly.
  • Blank slate: create empty dataframes with the prescribed columns and store in .h5 files
  • Reading DB: Load from .h5 file to pd dataframe, query dataframe.
  • Writing DB: In case of full table over-write, simply write to .h5 with ,mode='w' so it doesn't append.
  • Writing DB: In case of partial over-write, do the routine done here: load from .h5 file, remove entries, concat new entries, write to .h5 file.