To do:
Where we're doing it : A 42MB stop_times.txt file having 1,176,283 entries.
Why: Need to benchmark
In [1]:
import pandas as pd
import time
stop_times = 'GTFSbmtc/test/stop_times.txt'
In [2]:
start = time.time()
df = pd.read_csv(stop_times, na_filter=False)
tripEntries = df.query("trip_id == '994_21_d'")
print(tripEntries)
end = time.time()
print("took {} seconds.".format(round(end-start,2)))
In [ ]:
from tinydb import TinyDB, Query
from tinydb.operations import delete
start = time.time()
dbfile = 'GTFSbmtc/test/stop_times.json'
db = TinyDB(dbfile, sort_keys=True, indent=2)
db.purge_tables()
dfInsert = pd.read_csv(stop_times, na_filter=False).to_dict(orient='records')
feedDb = db.table('stop_times')
'''
feedDb.insert_multiple(dfInsert)
del dfInsert
'''
db.close()
end = time.time()
print("Time taken: {} seconds.".format(round(end-start,2)))
238 seconds it took to import stop__times.txt into tinydb json.
To purge the same, it took: 428 seconds.
WTF. Forget editing.. this will make the import and process times itself bloody painfull.
In [ ]:
%timeit df = pd.read_csv(stop_times, na_filter=False)
trip_id = '1991_1_d'
%timeit tripEntries = df.query('trip_id == "' + trip_id + '"')
df = pd.read_csv(stop_times, na_filter=False)
tripEntries = df.query('trip_id == "' + trip_id + '"')
print(tripEntries)
Inference: It seemed to take around 10 secs the very first time I ran it, but in subsequent runs (including with other values of trip_id) it got smaller.
Using %%time as advised here, the analysis:
The slowest run took 4.23 times longer than the fastest. This could mean that an intermediate result is being cached.
5.86 s ± 2.52 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Edit: Using %timeit on individual lines instead. There it shows the the file reading is taking up most of the time, whereas the query is taking much lesser time.
Ref Links:
https://stackoverflow.com/questions/18172851/deleting-dataframe-row-in-pandas-based-on-column-value
df.query('line_race != 0')
try with inplace=True
In [ ]:
# %%timeit
start = time.time()
%timeit df = pd.read_csv(stop_times, na_filter=False)
trip_id = '1991_1_d'
%timeit df.query('trip_id != "' + trip_id + '"',inplace=True)
# this command edits the dataframe in place, to include only rows where trip_id is not '1991_1_d'
end = time.time()
print("took {} seconds.".format(round(end-start,2)))
In [ ]:
# checking df to see if that trip_id entries are there any longer
df.query('trip_id == "' + trip_id + '"')
# yep, it's gone.
In [ ]:
# try loc :: tips = tips.loc[tips['tip'] <= 9]
start = time.time()
%timeit df = pd.read_csv(stop_times, na_filter=False)
trip_id = '1991_1_d'
%timeit df2 = df.loc[df['trip_id'] != trip_id]
end = time.time()
print("took {} seconds.".format(round(end-start,2)))
In [ ]:
%timeit df = pd.read_csv(stop_times, na_filter=False)
trip_id = '1991_1_d'
%timeit df.query('trip_id != "' + trip_id + '"',inplace=True)
outfile = 'GTFSbmtc/test/stop_times_edited.txt'
%timeit df.to_csv(outfile, index=False)
Anyways we'll be doing the deleting only when we actually receive the modified set from frontend.
Ref:
In [ ]:
tripEntries
In [ ]:
# since our incoming stop_times data will be in the form of a dict, let's simulate that
dict2Add = tripEntries.to_dict('records')
dict2Add[0]
In [ ]:
# reading csv.. this might already be done but let's time it anyways
%timeit df = pd.read_csv(stop_times, na_filter=False)
trip_id = '1991_1_d'
# removing entries for trip = 1991_1_d
%timeit df.query('trip_id != "' + trip_id + '"',inplace=True)
# now we append..
%timeit df2add = pd.DataFrame(dict2Add)
%timeit df2 = df.append(df2add, ignore_index=True)
%timeit df2.query('trip_id == "' + trip_id + '"')
outfile = 'GTFSbmtc/test/stop_times_edited.txt'
%timeit df.to_csv(outfile, index=False)
Inference:
In [4]:
df = pd.read_csv(stop_times, na_filter=False)
print(len(df))
trip_id = '1991_1_d'
# extract trip entries for adding later
tripEntries = df.query('trip_id == "' + trip_id + '"')
print(len(tripEntries))
# removing entries for trip = 1991_1_d
df.query('trip_id != "' + trip_id + '"',inplace=True)
print(len(df))
# append.
%timeit df2 = df.append(tripEntries, ignore_index=True)
df2 = df.append(tripEntries, ignore_index=True)
print(len(df2))
# concat
%timeit df3 = pd.concat([df,tripEntries], ignore_index=True)
df3 = pd.concat([df,tripEntries], ignore_index=True)
print(len(df3))
... concat wins by a hair
or.. WAIT, now concat is coming a few ms late! Darn!
Nope, when trying in actual python script, concat wins.
Tried running once.. errored out saying :
ImportError: HDFStore requires PyTables, "No module named 'tables'" problem importing
tried direction installing tables via pip, errored out.
Looking it up.. https://github.com/PyTables/PyTables
For installing in Ununtu they gave a apt-get install command:
https://github.com/PyTables/PyTables
After pip install tables worked. (did sudo.. should put --user at the end next time!)
testing command:
python3 -m tables.tests.test_all
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version: 3.4.3
HDF5 version: 1.8.18
NumPy version: 1.13.3
Numexpr version: 2.6.4 (not using Intel's VML/MKL)
Zlib version: 1.2.8 (in Python interpreter)
LZO version: 2.09 (Feb 04 2015)
BZIP2 version: 1.0.6 (6-Sept-2010)
Blosc version: 1.14.3 (2018-04-06)
Blosc compressors: blosclz (1.1.0), lz4 (1.8.1), lz4hc (1.8.1), snappy (1.1.1), zlib (1.2.8), zstd (1.3.4)
Blosc filters: shuffle, bitshuffle
Cython version: 0.27.3
Python version: 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
Platform: Linux-4.13.0-38-generic-i686-with-Ubuntu-16.04-xenial
Byte-ordering: little
Detected cores: 2
Default encoding: utf-8
Default FS encoding: utf-8
Default locale: (en_IN, ISO8859-1)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
In [ ]:
df = pd.read_csv(stop_times, na_filter=False)
hdf5 = 'GTFSbmtc/test/stop_times.h5'
%timeit df.to_hdf(hdf5,'df',mode='w')
# have to put ,mode='w' so that it over-writes instead of appending
outfile = 'GTFSbmtc/test/stop_times_edited.txt'
%timeit df.to_csv(outfile, index=False)
Inferenced: HDF5 Rocks
Also, apparently we can store multiple dataframes inside, just like tables in tinydb.
In [ ]:
# testing all forms of HDF5 : fixed and table
df = pd.read_csv(stop_times, na_filter=False)
# Fixed HDF5:
hdf5 = 'GTFSbmtc/test/stop_times.h5'
%timeit df.to_hdf(hdf5,'df',mode='w')
%timeit hdFixed = pd.read_hdf(hdf5,'df')
# Table HDF5:
hdf5Table = 'GTFSbmtc/test/stop_times-table.h5'
%timeit df.to_hdf(hdf5Table,'df',format='table',mode='w')
# %timeit hdTable = pd.read_hdf(hdf5Table,'df') # it's crashing the laptop here! Abort!
My laptop is HANGING for the command for reading HDF5 in table form. So, let's leave it. Anyways, the table way writing also is showing as too slow. So, let's go with the default FIXED form of HDF5.
Ran an independent python script to benchmarks times. Output of two runs:
$ python3 edit-table.py
Load csv took 2.143 seconds.
36
Querying out a trip's entries took 0.5772 seconds.
1176247
Removing a trip's entries from full df took 0.1984 seconds.
1176283
Appending trip entires to full df took 0.1177 seconds.
1176283
Concatenating trip entires to full df took 0.1142 seconds.
Storing in .h5 file took 1.8359 seconds.
Storing in .csv file took 9.7207 seconds.
Reading .h5 file took 0.5111 seconds.
$ python3 edit-table.py
Load csv took 1.7913 seconds.
36
Querying out a trip's entries took 0.2839 seconds.
1176247
Removing a trip's entries from full df took 0.1998 seconds.
1176283
Appending trip entires to full df took 0.1195 seconds.
1176283
Concatenating trip entires to full df took 0.1145 seconds.
Storing in .h5 file took 1.779 seconds.
Storing in .csv file took 9.3677 seconds.
Reading .h5 file took 0.6003 seconds.
In [9]:
stop_id = '640'
hdf5 = 'GTFSbmtc/test/stop_times.h5'
hdFixed = pd.read_hdf(hdf5,'stop_times')
print('Loaded stop_times dataframe from .h5 file')
print('Querying stop_times dataframe for stop ' + stop_id)
start = time.time()
stopEntries = hdFixed.query('stop_id == "' + stop_id + '"')
end = time.time()
print(str(len(stopEntries)) + ' entries found.')
print("Query for stop thru full stop_times db took {} seconds.".format(round(end-start,4)))
print('Dropping the stop from stop_times table')
start = time.time()
hdFixed.query('stop_id != "' + stop_id + '"',inplace=True)
end = time.time()
print(str(len(hdFixed)) + ' entries remain in full dataframe.')
print("Removing a stop from full stop_times db took {} seconds.".format(round(end-start,4)))
,mode='w' so it doesn't append.