This is the second round of sendit, and we want to look at files (or MB) processed per second, and per minute. This is the log produced after running the pipeline for N days after the start (9/5/2017), so for the date 9-08 this would mean 3 days. This log includes a couple of manual tester runs (you will see them below in the data). This dataset is different from the first in that there are very few files per compressed image package.
In [1]:
import pandas
from glob import glob
glob('*.tsv')
Out[1]:
In [2]:
files = glob('*.tsv')
df = pandas.read_csv(files[0],sep="\t",index_col=0)
done = df[df.status=="DONE"]
print("Folders that are done: %s" %done.shape[0])
In [3]:
df.size_mb.describe()
Out[3]:
In [4]:
%matplotlib inline
df.size_mb.hist()
Out[4]:
In [5]:
# And remove outliers greater than 100
%matplotlib inline
df[df.size_mb < 100].size_mb.hist()
Out[5]:
In [6]:
%matplotlib inline
import seaborn as sns
sns.set()
mb_min = done.size_mb / done.total_time_min
ax = sns.distplot(mb_min)
ax.set_xlabel("MB per minute, N=%s folders" %done.shape[0])
Out[6]:
In [10]:
# How many MB per minute, hour, day, are we moving, on average?
timings = pandas.DataFrame(columns=['mb_min','mb_hour','mb_day','gb_day'])
timings.mb_min = mb_min
timings.mb_hour = mb_min * 60
timings.mb_day = mb_min * 60 * 24
timings.gb_day = timings.mb_day / 1000
timings.describe()
Out[10]:
In [4]:
mb_sec = done.size_mb / done.total_time_sec
ax = sns.distplot(mb_sec)
ax.set_xlabel("MB per second, N=%s folders" %done.shape[0])
Out[4]:
In [5]:
import seaborn as sns; sns.set(color_codes=True)
ax = sns.regplot(x=done.total_time_min, y=done.size_mb)
ax.set_xlabel("total time minutes")
ax.set_ylabel("total size (MB)")
Out[5]:
Ha, see the two losers on the right? Those are actually test cases that I was (manually) testing. Let's remove them.
In [6]:
filtered = done[done.total_time_min<5]
ax = sns.regplot(x=filtered.total_time_min, y=filtered.size_mb)
ax.set_xlabel("total time minutes")
ax.set_ylabel("total size (MB)")
Out[6]:
In [7]:
filtered.mean()
Out[7]:
In [8]:
filtered.std()
Out[8]:
Again, for each batch there are anywhere between 1 and 10 images. I would say on average (just glancing at logs) it's usually 2-3, sometimes 4 and 1.
In [ ]: