Sendit Google Deep Learning Lungren Metrics

This is the second round of sendit, and we want to look at files (or MB) processed per second, and per minute. This is the log produced after running the pipeline for N days after the start (9/5/2017), so for the date 9-08 this would mean 3 days. This log includes a couple of manual tester runs (you will see them below in the data). This dataset is different from the first in that there are very few files per compressed image package.


In [1]:
import pandas
from glob import glob
glob('*.tsv')


Out[1]:
['sendit-process-time-2017-09-08.tsv', 'sendit-process-time-2017-09-06.tsv']

In [2]:
files = glob('*.tsv')
df = pandas.read_csv(files[0],sep="\t",index_col=0)
done = df[df.status=="DONE"]
print("Folders that are done: %s" %done.shape[0])


Folders that are done: 10079

Above we filter the frame of data collected to those that have a status of DONE. This means we can be confident about a start and finish time being present.

What is the size of a batch?

The first plot will show all the data, and the below will be filtered to not include outliers.


In [3]:
df.size_mb.describe()


Out[3]:
count    10079.000000
mean        16.373695
std         10.014129
min          0.006363
25%         12.520584
50%         12.521494
75%         17.849934
max        255.815571
Name: size_mb, dtype: float64

In [4]:
%matplotlib inline
df.size_mb.hist()


Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7c55d440f0>

In [5]:
# And remove outliers greater than 100
%matplotlib inline
df[df.size_mb < 100].size_mb.hist()


Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7c55ca0438>

Plotting Distributions of MB per Time Unit

Below we will plot first MB per minute, and then MB per second. This is done with 16 cores (each core corresponding to a worker).


In [6]:
%matplotlib inline
import seaborn as sns
sns.set()
mb_min = done.size_mb / done.total_time_min
ax = sns.distplot(mb_min)
ax.set_xlabel("MB per minute, N=%s folders" %done.shape[0])


Out[6]:
<matplotlib.text.Text at 0x7f7c49f79f60>

In [10]:
# How many MB per minute, hour, day, are we moving, on average?
timings = pandas.DataFrame(columns=['mb_min','mb_hour','mb_day','gb_day'])
timings.mb_min = mb_min
timings.mb_hour = mb_min * 60
timings.mb_day = mb_min * 60 * 24
timings.gb_day = timings.mb_day / 1000
timings.describe()


Out[10]:
mb_min mb_hour mb_day gb_day
count 10079.000000 10079.000000 1.007900e+04 10079.000000
mean 75.567487 4534.049194 1.088172e+05 108.817181
std 35.864674 2151.880466 5.164513e+04 51.645131
min 0.068423 4.105354 9.852851e+01 0.098529
25% 61.441615 3686.496882 8.847593e+04 88.475925
50% 71.379841 4282.790459 1.027870e+05 102.786971
75% 78.699772 4721.986349 1.133277e+05 113.327672
max 1267.820241 76069.214461 1.825661e+06 1825.661147

In [4]:
mb_sec = done.size_mb / done.total_time_sec
ax = sns.distplot(mb_sec)
ax.set_xlabel("MB per second, N=%s folders" %done.shape[0])


Out[4]:
<matplotlib.text.Text at 0x7f64bc5aa5f8>

In [5]:
import seaborn as sns; sns.set(color_codes=True)
ax = sns.regplot(x=done.total_time_min, y=done.size_mb)
ax.set_xlabel("total time minutes")
ax.set_ylabel("total size (MB)")


Out[5]:
<matplotlib.text.Text at 0x7f64bc47ea20>

Ha, see the two losers on the right? Those are actually test cases that I was (manually) testing. Let's remove them.


In [6]:
filtered = done[done.total_time_min<5]
ax = sns.regplot(x=filtered.total_time_min, y=filtered.size_mb)
ax.set_xlabel("total time minutes")
ax.set_ylabel("total size (MB)")


Out[6]:
<matplotlib.text.Text at 0x7f64bc4c3c50>

In [7]:
filtered.mean()


Out[7]:
batch_id          5.306808e+03
size_mb           1.637213e+01
start_time        1.504766e+09
finish_time       1.504766e+09
total_time_sec    1.395025e+01
total_time_min    2.325041e-01
dtype: float64

In [8]:
filtered.std()


Out[8]:
batch_id           2950.178347
size_mb              10.012416
start_time        66391.320586
finish_time       66391.159786
total_time_sec        7.400434
total_time_min        0.123341
dtype: float64

Again, for each batch there are anywhere between 1 and 10 images. I would say on average (just glancing at logs) it's usually 2-3, sometimes 4 and 1.


In [ ]: