Blaze - Fast on-disk queries with Pandas + BColz

Yesterday I showed how Blaze uses Pandas to query collections of large CSV files. (notebook here). If you're able to drive Pandas intelligently then we can query a 16GB dataset in around 3 minutes. This was about as fast as Postgres after the CSV files had been loaded in (a process that takes about 20 minutes if done right.)

The fact that Pandas+CSV is competitive with Postgres is startling because CSV is a very low-tech and expensive storage solution. If we replace CSV with a faster solution then Pandas is able to really shine.

BColz

BColz is an on-disk binary format (like HDF5) that stores columns separately for faster access. It has a straightforward data model and interacts well with NumPy.

Times

We compare the performance of our query between a stack of CSV files and a BColz array.

CSV - 3 minutes


In [1]:
from blaze import *

In [2]:
d = Data('trip_data_*.csv')

expr = by(d.passenger_count, avg_distance=d.trip_distance.mean(), 
                                    count=d.passenger_count.count())

%time _ = compute(expr)


CPU times: user 2min 12s, sys: 14 s, total: 2min 26s
Wall time: 3min 10s

BColz - 30 seconds


In [3]:
d = Data('trip_data.bcolz')

expr = by(d.passenger_count, avg_distance=d.trip_distance.mean(), 
                                    count=d.passenger_count.count())

%time _ = compute(expr)


CPU times: user 16.7 s, sys: 3.91 s, total: 20.6 s
Wall time: 20.7 s

Disclaimer: This may have benefitted from OS-level caching. These times range from 20s to 40s.

BColz + Multiprocessing - 10 seconds


In [4]:
import multiprocessing
pool = multiprocessing.Pool(4)

%time _ = compute(expr, map=pool.map)


CPU times: user 169 ms, sys: 13.2 ms, total: 182 ms
Wall time: 7.26 s

How did we migrate data from CSV to BColz?

The time difference above strongly motivates the use of efficient binary storage systems. Unfortunately migrating data between formats is usually hard enough that everyday users don't bother.

The into project eases this process. In principle into lets us write something as naive as the following:

into('trip.bcolz', 'trip_data_*.csv')

Into works for a variety of formats, not just csv and bcolz.

Better performance with fixed length strings

We can often improve performance by specifying more information. In our case the text columns in our dataset happen to have fixed lengths (they are just hash values.) By specifying these lengths explicilty we help BColz store and retrieve the data more effectively. We specify this additional information with a datashape specifying the column types exactly.

In just a moment we'll talk about how to automatically detect datashapes.


In [5]:
ds = dshape("""var * {
  medallion: string[32, 'ascii'],
  hack_license: string[32, 'ascii'],
  vendor_id: string[3, 'ascii'],
  rate_code: int32,
  store_and_fwd_flag: string[1, 'ascii'],
  pickup_datetime: datetime,
  dropoff_datetime: datetime,
  passenger_count: int32,
  trip_time_in_secs: int32,
  trip_distance: float64,
  pickup_longitude: float64,
  pickup_latitude: float64,
  dropoff_longitude: float64,
  dropoff_latitude: float64
  }""")

# Drop old version
drop('trip_data.bcolz')

# Migrate data
%time into('trip_data.bcolz', 'trip_data_*.csv', dshape=ds)


CPU times: user 16min 16s, sys: 1min 54s, total: 18min 10s
Wall time: 19min 15s
Out[5]:
ctable((173179759,), [('medallion', 'S32'), ('hack_license', 'S32'), ('vendor_id', 'S3'), ('rate_code', '<i4'), ('store_and_fwd_flag', 'S1'), ('pickup_datetime', '<M8[us]'), ('dropoff_datetime', '<M8[us]'), ('passenger_count', '<i4'), ('trip_time_in_secs', '<i4'), ('trip_distance', '<f8'), ('pickup_longitude', '<f8'), ('pickup_latitude', '<f8'), ('dropoff_longitude', '<f8'), ('dropoff_latitude', '<f8')])
  nbytes: 21.93 GB; cbytes: 16.43 GB; ratio: 1.34
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
  rootdir := 'trip_data.bcolz'
[ ('89D227B655E5C82AECF13C3F540D4CF4', 'BA96DE419E711691B9445D6A6307C170', 'CMT', 1, 'N', datetime.datetime(2013, 1, 1, 15, 11, 48), datetime.datetime(2013, 1, 1, 15, 18, 10), 4, 382, 1.0, -73.978165, 40.757977000000004, -73.989838, 40.751171)
 ('0BD7C8F5BA12B88E0B67BED28BEA73D8', '9FD8F69F0804BDB5549F40E9DA1BE472', 'CMT', 1, 'N', datetime.datetime(2013, 1, 6, 0, 18, 35), datetime.datetime(2013, 1, 6, 0, 22, 54), 1, 259, 1.5, -74.006683, 40.731781, -73.994499, 40.750659999999996)
 ('0BD7C8F5BA12B88E0B67BED28BEA73D8', '9FD8F69F0804BDB5549F40E9DA1BE472', 'CMT', 1, 'N', datetime.datetime(2013, 1, 5, 18, 49, 41), datetime.datetime(2013, 1, 5, 18, 54, 23), 1, 282, 1.1, -74.004707, 40.737770000000005, -74.009834, 40.726002)
 ...,
 ('B9751FE71B59FA04240FE5979963185E', 'DD35F09BF4591EA47D56A56D014B12FD', 'CMT', 1, 'N', datetime.datetime(2013, 9, 6, 16, 17, 12), datetime.datetime(2013, 9, 6, 16, 26, 30), 1, 558, 1.4, -73.954323, 40.781231, -73.967674, 40.762943)
 ('7C7EA825806A49F10411C1F6656E4B6B', '4B12AC33E74C663162B0850FCE5413E7', 'CMT', 1, 'N', datetime.datetime(2013, 9, 7, 8, 40, 49), datetime.datetime(2013, 9, 7, 9, 0, 3), 1, 1154, 10.8, -74.00057199999999, 40.740673, -73.861877, 40.768436)
 ('BCC1B5B7D9596FEC872C9DAD725310B3', '246735D51D474F0B3A93F431CE5A3127', 'CMT', 1, 'N', datetime.datetime(2013, 9, 6, 9, 47, 50), datetime.datetime(2013, 9, 6, 10, 2, 24), 1, 874, 2.1, -74.014252, 40.717148, -73.993431, 40.733189)]

How to create a datashape

Don't worry, we didn't write that by hand. The discover function gives you the datashape of most anything, including our original CSV files


In [6]:
d = Data('trip_data_*.csv')
discover(d)


Out[6]:
dshape("""var * {
  medallion: string,
  hack_license: string,
  vendor_id: string,
  rate_code: int64,
  store_and_fwd_flag: ?string,
  pickup_datetime: datetime,
  dropoff_datetime: datetime,
  passenger_count: int64,
  trip_time_in_secs: int64,
  trip_distance: float64,
  pickup_longitude: float64,
  pickup_latitude: float64,
  dropoff_longitude: float64,
  dropoff_latitude: float64
  }""")

We just copy-pasted this, rewriting string with string[32, 'ascii'] in a few places.

Conclusion

Last time we showed that your laptop can easily process 10+GB files using Pandas+ElbowGrease or Pandas+Blaze .

Here we show that using EfficientStorage+Pandas+Blaze your single laptop can process 10+GB files in seconds. Data in this range is quite convenient if you or your tools manage things well.