Yesterday I showed how Blaze uses Pandas to query collections of large CSV files. (notebook here). If you're able to drive Pandas intelligently then we can query a 16GB dataset in around 3 minutes. This was about as fast as Postgres after the CSV files had been loaded in (a process that takes about 20 minutes if done right.)
The fact that Pandas+CSV is competitive with Postgres is startling because CSV is a very low-tech and expensive storage solution. If we replace CSV with a faster solution then Pandas is able to really shine.
BColz is an on-disk binary format (like HDF5) that stores columns separately for faster access. It has a straightforward data model and interacts well with NumPy.
In [1]:
from blaze import *
In [2]:
d = Data('trip_data_*.csv')
expr = by(d.passenger_count, avg_distance=d.trip_distance.mean(),
count=d.passenger_count.count())
%time _ = compute(expr)
In [3]:
d = Data('trip_data.bcolz')
expr = by(d.passenger_count, avg_distance=d.trip_distance.mean(),
count=d.passenger_count.count())
%time _ = compute(expr)
Disclaimer: This may have benefitted from OS-level caching. These times range from 20s to 40s.
In [4]:
import multiprocessing
pool = multiprocessing.Pool(4)
%time _ = compute(expr, map=pool.map)
The time difference above strongly motivates the use of efficient binary storage systems. Unfortunately migrating data between formats is usually hard enough that everyday users don't bother.
The into
project eases this process. In principle into
lets us write something as naive as the following:
into('trip.bcolz', 'trip_data_*.csv')
Into works for a variety of formats, not just csv and bcolz.
We can often improve performance by specifying more information. In our case the text columns in our dataset happen to have fixed lengths (they are just hash values.) By specifying these lengths explicilty we help BColz store and retrieve the data more effectively. We specify this additional information with a datashape
specifying the column types exactly.
In just a moment we'll talk about how to automatically detect datashapes.
In [5]:
ds = dshape("""var * {
medallion: string[32, 'ascii'],
hack_license: string[32, 'ascii'],
vendor_id: string[3, 'ascii'],
rate_code: int32,
store_and_fwd_flag: string[1, 'ascii'],
pickup_datetime: datetime,
dropoff_datetime: datetime,
passenger_count: int32,
trip_time_in_secs: int32,
trip_distance: float64,
pickup_longitude: float64,
pickup_latitude: float64,
dropoff_longitude: float64,
dropoff_latitude: float64
}""")
# Drop old version
drop('trip_data.bcolz')
# Migrate data
%time into('trip_data.bcolz', 'trip_data_*.csv', dshape=ds)
Out[5]:
In [6]:
d = Data('trip_data_*.csv')
discover(d)
Out[6]:
We just copy-pasted this, rewriting string
with string[32, 'ascii']
in a few places.
Last time we showed that your laptop can easily process 10+GB files using Pandas+ElbowGrease or Pandas+Blaze .
Here we show that using EfficientStorage+Pandas+Blaze your single laptop can process 10+GB files in seconds. Data in this range is quite convenient if you or your tools manage things well.