Vaex supports Arrow. We will demonstrate vaex+arrow by giving a quick look at a large dataset that does not fit into memory. The NYC taxi dataset for the year 2015 contains about 150 million rows containing information about taxi trips in New York, and is about 23GB in size. You can download it here:
In case you want to convert it to the arrow format, use the code below:
ds_hdf5 = vaex.open('/Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.hdf5')
# this may take a while to export
ds_hdf5.export('./nyc_taxi2015.arrow')
Also make sure you install vaex-arrow:
$ pip install vaex-arrow
In [1]:
!ls -alh /Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow
-rw-r--r-- 1 maartenbreddels staff 23G Oct 31 18:56 /Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow
In [3]:
import vaex
In [4]:
%time
df = vaex.open('/Users/maartenbreddels/datasets/nytaxi/nyc_taxi2015.arrow')
CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.91 µs
In [5]:
df
Out[5]:
# VendorID dropoff_dayofweek dropoff_hour dropoff_latitude dropoff_longitude extra fare_amount improvement_surcharge mta_tax passenger_count payment_type pickup_dayofweek pickup_hour pickup_latitude pickup_longitude tip_amount tolls_amount total_amount tpep_dropoff_datetime tpep_pickup_datetime trip_distance 0 2 3.0 19.0 40.75061798095703 -73.97478485107422 1.0 12.0 0.3 0.5 1 1 3.0 19.0 40.7501106262207 -73.993896484375 3.25 0.0 17.05 numpy.datetime64('2015-01-15T19:23:42.000000000') numpy.datetime64('2015-01-15T19:05:39.000000000') 1.59 1 1 5.0 20.0 40.75910949707031 -73.99441528320312 0.5 14.5 0.3 0.5 1 1 5.0 20.0 40.7242431640625 -74.00164794921875 2.0 0.0 17.8 numpy.datetime64('2015-01-10T20:53:28.000000000') numpy.datetime64('2015-01-10T20:33:38.000000000') 3.3 2 1 5.0 20.0 40.82441329956055 -73.95182037353516 0.5 9.5 0.3 0.5 1 2 5.0 20.0 40.80278778076172 -73.96334075927734 0.0 0.0 10.8 numpy.datetime64('2015-01-10T20:43:41.000000000') numpy.datetime64('2015-01-10T20:33:38.000000000') 1.8 3 1 5.0 20.0 40.71998596191406 -74.00432586669923 0.5 3.5 0.3 0.5 1 2 5.0 20.0 40.71381759643555 -74.00908660888672 0.0 0.0 4.8 numpy.datetime64('2015-01-10T20:35:31.000000000') numpy.datetime64('2015-01-10T20:33:39.000000000') 0.5 4 1 5.0 20.0 40.742652893066406 -74.00418090820312 0.5 15.0 0.3 0.5 1 2 5.0 20.0 40.762428283691406 -73.97117614746094 0.0 0.0 16.3 numpy.datetime64('2015-01-10T20:52:58.000000000') numpy.datetime64('2015-01-10T20:33:39.000000000') 3.0 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 146,112,984 2 4.0 0.0 40.722469329833984 -73.98621368408203 0.5 7.5 0.3 0.5 5 1 3.0 23.0 40.72087097167969 -73.99381256103516 1.76 0.0 10.56 numpy.datetime64('2016-01-01T00:08:18.000000000') numpy.datetime64('2015-12-31T23:59:56.000000000') 1.2 146,112,985 1 4.0 0.0 40.75238800048828 -73.93951416015625 0.5 7.5 0.3 0.5 2 2 3.0 23.0 40.76028060913085 -73.96527099609375 0.0 0.0 8.8 numpy.datetime64('2016-01-01T00:05:19.000000000') numpy.datetime64('2015-12-31T23:59:58.000000000') 2.0 146,112,986 1 4.0 0.0 40.69329833984375 -73.9886703491211 0.5 13.5 0.3 0.5 2 2 3.0 23.0 40.73907852172852 -73.98729705810547 0.0 0.0 14.8 numpy.datetime64('2016-01-01T00:12:55.000000000') numpy.datetime64('2015-12-31T23:59:59.000000000') 3.8 146,112,987 2 4.0 0.0 40.705322265625 -74.01712036132812 0.5 8.5 0.3 0.5 1 2 3.0 23.0 40.72569274902344 -73.99755859375 0.0 0.0 9.8 numpy.datetime64('2016-01-01T00:10:26.000000000') numpy.datetime64('2015-12-31T23:59:59.000000000') 1.96 146,112,988 2 4.0 0.0 40.76057052612305 -73.99098205566406 0.5 13.5 0.3 0.5 1 1 3.0 23.0 40.76725769042969 -73.98439788818358 2.96 0.0 17.76 numpy.datetime64('2016-01-01T00:21:30.000000000') numpy.datetime64('2015-12-31T23:59:59.000000000') 1.06
In [6]:
df.plot(df.pickup_longitude, df.pickup_latitude, f='log');
In [7]:
df.total_amount.minmax()
Out[7]:
array([-4.9630000e+02, 3.9506116e+06])
In [8]:
df.plot1d(df.total_amount, shape=100, limits=[0, 100])
Out[8]:
[<matplotlib.lines.Line2D at 0x121d26320>]
In [9]:
# filter the dataset
dff = df[(df.total_amount >= 0) & (df.total_amount < 100)]
In [10]:
dff['ratio'] = dff.tip_amount/dff.total_amount
In [11]:
dff.ratio.mean()
<string>:1: RuntimeWarning: invalid value encountered in true_divide
Out[11]:
0.09601926650107262
In [12]:
arrow_table = df.to_arrow_table()
arrow_table
Out[12]:
pyarrow.Table
VendorID: int64
dropoff_dayofweek: double
dropoff_hour: double
dropoff_latitude: double
dropoff_longitude: double
extra: double
fare_amount: double
improvement_surcharge: double
mta_tax: double
passenger_count: int64
payment_type: int64
pickup_dayofweek: double
pickup_hour: double
pickup_latitude: double
pickup_longitude: double
tip_amount: double
tolls_amount: double
total_amount: double
tpep_dropoff_datetime: timestamp[ns]
tpep_pickup_datetime: timestamp[ns]
trip_distance: double
In [13]:
# Although you can 'convert' (pass the data) in to pandas,
# some memory will be wasted (at least an index will be created by pandas)
# here we just pass a subset of the data
df_pandas = df[:10000].to_pandas_df()
df_pandas
Out[13]:
VendorID
dropoff_dayofweek
dropoff_hour
dropoff_latitude
dropoff_longitude
extra
fare_amount
improvement_surcharge
mta_tax
passenger_count
...
pickup_dayofweek
pickup_hour
pickup_latitude
pickup_longitude
tip_amount
tolls_amount
total_amount
tpep_dropoff_datetime
tpep_pickup_datetime
trip_distance
0
2
3.0
19.0
40.750618
-73.974785
1.0
12.0
0.3
0.5
1
...
3.0
19.0
40.750111
-73.993896
3.25
0.00
17.05
2015-01-15 19:23:42
2015-01-15 19:05:39
1.59
1
1
5.0
20.0
40.759109
-73.994415
0.5
14.5
0.3
0.5
1
...
5.0
20.0
40.724243
-74.001648
2.00
0.00
17.80
2015-01-10 20:53:28
2015-01-10 20:33:38
3.30
2
1
5.0
20.0
40.824413
-73.951820
0.5
9.5
0.3
0.5
1
...
5.0
20.0
40.802788
-73.963341
0.00
0.00
10.80
2015-01-10 20:43:41
2015-01-10 20:33:38
1.80
3
1
5.0
20.0
40.719986
-74.004326
0.5
3.5
0.3
0.5
1
...
5.0
20.0
40.713818
-74.009087
0.00
0.00
4.80
2015-01-10 20:35:31
2015-01-10 20:33:39
0.50
4
1
5.0
20.0
40.742653
-74.004181
0.5
15.0
0.3
0.5
1
...
5.0
20.0
40.762428
-73.971176
0.00
0.00
16.30
2015-01-10 20:52:58
2015-01-10 20:33:39
3.00
5
1
5.0
20.0
40.758194
-73.986977
0.5
27.0
0.3
0.5
1
...
5.0
20.0
40.774048
-73.874374
6.70
5.33
40.33
2015-01-10 20:53:52
2015-01-10 20:33:39
9.00
6
1
5.0
20.0
40.749634
-73.992470
0.5
14.0
0.3
0.5
1
...
5.0
20.0
40.726009
-73.983276
0.00
0.00
15.30
2015-01-10 20:58:31
2015-01-10 20:33:39
2.20
7
1
5.0
20.0
40.726326
-73.995010
0.5
7.0
0.3
0.5
3
...
5.0
20.0
40.734142
-74.002663
1.66
0.00
9.96
2015-01-10 20:42:20
2015-01-10 20:33:39
0.80
8
1
5.0
21.0
40.759357
-73.987595
0.0
52.0
0.3
0.5
3
...
5.0
20.0
40.644356
-73.783043
0.00
5.33
58.13
2015-01-10 21:11:35
2015-01-10 20:33:39
18.20
9
1
5.0
20.0
40.759365
-73.985916
0.5
6.5
0.3
0.5
2
...
5.0
20.0
40.767948
-73.985588
1.55
0.00
9.35
2015-01-10 20:40:44
2015-01-10 20:33:40
0.90
10
1
5.0
20.0
40.728584
-74.004395
0.5
7.0
0.3
0.5
1
...
5.0
20.0
40.723103
-73.988617
1.66
0.00
9.96
2015-01-10 20:41:39
2015-01-10 20:33:40
0.90
11
1
5.0
20.0
40.757217
-73.967407
0.5
7.5
0.3
0.5
1
...
5.0
20.0
40.751419
-73.993782
1.00
0.00
9.80
2015-01-10 20:43:26
2015-01-10 20:33:41
1.10
12
1
5.0
20.0
40.707726
-74.009773
0.5
3.0
0.3
0.5
1
...
5.0
20.0
40.704376
-74.008362
0.00
0.00
4.30
2015-01-10 20:35:23
2015-01-10 20:33:41
0.30
13
1
5.0
21.0
40.735210
-73.997345
0.5
19.0
0.3
0.5
1
...
5.0
20.0
40.760448
-73.973946
3.00
0.00
23.30
2015-01-10 21:03:04
2015-01-10 20:33:41
3.10
14
1
5.0
20.0
40.739895
-73.995216
0.5
6.0
0.3
0.5
1
...
5.0
20.0
40.731777
-74.006721
0.00
0.00
7.30
2015-01-10 20:39:23
2015-01-10 20:33:41
1.10
15
2
3.0
19.0
40.757889
-73.983978
1.0
16.5
0.3
0.5
1
...
3.0
19.0
40.739811
-73.976425
4.38
0.00
22.68
2015-01-15 19:32:00
2015-01-15 19:05:39
2.38
16
2
3.0
19.0
40.786858
-73.955124
1.0
12.5
0.3
0.5
5
...
3.0
19.0
40.754246
-73.968704
0.00
0.00
14.30
2015-01-15 19:21:00
2015-01-15 19:05:40
2.83
17
2
3.0
19.0
40.785782
-73.952713
1.0
26.0
0.3
0.5
5
...
3.0
19.0
40.769581
-73.863060
8.08
5.33
41.21
2015-01-15 19:28:18
2015-01-15 19:05:40
8.33
18
2
3.0
19.0
40.786083
-73.980850
1.0
11.5
0.3
0.5
1
...
3.0
19.0
40.779423
-73.945541
0.00
0.00
13.30
2015-01-15 19:20:36
2015-01-15 19:05:41
2.37
19
2
3.0
19.0
40.718590
-73.952377
1.0
21.5
0.3
0.5
2
...
3.0
19.0
40.774010
-73.874458
4.50
0.00
27.80
2015-01-15 19:20:22
2015-01-15 19:05:41
7.13
20
2
3.0
19.0
40.714596
-73.998924
1.0
17.5
0.3
0.5
1
...
3.0
19.0
40.751896
-73.976601
0.00
0.00
19.30
2015-01-15 19:31:00
2015-01-15 19:05:41
3.60
21
2
3.0
19.0
40.734650
-73.999939
1.0
5.5
0.3
0.5
1
...
3.0
19.0
40.745079
-73.994957
1.62
0.00
8.92
2015-01-15 19:10:22
2015-01-15 19:05:41
0.89
22
2
3.0
19.0
40.735512
-74.003563
1.0
5.5
0.3
0.5
1
...
3.0
19.0
40.747063
-74.000938
1.30
0.00
8.60
2015-01-15 19:10:55
2015-01-15 19:05:41
0.96
23
2
3.0
19.0
40.704220
-74.007919
1.0
6.5
0.3
0.5
2
...
3.0
19.0
40.717892
-74.002777
1.50
0.00
9.80
2015-01-15 19:12:36
2015-01-15 19:05:41
1.25
24
2
3.0
19.0
40.761856
-73.978172
1.0
11.5
0.3
0.5
5
...
3.0
19.0
40.736362
-73.997459
2.50
0.00
15.80
2015-01-15 19:22:11
2015-01-15 19:05:41
2.11
25
2
3.0
19.0
40.811089
-73.953339
1.0
7.5
0.3
0.5
5
...
3.0
19.0
40.823994
-73.952278
1.70
0.00
11.00
2015-01-15 19:14:05
2015-01-15 19:05:41
1.15
26
2
3.0
19.0
40.734890
-73.988609
1.0
9.0
0.3
0.5
1
...
3.0
19.0
40.750080
-73.991127
0.00
0.00
10.80
2015-01-15 19:16:18
2015-01-15 19:05:42
1.53
27
2
3.0
19.0
40.743530
-73.985603
0.0
52.0
0.3
0.5
1
...
3.0
19.0
40.644127
-73.786575
6.00
5.33
64.13
2015-01-15 19:49:07
2015-01-15 19:05:42
18.06
28
2
3.0
19.0
40.757721
-73.994514
1.0
10.0
0.3
0.5
1
...
3.0
19.0
40.741447
-73.993668
2.36
0.00
14.16
2015-01-15 19:18:33
2015-01-15 19:05:42
1.76
29
2
3.0
19.0
40.704689
-74.009079
1.0
17.5
0.3
0.5
6
...
3.0
19.0
40.744083
-73.985291
3.70
0.00
23.00
2015-01-15 19:21:40
2015-01-15 19:05:42
5.19
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
9970
1
4.0
11.0
40.719917
-73.955521
0.0
20.0
0.3
0.5
1
...
4.0
10.0
40.725979
-74.009071
4.00
0.00
24.80
2015-01-30 11:20:08
2015-01-30 10:51:40
3.70
9971
1
4.0
10.0
40.720398
-73.984940
1.0
6.5
0.3
0.5
1
...
4.0
10.0
40.732452
-73.985001
1.65
0.00
9.95
2015-01-30 10:58:58
2015-01-30 10:51:40
1.10
9972
1
4.0
11.0
40.755405
-74.002457
0.0
8.5
0.3
0.5
2
...
4.0
10.0
40.751358
-73.990479
1.00
0.00
10.30
2015-01-30 11:03:41
2015-01-30 10:51:41
0.70
9973
2
1.0
19.0
40.763626
-73.969666
1.0
24.5
0.3
0.5
1
...
1.0
18.0
40.708790
-74.017281
5.10
0.00
31.40
2015-01-13 19:22:18
2015-01-13 18:55:41
7.08
9974
2
1.0
19.0
40.772366
-73.960800
1.0
5.5
0.3
0.5
5
...
1.0
18.0
40.780003
-73.954681
1.00
0.00
8.30
2015-01-13 19:02:03
2015-01-13 18:55:41
0.64
9975
2
1.0
19.0
40.733429
-73.984154
1.0
9.0
0.3
0.5
1
...
1.0
18.0
40.749680
-73.991531
0.00
0.00
10.80
2015-01-13 19:06:56
2015-01-13 18:55:41
1.67
9976
2
1.0
19.0
40.774780
-73.957779
1.0
20.0
0.3
0.5
3
...
1.0
18.0
40.751801
-74.002327
2.00
0.00
23.80
2015-01-13 19:18:39
2015-01-13 18:55:42
5.28
9977
2
1.0
19.0
40.751698
-73.989746
1.0
8.5
0.3
0.5
2
...
1.0
18.0
40.768433
-73.986137
0.00
0.00
10.30
2015-01-13 19:06:38
2015-01-13 18:55:42
1.38
9978
2
1.0
19.0
40.752941
-73.977470
1.0
7.5
0.3
0.5
1
...
1.0
18.0
40.745071
-73.987068
1.00
0.00
10.30
2015-01-13 19:05:34
2015-01-13 18:55:42
0.88
9979
2
1.0
19.0
40.735130
-73.976120
1.0
8.5
0.3
0.5
1
...
1.0
18.0
40.751259
-73.977814
0.00
0.00
10.30
2015-01-13 19:05:41
2015-01-13 18:55:42
1.58
9980
2
1.0
19.0
40.745541
-73.984383
1.0
8.5
0.3
0.5
1
...
1.0
18.0
40.731110
-74.001350
0.00
0.00
10.30
2015-01-13 19:05:32
2015-01-13 18:55:42
1.58
9981
2
1.0
19.0
40.793671
-73.974327
1.0
5.0
0.3
0.5
2
...
1.0
18.0
40.791222
-73.965118
0.00
0.00
6.80
2015-01-13 19:00:05
2015-01-13 18:55:42
0.63
9982
2
1.0
19.0
40.754639
-73.986343
1.0
11.0
0.3
0.5
1
...
1.0
18.0
40.764175
-73.968994
1.00
0.00
13.80
2015-01-13 19:11:57
2015-01-13 18:55:43
1.63
9983
2
1.0
18.0
40.723721
-73.989494
1.0
4.5
0.3
0.5
1
...
1.0
18.0
40.714985
-73.992409
2.00
0.00
8.30
2015-01-13 18:59:19
2015-01-13 18:55:43
0.70
9984
2
1.0
19.0
40.774590
-73.963249
1.0
5.5
0.3
0.5
5
...
1.0
18.0
40.764881
-73.968529
1.30
0.00
8.60
2015-01-13 19:01:19
2015-01-13 18:55:44
0.94
9985
2
1.0
19.0
40.774872
-73.982613
1.0
7.0
0.3
0.5
1
...
1.0
18.0
40.762344
-73.985695
1.60
0.00
10.40
2015-01-13 19:03:54
2015-01-13 18:55:44
1.04
9986
2
1.0
19.0
40.787998
-73.953888
1.0
5.0
0.3
0.5
2
...
1.0
18.0
40.779526
-73.957619
1.20
0.00
8.00
2015-01-13 19:00:06
2015-01-13 18:55:44
0.74
9987
2
1.0
19.0
40.790218
-73.975128
1.0
11.5
0.3
0.5
1
...
1.0
18.0
40.762226
-73.985916
2.50
0.00
15.80
2015-01-13 19:10:46
2015-01-13 18:55:44
2.19
9988
2
1.0
19.0
40.739487
-73.989059
1.0
9.5
0.3
0.5
1
...
1.0
18.0
40.725056
-73.984329
2.10
0.00
13.40
2015-01-13 19:08:40
2015-01-13 18:55:44
1.48
9989
2
1.0
19.0
40.780548
-73.959030
1.0
8.5
0.3
0.5
1
...
1.0
18.0
40.778542
-73.981949
1.00
0.00
11.30
2015-01-13 19:04:44
2015-01-13 18:55:45
1.83
9990
2
1.0
19.0
40.761524
-73.960602
1.0
15.0
0.3
0.5
1
...
1.0
18.0
40.746319
-74.001114
0.00
0.00
16.80
2015-01-13 19:14:59
2015-01-13 18:55:45
3.27
9991
2
1.0
19.0
40.720646
-73.989716
1.0
8.0
0.3
0.5
1
...
1.0
18.0
40.738167
-73.987434
1.00
0.00
10.80
2015-01-13 19:04:58
2015-01-13 18:55:45
1.56
9992
2
1.0
19.0
40.795898
-73.972610
1.0
20.5
0.3
0.5
1
...
1.0
18.0
40.740582
-73.989738
4.30
0.00
26.60
2015-01-13 19:18:18
2015-01-13 18:55:45
5.40
9993
2
1.0
18.0
40.769939
-73.981316
1.0
4.5
0.3
0.5
1
...
1.0
18.0
40.772015
-73.979416
1.10
0.00
7.40
2015-01-13 18:59:40
2015-01-13 18:55:45
0.34
9994
2
4.0
18.0
40.773521
-73.955353
1.0
31.0
0.3
0.5
1
...
4.0
18.0
40.713215
-74.013542
5.00
0.00
37.80
2015-01-23 18:59:52
2015-01-23 18:22:55
9.05
9995
2
4.0
18.0
40.774670
-73.947845
1.0
11.5
0.3
0.5
1
...
4.0
18.0
40.773186
-73.978043
0.00
0.00
13.30
2015-01-23 18:37:44
2015-01-23 18:22:55
2.32
9996
2
4.0
18.0
40.758148
-73.985626
1.0
8.5
0.3
0.5
2
...
4.0
18.0
40.752003
-73.973198
0.00
0.00
10.30
2015-01-23 18:34:48
2015-01-23 18:22:56
0.92
9997
2
4.0
18.0
40.768131
-73.964516
1.0
10.5
0.3
0.5
1
...
4.0
18.0
40.740456
-73.986252
2.46
0.00
14.76
2015-01-23 18:33:58
2015-01-23 18:22:56
2.36
9998
2
4.0
18.0
40.759171
-73.975189
1.0
6.5
0.3
0.5
3
...
4.0
18.0
40.770500
-73.981323
2.08
0.00
10.38
2015-01-23 18:29:22
2015-01-23 18:22:56
1.05
9999
2
4.0
18.0
40.752113
-73.975189
1.0
5.0
0.3
0.5
1
...
4.0
18.0
40.761505
-73.968452
0.00
0.00
6.80
2015-01-23 18:27:58
2015-01-23 18:22:57
0.75
10000 rows × 21 columns
If you want to learn more on vaex, take a look at the tutorials to see what is possible.
Content source: maartenbreddels/vaex
Similar notebooks: