In [1]:
import numpy as np
import pandas as pd

In [2]:
from numpy.testing.utils import assert_almost_equal

Build a numpy.ndarray, an equivalent dataFrame, and a numpy.rec.array


In [3]:
rows = 10000000
# Equivalent numpy array
arr = np.random.uniform(size=rows*3).reshape(rows, 3)
# The pandas dataFrame with column names
df = pd.DataFrame(arr, columns=['x','y','z'])
# a `numpy.recarray`
rec = df.to_records()

In [4]:
df.head()


Out[4]:
x y z
0 0.398052 0.649871 0.268314
1 0.914418 0.709858 0.873499
2 0.870026 0.587125 0.419933
3 0.306768 0.511699 0.102610
4 0.388316 0.989567 0.265423

In [5]:
df.dtypes


Out[5]:
x    float64
y    float64
z    float64
dtype: object

Simple Array Operation: Sum

numpy.ndarray


In [6]:
%timeit arr[:, 2].sum()
arrsum = arr[:, 2].sum()


10 loops, best of 3: 17.3 ms per loop

pandas.dataFrame


In [13]:
%timeit df.z.sum()
pdattsum = df.z.sum()


10 loops, best of 3: 79.1 ms per loop

In [14]:
%timeit df.z.values.sum()
pdattsum = df.z.values.sum()


100 loops, best of 3: 15.6 ms per loop

In [15]:
%timeit df.z.values.sum()
pdattsum = df.values['z'].sum()


100 loops, best of 3: 15.8 ms per loop
/usr/local/manual/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  from ipykernel import kernelapp as app
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-15-da773faafd40> in <module>()
      1 get_ipython().magic(u'timeit df.z.values.sum()')
----> 2 pdattsum = df.values['z'].sum()

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

In [ ]:
assert_almost_equal(arrsum, pdattsum)

pandas.dataFrame


In [ ]:
%timeit df['z'].sum()
pdnstyle = df['z'].sum()

numpy.rec.array


In [ ]:
%timeit rec['z'].sum()
reccolnames = rec['z'].sum()

pandas.dataFrame with object type, expected to be slow


In [ ]:
df['z'] = df['z'].astype('object')

In [ ]:
df.dtypes

In [ ]:
%timeit df['z'].sum()
objectSum = df['z'].sum()

I would have expected pandas.dataFrame.sum to be more competitive with numpy.ndarray.sum, where the type of the dataFrame column was specified.

List comprehension style iteration in numpy.ndarray


In [ ]:
%timeit sum(i for i in arr[:, 2])
itersumnumpy = sum(i for i in arr[:, 2])

In [ ]:
assert_almost_equal(itersumnumpy,arrsum, decimal=5)

List comprehension style sum on a list: Again expected to be slow


In [ ]:
l = arr[:, 2].tolist

In [ ]:
%timeit sum(i for i in l)
listsum = sum(i for i in l)

In [ ]:
assert_almost_equal(listsum, arrsum, 5)

In [ ]:
%timeit sum(i for i in df['z'])
pandasitersum = sum(i for i in df['z'])

In [ ]:
t = tuple(l)

In [ ]:
%timeit sum(i for i in t)
tuplesum = sum(i for i in t)

In [ ]:
assert_almost_equal(pandasitersum, arrsum, 5)

In [ ]:
assert_almost_equal(tuplesum, arrsum, 5)

So for a dataFame with object type, doing array operations like sum (admittedly silly), is about as good as doing this with a list comprehension. But iterating through the dataFrame rows using a list comprehension style is much worse.


In [17]:
%timeit df.values


The slowest run took 6.87 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 6.53 µs per loop

In [ ]: