In [1]:

    
import numpy as np
import pandas as pd



In [2]:

    
from numpy.testing.utils import assert_almost_equal

Build a `numpy.ndarray`, an equivalent dataFrame, and a `numpy.rec.array`



In [3]:

    
rows = 10000000
# Equivalent numpy array
arr = np.random.uniform(size=rows*3).reshape(rows, 3)
# The pandas dataFrame with column names
df = pd.DataFrame(arr, columns=['x','y','z'])
# a `numpy.recarray`
rec = df.to_records()



In [4]:

    
df.head()



In [5]:

    
df.dtypes









    Out[5]:





x    float64
y    float64
z    float64
dtype: object

Simple Array Operation: Sum

numpy.ndarray



In [6]:

    
%timeit arr[:, 2].sum()
arrsum = arr[:, 2].sum()









    



10 loops, best of 3: 17.3 ms per loop

pandas.dataFrame



In [13]:

    
%timeit df.z.sum()
pdattsum = df.z.sum()









    



10 loops, best of 3: 79.1 ms per loop



In [14]:

    
%timeit df.z.values.sum()
pdattsum = df.z.values.sum()









    



100 loops, best of 3: 15.6 ms per loop



In [15]:

    
%timeit df.z.values.sum()
pdattsum = df.values['z'].sum()









    



100 loops, best of 3: 15.8 ms per loop






    



/usr/local/manual/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: DeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  from ipykernel import kernelapp as app






    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-15-da773faafd40> in <module>()
      1 get_ipython().magic(u'timeit df.z.values.sum()')
----> 2 pdattsum = df.values['z'].sum()

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices



In [ ]:

    
assert_almost_equal(arrsum, pdattsum)

pandas.dataFrame



In [ ]:

    
%timeit df['z'].sum()
pdnstyle = df['z'].sum()

numpy.rec.array



In [ ]:

    
%timeit rec['z'].sum()
reccolnames = rec['z'].sum()

pandas.dataFrame with object type, expected to be slow



In [ ]:

    
df['z'] = df['z'].astype('object')



In [ ]:

    
df.dtypes



In [ ]:

    
%timeit df['z'].sum()
objectSum = df['z'].sum()

I would have expected pandas.dataFrame.sum to be more competitive with numpy.ndarray.sum, where the type of the dataFrame column was specified.

List comprehension style iteration in numpy.ndarray



In [ ]:

    
%timeit sum(i for i in arr[:, 2])
itersumnumpy = sum(i for i in arr[:, 2])



In [ ]:

    
assert_almost_equal(itersumnumpy,arrsum, decimal=5)

List comprehension style sum on a list: Again expected to be slow



In [ ]:

    
l = arr[:, 2].tolist



In [ ]:

    
%timeit sum(i for i in l)
listsum = sum(i for i in l)



In [ ]:

    
assert_almost_equal(listsum, arrsum, 5)



In [ ]:

    
%timeit sum(i for i in df['z'])
pandasitersum = sum(i for i in df['z'])



In [ ]:

    
t = tuple(l)



In [ ]:

    
%timeit sum(i for i in t)
tuplesum = sum(i for i in t)



In [ ]:

    
assert_almost_equal(pandasitersum, arrsum, 5)



In [ ]:

    
assert_almost_equal(tuplesum, arrsum, 5)

So for a dataFame with object type, doing array operations like sum (admittedly silly), is about as good as doing this with a list comprehension. But iterating through the dataFrame rows using a list comprehension style is much worse.



In [17]:

    
%timeit df.values









    



The slowest run took 6.87 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 6.53 µs per loop



In [ ]:

	x	y	z
0	0.398052	0.649871	0.268314
1	0.914418	0.709858	0.873499
2	0.870026	0.587125	0.419933
3	0.306768	0.511699	0.102610
4	0.388316	0.989567	0.265423

Build a numpy.ndarray, an equivalent dataFrame, and a numpy.rec.array

Simple Array Operation: Sum

Build a `numpy.ndarray`, an equivalent dataFrame, and a `numpy.rec.array`