Vectorizing Operations with Pandas: Speed Galore!

Often, when you have your data in a Pandas DataFrame or a Numpy ndarray, you want to perform some sort of operation on every cell in your data frame. You could loop over it, and sometimes you may have to. But you should always try to vectorize your operations since this takes advantage of the Pandas/Numpy system of (arrayed) vectors.

Let's jump right in.


In [1]:
import pandas as pd
import numpy as np
#First, we create an 8x8 array of random integers.
df = pd.DataFrame(data = np.random.randint(0, 100, size = (8,8)), index = ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'))
print(df) #since we are using random numbers, your array will always be different.


    0   1   2   3   4   5   6   7
a  92   7  61  79  18  26  93  91
b  47  73  92  65   0  62  28  42
c  40  31  66  46   5  98  80   0
d  60  69   5  10   2  34  23  20
e  27  67  96  19  38  71  14  40
f  30  49  55   9  65  95  64  40
g  77  87  11  37  65  44  86  48
h  43   2  90  22  87  40  12  89

[8 rows x 8 columns]

Now, as mentioned in the introduction, if we wanted to multiply every cell by 2, we could loop through each row and then each column of the DataFrame, multiply everything by two, and then return the result. Something like this.


In [2]:
def times_2(df):
    val_dict = {}
    for index, values in df.iterrows():
        val_dict[index] = []
        for value in values:
            val_dict[index].append(value * 2)
    df2 = pd.DataFrame(val_dict).T # transposes the df since dictionary elements are read as columns instead of rows.
    df2.index = df.index
    df2.columns = df.columns
    return df2

df2 = times_2(df)
print(df2)


     0    1    2    3    4    5    6    7
a  184   14  122  158   36   52  186  182
b   94  146  184  130    0  124   56   84
c   80   62  132   92   10  196  160    0
d  120  138   10   20    4   68   46   40
e   54  134  192   38   76  142   28   80
f   60   98  110   18  130  190  128   80
g  154  174   22   74  130   88  172   96
h   86    4  180   44  174   80   24  178

[8 rows x 8 columns]

But this is not a very efficient way to perform this operation. Let's take a look at an easier and faster way to do this: vectorizing.


In [3]:
df3 = df * 2
print(df3)
print(df3 == df2) # element-wise comparison
print(df3.all() == df2.all()) # column-wise comparison
print(df3.all(axis = 1) == df2.all(axis = 1)) # row-wise comparison


     0    1    2    3    4    5    6    7
a  184   14  122  158   36   52  186  182
b   94  146  184  130    0  124   56   84
c   80   62  132   92   10  196  160    0
d  120  138   10   20    4   68   46   40
e   54  134  192   38   76  142   28   80
f   60   98  110   18  130  190  128   80
g  154  174   22   74  130   88  172   96
h   86    4  180   44  174   80   24  178

[8 rows x 8 columns]
      0     1     2     3     4     5     6     7
a  True  True  True  True  True  True  True  True
b  True  True  True  True  True  True  True  True
c  True  True  True  True  True  True  True  True
d  True  True  True  True  True  True  True  True
e  True  True  True  True  True  True  True  True
f  True  True  True  True  True  True  True  True
g  True  True  True  True  True  True  True  True
h  True  True  True  True  True  True  True  True

[8 rows x 8 columns]
0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
dtype: bool
a    True
b    True
c    True
d    True
e    True
f    True
g    True
h    True
dtype: bool

Not only does the code take much less time to write (only one line), it also takes much less time to run. Check it out.


In [4]:
%timeit times_2(df)
%timeit df * 2


1000 loops, best of 3: 763 µs per loop
10000 loops, best of 3: 45.5 µs per loop

Look at the difference! On my computer, vectorizing the operation (the second one) is about 19x faster than the loop. This is why you should always try to vectorize your operations if you have a large data set.
You can also perform arithmetic operations between objects in a vectorized manner.


In [5]:
df + df3


Out[5]:
0 1 2 3 4 5 6 7
a 276 21 183 237 54 78 279 273
b 141 219 276 195 0 186 84 126
c 120 93 198 138 15 294 240 0
d 180 207 15 30 6 102 69 60
e 81 201 288 57 114 213 42 120
f 90 147 165 27 195 285 192 120
g 231 261 33 111 195 132 258 144
h 129 6 270 66 261 120 36 267

8 rows × 8 columns


In [6]:
df - df3


Out[6]:
0 1 2 3 4 5 6 7
a -92 -7 -61 -79 -18 -26 -93 -91
b -47 -73 -92 -65 0 -62 -28 -42
c -40 -31 -66 -46 -5 -98 -80 0
d -60 -69 -5 -10 -2 -34 -23 -20
e -27 -67 -96 -19 -38 -71 -14 -40
f -30 -49 -55 -9 -65 -95 -64 -40
g -77 -87 -11 -37 -65 -44 -86 -48
h -43 -2 -90 -22 -87 -40 -12 -89

8 rows × 8 columns


In [7]:
-df


Out[7]:
0 1 2 3 4 5 6 7
a -92 -7 -61 -79 -18 -26 -93 -91
b -47 -73 -92 -65 0 -62 -28 -42
c -40 -31 -66 -46 -5 -98 -80 0
d -60 -69 -5 -10 -2 -34 -23 -20
e -27 -67 -96 -19 -38 -71 -14 -40
f -30 -49 -55 -9 -65 -95 -64 -40
g -77 -87 -11 -37 -65 -44 -86 -48
h -43 -2 -90 -22 -87 -40 -12 -89

8 rows × 8 columns


In [8]:
df * df3


Out[8]:
0 1 2 3 4 5 6 7
a 16928 98 7442 12482 648 1352 17298 16562
b 4418 10658 16928 8450 0 7688 1568 3528
c 3200 1922 8712 4232 50 19208 12800 0
d 7200 9522 50 200 8 2312 1058 800
e 1458 8978 18432 722 2888 10082 392 3200
f 1800 4802 6050 162 8450 18050 8192 3200
g 11858 15138 242 2738 8450 3872 14792 4608
h 3698 8 16200 968 15138 3200 288 15842

8 rows × 8 columns


In [9]:
df / df3


Out[9]:
0 1 2 3 4 5 6 7
a 0.5 0.5 0.5 0.5 0.500000 0.5 0.5 0.500000
b 0.5 0.5 0.5 0.5 inf 0.5 0.5 0.500000
c 0.5 0.5 0.5 0.5 0.500000 0.5 0.5 inf
d 0.5 0.5 0.5 0.5 0.500000 0.5 0.5 0.500000
e 0.5 0.5 0.5 0.5 0.500000 0.5 0.5 0.500000
f 0.5 0.5 0.5 0.5 0.500000 0.5 0.5 0.500000
g 0.5 0.5 0.5 0.5 0.500000 0.5 0.5 0.500000
h 0.5 0.5 0.5 0.5 0.500000 0.5 0.5 0.500000

8 rows × 8 columns

If we want to apply a function to each column of a DataFrame, we can simply call the method.
There is also the apply method that applies a function row-wise.


In [10]:
from timeit import timeit

print(df.mean(axis = 1)) #This applies column-wise.
print('Vectorized function takes %s seconds' % timeit('np.mean(df)', 'from __main__ import df, np', number = 1000))
print(df.apply(np.mean, axis = 1)) #This applies the function row-wise.
print('Apply method takes %s seconds' % timeit('df.apply(np.mean, axis = 1)', 'from __main__ import df, np', number = 1000))


a    58.375
b    51.125
c    45.750
d    27.875
e    46.500
f    50.875
g    56.875
h    48.125
dtype: float64
Vectorized function takes 0.039750566000293475 seconds
a    58.375
b    51.125
c    45.750
d    27.875
e    46.500
f    50.875
g    56.875
h    48.125
dtype: float64
Apply method takes 0.5598446589992818 seconds

But notice how much longer the apply method takes. So while the apply method can be useful for functions you implement yourself, you should first check whether there is a function in Pandas that does the same thing because it will almost certainly be faster.

Note: we will see this later in the week when we implement our own cosine similarity function.

You can also apply a Pandas Series (or even scalar values) to a DataFrame element-wise.


In [11]:
s = pd.Series(np.random.randint(0, 8, size = 8))
s


Out[11]:
0    7
1    4
2    7
3    5
4    3
5    2
6    0
7    0
dtype: int64

In [12]:
df


Out[12]:
0 1 2 3 4 5 6 7
a 92 7 61 79 18 26 93 91
b 47 73 92 65 0 62 28 42
c 40 31 66 46 5 98 80 0
d 60 69 5 10 2 34 23 20
e 27 67 96 19 38 71 14 40
f 30 49 55 9 65 95 64 40
g 77 87 11 37 65 44 86 48
h 43 2 90 22 87 40 12 89

8 rows × 8 columns


In [13]:
df * s


Out[13]:
0 1 2 3 4 5 6 7
a 644 28 427 395 54 52 0 0
b 329 292 644 325 0 124 0 0
c 280 124 462 230 15 196 0 0
d 420 276 35 50 6 68 0 0
e 189 268 672 95 114 142 0 0
f 210 196 385 45 195 190 0 0
g 539 348 77 185 195 88 0 0
h 301 8 630 110 261 80 0 0

8 rows × 8 columns


In [14]:
df.multiply(s) #this is slightly faster than the above


Out[14]:
0 1 2 3 4 5 6 7
a 644 28 427 395 54 52 0 0
b 329 292 644 325 0 124 0 0
c 280 124 462 230 15 196 0 0
d 420 276 35 50 6 68 0 0
e 189 268 672 95 114 142 0 0
f 210 196 385 45 195 190 0 0
g 539 348 77 185 195 88 0 0
h 301 8 630 110 261 80 0 0

8 rows × 8 columns


In [15]:
np.multiply(df, s) #the numpy function is about 33% bit faster than either of the previous ones


Out[15]:
0 1 2 3 4 5 6 7
a 644 28 427 395 54 52 0 0
b 329 292 644 325 0 124 0 0
c 280 124 462 230 15 196 0 0
d 420 276 35 50 6 68 0 0
e 189 268 672 95 114 142 0 0
f 210 196 385 45 195 190 0 0
g 539 348 77 185 195 88 0 0
h 301 8 630 110 261 80 0 0

8 rows × 8 columns


In [16]:
df * 5


Out[16]:
0 1 2 3 4 5 6 7
a 460 35 305 395 90 130 465 455
b 235 365 460 325 0 310 140 210
c 200 155 330 230 25 490 400 0
d 300 345 25 50 10 170 115 100
e 135 335 480 95 190 355 70 200
f 150 245 275 45 325 475 320 200
g 385 435 55 185 325 220 430 240
h 215 10 450 110 435 200 60 445

8 rows × 8 columns

And, finally, notice what happens if your Series (or DataFrame) is a different size than your original.


In [17]:
s2 = pd.Series(np.random.randint(0, 8, size = 3))
s2


Out[17]:
0    7
1    5
2    4
dtype: int64

In [18]:
df


Out[18]:
0 1 2 3 4 5 6 7
a 92 7 61 79 18 26 93 91
b 47 73 92 65 0 62 28 42
c 40 31 66 46 5 98 80 0
d 60 69 5 10 2 34 23 20
e 27 67 96 19 38 71 14 40
f 30 49 55 9 65 95 64 40
g 77 87 11 37 65 44 86 48
h 43 2 90 22 87 40 12 89

8 rows × 8 columns


In [19]:
df * s2


Out[19]:
0 1 2 3 4 5 6 7
a 644 35 244 NaN NaN NaN NaN NaN
b 329 365 368 NaN NaN NaN NaN NaN
c 280 155 264 NaN NaN NaN NaN NaN
d 420 345 20 NaN NaN NaN NaN NaN
e 189 335 384 NaN NaN NaN NaN NaN
f 210 245 220 NaN NaN NaN NaN NaN
g 539 435 44 NaN NaN NaN NaN NaN
h 301 10 360 NaN NaN NaN NaN NaN

8 rows × 8 columns

Final Activity:

Below write a small script that performs the following operations on the two DataFrames:
1. Raises every cell in one DataFrame to the power of the corresponding cell in the other, e.g.,

$\text{df.ix[0,0]}^{\textrm{df2.ix[0,0]}}$

2. Takes the xth root of every cell in one DataFrame, where x is the number in the corresponding cell in the other DF, e.g.,

$\sqrt[\text{df2.ix[0,0]}]{\text{df.ix[0,0]}}$

3. Takes the square root of the cell-wise sums of the DFs, e.g.,

$\sqrt{\text{df.ix[0,0]} + \text{df2.ix[0,0]}}$

4. Takes the natural log of the result of the each cell in one DF raised to the power of the corresponding cell in the other, e.g.,

$\ln(\text{df.ix[0,0]}^{\text{df2.ix[0,0]}})$

Hint: If any of these take more than one line of code or you are using apply, you are not vectorizing. Try Again!


In [23]:
# Insert your code for number 1 here.
np.power(np.float64(df), df2)


Out[23]:
0 1 2 3 4 5 6 7
a inf 6.782231e+11 6.460109e+217 6.684677e+299 1.548140e+45 3.789781e+73 inf inf
b 1.503830e+157 1.109526e+272 inf 4.772395e+235 1.000000e+00 1.805384e+222 1.098626e+81 2.253930e+136
c 1.461502e+128 2.913567e+92 1.512863e+240 9.412836e+152 9.765625e+06 inf 3.121749e+304 1.000000e+00
d 2.388636e+213 5.769983e+253 9.765625e+06 1.000000e+20 1.600000e+01 1.382186e+104 4.359939e+62 1.099512e+52
e 1.966271e+77 4.943375e+244 inf 3.914144e+48 1.157586e+120 7.562851e+262 1.234767e+32 1.461502e+128
f 4.239116e+88 4.357284e+165 2.753568e+191 1.500946e+17 4.772395e+235 inf 1.552518e+231 1.461502e+128
g 3.308047e+290 inf 8.140275e+22 1.114109e+116 4.772395e+235 4.205673e+144 inf 2.507026e+161
h 3.008065e+140 1.600000e+01 inf 1.165730e+59 inf 1.461502e+128 7.949685e+25 inf

8 rows × 8 columns


In [29]:
# Insert your code for number 2 here.
np.power(np.float64(df), 1/df2)


Out[29]:
0 1 2 3 4 5 6 7
a 1.024879 1.149117 1.034270 1.028041 1.083599 1.064660 1.024668 1.025095
b 1.041809 1.029823 1.024879 1.032632 0.000000 1.033843 1.061310 1.045501
c 1.047191 1.056949 1.032249 1.042494 1.174619 1.023668 1.027766 0.000000
d 1.034708 1.031157 1.174619 1.122018 1.189207 1.053226 1.070540 1.077769
e 1.062935 1.031876 1.024057 1.080566 1.049027 1.030474 1.098837 1.047191
f 1.058324 1.040512 1.037102 1.129831 1.032632 1.024257 1.033025 1.047191
g 1.028608 1.025998 1.115157 1.050006 1.032632 1.043940 1.026236 1.041149
h 1.044705 1.189207 1.025314 1.072777 1.025998 1.047191 1.109088 1.025538

8 rows × 8 columns

HINT:

$\sqrt{4} = 4^{\frac{1}{2}}$


In [32]:
# Insert your code for number 3 here.
from math import sqrt
np.sqrt(np.float64(df) + df2)


Out[32]:
0 1 2 3 4 5 6 7
a 16.613248 4.582576 13.527749 15.394804 7.348469 8.831761 16.703293 16.522712
b 11.874342 14.798649 16.613248 13.964240 0.000000 13.638182 9.165151 11.224972
c 10.954451 9.643651 14.071247 11.747340 3.872983 17.146428 15.491933 0.000000
d 13.416408 14.387495 3.872983 5.477226 2.449490 10.099505 8.306624 7.745967
e 9.000000 14.177447 16.970563 7.549834 10.677078 14.594520 6.480741 10.954451
f 9.486833 12.124356 12.845233 5.196152 13.964240 16.881943 13.856406 10.954451
g 15.198684 16.155494 5.744563 10.535654 13.964240 11.489125 16.062378 12.000000
h 11.357817 2.449490 16.431677 8.124038 16.155494 10.954451 6.000000 16.340135

8 rows × 8 columns


In [33]:
# Insert your code for number 4 here.
from math import log
np.log(np.power(np.float64(df), df2))


Out[33]:
0 1 2 3 4 5 6 7
a inf 27.242742 501.526611 690.372761 104.053383 169.421020 inf inf
b 361.913875 626.407078 inf 542.670345 0.000000 511.764664 186.603453 313.964248
c 295.110356 212.907207 553.034426 352.235008 16.094379 inf 701.124262 0.000000
d 491.321347 584.306698 16.094379 46.051702 2.772589 239.792516 144.232734 119.829291
e 177.975191 563.428811 inf 111.888681 276.456548 605.300543 73.893605 295.110356
f 204.071843 381.398389 440.806650 39.550042 542.670345 inf 532.337035 295.110356
g 668.946035 inf 52.753696 267.207926 542.670345 333.008688 inf 371.635297
h 323.463210 2.772589 inf 136.005868 inf 295.110356 59.637760 inf

8 rows × 8 columns

Take a close look at your answers. If they don't make sense or there are problems with them, figure out why and correct them in your code.