Often, when you have your data in a Pandas DataFrame or a Numpy ndarray, you want to perform some sort of operation on every cell in your data frame. You could loop over it, and sometimes you may have to. But you should always try to vectorize your operations since this takes advantage of the Pandas/Numpy system of (arrayed) vectors.
Let's jump right in.
In [1]:
import pandas as pd
import numpy as np
#First, we create an 8x8 array of random integers.
df = pd.DataFrame(data = np.random.randint(0, 100, size = (8,8)), index = ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'))
print(df) #since we are using random numbers, your array will always be different.
Now, as mentioned in the introduction, if we wanted to multiply every cell by 2, we could loop through each row and then each column of the DataFrame, multiply everything by two, and then return the result. Something like this.
In [2]:
def times_2(df):
val_dict = {}
for index, values in df.iterrows():
val_dict[index] = []
for value in values:
val_dict[index].append(value * 2)
df2 = pd.DataFrame(val_dict).T # transposes the df since dictionary elements are read as columns instead of rows.
df2.index = df.index
df2.columns = df.columns
return df2
df2 = times_2(df)
print(df2)
But this is not a very efficient way to perform this operation. Let's take a look at an easier and faster way to do this: vectorizing.
In [3]:
df3 = df * 2
print(df3)
print(df3 == df2) # element-wise comparison
print(df3.all() == df2.all()) # column-wise comparison
print(df3.all(axis = 1) == df2.all(axis = 1)) # row-wise comparison
Not only does the code take much less time to write (only one line), it also takes much less time to run. Check it out.
In [4]:
%timeit times_2(df)
%timeit df * 2
Look at the difference! On my computer, vectorizing the operation (the second one) is about 19x faster than the loop. This is why you should always try to vectorize your operations if you have a large data set.
You can also perform arithmetic operations between objects in a vectorized manner.
In [5]:
df + df3
Out[5]:
In [6]:
df - df3
Out[6]:
In [7]:
-df
Out[7]:
In [8]:
df * df3
Out[8]:
In [9]:
df / df3
Out[9]:
If we want to apply a function to each column of a DataFrame, we can simply call the method.
There is also the apply method that applies a function row-wise.
In [10]:
from timeit import timeit
print(df.mean(axis = 1)) #This applies column-wise.
print('Vectorized function takes %s seconds' % timeit('np.mean(df)', 'from __main__ import df, np', number = 1000))
print(df.apply(np.mean, axis = 1)) #This applies the function row-wise.
print('Apply method takes %s seconds' % timeit('df.apply(np.mean, axis = 1)', 'from __main__ import df, np', number = 1000))
But notice how much longer the apply method takes. So while the apply method can be useful for functions you implement yourself, you should first check whether there is a function in Pandas that does the same thing because it will almost certainly be faster.
Note: we will see this later in the week when we implement our own cosine similarity function.
You can also apply a Pandas Series (or even scalar values) to a DataFrame element-wise.
In [11]:
s = pd.Series(np.random.randint(0, 8, size = 8))
s
Out[11]:
In [12]:
df
Out[12]:
In [13]:
df * s
Out[13]:
In [14]:
df.multiply(s) #this is slightly faster than the above
Out[14]:
In [15]:
np.multiply(df, s) #the numpy function is about 33% bit faster than either of the previous ones
Out[15]:
In [16]:
df * 5
Out[16]:
And, finally, notice what happens if your Series (or DataFrame) is a different size than your original.
In [17]:
s2 = pd.Series(np.random.randint(0, 8, size = 3))
s2
Out[17]:
In [18]:
df
Out[18]:
In [19]:
df * s2
Out[19]:
xth root of every cell in one DataFrame, where x is the number in the corresponding cell in the other DF, e.g.,apply, you are not vectorizing. Try Again!
In [23]:
# Insert your code for number 1 here.
np.power(np.float64(df), df2)
Out[23]:
In [29]:
# Insert your code for number 2 here.
np.power(np.float64(df), 1/df2)
Out[29]:
HINT:
$\sqrt{4} = 4^{\frac{1}{2}}$
In [32]:
# Insert your code for number 3 here.
from math import sqrt
np.sqrt(np.float64(df) + df2)
Out[32]:
In [33]:
# Insert your code for number 4 here.
from math import log
np.log(np.power(np.float64(df), df2))
Out[33]:
Take a close look at your answers. If they don't make sense or there are problems with them, figure out why and correct them in your code.