Python's pandas package is one of the most powerful tools for data analysis in the Python ecosystem. Built on top of NumPy, it makes working with tabular data quite effective and adds an astounding amount of functionality to your toolkit. Despite its strengths, there are some very useful functions that are challenging to grasp based on the pandas docs. apply
and transform
are two such examples.
One quick note before we dive in: this series assumes basic working knowledge of pandas. There are several resources like Dataquest, Data Camp and pandas cheat sheets to get you up to speed if this is hard to follow.
apply
and transform
?In short, these two functions are used to operate on data structures, similarly to Python's built in map
function. We will get into the differences, but typically they are used in combination with groupby
to perform aggregate functions on various groups of a dataset. This a direct analogy to GROUP BY
in SQL and I am going to assume familiarity with how it works (if you aren't, here is a decent intro). The major difference is that we can leverage the flexibility of Python and pandas DataFrames to do basically whatever we want.
To keep things practical, let's start with event data from a hypothetical mobile game. I created some randomly generated, but logical data for us to analyze.
In [1]:
import pandas as pd
data = pd.read_csv('test_user_data.csv')
print(data.head(10))
The data contains one event per row and has 5 variables:
In [2]:
apply_ex = data.groupby('user_id').apply(len)
print(apply_ex.head())
The output here is a pandas Series with each user_id as the index and the count of the number of events as values. Now to try the same thing with transform
.
In [3]:
transform_ex = data.groupby('user_id').transform(len)
print(transform_ex.head())
What the heck happened here? This odd DataFrame highlights a key difference: apply
by default returns an object with one element per group and transform
returns an object of the exact same size as the input object. Unless specified, it operates column by column in order.
How about we clean this up a bit and create a new column in our original DataFrame that contains the total event count for each group in it.
In [4]:
data['event_count'] = data.groupby('user_id')['user_id'].transform(len)
print(data.head(7))
Much better. All we had to do was assign to the new event_count
column and then specify the ['user_id']
column after the groupby
statement. Whether you would prefer to have this additional column of repeating values depends on what you intend to do with the data afterwards. Let's assume this is acceptable. Now for something a bit more involved.
In [5]:
def add_value(x):
if x == 'buy_coins':
y = 1.00
elif x == 'megapack':
y = 10.00
else:
y=0.0
return y
Here we've defined a very simple custom function that assigns values to each of the four event types. Now to apply
it to our data.
In [6]:
data['event_value'] = data['event_type'].apply(add_value)
print(data.head(7))
That worked out nicely. Since we didn't care about event_values per user, groupby
wasn't necessary. If we were to run this using transform
, we'd get an error. Since it is run column-by-column, there isn't a practical way to reference other columns like with apply
.
In the next post of the series, we'll continue using pandas to answer more interesting product questions like: