00-masks

This notebook makes part of the Lisbon Data Science meetup - class 2 and explains how to create masks to assign values to dataframes.


In [ ]:
import pandas as pd
import numpy as np

What are masks and why are they useful?

You must certainly have used masks already. They are, boolean arrays that let us access, in our case, to parts of the DataFrame. These parts of the dataframe can be defined by using inequalities for instance. The best would be to go forward with some examples.


In [ ]:
# Example dataframe
df = pd.DataFrame({'Age': [20, 18, 25, 55, 125, 30],
                   'Height': [165, 189, 359, 149, 175, 163]})
df

Masks are useful to get parts of our dataframe with specific characteristics, for instance,


In [ ]:
my_mask = df['Age'] < 30
my_mask

... People with an exact age:


In [ ]:
my_mask = df['Age'] == 55
my_mask

Or, if we want people with age 0 or above and below 115:


In [ ]:
my_mask = (df['Age'] >= 0) & (df['Age'] < 115)
my_mask

This is our mask! When dealing with Dataframes, you get a Series in return with the rows that fulfill your inequalities. Let us see our last mask in practice, where we see that one of the rows was dropped:


In [ ]:
df[my_mask]

In [ ]:
df.loc[my_mask]

Using masks to assign values to dataframe

Well, our initial dataframe df is still...


In [ ]:
df

.. since we didn't change it yet! We just took a look at views of the dataframe. Let us drop the row 4 with Age=125


In [ ]:
df = df[my_mask]
df

But we still have a person that looks too tall to be true. Let's do something about it, let's trim her to 155!


In [ ]:
mask = df['Height'] == 359
df[mask]['Height'] = 155
df

Oh no!

We got a warning! Maybe we shouldn't have trimmed that person down!!

Actually, it's not that... The problem is that we are (or might be) trying to assign a value (175) to a view of a dataframe instead of the actual dataframe! And this can be a hidden problem if we disregard the warning. Explaining this would require more time than we actually have, but I recommend you to take a look at the warning's link. Always pay attention to the warnings - if you don't know what they mean, Google them.

The solution for this is to use the .loc[], which is primarily label based (e.g., using 'Age', 'Height'), but may also be used with a boolean array (which is what we want). I would also recommend to take a look at this post.


In [ ]:
df.loc[df['Height'] == 359, 'Height'] = 155
df

And here we have our dataframe without extreme heights and our ages within a specified range. By the way, if you want to invert your mask in a pythonic way you just need to do this:


In [ ]:
~my_mask

In [ ]: