Pandas Data Munging: Avoiding that 'SettingWithCopyWarning'

If you use Python for data analysis, you probably use Pandas for Data Munging. And if you use Pandas, you've probably come across the warning below:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

The Pandas documentation is great in general, but it's easy to read through the link above and still be confused. Or if you're like me, you'll read the documentation page, think "Oh, I get it," and then get the same warning again.

A Simple Reproducible Example of The Warning(tm)

Here's where this issue pops up. Say you have some data:


In [3]:
import pandas as pd
df = pd.DataFrame({'Number' : [100,200,300,400,500], 'Letter' : ['a','b','c', 'd', 'e']})
df


Out[3]:
Letter Number
0 a 100
1 b 200
2 c 300
3 d 400
4 e 500

...and you want to filter it on some criteria. Pandas makes that easy with Boolean Indexing


In [4]:
criteria = df['Number']>300
criteria


Out[4]:
0    False
1    False
2    False
3     True
4     True
Name: Number, dtype: bool

In [28]:
#Keep only rows which correspond to 'Number'>300 ('True' in the 'criteria' vector above)
df[criteria]


Out[28]:
Letter Number
3 d 400
4 e 500

This works great right? Unfortunately not, because once we:

  1. Use that filtering code to create a new Pandas DataFrame, and
  2. Assign a new column or change an existing column in that DataFrame

like so...


In [29]:
#Create a new DataFrame based on filtering criteria
df_2 = df[criteria]

#Assign a new column and print output
df_2['new column'] = 'new value'
df_2


/home/max/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[29]:
Letter Number new column
3 d 400 new value
4 e 500 new value

There's the warning.

So what should we have done differently? The warning suggests using ".loc[row_indexer, col_indexer]". So let's try subsetting the DataFrame the same way as before, but this time using the df.loc[ ] method.

Re-Creating Our New Dataframe Using .loc[]


In [30]:
df.loc[criteria, :]


Out[30]:
Letter Number
3 d 400
4 e 500

In [31]:
#Create New DataFrame Based on Filtering Criteria
df_2 = df.loc[criteria, :]

#Add a New Column to the DataFrame
df_2.loc[:, 'new column'] = 'new value'
df_2


/home/max/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:296: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
/home/max/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:476: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
Out[31]:
Letter Number new column
3 d 400 new value
4 e 500 new value

Two warnings this time!

OK, So What's Going On?

Recall that our "criteria" variable is a Pandas Series of Boolean True/False values, corresponding to whether a row of 'df' meets our Number>300 criteria.


In [14]:
criteria


Out[14]:
0    False
1    False
2    False
3     True
4     True
Name: Number, dtype: bool

The Pandas Docs say a "common operation is the use of boolean vectors to filter the data" as we've done here. But apparently a boolean vector is not the "row_indexer" the warning advises us to use with .loc[] for creating new dataframes. Instead, Pandas wants us to use .loc[] with a vector of row-numbers (technically, "row labels", which here are numbers).

Solution

We can get to that "row_indexer" with one extra line of code. Building on what we had before. Instead of creating our new dataframe by filtering rows with a vector of True/False like below...


In [15]:
df_2 = df[criteria]

We first grab the indices of that filtered dataframe using .index...


In [32]:
criteria_row_indices = df[criteria].index
criteria_row_indices


Out[32]:
Int64Index([3, 4], dtype='int64')

And pass that list of indices to .loc[ ] to create our new dataframe


In [27]:
new_df = df.loc[criteria_row_indices, :]
new_df


Out[27]:
Letter Number
3 d 400
4 e 500

Now we can add a new column without throwing The Warning (tm)


In [24]:
new_df['New Column'] = 'New Value'
new_df


Out[24]:
Letter Number New Column
3 d 400 New Value
4 e 500 New Value

Final Note - Did That Warning Even Mean Our Results Were Wrong?

In each of the instances above where we got a warning, you may have noticed that we also got the results we expected. Maybe the warning isn't such a big deal? It's not an error right?

The Pandas documentation page linked in the warning states that the results may be correct, but are not reliably correct, because of the unpredictable nature of when an underlying __getitem__ call returns a view vs a copy. After reading some StackOverflow discussions, at least one dev is confident that "if you know what you are doing", you can ignore these warnings (or suppress them) and rest assured your results are reliable.

I'm sure that works for him, but even if I managed to convince myself when it's safe to ignore this warning, what happens in a year when I forget if some old code which throws the warning is reliable or not? Was this written before I figured it out? What happens when someone else is using my code, asks about the warning, and I say "don't worry it's fine, but I forget why" and wave my hands a lot.

Plus, doesn't that warning just bother you? Either out of prudence or neuroticism, I'm not interested in peppering my logs with warnings from the Pandas devs, and I'm not cavalier enough to suppress the warning messages.

To me, the clean code solution requires using code that provides reliably correct results without these warnings.