Pandas Data Munging: Avoiding that 'SettingWithCopyWarning'

If you use Python for data analysis, you probably use Pandas for Data Munging. And if you use Pandas, you've probably come across the warning below:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

The Pandas documentation is great in general, but it's easy to read through the link above and still be confused. Or if you're like me, you'll read the documentation page, think "Oh, I get it," and then get the same warning again.

A Simple Reproducible Example of The Warning^(tm)

Here's where this issue pops up. Say you have some data:



In [3]:

    
import pandas as pd
df = pd.DataFrame({'Number' : [100,200,300,400,500], 'Letter' : ['a','b','c', 'd', 'e']})
df

...and you want to filter it on some criteria. Pandas makes that easy with Boolean Indexing



In [4]:

    
criteria = df['Number']>300
criteria









    Out[4]:





0    False
1    False
2    False
3     True
4     True
Name: Number, dtype: bool



In [28]:

    
#Keep only rows which correspond to 'Number'>300 ('True' in the 'criteria' vector above)
df[criteria]

This works great right? Unfortunately not, because once we:

Use that filtering code to create a new Pandas DataFrame, and
Assign a new column or change an existing column in that DataFrame

like so...



In [29]:

    
#Create a new DataFrame based on filtering criteria
df_2 = df[criteria]

#Assign a new column and print output
df_2['new column'] = 'new value'
df_2









    



/home/max/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy






    Out[29]:






  
    
      
      Letter
      Number
      new column
    
  
  
    
      3
      d
      400
      new value
    
    
      4
      e
      500
      new value

There's the warning.

So what should we have done differently? The warning suggests using ".loc[row_indexer, col_indexer]". So let's try subsetting the DataFrame the same way as before, but this time using the df.loc[ ] method.

Re-Creating Our New Dataframe Using .loc[]



In [30]:

    
df.loc[criteria, :]



In [31]:

    
#Create New DataFrame Based on Filtering Criteria
df_2 = df.loc[criteria, :]

#Add a New Column to the DataFrame
df_2.loc[:, 'new column'] = 'new value'
df_2









    



/home/max/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:296: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
/home/max/anaconda3/lib/python3.5/site-packages/pandas/core/indexing.py:476: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s






    Out[31]:






  
    
      
      Letter
      Number
      new column
    
  
  
    
      3
      d
      400
      new value
    
    
      4
      e
      500
      new value

Two warnings this time!

OK, So What's Going On?

Recall that our "criteria" variable is a Pandas Series of Boolean True/False values, corresponding to whether a row of 'df' meets our Number>300 criteria.



In [14]:

    
criteria









    Out[14]:





0    False
1    False
2    False
3     True
4     True
Name: Number, dtype: bool

The Pandas Docs say a "common operation is the use of boolean vectors to filter the data" as we've done here. But apparently a boolean vector is not the "row_indexer" the warning advises us to use with .loc[] for creating new dataframes. Instead, Pandas wants us to use .loc[] with a vector of row-numbers (technically, "row labels", which here are numbers).

Solution

We can get to that "row_indexer" with one extra line of code. Building on what we had before. Instead of creating our new dataframe by filtering rows with a vector of True/False like below...



In [15]:

    
df_2 = df[criteria]

We first grab the indices of that filtered dataframe using .index...



In [32]:

    
criteria_row_indices = df[criteria].index
criteria_row_indices









    Out[32]:





Int64Index([3, 4], dtype='int64')

And pass that list of indices to .loc[ ] to create our new dataframe



In [27]:

    
new_df = df.loc[criteria_row_indices, :]
new_df

Now we can add a new column without throwing The Warning ^(tm)



In [24]:

    
new_df['New Column'] = 'New Value'
new_df









    Out[24]:






  
    
      
      Letter
      Number
      New Column
    
  
  
    
      3
      d
      400
      New Value
    
    
      4
      e
      500
      New Value

Final Note - Did That Warning Even Mean Our Results Were Wrong?

In each of the instances above where we got a warning, you may have noticed that we also got the results we expected. Maybe the warning isn't such a big deal? It's not an error right?

The Pandas documentation page linked in the warning states that the results may be correct, but are not reliably correct, because of the unpredictable nature of when an underlying __getitem__ call returns a view vs a copy. After reading some StackOverflow discussions, at least one dev is confident that "if you know what you are doing", you can ignore these warnings (or suppress them) and rest assured your results are reliable.

I'm sure that works for him, but even if I managed to convince myself when it's safe to ignore this warning, what happens in a year when I forget if some old code which throws the warning is reliable or not? Was this written before I figured it out? What happens when someone else is using my code, asks about the warning, and I say "don't worry it's fine, but I forget why" and wave my hands a lot.

Plus, doesn't that warning just bother you? Either out of prudence or neuroticism, I'm not interested in peppering my logs with warnings from the Pandas devs, and I'm not cavalier enough to suppress the warning messages.