If you use Python for data analysis, you probably use Pandas for Data Munging. And if you use Pandas, you've probably come across the warning below:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The Pandas documentation is great in general, but it's easy to read through the link above and still be confused. Or if you're like me, you'll read the documentation page, think "Oh, I get it," and then get the same warning again.
Here's where this issue pops up. Say you have some data:
In [3]:
import pandas as pd
df = pd.DataFrame({'Number' : [100,200,300,400,500], 'Letter' : ['a','b','c', 'd', 'e']})
df
Out[3]:
...and you want to filter it on some criteria. Pandas makes that easy with Boolean Indexing
In [4]:
criteria = df['Number']>300
criteria
Out[4]:
In [28]:
#Keep only rows which correspond to 'Number'>300 ('True' in the 'criteria' vector above)
df[criteria]
Out[28]:
This works great right? Unfortunately not, because once we:
like so...
In [29]:
#Create a new DataFrame based on filtering criteria
df_2 = df[criteria]
#Assign a new column and print output
df_2['new column'] = 'new value'
df_2
Out[29]:
There's the warning.
So what should we have done differently? The warning suggests using ".loc[row_indexer, col_indexer]". So let's try subsetting the DataFrame the same way as before, but this time using the df.loc[ ] method.
In [30]:
df.loc[criteria, :]
Out[30]:
In [31]:
#Create New DataFrame Based on Filtering Criteria
df_2 = df.loc[criteria, :]
#Add a New Column to the DataFrame
df_2.loc[:, 'new column'] = 'new value'
df_2
Out[31]:
Two warnings this time!
Recall that our "criteria" variable is a Pandas Series of Boolean True/False values, corresponding to whether a row of 'df' meets our Number>300 criteria.
In [14]:
criteria
Out[14]:
The Pandas Docs say a "common operation is the use of boolean vectors to filter the data" as we've done here. But apparently a boolean vector is not the "row_indexer" the warning advises us to use with .loc[] for creating new dataframes. Instead, Pandas wants us to use .loc[] with a vector of row-numbers (technically, "row labels", which here are numbers).
We can get to that "row_indexer" with one extra line of code. Building on what we had before. Instead of creating our new dataframe by filtering rows with a vector of True/False like below...
In [15]:
df_2 = df[criteria]
We first grab the indices of that filtered dataframe using .index...
In [32]:
criteria_row_indices = df[criteria].index
criteria_row_indices
Out[32]:
And pass that list of indices to .loc[ ] to create our new dataframe
In [27]:
new_df = df.loc[criteria_row_indices, :]
new_df
Out[27]:
Now we can add a new column without throwing The Warning (tm)
In [24]:
new_df['New Column'] = 'New Value'
new_df
Out[24]:
In each of the instances above where we got a warning, you may have noticed that we also got the results we expected. Maybe the warning isn't such a big deal? It's not an error right?
The Pandas documentation page linked in the warning states that the results may be correct, but are not reliably correct, because of the unpredictable nature of when an underlying __getitem__ call returns a view vs a copy. After reading some StackOverflow discussions, at least one dev is confident that "if you know what you are doing", you can ignore these warnings (or suppress them) and rest assured your results are reliable.
I'm sure that works for him, but even if I managed to convince myself when it's safe to ignore this warning, what happens in a year when I forget if some old code which throws the warning is reliable or not? Was this written before I figured it out? What happens when someone else is using my code, asks about the warning, and I say "don't worry it's fine, but I forget why" and wave my hands a lot.
Plus, doesn't that warning just bother you? Either out of prudence or neuroticism, I'm not interested in peppering my logs with warnings from the Pandas devs, and I'm not cavalier enough to suppress the warning messages.
To me, the clean code solution requires using code that provides reliably correct results without these warnings.