Views, Copies, and that annoying SettingWithCopyWarning

If you’ve spent any time in pandas at all, you’ve seen SettingWithCopyWarning. If not, you will soon!

Just like any warning, it’s wise to not ignore it since you get it for a reason: it’s a sign that you’re probably doing something wrong. In my case, I usually get this warning when I’m knee deep in some analysis and don’t want to spend too much time figuring out how to fix it.

I’m going to cover a few typical examples of when this warning shows up, why it shows up, and how to quickly fix the underlying issue.

First, let’s make an example DataFrame. I’m using a handy Python package called Faker to create some test data. You may need to install it first, with pip.

%pip install Faker  # notebook
pip install Faker   # commmand line

As a quick aside, Faker is a great way to build test data for unit tests, test databases, or examples. It generates real-looking data that is not personally identifiable, since it’s all fake, but it’s based on rules that generate data combinations you’ll likely encounter in real life.

>>> import datetime
>>> import pandas as pd
>>> import numpy as np
>>> from faker import Faker
>>> fake = Faker()
>>> df = pd.DataFrame([
            [fake.first_name(),
             fake.last_name(),
             fake.date_of_birth(),
             fake.date_this_year(),
             fake.city(),
             fake.state_abbr(),
             fake.postalcode()]
                for _ in range(20)],
            columns = ['first_name', 'last_name', 'dob', 'lastupdate', 'city', 'state', 'zip'])
          
>>> df.head(3)
  first_name last_name         dob  lastupdate          city state    zip
0       Evan   Daniels  1943-05-27  2021-01-11    North Erin    AZ  27597
1  Christine   Herrera  2019-04-11  2021-01-29     Ellenview    AL  28989
2   Michelle    Warren  2015-05-29  2021-01-11  Mcknighttown    VA  55551

How do we set data again?

First, let’s just review the ways we can set data in a DataFrame, using use the loc or iloc indexers. These are for label based or integer offset based indexing respectively. (See this article for more detail on the two methods)

The first argument in the indexer is for the row, the second is for the column (or columns), and if we assign to this expression, we will update the underlying DataFrame.

Note that the index here is just a RangeIndex, so the labels are numbers. Because of that, even though I’m passing in int values to loc, this is looking up by label, not relative index.

>>> df.head(1)['zip']
0    27597
Name: zip, dtype: object
>>> df.loc[0, 'zip'] = '60601'
>>> df.head(1)['zip']
0    60601
Name: zip, dtype: object
>>> df.loc[0, ['city', 'state']] = ['Chicago', 'IL']
>>> df.head(1)
  first_name last_name         dob  lastupdate     city state    zip
0       Evan   Daniels  1943-05-27  2021-01-11  Chicago    IL  60601
>>> # Here's an example of an iloc update.
>>> df.iloc[0, 0] = 'Josh'
>>> df.head(1)
  first_name last_name         dob  lastupdate     city state    zip
0       Josh   Daniels  1943-05-27  2021-01-11  Chicago    IL  60601

Now, you can also do updates with the array indexing operator, but this can look very confusing because remember that on a DataFrame, you are selecting columns first. I’d recommend not doing this for this reason alone, but as you’ll soon see, there are other issues that can arise.

>>> df["first_name"][0] = 'Joshy'
>>> df.head(1)
  first_name last_name         dob  lastupdate     city state    zip
0      Joshy   Daniels  1943-05-27  2021-01-11  Chicago    IL  60601

When do we see this warning?

OK, now that we have updated our DataFrame successfully, it’s time to see an example of where things can go wrong. For me, it’s very typical to select a subset of the original data to work with. For example, let’s say that we decide to only work with data where the person was born before 2000.

>>> dob_limit = datetime.date(2000, 1, 1)
>>> sub = df[df['dob'] < dob_limit]
>>> sub.shape
(16, 7)
>>> idx = sub.head(1).index[0]  # save the location for update attempts below
>>> sub.head(1)
  first_name last_name         dob  lastupdate     city state    zip
0      Joshy   Daniels  1943-05-27  2021-01-11  Chicago    IL  60601

Let’s try to update the lastupdate column.

>>> sub.loc[idx, 'lastupdate'] = datetime.date.today()
/Users/mcw/.pyenv/versions/3.8.6/envs/pandas/lib/python3.8/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
<ipython-input-14-5f1769c87aaf>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub.loc[idx, 'lastupdate'] = datetime.date.today()

Boom! There it is, we are told we are trying to set values on a copy of a slice from a DataFrame. What ended up happening here? Well, sub was updated, but df wasn’t, even though we had the warning.

>>> sub.loc[idx, 'lastupdate']
datetime.date(2021, 2, 4)
>>> df.loc[idx, 'lastupdate']
datetime.date(2021, 1, 11)

Pandas is warning you that you might have not done what you expected. When you created sub, you ended up with a copy of the data in df. When you updated the value, you’re warned that you only updated the copy, not the original.

So how should you fix it?

There are two primary ways to address this, and which one you choose depends on what you are trying to accomplish in your code. The warning is telling you that you chose a path that could cause confusion or error down the road, and is pointing you toward using the best practices for updating data.

Update the original

If your intention is to update your original data, you just need to update it directly. So instead of doing your update on sub, do it on df instead.

>>> df.loc[idx, 'lastupdate'] = datetime.date.today()
>>> df.loc[idx, 'lastupdate']
datetime.date(2021, 2, 4)

Now note that when you do this, since your view is a copy, it isn’t updated. If you want both sub and df to match, you need to either update both or recreate sub after the update. Because of this, it’s important for you to pause and think any time you update a DataFrame. Have you created views of this data that now need to be refreshed?

Update the copy

If your goal is to update the copy of the data only, to eliminate the warning, tell pandas you want that view to always be a copy.

>>> sub2 = df[df['dob'] < dob_limit].copy()
>>> sub2.loc[idx, 'lastupdate'] = datetime.date.today()
>>> sub2.loc[idx, 'lastupdate']
datetime.date(2021, 2, 4)

In between

One common situation that happens is an initial full sized DataFrame is narrowed down to a much smaller one by filtering the data. Maybe new columns are added as part of some calculations, and then as a final result, the original DataFrame should be updated. One way to do that is to use the index to help you out.

>>> sub3 = df[df['dob'] < dob_limit].copy()                                          # we'll be updating this DataFrame
>>> sub3['manualupdate'] = datetime.date.today() - datetime.timedelta(days=10)       # you can modify this DataFrame
>>> sub3 = sub3.head(3)                                                              # or even make it smaller
>>> sub3['manualupdate']
0    2021-01-25
1    2021-01-25
3    2021-01-25
Name: manualupdate, dtype: object

Now, we’ll use the fact that sub3 shares an index with the original df to use it to update the data. We can update all matching row of column lastupdate for example.

>>> df.loc[sub3.index, 'lastupdate'] = sub3['manualupdate']
>>> df.loc[sub3.index]
  first_name  last_name         dob  lastupdate          city state    zip
0      Joshy    Daniels  1943-05-27  2021-01-25       Chicago    IL  60601
3     Vernon  Hernandez  1989-04-10  2021-01-25    South Mark    NE  05048
4       Mary      Munoz  1933-03-16  2021-01-25  Ewingborough    OK  31127

Now, you can see that those rows were updated from our smaller subset of data.

Subsets of columns

You also may encounter this warning when working with subsets of columns in a DataFrame.

>>> df_d = df[['zip']]
>>> df_d.loc[idx, 'zip'] = "00313" # SettingWithCopyWarning

A great way to suppress the warning here is to do a full slice with loc in your initial selection. You can also use copy.

>>> df_d = df.loc[:, ['zip']]
>>> df_d.loc[idx, 'zip'] = "00313"

For completeness, some more details

Now you can read about this warning in many other places, and if you’ve come here through a search engine maybe you’ve already found them either confusing or not directly applicable to your situation. I took a slightly different approach above to show the situation where I usually see this error. However, a more common reason new pandas users encounter this error is when trying to update their DataFrame using the array index operator ([]).

>>> df[df['dob'] < dob_limit]['lastupdate'] = datetime.date.today()
file.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['dob'] < dob_limit]['lastupdate'] = datetime.date.today()

The fix here is pretty straightforward, use loc. Let’s give that a try.

>>> df.loc[df['dob'] < dob_limit, 'lastupdate'] = datetime.date.today() - datetime.timedelta(days=1)
>>> df.loc[df['dob'] < dob_limit].head(1)
  first_name last_name         dob  lastupdate     city state    zip
0      Joshy   Daniels  1943-05-27  2021-02-03  Chicago    IL  60601

That works. The warning here was telling us that our first update is (potentially) operating on a copy of our original data. I don’t think this is quite as obvious as our opening case because pandas has some complicated reasons for choosing to sometimes return a copy and sometimes return a view into the original data, and this may not seem obvious when the update is on one line. When it can detect that this is happening, it raises this warning.

This is called chained assignment. The assignment above with the warning is really doing this:

df.__getitem__(df.__getitem__('dob') < dob_limit).__setitem__('lastupdate', datetime.date.today())

When you use the array index operator, the __getitem__ and __setitem__ methods are invoked for getting and setting respectively. That first function call to __getitem__ is returning a copy of the data, then attempting to set data on it, triggering the warning.

If we use loc, though, it will be doing this, without returning a temporary view.

df.loc.__setitem__((df.__getitem__('dob') < dob_limit, 'lastupdate'), datetime.date.today())

So whenever you see this warning, just look at your code and check two things. Did you try to update the data using []? If so, switch to loc (or iloc). If you’re doing that and it’s still complaining, it’s because your DataFrame was created from another DataFrame. Either make a full copy if you plan to update it, or update your original DataFrame instead.

Have anything to say about this topic?