This is the fifth post in a series on indexing and selecting in pandas. If you are jumping in the middle and want to get caught up, here’s what has been discussed so far:
- Basic indexing, selecting by label and location
- Slicing in pandas
- Selecting by boolean indexing
- Selecting by callable
Once the basics were covered in the first three posts we were able to move onto more detailed topics on how to select and update data. In this post, we’ll look at selecting using where
and mask
.
In the third post of this series, we covered the concept of boolean indexing. If you remember, boolean indexing allows us to essentially query our data (either a DataFrame
or a Series
) and return only the data that matches the boolean vector we use as our indexer. For example, to select odd values in a Series
like this:
>>> import pandas as pd >>> import numpy as np >>> >>> s = pd.Series(np.arange(10)) >>> s[s % 2 != 0] 1 1 3 3 5 5 7 7 9 9 dtype: int64
Where?
You’ll notice that our result here is only 5 elements even though the original Series
contains 10 elements. This is the whole point of indexing, selecting the values you want. But what happens if you want the shape of your result to match your original data? In this case, you use where
. The values that are selected by the where condition are returned, the values that are not selected are set to NaN
. This way, the value you select has the same shape as your original data.
>>> s.where(s % 2 != 0) 0 NaN 1 1.0 2 NaN 3 3.0 4 NaN 5 5.0 6 NaN 7 7.0 8 NaN 9 9.0 dtype: float64
where
also accepts an optional argument for what you want the other values to be, if NaN
is not what you want, with some flexibility. For starters, it can be a scalar or Series
/DataFrame
.
>>> s.where(s % 2 != 0, -1) # set the non-matching elements to one value 0 -1 1 1 2 -1 3 3 4 -1 5 5 6 -1 7 7 8 -1 9 9 dtype: int64 >>> s.where(s % 2 != 0, -s) # set the non-matching elements to the negative value of the original series 0 0 1 1 2 -2 3 3 4 -4 5 5 6 -6 7 7 8 -8 9 9 dtype: int64
Both the first condition
argument and the other
can be a callable that accepts the Series
or DataFrame
and returns either a scalar or a Series
/DataFrame
. The condition callable should return boolean, the other can return whatever value you want for non selected values. So the above could be expressed (more verbosely) as
s.where(lambda x: x % 2 != 0, lambda x: x * -1)
Using where
will always return a copy of the existing data. But if you want to modify the original, you can by using the inplace
argument, similar to many other functions in pandas (like fillna
or ffill
and others).
>>> s.where(s % 2 != 0, -s, inplace=True) >>> s >>> s 0 0 1 1 2 -2 3 3 4 -4 5 5 6 -6 7 7 8 -8 9 9 dtype: int64
The .mask
method is just the inverse of where
. Instead of selecting values based on the condition, it selects values where the condition is False
. Everthing else is the same as above.
>>> s.mask(s % 2 != 0, 99) 0 0 1 99 2 -2 3 99 4 -4 5 99 6 -6 7 99 8 -8 9 99 dtype: int64
Updating data
One thing that I noticed in writing this that I had missed before is that where
is the underlying implementation for boolean indexing with the array indexing operator, i.e. []
on a DataFrame
. So you’ve already been using where
even if you didn’t know it.
This has implications for updating data. So with a DataFrame
of random floats around 0,
>>> df = pd.DataFrame(np.random.random_sample((5, 5)) - .5) >>> df 0 1 2 3 4 0 -0.326058 -0.205408 -0.394306 0.365862 0.141009 1 0.394965 0.283149 -0.014750 0.279396 -0.172909 2 -0.141023 -0.297178 -0.247611 -0.170736 0.229474 3 -0.276158 -0.438667 -0.290731 0.317484 -0.378233 4 -0.018927 0.354160 -0.254558 -0.056842 -0.245184 >>> df.where(df < 0) # these two are equivalent >>> df[df < 0] 0 1 2 3 4 0 -0.326058 -0.205408 -0.394306 NaN NaN 1 NaN NaN -0.014750 NaN -0.172909 2 -0.141023 -0.297178 -0.247611 -0.170736 NaN 3 -0.276158 -0.438667 -0.290731 NaN -0.378233 4 -0.018927 NaN -0.254558 -0.056842 -0.245184
You can also do updates. This is not necessarily that practical for most DataFrame
s I work with though, because you I rarely have a DataFrame
where I want to update across all the columns like this. But for some instances that might be useful, so here’s an example. We could force all the values to be positive by inverting only the negative values.
>>> df[df < 0] = -df >>> df 0 1 2 3 4 0 0.326058 0.205408 0.394306 0.365862 0.141009 1 0.394965 0.283149 0.014750 0.279396 0.172909 2 0.141023 0.297178 0.247611 0.170736 0.229474 3 0.276158 0.438667 0.290731 0.317484 0.378233 4 0.018927 0.354160 0.254558 0.056842 0.245184
NumPy .where
If you’re a NumPy user, you are probably familiar with np.where
. To use it, you supply a condition and optional x
and y
values for True
and False
results in the condition. This is a bit different than using where
in pandas, where the object itself provides data for the True
result, with an optional False
result (which defaults to NaN
if not supplied). So here’s how you’d use it to select odd values in our Series
, and set the even values to 99
.
>>> np.where(s % 2 != 0, s, 99) array([99, 1, 99, 3, 99, 5, 99, 7, 99, 9])
Another way to think about this is that the pandas implementation can be used like the NumPy version, just think of the self
argument of the DataFrame
as the x
argument in NumPy.
>>> pd.Series.where(cond=s % 2 != 0, self=s, other=99) 0 99 1 1 2 99 3 3 4 99 5 5 6 99 7 7 8 99 9 9 dtype: int64
More examples
Let’s go back to a data set from a previous post. This is salary info for City of Chicago employees for both hourly and salaried employees.
One thing I noticed about this data set last time was that there were a lot of NaN
values because of the different treatment of salaried and hourly employees. As a result, there’s a column for annual salary, and separate columns for typical hours and hourly rates. What if we just want to know what a typical full salary would be for any employee, regardless of their category?
Using where
is one way we could address this if we wanted a uniform data set. Now we won’t apply it to the entire DataFrame
as the update example above, we’ll use it to create one column.
>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited >>> # it also only returns 1000 rows (or at least it did for me without an API key) >>> sal = pd.read_json("https://data.cityofchicago.org/resource/xzkq-xp2w.json") sal = sal.drop('name', axis=1) # remove personal info >>> sal = sal.drop('name', axis=1) # remove personal info >>> sal['total_pay'] = sal['annual_salary'].where(sal['salary_or_hourly'] == 'Salary', sal['typical_hours'] * sal['hourly_rate'] * 52)
Another way to do this, would be to selectively update only rows for hourly workers. But to do this, you end up needing to apply a mask multiple times.
>>> sal['total_pay2'] = sal['annual_salary'] >>> mask = sal['salary_or_hourly'] != 'Salary' >>> sal.loc[mask, 'total_pay2'] = sal.loc[mask, 'typical_hours'] * sal.loc[mask, 'hourly_rate'] * 52
So using where
can result in a slightly more simple expression, even if it’s a little long.
NumPy where
and select
for more complicated updates
There are times where you want to create new columns with some sort of complicated condition on a dataframe that might need to be applied across multiple columns. Using NumPy where
can be helpful for these situations. For example, we can creating an hourly rate column that calculates an hourly equivalent for the salried employees, but use the existing hourly rate.
>>> sal['hourly_rate_all'] = np.where(sal['salary_or_hourly'] == 'Salary', sal['annual_salary'] / (52 * 40), sal['hourly_rate'])
If you have a much more complex scenario, you can use np.select
. Think of np.select
as a where with multiple conditions and multiple choices, as opposed to just one condition with two choices. For example, let’s say that the hourly rate for employees in the police and fire departments was slightly different because of their shift schedule, so their calculation was different. We could do the calculation in one pass. Note that I chose to use the hourly rate as the default (using the last parameter), but could have just as easily made it a third condition.
>>> conditions = [ ... (sal['salary_or_hourly'] == 'Salary') & (sal['department'].isin(['POLICE', 'FIRE'])), ... (sal['salary_or_hourly'] == 'Salary') & (~sal['department'].isin(['POLICE', 'FIRE'])), ... ] >>> choices = [ ... sal['annual_salary'] / (26 * 75), ... sal['annual_salary'] / (52 * 40) ... ] >>> sal['hourly_rate_all2'] = np.select(conditions, choices, sal['hourly_rate']) >>> sal.head()[['department', 'hourly_rate_all', 'hourly_rate_all2']] department hourly_rate_all hourly_rate_all2 0 POLICE 53.578846 57.150769 1 POLICE 45.250962 48.267692 2 DAIS 57.023077 57.023077 3 WATER MGMNT 56.284615 56.284615 4 TRANSPORTN 44.400000 44.400000
In summary, using pandas’ where
and mask
methods can be useful ways to select and update data in the same shape as your original data. Using NumPy’s where
and select
can also be very useful for more complicated scenarios.
Stay tuned for the next post in this series where I’ll look at the query
method.