This is the third post in the series on indexing and selecting data in pandas. If you haven’t read the others yet, see the first post that covers the basics of selecting based on index or relative numerical indexing, and the second post, that talks about slicing. In this post, I’m going to talk about boolean indexing which is the way that I usually select subsets of data when I work with pandas.
What is boolean indexing?
For those familiar with NumPy, this may already be second nature, but for beginners it is not so obvious. Boolean indexing works for a given array by passing a boolean vector into the indexing operator ([]
), returning all values that are True
.
One thing to note, this array needs to be the same length as the array dimension being indexed.
Let’s look at an example.
>>> import pandas as pd >>> import numpy as np >>> >>> a = np.arange(5) >>> a array([0, 1, 2, 3, 4])
Now we can select the first, second and last elements of our array using a list of array indices.
>>> a[[0, 1, 4]] array([0, 1, 4])
Boolean indexing can do the same, by creating a boolean array of the same size as the entire array, with elements 0, 1 and 4 set to True
, all others False
.
>>> mask = np.array([True, True, False, False, True]) >>> a[mask] array([0, 1, 4])
Boolean operators
So now we know how to index our array with a single boolean array. But building that array by hand is a pain, so what you will usually end up doing is applying operations to the original array that return a boolean array themselves.
For example, to select all elements less than 3:
>>> a[a < 3] array([0, 1, 2])
or all even elements:
>>> a[a % 2 == 0] array([0, 2, 4])
And we can combine these using expressions, to AND them (&
) or OR them (|
). With these operators, we can select the same elements from our first example.
>>> a[(a < 2) | (a >= 4)] array([0, 1, 4])
Another very helpful operators is the inverse or not operator, (~
). Remember to watch your parentheses.
>>> a[~((a < 2) | (a >= 4))] array([2, 3])
On to pandas
In pandas, boolean indexing works pretty much like in NumPy, especially in a Series
. You pass in a vector the same length as the Series
. Note that this vector doesn’t have to have an index, but if you use a Series
as the argument, it does have an index so you need to be aware of how your index aligns. A common method of using boolean indexing is to apply functions to the original Series
so your index will always match.
Series
>>> s = pd.Series(np.arange(5), index=list("abcde")) >>> s a 0 b 1 c 2 d 3 e 4 dtype: int64 >>> s[[True, True, False, False, True]] # this vector is just a list of boolean values a 0 b 1 e 4 dtype: int64 >>> s[np.array([True, True, False, False, True])] # this vector is a NumPy array of boolean values a 0 b 1 e 4 dtype: int64
But, since our index is not the default (i.e. not a RangeIndex
), if we use another Series
of the same length, it will not work. It needs a matching index, and the default index created below doesn’t match our character index. So we have to specify the index to match.
>>> try: ... s[pd.Series([True, True, False, False, True])] ... except Exception as ie: ... print(ie) Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match). >>> s[pd.Series([True, True, False, False, True], index=list("abcde"))] a 0 b 1 e 4 dtype: int64
But instead of making a new Series
, we’ll just base all of our expressions on our source data Series
or DataFrame
, then they’ll share an index.
>>> # just like before with NumPy >>> s[(s < 2) | (s > 3)] a 0 b 1 e 4 dtype: int64
Make note that you need to surround each expression with parentheses because the Python parser will apply the boolean operators incorrectly. For the example above, it would apply it as s < (2 | s ) < 3. You’ll realize you’re forgetting parentheses when you get complaints about the boolean operators being applied to a series. See?
>>> s[s < 2 | s > 3] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/mcw/.pyenv/versions/pandas/lib/python3.8/site-packages/pandas/core/generic.py", line 1329, in __nonzero__ raise ValueError( ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
DataFrame
We can also do boolean indexing on DataFrames
. A popular way to create the boolean vector is to use one or more of the columns of the DataFrame
.
>>> df = pd.DataFrame({'x': np.arange(5), 'y': np.arange(5, 10)}) >>> df[df['x'] < 3] x y 0 0 5 1 1 6 2 2 7
You can also supply multiple conditions, just like before with Series
. (Remember those parentheses!)
>>> df[(df['x'] < 3) & (df['y'] > 5)] x y 1 1 6 2 2 7
.loc
, .iloc
, and []
If you remember from previous posts, pandas has three primary ways to index the containers. The indexing operator ([]
) is sort of a hybrid of using the index labels or location based offsets. .loc
is meant for using the index labels, .iloc
is for integer based indexing. The good news is that all of them accept boolean arrays, and return subsets of the underlying container.
>>> mask = (df['x'] < 3) & (df['y'] > 5) >>> mask 0 False 1 True 2 True 3 False 4 False dtype: bool >>> df[mask] x y 1 1 6 2 2 7 >>> df.loc[mask] x y 1 1 6 2 2 7
Note that .iloc
is a little different than the others, if you pass in this mask, you’ll get an exception.
>>> try: ... df.iloc[mask] ... except NotImplementedError as nie: ... print(nie) ... iLocation based boolean indexing on an integer type is not available
This is by design, .iloc
is only intended to take positional arguments. However, our mask is a Series
with an index, so it is rejected. You can still pass in a boolean vector, but just pass in the vector itself without the index.
>>> df.iloc[mask.to_numpy()] x y 1 1 6 2 2 7 >>> # or >>> df.iloc[mask.values] x y 1 1 6 2 2 7
Examples!
I think one of the most helpful things when thinking about boolean indexing is to see some examples. You are only limited by what you can express by grouping together any combination of expressions on your data. You do this by carefully grouping your boolean expressions and using parentheses wisely. It can also help to break the problem into pieces as you work on it.
In this series I’ve been grabbing data from the Chicago Data Portal. This time, I thought the list of lobbyists might be interesting. Due to the need for lobbyists to re-register every year, there’s some repeating data. Let’s take a look.
>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited >>> lbys = pd.read_json("https://data.cityofchicago.org/resource/tq3e-t5yq.json") >>> lbys.dtypes year int64 lobbyist_id int64 salutation object first_name object last_name object address_1 object city object state object zip object country object email object phone object fax object employer_id int64 employer_name object created_date object middle_initial object address_2 object suffix object dtype: object >>> lbys['created_date'] = pd.to_datetime(lbys['created_date']) >>> # I'll drop the personally identifiable data, just to be nice >>> lbys = lbys.drop(['email', 'phone', 'fax', 'last_name'], axis=1) >>> >>> lbys.head(3) year lobbyist_id salutation first_name address_1 ... employer_name created_date middle_initial address_2 suffix 0 2020 18883 MR. PERICLES 6969 W WABANSIA AVE ... THE PERICLES ORGANIZATION 2020-08-07 NaN NaN NaN 1 2019 18883 NaN PERICLES 6969 W WABANSIA AVE ... THE PERICLES ORGANIZATION 2020-01-21 NaN NaN NaN 2 2018 18883 NaN PERICLES 6969 W WABANSIA AVE ... THE PERICLES ORGANIZATION 2018-12-12 NaN NaN NaN [3 rows x 15 columns]
In terms of examples, there’s not really too much complexity to deal with, but here’s a few to give you an idea what boolean indexing looks like.
>>> lbys[lbys['year'] == 2020] # all lobbyists registered in 2020 year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix 0 2020 18883 MR. PERICLES ... 2020-08-07 NaN NaN NaN 4 2020 17521 MR. STEVE ... 2020-01-15 NaN NaN NaN .. ... ... ... ... ... ... ... ... ... 998 2020 24740 MRS. CHRISONIA ... 2020-01-16 D. SUITE 1700 NaN 999 2020 15081 MS. LIZ ... 2020-05-29 NaN STE. 900 NaN [125 rows x 15 columns] >>> lbys[(lbys['year'] == 2020) & (lbys['city'] == 'CHICAGO')] # lobbyists registered in 2020 from Chicago year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix 0 2020 18883 MR. PERICLES ... 2020-08-07 NaN NaN NaN 4 2020 17521 MR. STEVE ... 2020-01-15 NaN NaN NaN .. ... ... ... ... ... ... ... ... ... 998 2020 24740 MRS. CHRISONIA ... 2020-01-16 D. SUITE 1700 NaN 999 2020 15081 MS. LIZ ... 2020-05-29 NaN STE. 900 NaN [76 rows x 15 columns] >>> # let's get the most popular employer for 2020 >>> pop_emp_id = lbys[lbys['year'] == 2020].groupby('employer_id').count().sort_values(by='lobbyist_id', ascending=False).index[0] >>> # who works for them? >>> lbys[(lbys['employer_id'] == pop_emp_id) & (lbys['year'] == 2020)] year lobbyist_id salutation first_name address_1 ... employer_name created_date middle_initial address_2 suffix 619 2020 24484 NaN BRIAN 1330 W FULTON ST ... STERLING BAY, LLC AND ITS AFFILIATES 2020-01-15 NaN STE 800 NaN 654 2020 24106 NaN HOWARD 1330 W. FULTON ST ... STERLING BAY, LLC AND ITS AFFILIATES 2020-01-17 NaN SUITE 800 NaN 982 2020 23828 MS. SHELLY 1330 W. FULTON ST ... STERLING BAY, LLC AND ITS AFFILIATES 2020-01-15 NaN SUITE 800 NaN [3 rows x 15 columns]
If we only want to deal with 2020 data, we can just make a new smaller DataFrame
with that data.
lbys = lbys[lbys['year'] == 2020]
Boolean indexing with isin
A helpful method that is often paired with boolean indexing is Series.isin
. It returns a boolean vector with all rows that match one of the elements in the arguments.
>>> lbys['state'].tail() 995 MO 996 MO 997 IL 998 IL 999 IL Name: state, dtype: object >>> lbys['state'].isin(['IL']).tail() 995 False 996 False 997 True 998 True 999 True Name: state, dtype: bool >>> lbys[lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])] # lobbyists from bordering states year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix 164 2020 24879 NaN DAN ... 2020-06-15 NaN NaN NaN 786 2012 4644 MR. BRIAN ... 2012-05-25 NaN NaN NaN 788 2019 15543 MR. LORENZO ... 2019-01-10 NaN NaN NaN 789 2018 15543 MR. LORENZO ... 2018-01-17 NaN NaN NaN 790 2017 15543 MR. LORENZO ... 2017-01-19 NaN NaN NaN 791 2020 15543 MR. LORENZO ... 2020-01-10 NaN NaN NaN 792 2016 15543 MR. LORENZO ... 2016-10-13 NaN NaN NaN 994 2016 10222 MS. MARILYN ... 2016-01-13 NaN MO1-800-14-40 NaN 995 2015 10222 MS. MARILYN ... 2015-01-14 NaN MO1-800-14-40 NaN 996 2014 10222 MS. MARILYN ... 2014-05-06 NaN MO1-800-14-40 NaN [10 rows x 15 columns]
I’ll wrap it up with a slightly more complicated expression.
>>> lbys[ ... ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])) & # lobbyists NOT from bordering states ... (lbys['state'] != 'IL') & # and NOT from IL ... (lbys['created_date'] >= '2020-07-01') # created in the last half of the year ... ] year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix 26 2021 24967 NaN JON ... 2020-12-27 NaN SUITE 104 NaN 27 2020 24967 NaN JON ... 2020-10-21 NaN SUITE 104 NaN 34 2021 24901 NaN JOSEPH ... 2020-12-20 NaN NaN NaN 160 2021 23782 MR. ALAN ... 2021-01-01 P SUITE 404-352 NaN 227 2020 24909 MS. LAKEITHA ... 2020-07-14 NaN NaN NaN 332 2021 22341 NaN CAROLINE ... 2020-12-23 NaN NaN NaN 511 2021 5361 NaN JAY ... 2020-12-27 NaN SUITE 450 NaN 609 2021 4561 NaN BRET ... 2020-12-18 NaN NaN NaN 660 2021 13801 MR. JOHN ... 2020-12-27 NaN MC 482-C30-C76 NaN 700 2020 24925 NaN TAMI ... 2020-08-06 NaN NaN NaN 862 2020 24969 MR. ALEX ... 2020-08-31 NaN NaN NaN 951 2021 6164 MS. GABRIELLE ... 2020-12-21 NaN 73RD FLOOR NaN [12 rows x 15 columns]
I also find it helpful to sometimes create a variable for storing the mask. So for the above example, instead of having to parse the entire expression when reading the code, it can be helpful to have expressive variable names for the parts of the indexing expression.
>>> non_bordering = ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])) >>> non_illinois = (lbys['state'] != 'IL') >>> >>> # more readable maybe? >>> lbys[non_bordering & non_illinois & (lbys['created_date'] >= '2020-07-01')]
Often when building a complex expression, it can be helpful to build it in pieces, so assigning parts of the mask to variables can make a much more complicated expression easier to read, at the cost of extra variables to deal with. In general, I use variables in the mask if I have to reuse them multiple times, but if only used once, I do the entire expression in place.
In summary, boolean indexing is really quite simple, but powerful. I’ll be looking at a few other ways to select data in pandas in upcoming posts.