Boolean Indexing in Pandas

This is the third post in the series on indexing and selecting data in pandas. If you haven’t read the others yet, see the first post that covers the basics of selecting based on index or relative numerical indexing, and the second post, that talks about slicing. In this post, I’m going to talk about boolean indexing which is the way that I usually select subsets of data when I work with pandas.

What is boolean indexing?

For those familiar with NumPy, this may already be second nature, but for beginners it is not so obvious. Boolean indexing works for a given array by passing a boolean vector into the indexing operator ([]), returning all values that are True.

One thing to note, this array needs to be the same length as the array dimension being indexed.

Let’s look at an example.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> import pandas as pd
>>> import numpy as np
>>>
>>> a = np.arange(5)
>>> a
array([0, 1, 2, 3, 4])
>>> import pandas as pd >>> import numpy as np >>> >>> a = np.arange(5) >>> a array([0, 1, 2, 3, 4])
>>> import pandas as pd
>>> import numpy as np
>>>
>>> a = np.arange(5)
>>> a
array([0, 1, 2, 3, 4])

Now we can select the first, second and last elements of our array using a list of array indices.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> a[[0, 1, 4]]
array([0, 1, 4])
>>> a[[0, 1, 4]] array([0, 1, 4])
>>> a[[0, 1, 4]]
array([0, 1, 4])

Boolean indexing can do the same, by creating a boolean array of the same size as the entire array, with elements 0, 1 and 4 set to True, all others False.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> mask = np.array([True, True, False, False, True])
>>> a[mask]
array([0, 1, 4])
>>> mask = np.array([True, True, False, False, True]) >>> a[mask] array([0, 1, 4])
>>> mask = np.array([True, True, False, False, True])
>>> a[mask]
array([0, 1, 4])

Boolean operators

So now we know how to index our array with a single boolean array. But building that array by hand is a pain, so what you will usually end up doing is applying operations to the original array that return a boolean array themselves.

For example, to select all elements less than 3:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> a[a < 3]
array([0, 1, 2])
>>> a[a < 3] array([0, 1, 2])
>>> a[a < 3]
array([0, 1, 2])

or all even elements:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> a[a % 2 == 0]
array([0, 2, 4])
>>> a[a % 2 == 0] array([0, 2, 4])
>>> a[a % 2 == 0]
array([0, 2, 4])

And we can combine these using expressions, to AND them (&) or OR them (|). With these operators, we can select the same elements from our first example.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> a[(a < 2) | (a >= 4)]
array([0, 1, 4])
>>> a[(a < 2) | (a >= 4)] array([0, 1, 4])
>>> a[(a < 2) | (a >= 4)]
array([0, 1, 4])

Another very helpful operators is the inverse or not operator, (~). Remember to watch your parentheses.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> a[~((a < 2) | (a >= 4))]
array([2, 3])
>>> a[~((a < 2) | (a >= 4))] array([2, 3])
>>> a[~((a < 2) | (a >= 4))]
array([2, 3])

On to pandas

In pandas, boolean indexing works pretty much like in NumPy, especially in a Series. You pass in a vector the same length as the Series. Note that this vector doesn’t have to have an index, but if you use a Series as the argument, it does have an index so you need to be aware of how your index aligns. A common method of using boolean indexing is to apply functions to the original Series so your index will always match.

Series

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> s = pd.Series(np.arange(5), index=list("abcde"))
>>> s
a 0
b 1
c 2
d 3
e 4
dtype: int64
>>> s[[True, True, False, False, True]] # this vector is just a list of boolean values
a 0
b 1
e 4
dtype: int64
>>> s[np.array([True, True, False, False, True])] # this vector is a NumPy array of boolean values
a 0
b 1
e 4
dtype: int64
>>> s = pd.Series(np.arange(5), index=list("abcde")) >>> s a 0 b 1 c 2 d 3 e 4 dtype: int64 >>> s[[True, True, False, False, True]] # this vector is just a list of boolean values a 0 b 1 e 4 dtype: int64 >>> s[np.array([True, True, False, False, True])] # this vector is a NumPy array of boolean values a 0 b 1 e 4 dtype: int64
>>> s = pd.Series(np.arange(5), index=list("abcde"))
>>> s
a    0
b    1
c    2
d    3
e    4
dtype: int64
>>> s[[True, True, False, False, True]]            # this vector is just a list of boolean values
a    0
b    1
e    4
dtype: int64
>>> s[np.array([True, True, False, False, True])]  # this vector is a NumPy array of boolean values
a    0
b    1
e    4
dtype: int64

But, since our index is not the default (i.e. not a RangeIndex), if we use another Series of the same length, it will not work. It needs a matching index, and the default index created below doesn’t match our character index. So we have to specify the index to match.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> try:
... s[pd.Series([True, True, False, False, True])]
... except Exception as ie:
... print(ie)
Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
>>> s[pd.Series([True, True, False, False, True], index=list("abcde"))]
a 0
b 1
e 4
dtype: int64
>>> try: ... s[pd.Series([True, True, False, False, True])] ... except Exception as ie: ... print(ie) Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match). >>> s[pd.Series([True, True, False, False, True], index=list("abcde"))] a 0 b 1 e 4 dtype: int64
>>> try:
...     s[pd.Series([True, True, False, False, True])]
... except Exception as ie:
...     print(ie)

Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
>>> s[pd.Series([True, True, False, False, True], index=list("abcde"))]
a    0
b    1
e    4
dtype: int64

But instead of making a new Series, we’ll just base all of our expressions on our source data Series or DataFrame, then they’ll share an index.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> # just like before with NumPy
>>> s[(s < 2) | (s > 3)]
a 0
b 1
e 4
dtype: int64
>>> # just like before with NumPy >>> s[(s < 2) | (s > 3)] a 0 b 1 e 4 dtype: int64
>>> # just like before with NumPy
>>> s[(s < 2) | (s > 3)]
a    0
b    1
e    4
dtype: int64

Make note that you need to surround each expression with parentheses because the Python parser will apply the boolean operators incorrectly. For the example above, it would apply it as s < (2 | s ) < 3. You’ll realize you’re forgetting parentheses when you get complaints about the boolean operators being applied to a series. See?

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> s[s < 2 | s > 3]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mcw/.pyenv/versions/pandas/lib/python3.8/site-packages/pandas/core/generic.py", line 1329, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> s[s < 2 | s > 3] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/mcw/.pyenv/versions/pandas/lib/python3.8/site-packages/pandas/core/generic.py", line 1329, in __nonzero__ raise ValueError( ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> s[s < 2 | s > 3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mcw/.pyenv/versions/pandas/lib/python3.8/site-packages/pandas/core/generic.py", line 1329, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

DataFrame

We can also do boolean indexing on DataFrames. A popular way to create the boolean vector is to use one or more of the columns of the DataFrame.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> df = pd.DataFrame({'x': np.arange(5), 'y': np.arange(5, 10)})
>>> df[df['x'] < 3]
x y
0 0 5
1 1 6
2 2 7
>>> df = pd.DataFrame({'x': np.arange(5), 'y': np.arange(5, 10)}) >>> df[df['x'] < 3] x y 0 0 5 1 1 6 2 2 7
>>> df = pd.DataFrame({'x': np.arange(5), 'y': np.arange(5, 10)})
>>> df[df['x'] < 3]
   x  y
0  0  5
1  1  6
2  2  7

You can also supply multiple conditions, just like before with Series. (Remember those parentheses!)

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> df[(df['x'] < 3) & (df['y'] > 5)]
x y
1 1 6
2 2 7
>>> df[(df['x'] < 3) & (df['y'] > 5)] x y 1 1 6 2 2 7
>>> df[(df['x'] < 3) & (df['y'] > 5)]
   x  y
1  1  6
2  2  7

.loc.iloc, and []

If you remember from previous posts, pandas has three primary ways to index the containers. The indexing operator ([]) is sort of a hybrid of using the index labels or location based offsets. .loc is meant for using the index labels, .iloc is for integer based indexing. The good news is that all of them accept boolean arrays, and return subsets of the underlying container.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> mask = (df['x'] < 3) & (df['y'] > 5)
>>> mask
0 False
1 True
2 True
3 False
4 False
dtype: bool
>>> df[mask]
x y
1 1 6
2 2 7
>>> df.loc[mask]
x y
1 1 6
2 2 7
>>> mask = (df['x'] < 3) & (df['y'] > 5) >>> mask 0 False 1 True 2 True 3 False 4 False dtype: bool >>> df[mask] x y 1 1 6 2 2 7 >>> df.loc[mask] x y 1 1 6 2 2 7
>>> mask = (df['x'] < 3) & (df['y'] > 5)
>>> mask
0    False
1     True
2     True
3    False
4    False
dtype: bool
>>> df[mask]
   x  y
1  1  6
2  2  7
>>> df.loc[mask]
   x  y
1  1  6
2  2  7

Note that .iloc is a little different than the others, if you pass in this mask, you’ll get an exception.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> try:
... df.iloc[mask]
... except NotImplementedError as nie:
... print(nie)
...
iLocation based boolean indexing on an integer type is not available
>>> try: ... df.iloc[mask] ... except NotImplementedError as nie: ... print(nie) ... iLocation based boolean indexing on an integer type is not available
>>> try:
...     df.iloc[mask]
... except NotImplementedError as nie:
...     print(nie)
...
iLocation based boolean indexing on an integer type is not available

This is by design, .iloc is only intended to take positional arguments. However, our mask is a Series with an index, so it is rejected. You can still pass in a boolean vector, but just pass in the vector itself without the index.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> df.iloc[mask.to_numpy()]
x y
1 1 6
2 2 7
>>> # or
>>> df.iloc[mask.values]
x y
1 1 6
2 2 7
>>> df.iloc[mask.to_numpy()] x y 1 1 6 2 2 7 >>> # or >>> df.iloc[mask.values] x y 1 1 6 2 2 7
>>> df.iloc[mask.to_numpy()]
   x  y
1  1  6
2  2  7
>>> # or
>>> df.iloc[mask.values]
   x  y
1  1  6
2  2  7

Examples!

I think one of the most helpful things when thinking about boolean indexing is to see some examples. You are only limited by what you can express by grouping together any combination of expressions on your data. You do this by carefully grouping your boolean expressions and using parentheses wisely. It can also help to break the problem into pieces as you work on it.

In this series I’ve been grabbing data from the Chicago Data Portal. This time, I thought the list of lobbyists might be interesting. Due to the need for lobbyists to re-register every year, there’s some repeating data. Let’s take a look.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited
>>> lbys = pd.read_json("https://data.cityofchicago.org/resource/tq3e-t5yq.json")
>>> lbys.dtypes
year int64
lobbyist_id int64
salutation object
first_name object
last_name object
address_1 object
city object
state object
zip object
country object
email object
phone object
fax object
employer_id int64
employer_name object
created_date object
middle_initial object
address_2 object
suffix object
dtype: object
>>> lbys['created_date'] = pd.to_datetime(lbys['created_date'])
>>> # I'll drop the personally identifiable data, just to be nice
>>> lbys = lbys.drop(['email', 'phone', 'fax', 'last_name'], axis=1)
>>>
>>> lbys.head(3)
year lobbyist_id salutation first_name address_1 ... employer_name created_date middle_initial address_2 suffix
0 2020 18883 MR. PERICLES 6969 W WABANSIA AVE ... THE PERICLES ORGANIZATION 2020-08-07 NaN NaN NaN
1 2019 18883 NaN PERICLES 6969 W WABANSIA AVE ... THE PERICLES ORGANIZATION 2020-01-21 NaN NaN NaN
2 2018 18883 NaN PERICLES 6969 W WABANSIA AVE ... THE PERICLES ORGANIZATION 2018-12-12 NaN NaN NaN
[3 rows x 15 columns]
>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited >>> lbys = pd.read_json("https://data.cityofchicago.org/resource/tq3e-t5yq.json") >>> lbys.dtypes year int64 lobbyist_id int64 salutation object first_name object last_name object address_1 object city object state object zip object country object email object phone object fax object employer_id int64 employer_name object created_date object middle_initial object address_2 object suffix object dtype: object >>> lbys['created_date'] = pd.to_datetime(lbys['created_date']) >>> # I'll drop the personally identifiable data, just to be nice >>> lbys = lbys.drop(['email', 'phone', 'fax', 'last_name'], axis=1) >>> >>> lbys.head(3) year lobbyist_id salutation first_name address_1 ... employer_name created_date middle_initial address_2 suffix 0 2020 18883 MR. PERICLES 6969 W WABANSIA AVE ... THE PERICLES ORGANIZATION 2020-08-07 NaN NaN NaN 1 2019 18883 NaN PERICLES 6969 W WABANSIA AVE ... THE PERICLES ORGANIZATION 2020-01-21 NaN NaN NaN 2 2018 18883 NaN PERICLES 6969 W WABANSIA AVE ... THE PERICLES ORGANIZATION 2018-12-12 NaN NaN NaN [3 rows x 15 columns]
>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited
>>> lbys = pd.read_json("https://data.cityofchicago.org/resource/tq3e-t5yq.json")
>>> lbys.dtypes
year               int64
lobbyist_id        int64
salutation        object
first_name        object
last_name         object
address_1         object
city              object
state             object
zip               object
country           object
email             object
phone             object
fax               object
employer_id        int64
employer_name     object
created_date      object
middle_initial    object
address_2         object
suffix            object
dtype: object
>>> lbys['created_date'] = pd.to_datetime(lbys['created_date'])
>>> # I'll drop the personally identifiable data, just to be nice
>>> lbys = lbys.drop(['email', 'phone', 'fax', 'last_name'], axis=1)
>>>
>>> lbys.head(3)
   year  lobbyist_id salutation first_name            address_1  ...              employer_name created_date middle_initial address_2  suffix
0  2020        18883        MR.   PERICLES  6969 W WABANSIA AVE  ...  THE PERICLES ORGANIZATION   2020-08-07            NaN       NaN     NaN
1  2019        18883        NaN   PERICLES  6969 W WABANSIA AVE  ...  THE PERICLES ORGANIZATION   2020-01-21            NaN       NaN     NaN
2  2018        18883        NaN   PERICLES  6969 W WABANSIA AVE  ...  THE PERICLES ORGANIZATION   2018-12-12            NaN       NaN     NaN

[3 rows x 15 columns]

In terms of examples, there’s not really too much complexity to deal with, but here’s a few to give you an idea what boolean indexing looks like.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> lbys[lbys['year'] == 2020] # all lobbyists registered in 2020
year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix
0 2020 18883 MR. PERICLES ... 2020-08-07 NaN NaN NaN
4 2020 17521 MR. STEVE ... 2020-01-15 NaN NaN NaN
.. ... ... ... ... ... ... ... ... ...
998 2020 24740 MRS. CHRISONIA ... 2020-01-16 D. SUITE 1700 NaN
999 2020 15081 MS. LIZ ... 2020-05-29 NaN STE. 900 NaN
[125 rows x 15 columns]
>>> lbys[(lbys['year'] == 2020) & (lbys['city'] == 'CHICAGO')] # lobbyists registered in 2020 from Chicago
year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix
0 2020 18883 MR. PERICLES ... 2020-08-07 NaN NaN NaN
4 2020 17521 MR. STEVE ... 2020-01-15 NaN NaN NaN
.. ... ... ... ... ... ... ... ... ...
998 2020 24740 MRS. CHRISONIA ... 2020-01-16 D. SUITE 1700 NaN
999 2020 15081 MS. LIZ ... 2020-05-29 NaN STE. 900 NaN
[76 rows x 15 columns]
>>> # let's get the most popular employer for 2020
>>> pop_emp_id = lbys[lbys['year'] == 2020].groupby('employer_id').count().sort_values(by='lobbyist_id', ascending=False).index[0]
>>> # who works for them?
>>> lbys[(lbys['employer_id'] == pop_emp_id) & (lbys['year'] == 2020)]
year lobbyist_id salutation first_name address_1 ... employer_name created_date middle_initial address_2 suffix
619 2020 24484 NaN BRIAN 1330 W FULTON ST ... STERLING BAY, LLC AND ITS AFFILIATES 2020-01-15 NaN STE 800 NaN
654 2020 24106 NaN HOWARD 1330 W. FULTON ST ... STERLING BAY, LLC AND ITS AFFILIATES 2020-01-17 NaN SUITE 800 NaN
982 2020 23828 MS. SHELLY 1330 W. FULTON ST ... STERLING BAY, LLC AND ITS AFFILIATES 2020-01-15 NaN SUITE 800 NaN
[3 rows x 15 columns]
>>> lbys[lbys['year'] == 2020] # all lobbyists registered in 2020 year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix 0 2020 18883 MR. PERICLES ... 2020-08-07 NaN NaN NaN 4 2020 17521 MR. STEVE ... 2020-01-15 NaN NaN NaN .. ... ... ... ... ... ... ... ... ... 998 2020 24740 MRS. CHRISONIA ... 2020-01-16 D. SUITE 1700 NaN 999 2020 15081 MS. LIZ ... 2020-05-29 NaN STE. 900 NaN [125 rows x 15 columns] >>> lbys[(lbys['year'] == 2020) & (lbys['city'] == 'CHICAGO')] # lobbyists registered in 2020 from Chicago year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix 0 2020 18883 MR. PERICLES ... 2020-08-07 NaN NaN NaN 4 2020 17521 MR. STEVE ... 2020-01-15 NaN NaN NaN .. ... ... ... ... ... ... ... ... ... 998 2020 24740 MRS. CHRISONIA ... 2020-01-16 D. SUITE 1700 NaN 999 2020 15081 MS. LIZ ... 2020-05-29 NaN STE. 900 NaN [76 rows x 15 columns] >>> # let's get the most popular employer for 2020 >>> pop_emp_id = lbys[lbys['year'] == 2020].groupby('employer_id').count().sort_values(by='lobbyist_id', ascending=False).index[0] >>> # who works for them? >>> lbys[(lbys['employer_id'] == pop_emp_id) & (lbys['year'] == 2020)] year lobbyist_id salutation first_name address_1 ... employer_name created_date middle_initial address_2 suffix 619 2020 24484 NaN BRIAN 1330 W FULTON ST ... STERLING BAY, LLC AND ITS AFFILIATES 2020-01-15 NaN STE 800 NaN 654 2020 24106 NaN HOWARD 1330 W. FULTON ST ... STERLING BAY, LLC AND ITS AFFILIATES 2020-01-17 NaN SUITE 800 NaN 982 2020 23828 MS. SHELLY 1330 W. FULTON ST ... STERLING BAY, LLC AND ITS AFFILIATES 2020-01-15 NaN SUITE 800 NaN [3 rows x 15 columns]
>>> lbys[lbys['year'] == 2020]  # all lobbyists registered in 2020
     year  lobbyist_id salutation first_name  ... created_date middle_initial   address_2 suffix
0    2020        18883        MR.   PERICLES  ...   2020-08-07            NaN         NaN    NaN
4    2020        17521        MR.      STEVE  ...   2020-01-15            NaN         NaN    NaN
..    ...          ...        ...        ...  ...          ...            ...         ...    ...
998  2020        24740       MRS.  CHRISONIA  ...   2020-01-16             D.  SUITE 1700    NaN
999  2020        15081        MS.        LIZ  ...   2020-05-29            NaN    STE. 900    NaN

[125 rows x 15 columns]
>>> lbys[(lbys['year'] == 2020) & (lbys['city'] == 'CHICAGO')] # lobbyists registered in 2020 from Chicago
     year  lobbyist_id salutation first_name  ... created_date middle_initial   address_2 suffix
0    2020        18883        MR.   PERICLES  ...   2020-08-07            NaN         NaN    NaN
4    2020        17521        MR.      STEVE  ...   2020-01-15            NaN         NaN    NaN
..    ...          ...        ...        ...  ...          ...            ...         ...    ...
998  2020        24740       MRS.  CHRISONIA  ...   2020-01-16             D.  SUITE 1700    NaN
999  2020        15081        MS.        LIZ  ...   2020-05-29            NaN    STE. 900    NaN

[76 rows x 15 columns]
>>> # let's get the most popular employer for 2020
>>> pop_emp_id = lbys[lbys['year'] == 2020].groupby('employer_id').count().sort_values(by='lobbyist_id', ascending=False).index[0]
>>> # who works for them?
>>> lbys[(lbys['employer_id'] == pop_emp_id) & (lbys['year'] == 2020)]
     year  lobbyist_id salutation first_name          address_1  ...                         employer_name created_date middle_initial  address_2  suffix
619  2020        24484        NaN      BRIAN   1330 W FULTON ST  ...  STERLING BAY, LLC AND ITS AFFILIATES   2020-01-15            NaN    STE 800     NaN
654  2020        24106        NaN     HOWARD  1330 W. FULTON ST  ...  STERLING BAY, LLC AND ITS AFFILIATES   2020-01-17            NaN  SUITE 800     NaN
982  2020        23828        MS.     SHELLY  1330 W. FULTON ST  ...  STERLING BAY, LLC AND ITS AFFILIATES   2020-01-15            NaN  SUITE 800     NaN

[3 rows x 15 columns]

If we only want to deal with 2020 data, we can just make a new smaller DataFrame with that data.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
lbys = lbys[lbys['year'] == 2020]
lbys = lbys[lbys['year'] == 2020]
lbys = lbys[lbys['year'] == 2020]

Boolean indexing with isin

A helpful method that is often paired with boolean indexing is Series.isin. It returns a boolean vector with all rows that match one of the elements in the arguments.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> lbys['state'].tail()
995 MO
996 MO
997 IL
998 IL
999 IL
Name: state, dtype: object
>>> lbys['state'].isin(['IL']).tail()
995 False
996 False
997 True
998 True
999 True
Name: state, dtype: bool
>>> lbys[lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])] # lobbyists from bordering states
year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix
164 2020 24879 NaN DAN ... 2020-06-15 NaN NaN NaN
786 2012 4644 MR. BRIAN ... 2012-05-25 NaN NaN NaN
788 2019 15543 MR. LORENZO ... 2019-01-10 NaN NaN NaN
789 2018 15543 MR. LORENZO ... 2018-01-17 NaN NaN NaN
790 2017 15543 MR. LORENZO ... 2017-01-19 NaN NaN NaN
791 2020 15543 MR. LORENZO ... 2020-01-10 NaN NaN NaN
792 2016 15543 MR. LORENZO ... 2016-10-13 NaN NaN NaN
994 2016 10222 MS. MARILYN ... 2016-01-13 NaN MO1-800-14-40 NaN
995 2015 10222 MS. MARILYN ... 2015-01-14 NaN MO1-800-14-40 NaN
996 2014 10222 MS. MARILYN ... 2014-05-06 NaN MO1-800-14-40 NaN
[10 rows x 15 columns]
>>> lbys['state'].tail() 995 MO 996 MO 997 IL 998 IL 999 IL Name: state, dtype: object >>> lbys['state'].isin(['IL']).tail() 995 False 996 False 997 True 998 True 999 True Name: state, dtype: bool >>> lbys[lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])] # lobbyists from bordering states year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix 164 2020 24879 NaN DAN ... 2020-06-15 NaN NaN NaN 786 2012 4644 MR. BRIAN ... 2012-05-25 NaN NaN NaN 788 2019 15543 MR. LORENZO ... 2019-01-10 NaN NaN NaN 789 2018 15543 MR. LORENZO ... 2018-01-17 NaN NaN NaN 790 2017 15543 MR. LORENZO ... 2017-01-19 NaN NaN NaN 791 2020 15543 MR. LORENZO ... 2020-01-10 NaN NaN NaN 792 2016 15543 MR. LORENZO ... 2016-10-13 NaN NaN NaN 994 2016 10222 MS. MARILYN ... 2016-01-13 NaN MO1-800-14-40 NaN 995 2015 10222 MS. MARILYN ... 2015-01-14 NaN MO1-800-14-40 NaN 996 2014 10222 MS. MARILYN ... 2014-05-06 NaN MO1-800-14-40 NaN [10 rows x 15 columns]
>>> lbys['state'].tail()
995    MO
996    MO
997    IL
998    IL
999    IL
Name: state, dtype: object
>>> lbys['state'].isin(['IL']).tail()
995    False
996    False
997     True
998     True
999     True
Name: state, dtype: bool
>>> lbys[lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])] # lobbyists from bordering states
     year  lobbyist_id salutation first_name  ... created_date middle_initial      address_2 suffix
164  2020        24879        NaN        DAN  ...   2020-06-15            NaN            NaN    NaN
786  2012         4644        MR.      BRIAN  ...   2012-05-25            NaN            NaN    NaN
788  2019        15543        MR.    LORENZO  ...   2019-01-10            NaN            NaN    NaN
789  2018        15543        MR.    LORENZO  ...   2018-01-17            NaN            NaN    NaN
790  2017        15543        MR.    LORENZO  ...   2017-01-19            NaN            NaN    NaN
791  2020        15543        MR.    LORENZO  ...   2020-01-10            NaN            NaN    NaN
792  2016        15543        MR.    LORENZO  ...   2016-10-13            NaN            NaN    NaN
994  2016        10222        MS.    MARILYN  ...   2016-01-13            NaN  MO1-800-14-40    NaN
995  2015        10222        MS.    MARILYN  ...   2015-01-14            NaN  MO1-800-14-40    NaN
996  2014        10222        MS.    MARILYN  ...   2014-05-06            NaN  MO1-800-14-40    NaN

[10 rows x 15 columns]

I’ll wrap it up with a slightly more complicated expression.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> lbys[
... ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])) & # lobbyists NOT from bordering states
... (lbys['state'] != 'IL') & # and NOT from IL
... (lbys['created_date'] >= '2020-07-01') # created in the last half of the year
... ]
year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix
26 2021 24967 NaN JON ... 2020-12-27 NaN SUITE 104 NaN
27 2020 24967 NaN JON ... 2020-10-21 NaN SUITE 104 NaN
34 2021 24901 NaN JOSEPH ... 2020-12-20 NaN NaN NaN
160 2021 23782 MR. ALAN ... 2021-01-01 P SUITE 404-352 NaN
227 2020 24909 MS. LAKEITHA ... 2020-07-14 NaN NaN NaN
332 2021 22341 NaN CAROLINE ... 2020-12-23 NaN NaN NaN
511 2021 5361 NaN JAY ... 2020-12-27 NaN SUITE 450 NaN
609 2021 4561 NaN BRET ... 2020-12-18 NaN NaN NaN
660 2021 13801 MR. JOHN ... 2020-12-27 NaN MC 482-C30-C76 NaN
700 2020 24925 NaN TAMI ... 2020-08-06 NaN NaN NaN
862 2020 24969 MR. ALEX ... 2020-08-31 NaN NaN NaN
951 2021 6164 MS. GABRIELLE ... 2020-12-21 NaN 73RD FLOOR NaN
[12 rows x 15 columns]
>>> lbys[ ... ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])) & # lobbyists NOT from bordering states ... (lbys['state'] != 'IL') & # and NOT from IL ... (lbys['created_date'] >= '2020-07-01') # created in the last half of the year ... ] year lobbyist_id salutation first_name ... created_date middle_initial address_2 suffix 26 2021 24967 NaN JON ... 2020-12-27 NaN SUITE 104 NaN 27 2020 24967 NaN JON ... 2020-10-21 NaN SUITE 104 NaN 34 2021 24901 NaN JOSEPH ... 2020-12-20 NaN NaN NaN 160 2021 23782 MR. ALAN ... 2021-01-01 P SUITE 404-352 NaN 227 2020 24909 MS. LAKEITHA ... 2020-07-14 NaN NaN NaN 332 2021 22341 NaN CAROLINE ... 2020-12-23 NaN NaN NaN 511 2021 5361 NaN JAY ... 2020-12-27 NaN SUITE 450 NaN 609 2021 4561 NaN BRET ... 2020-12-18 NaN NaN NaN 660 2021 13801 MR. JOHN ... 2020-12-27 NaN MC 482-C30-C76 NaN 700 2020 24925 NaN TAMI ... 2020-08-06 NaN NaN NaN 862 2020 24969 MR. ALEX ... 2020-08-31 NaN NaN NaN 951 2021 6164 MS. GABRIELLE ... 2020-12-21 NaN 73RD FLOOR NaN [12 rows x 15 columns]
>>> lbys[
...     ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])) & # lobbyists NOT from bordering states
...     (lbys['state'] != 'IL') &                               # and NOT from IL
...     (lbys['created_date'] >= '2020-07-01')                  # created in the last half of the year
...     ]
     year  lobbyist_id salutation first_name  ... created_date middle_initial       address_2 suffix
26   2021        24967        NaN        JON  ...   2020-12-27            NaN       SUITE 104    NaN
27   2020        24967        NaN        JON  ...   2020-10-21            NaN       SUITE 104    NaN
34   2021        24901        NaN     JOSEPH  ...   2020-12-20            NaN             NaN    NaN
160  2021        23782        MR.       ALAN  ...   2021-01-01              P   SUITE 404-352    NaN
227  2020        24909        MS.   LAKEITHA  ...   2020-07-14            NaN             NaN    NaN
332  2021        22341        NaN   CAROLINE  ...   2020-12-23            NaN             NaN    NaN
511  2021         5361        NaN        JAY  ...   2020-12-27            NaN       SUITE 450    NaN
609  2021         4561        NaN       BRET  ...   2020-12-18            NaN             NaN    NaN
660  2021        13801        MR.       JOHN  ...   2020-12-27            NaN  MC 482-C30-C76    NaN
700  2020        24925        NaN       TAMI  ...   2020-08-06            NaN             NaN    NaN
862  2020        24969        MR.       ALEX  ...   2020-08-31            NaN             NaN    NaN
951  2021         6164        MS.  GABRIELLE  ...   2020-12-21            NaN      73RD FLOOR    NaN

[12 rows x 15 columns]

I also find it helpful to sometimes create a variable for storing the mask. So for the above example, instead of having to parse the entire expression when reading the code, it can be helpful to have expressive variable names for the parts of the indexing expression.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
>>> non_bordering = ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN']))
>>> non_illinois = (lbys['state'] != 'IL')
>>>
>>> # more readable maybe?
>>> lbys[non_bordering & non_illinois & (lbys['created_date'] >= '2020-07-01')]
>>> non_bordering = ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN'])) >>> non_illinois = (lbys['state'] != 'IL') >>> >>> # more readable maybe? >>> lbys[non_bordering & non_illinois & (lbys['created_date'] >= '2020-07-01')]
>>> non_bordering = ~(lbys['state'].isin(['WI', 'IA', 'MO', 'KY', 'IN']))
>>> non_illinois = (lbys['state'] != 'IL')
>>>
>>> # more readable maybe?
>>> lbys[non_bordering & non_illinois & (lbys['created_date'] >= '2020-07-01')]

Often when building a complex expression, it can be helpful to build it in pieces, so assigning parts of the mask to variables can make a much more complicated expression easier to read, at the cost of extra variables to deal with. In general, I use variables in the mask if I have to reuse them multiple times, but if only used once, I do the entire expression in place.

In summary, boolean indexing is really quite simple, but powerful. I’ll be looking at a few other ways to select data in pandas in upcoming posts.

Have anything to say about this topic?