The topic of indexing and selecting data in pandas is core to using pandas, but it can be quite confusing. One reason for that is because over the years pandas has grown organically based on user requests so there are multiple way to select data out of a pandas DataFrame
or Series
. Reading through the documentation can be a real challenge, especially for beginners. For more advanced users, it is easy to learn a few techniques and just fall back on your favorite method to access data and not realize that there might be simpler, faster, or more reliable ways to both select and modify your data.
Since this can be a complicated and confusing topic, I’m choosing to break it into several smaller posts so that it doesn’t become overwhelming, working up from the basics to the more complex scenarios. The methods used for selecting and indexing are some of the most confusing methods to work with in pandas due to their different behavior with different argument types.
Indexing, starting with the basics
This will be the first of a series of posts covering indexing and selecting data. This post is just going to cover selecting data by index label or integer offsets. In the future, I’ll discuss slicing, boolean indexing, the query
method, using isin
, cross sections, and much more.
Our test data
As we get started, we want some data. But instead of making fake data to work with, let’s grab some real data online that is a little more interesting. I live near Chicago, so I looked at the Chicago Data Portal for some datasets. A smaller one I found was the Individual Landmarks dataset that gives an inventory of our famous Chicago landmarks. It has text, numeric, and date fields. I’m going to manipulate it a bit to make it useful for explaining more concepts.
These examples were executed using Python 3.8.6 and pandas 1.1.4.
>>> import pandas as pd >>> import numpy as np >>> >>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited >>> df = pd.read_json("https://data.cityofchicago.org/resource/tdab-kixi.json") >>> df.head(3) landmark_name id address ... :@computed_region_awaf_s7ux date_built architect 0 Vassar Swiss Underwear Company Building L-265 2543 - 2545 W Diversey Av ... 24.0 NaN NaN 1 Mathilde Eliel House L- 89 4122 S Ellis Av ... 1.0 1886 Adler & Sullivan 2 Manhattan Building L-139 431 S Dearborn St ... 48.0 1891 William LeBaron Jenney [3 rows x 15 columns] >>> df.dtypes landmark_name object id object address object landmark_designation_date object latitude float64 longitude float64 location object :@computed_region_rpca_8um6 float64 :@computed_region_vrxf_vc4k float64 :@computed_region_6mkv_f3dw float64 :@computed_region_bdys_3d7i float64 :@computed_region_43wa_7qmu float64 :@computed_region_awaf_s7ux float64 date_built object architect object dtype: object >>> # let's transform that date from object to a datetime >>> df['landmark_designation_date'] = pd.to_datetime(df['landmark_designation_date']) >>> # also trimming down the columns >>> df = df[['landmark_name', 'id', 'address', 'landmark_designation_date', ... 'latitude', 'longitude', 'location', 'date_built', 'architect']] >>> df.columns Index(['landmark_name', 'id', 'address', 'landmark_designation_date', 'latitude', 'longitude', 'location', 'date_built', 'architect'], dtype='object')
Axes
The two main data structures in pandas both have at least one axis. A Series
has one axis, the index. A DataFrame
has two axes, the index and the columns. It’s useful to note here that in all the DataFrame
functions that can be applied to either rows or columns, an axis of 0 refers to the index, an axis of 1 refers to the columns.
We can inspect these in our sample DataFrame
. We’ll pick the landmark_name
column as a sample Series
to demonstrate the basics for a Series
. You can see the column (which is a Series
) and the entire DataFrame
share the same index.
>>> s = df['landmark_name'] >>> print("Series index:", s.index) Series index: RangeIndex(start=0, stop=317, step=1) >>> print("DataFrame index:", df.index) DataFrame index: RangeIndex(start=0, stop=317, step=1)
Index¶
In pandas, an Index
(or a subclass) allows for the data structures that use it to support lookups (or selection), data alignment (think of time-series data especially, where all the observations needs to be aligned with their observation time), and reindexing (changing the underlying index to have different values, but keeping the data aligned). There are a number of types of indices, but for now, we’ll just look at the simple RangeIndex
that our current DataFrame
is using, which will have integer values.
Basic selecting with []
We’re going to start with the basic form of selecting, using the []
operator, which in Python maps to a class’s __getitem__
function (if you’re familiar with objects in Python, if not, don’t worry about that for now). Depending on whether the pandas object is a Series
or a DataFrame
and the arguments you pass into this function, you will get very different results. Let’s start with the basics, invoking with a single argument.
Series
With a Series
, the call will return a single scalar value that matches the value at that label in the index. If you pass in a value for a label that doesn’t exist, you will get a KeyError
raised. Also, if you pass in an integer and your index has that value, it will return it. But if you don’t have an integer value in your index, it will return the value by position. This is convenient, but can be confusing.
Now I’m going to give this DataFrame
(and Series
) a new index because the RangeIndex
could make much of the following examples very confusing. It’s important for us to differentiate between accessing elements by label and by position. If our index labels are integers, you will not be able to see the difference! Since this dataset already has a unique id
column, we’ll use that instead.
>>> df.index = df['id'].str.replace(' ', '') # remove spaces for this example >>> s = df['landmark_name'] >>> df.index Index(['L-265', 'L-89', 'L-139', 'L-12', 'L-88', 'L-318', 'L-85', 'L-149', 'L-286', 'L-71', ... 'L-241', 'L-133', 'L-169', 'L-277', 'L-164', 'L-310', 'L-103', 'L-236', 'L-65', 'L-224'], dtype='object', name='id', length=317)
Now that our index doesn’t contain int
values, when we call it with int
s they will be evaluated as positional arguments. If you index had int
values, they would be found first rather than posititional values. Confusing, isn’t it? This is one reason why you want to read on and see why there are better ways to do this.
>>> print("The value for L-265:", s['L-265']) The value for L-265: Vassar Swiss Underwear Company Building >>> print("The first value:", s[0]) The first value: Vassar Swiss Underwear Company Building >>> print("The value for L-139:", s['L-139']) The value for L-139: Manhattan Building >>> print("The third value:", s[2]) The third value: Manhattan Building >>> try: ... s['L-900'] ... except KeyError as ke: ... print("Exception: ", ke) ... Exception: 'L-900'
While I rarely use it, there is a get
method available, which will return None
if the argument is not in the index instead of raising a KeyError
.
>>> print("The first value:", s.get(0)) The first value: Vassar Swiss Underwear Company Building >>> print("Is there a value at 'L-900'?: ", s.get('L-900')) Is there a value at 'L-900'?: None
DataFrame
Now with a DataFrame
, calls to []
are used for selecting from the column index, not the row index. This can be confusing since it’s different from a Series
when passing in integer values. Instead, we pass in column names.
>>> try: ... print("First element in a Series:", s[0]) ... print("First row in a DataFrame?:", df[0]) ... except KeyError as ke: ... print("Nope, that's not how you select rows in a DataFrame") ... First element in a Series: Vassar Swiss Underwear Company Building Nope, that's not how you select rows in a DataFrame >>> df['landmark_name'] id L-265 Vassar Swiss Underwear Company Building L-89 Mathilde Eliel House L-139 Manhattan Building L-12 Machinery Hall at Illinois Institute of Techno... L-88 Melissa Ann Elam House ... L-310 (Former) Schlitz BreweryTied-House@1944 N.Oakley L-103 Getty Tomb L-236 Engine Company 129, Truck 50 L-65 James Charnley House L-224 Beeson House and Coach House Name: landmark_name, Length: 317, dtype: object
We can also select a list of columns, in any order (even repeated).
>>> df[['landmark_name', 'address', 'landmark_name']]
Attribute access for columns on a DataFrame
or values in a Series
.
Another way to select a DataFrame
column or an element in a Series
is using the attribute operator, .
. Pandas will automagically create accessors for all DataFrame
columns or for all labels in a Series
, provided their name translates to valid Python. This can make life easier for you, especially in an environment with tab-completion like IPython or a Jupyter notebook, but in general, it’s best not to use these attributes in production code. Why not?
- Your column names may collide with method names on a
DataFrame
itself, and in that case you will be accessing something you weren’t intending to. - Column names often may not be valid Python identifiers since they may contain spaces or start with (or just be) numbers, so you have to use the longer form anyway.
- Using
[]
with quoted strings makes your code very clear for others (and future self) to read. - Assigning to a non-existing attribute won’t create a new column, it will just create an attribute on the object. (We’ll talk about modifying data in subsequent posts, you can read this article for details on adding columns to a
DataFrame
)
So hopefully this list reinforces why it’s just a bad habit to rely on attribute access. In our data example, we can use attribute access for some of our column names, but because the primary index doesn’t make valid Python identifiers, we can’t use it on our Series
(L-265 is not a valid Python identifier, for example). But it does work for some situations and may save you a few keystrokes when doing exploratory analysis.
>>> df.landmark_name
Selecting with .loc
We also have the .loc
attribute which is intended for selection and indexing by label, similar to []
on a Series
. However, using .loc
and .iloc
will make it more clear in your code what your intentions are, and they behave differently. Note that .loc
will raise KeyError
when an element doesn’t exist at that label.
Also note that this is not a method, so we don’t call it, but rather we use the indexing operator ([]
) on it.
Series
Let’s start with a Series
. First, note that we don’t select by location, we can select by index label.
>>> try: ... print("Selecting by location?", s.loc[0]) ... except KeyError as ke: ... print("Nope, not with .loc: ", ke) Nope, not with .loc: 0 >>> print("Yes, do it by label: ", s.loc['L-139']) Yes, do it by label: Manhattan Building
We can also pass in a list of labels, and they will be returned as a Series
.
>>> s.loc[['L-12', 'L-265', 'L-224']] id L-12 Machinery Hall at Illinois Institute of Techno... L-265 Vassar Swiss Underwear Company Building L-224 Beeson House and Coach House Name: landmark_name, dtype: object
Note that this can be a list with a single element, but passing in a list returns a Series
(with one element), not a scalar.
>>> s.loc[['L-12']] id L-12 Machinery Hall at Illinois Institute of Techno... Name: landmark_name, dtype: object
DataFrame
On a DataFrame
, a single argument to .loc
will return a Series
for the row matching the label. I’ll select one of my favorite Chicago buildings, the Rookery. If you ever are in Chicago, check out the lobby of this beautiful building, remodeled by Frank Lloyd Wright, and note that one of the columns has its marble cladding missing so you can see the original post underneath. Wright left it open so the original design could still be seen.
>>> df.loc['L-156'] landmark_name Rookery Building id L-156 address 209 S LaSalle St landmark_designation_date 1972-07-05 07:00:00+00:00 latitude 41.8791 longitude -87.6318 location {'latitude': '41.8790761299', 'longitude': '-8... date_built 1885-88 architect Burnham & Root Name: L-156, dtype: object
If passed a list of labels, .loc
will return a DataFrame
of the matching rows. This is just selecting rows by index, but selecting multiple rows. Note that all of the elements in the list have to be in the index or KeyError
is raised.
>>> df.loc[['L-12', 'L-11', 'L-13']] landmark_name id ... date_built architect id ... L-12 Machinery Hall at Illinois Institute of Techno... L- 12 ... 1901 Patton, Fisher & Miller L-11 South Shore Cultural Center L- 11 ... 1906,09,16 Marshall & Fox L-13 Main Building at Illinois Institute of Technology L- 13 ... 1891-93 Patton & Fisher [3 rows x 9 columns]
But, DataFrame
‘s .loc
can take multiple arguments. The first argument is an indexer for rows, the second argument is an indexer for the columns. So if we wanted a row and the landmark_name column, we can get back a scalar.
>>> df.loc['L-12','landmark_name'] 'Machinery Hall at Illinois Institute of Technology'
Now I’m going to move on from .loc
for now, but will circle back to talk about some more advanced ways of selecting data with it.
Selecting with .iloc
.
.iloc
is a separate method for use with purely integer based indexing, starting from 0. It’s very similar to the behavior of .loc
, as we’ll see, but raises IndexError
when the indexer is out of bounds, or the wrong type.
Series
For a Series
, .iloc
returns a scalar, just like .loc
.
'Vassar Swiss Underwear Company Building'
Relative indexing is allowed, and if you try to access a non-existent element, IndexError
is raised.
>>> s.iloc[-1] 'Beeson House and Coach House' >>> try: ... s.iloc[9999] ... except IndexError as ix: ... print("You ran over the end of your Series: ", ix) You ran over the end of your Series: single positional indexer is out-of-bounds
Passing in a list returns a Series
. Again, all elements in the list need to be in the index, or IndexError
is raised.
>>> s.iloc[[0,1,2]] id L-265 Vassar Swiss Underwear Company Building L-89 Mathilde Eliel House L-139 Manhattan Building Name: landmark_name, dtype: object
I’m going to stop here for this post, and I know this seems quite unsatisfying to only pull data out by index, but understanding how to use the index will make future types of selections much easier to understand. For now, make sure you remember that .loc
selects items by label, .iloc
by integer position, and []
is sort of a hybrid.
To learn more about indexing, take a look on my articles on slicing as well as boolean indexing which will give us much more power to get more out of the three methods that were covered in this post: []
, .loc
, and .iloc
.
I found the text highly informative. I suggest correcting the information on using .iloc for Series. The statement (above the example s.iloc[[0,1,2]]) “Passing in a list returns a DataFrame.” should be “Passing in a list returns a Series.”
Thank you, good catch. I’ve corrected the error.
Thanks for this. I’m an absolute hack, but I’ve used Pandas for a few years. And these distinctions were never clear to me before. This is very well explained.
Thank you. I’m glad it helped clear things up. I hope you’ll read the followup posts as well, these methods are loaded with a lot of capability, but can also be totally confusing, as I’m realizing as I dig in more.
This distills the essence very nicely, thanks for the post. As a next topic, looking at performance implications of various ways of access data from existing padas object might be of interest – when the copies are made or not is not quite transparent.