It’s time for a different way to do selection in pandas. The query
method can provide a few benefits to your code, and offers potential speed improvements. Read on to find out more.
This is the sixth post in a series on indexing and selecting in pandas, and so far we have covered concepts about indexing and selecting data in both pandas Series
and DataFrame
s using concepts like slicing, boolean indexing, and familiar python concepts like callables. This post will cover the details behind the query
method and requires a little more background and explanation. In the end, you’ll see how it’s useful and how it can help speed up your calculations.
Before we dive in, if you are jumping in the middle and want to get caught up, here is the rest of the series:
- Basic indexing, selecting by label and location
- Slicing in pandas
- Selecting by boolean indexing
- Selecting by callable
- Selecting using where and mask
Using query
In pandas, DataFrame
s have a query
method that supports selection using an expression as a string. One way to think about this (and a natural way to think about it due to its name and basic functionality) is that it can sort of be like SQL. But I’m not sure that’s a great way to think about it, for a couple of reasons. First, there is a limit in language support (i.e. don’t count on writing queries with complex joins or group by expressions). Second, the purpose of this expression is not just to give you a simple query language for selecting data, but to speed up expressions.
That being said, using query does offer a number of advantages, with a few complications, which we will cover.
Let’s start with a few examples of what an expression in query
might look like.
>>> import pandas as pd >>> import numpy as np >>> # create a sample dataframe of floating points around 0 >>> df = pd.DataFrame(np.random.random((5,5)) - .5, columns=list("abcde")) >>> df a b c d e 0 -0.152062 0.320043 0.445840 -0.434280 0.404650 1 0.112074 -0.449177 0.190247 -0.471130 0.304824 2 -0.090032 -0.414783 0.388579 0.363154 -0.291303 3 0.223888 -0.202376 0.053828 -0.070127 0.005833 4 0.417009 -0.251136 0.268446 0.020803 0.265924
query
is still boolean indexing
If you look back on the post on boolean indexing, you see that to select rows in a DataFrame
using any of the indexing methods, you can pass in a boolean vector the same size as the DataFrame
index. So, if we want to select all rows in our DataFrame
where the value in column b
is negative, we pass in a boolean array created by applying the less than (<
) operation to the column.
>>> df[df["b"] < 0] a b c d e 1 0.112074 -0.449177 0.190247 -0.471130 0.304824 2 -0.090032 -0.414783 0.388579 0.363154 -0.291303 3 0.223888 -0.202376 0.053828 -0.070127 0.005833 4 0.417009 -0.251136 0.268446 0.020803 0.265924
The query
method is similar. It takes an expression, which is evaluated in the context of the DataFrame
, and that expression returns a boolean value used for selection. So we can write the same selection like this:
>>> df.query("b < 0") a b c d e 1 0.112074 -0.449177 0.190247 -0.471130 0.304824 2 -0.090032 -0.414783 0.388579 0.363154 -0.291303 3 0.223888 -0.202376 0.053828 -0.070127 0.005833 4 0.417009 -0.251136 0.268446 0.020803 0.265924
More complex expressions can be expressed in a simpler syntax. For example, to select rows where column b
is less than column c
and column c
is less than column e
, we need to use a fair amount of parentheses and repeat our df
variable in order to have it parse properly.
>>> df[(df["b"] < df["c"]) & (df["c"] < df["e"])] a b c d e 1 0.112074 -0.449177 0.190247 -0.47113 0.304824
Using query
, this can be written in a much more elegant way. For combining and modifying expressions, you can use &
and |
and ~
, or and
and or
and not
.
>>> df.query("(b < c) & (c < e)") # or >>> df.query("(b < c) and (c < e)") a b c d e 1 0.112074 -0.449177 0.190247 -0.47113 0.304824
Or even better
>>> df.query("b < c < e") a b c d e 1 0.112074 -0.449177 0.190247 -0.47113 0.304824
You can also use your index in the query, by using index
, or you can rename your index and use that.
>>> df.query("index > 2") >>> df.index.name = 'idx' >>> df.query("idx > 2") a b c d e idx 3 0.223888 -0.202376 0.053828 -0.070127 0.005833 4 0.417009 -0.251136 0.268446 0.020803 0.265924
Using variables in your query expression
You can also use variables that are in scope inside your expression by prefixing them with an @
.
>>> limit = 0.1 >>> df.query("b > @limit") a b c d e idx 0 -0.152062 0.320043 0.44584 -0.43428 0.40465
If your columns are not valid Python variable names, you’ll have to surround them with backticks. (Note you’ll need pandas 0.25 or higher for this functionality, and 1.0+ for expanded support).
>>> df['mean value'] = df.mean(axis=1) >>> df.query('`mean value` > d') a b c d e mean value idx 0 -0.152062 0.320043 0.445840 -0.434280 0.404650 0.116838 1 0.112074 -0.449177 0.190247 -0.471130 0.304824 -0.062632 3 0.223888 -0.202376 0.053828 -0.070127 0.005833 0.002209 4 0.417009 -0.251136 0.268446 0.020803 0.265924 0.144209
The in
operator is supported in query, so you can use that as well, either using a variable, or existing columns, or as literals. The ==
and !=
operators work like in
and not in
for lists.
>>> df['name'] = ['x', 'y', 'z', 'p', 'q'] >>> search = ['x', 'y'] >>> df.query("name in ['x', 'y']") # or >>> df.query('name in @search') # or >>> df.query('name == @search') # or >>> df.query("name == ['x', 'y']") a b c d e mean value name idx 0 -0.152062 0.320043 0.445840 -0.43428 0.404650 0.116838 x 1 0.112074 -0.449177 0.190247 -0.47113 0.304824 -0.062632 y
What about updates?
This expression is handy, but what about updating your DataFrame
? I know we’ve mostly talked about selecting data, but it turns out that you can also update or add columns by using expressions, but with the eval
method. Note that eval
returns a copy of the modified data, so you need to assign back to your variable or use the inplace
argument to retain the changes.
>>> df.eval('x = a * b') a b c d e mean value name x idx 0 -0.152062 0.320043 0.445840 -0.434280 0.404650 0.116838 x -0.048666 1 0.112074 -0.449177 0.190247 -0.471130 0.304824 -0.062632 y -0.050341 2 -0.090032 -0.414783 0.388579 0.363154 -0.291303 -0.008877 z 0.037344 3 0.223888 -0.202376 0.053828 -0.070127 0.005833 0.002209 p -0.045310 4 0.417009 -0.251136 0.268446 0.020803 0.265924 0.144209 q -0.104726
Note you can pass in multiple expressions to be evaluated in one go.
df.eval(''' x = a * b y = c * d ''') a b c d e mean value name x y idx 0 -0.152062 0.320043 0.445840 -0.434280 0.404650 0.116838 x -0.048666 -0.193619 1 0.112074 -0.449177 0.190247 -0.471130 0.304824 -0.062632 y -0.050341 -0.089631 2 -0.090032 -0.414783 0.388579 0.363154 -0.291303 -0.008877 z 0.037344 0.141114 3 0.223888 -0.202376 0.053828 -0.070127 0.005833 0.002209 p -0.045310 -0.003775 4 0.417009 -0.251136 0.268446 0.020803 0.265924 0.144209 q -0.104726 0.005584
What about updates with multiple values?
It turns out there is a also a higher level pandas eval
function available for doing more complicated updates involving multiple DataFrames
or Series
. I won’t get it into detail here, but you can use it for the same sorts of expressions as above.
>>> s = pd.Series(np.random.random(5)) >>> pd.eval('g = df.a + s', target=df) a b c d e mean value name g idx 0 -0.152062 0.320043 0.445840 -0.434280 0.404650 0.116838 x 0.661418 1 0.112074 -0.449177 0.190247 -0.471130 0.304824 -0.062632 y 0.814406 2 -0.090032 -0.414783 0.388579 0.363154 -0.291303 -0.008877 z 0.663946 3 0.223888 -0.202376 0.053828 -0.070127 0.005833 0.002209 p 0.435111 4 0.417009 -0.251136 0.268446 0.020803 0.265924 0.144209 q 1.012334
One other nice use case
One more nice advantage of using query
(and eval
) is that you may have a situation where you have multiple DataFrame
s in your code that share column names where you’d like to use the same expression on them. If you didn’t use query
, you’d have to pass these into a function so you refer to these in local python expressions. So as these expressions get more complex, query
may make sense for code clarity.
df2 = pd.DataFrame(np.random.random((5,5)) - .5, columns=list("abcde")) # without query, need to build expression in python def find_d_gt_e(d): return d[d['d'] > d['e']] find_d_gt_e(df) find_d_gt_e(df2) expr = "d > e" # same expression, re-usable anywhere these columns exist df.query(expr) df2.query(expr);
But what’s the point?
So there are a few advantages to the mini-langage used by query
and eval
, resulting in cleaner looking code. But there are still some disadvantages to using a differing query language. First, when passing in a string, there’s no syntax checking being done by your editor. You may find this annoying or it may make you slightly less efficient. Also, you will need to learn a new set of rules for this mini language.
It’s mainly about speed
What is not entirely obvious here is that under the hood you can install a nice library called numexpr
(docs, src) that exists to make calculations with large NumPy (and pandas) objects potentially much faster. When you use query
or eval
, this expression is passed into numexpr
and optimized using its bag of tricks. Expected performance improvement can be between .95x and up to 20x, with average performance around 3-4x for typical use cases. You can read details in the docs, but essentially numexpr
takes vectorized operations and makes them work in chunks that optimize for cache and CPU branch prediction. If your arrays are really large, your cache will not be hit as often. If you break your large arrays into very small pieces, your CPU won’t be as efficient.
Make sure you install it in your environment first if it’s not there.
%pip install numexpr # in iPython or Jupyter notebooks pip install numexpr # in your shell
Here’s a simple NumPy example from the numexpr
docs that shows what the improvement can look like.
>>> import numexpr as ne >>> a = np.random.rand(int(1e6)) >>> b = np.random.rand(int(1e6)) >>> r1 = %timeit -o 2 * a + 3 * b # assumes you're in iPython or Jupyter notebook >>> r2 = %timeit -o ne.evaluate("2 * a + 3 * b") 5.57 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 1.65 ms ± 31.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) >>> print(f"{r1.average / r2.average:.2f}x faster") 3.38x faster
You can see much more detail in the numexpr
docs if you’re interested, but for a pandas specific example, let’s see what the difference is when using numexpr and not for some DataFrames
in the size range where we can start to expect an improvement (over 200k rows). In terms of what is possible to query, the numexpr
docs have a concise summary, but pandas adds some more complexity you’ll need to consider.
>>> sd = pd.DataFrame(np.random.random(int(3e7)).reshape(int(1e7), 3)) >>> sd2 = pd.DataFrame(np.random.random(int(3e7)).reshape(int(1e7), 3)) >>> r1 = %timeit -o 3 * sd + 4 * sd2 174 ms ± 23.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) >>> r2 = %timeit -o pd.eval('3 * sd + 4 * sd2') 104 ms ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) >>> print(f"{r1.average / r2.average: .2f}x faster") 1.66x faster
Depending on what you are doing, some operations will be sped up even more.
>>> r1 = %timeit -o 3 * sd + sd2 ** 4 286 ms ± 56.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) >>> r2 = %timeit -o pd.eval('3 * sd + sd2 ** 4') 101 ms ± 6.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) >>> print(f"{r1.average / r2.average: .2f}x faster") 2.84x faster
Wrapping up
When you’re working with initial exploration of data, or smaller data sets, using query
and eval
will probably not be your go-to options for data selection. But you should consider this method for speeding up code with large data sets or in speed critical areas, it can make a big difference. It’s also useful for cleaning up code with complicated selection criteria.
I also think this is a good place to wrap up the series on selecting and indexing in pandas. Now that we’ve covered the base, there are many more topics worth looking at.
Nice series of articles. Are you on twitter? I was planning to mention some of these but wanted to link to your account if you are.
Keep ’em coming!
Thanks Chris! I really like your blog, I’ve found quite a bit of useful stuff there. I’m @matt_wright on twitter (I need to put some social links up here). I appreciate it.