If you have a Python program that is currently running you may want to understand what the real-world performance profile of the code is. This program could be in a production environment or just on your local machine. You will want to understand where the running program spends its time and if any “hot spots” exist that should be investigated further for improvement. You may be dealing with a production system that is misbehaving and you may want to profile it in an unobtrusive way that doesn’t further impact production performance or require code modifications. What’s a good way to do this? This article will talk about py-spy, a tool that allows you to profile Python programs that are already running.
Deterministic vs. Sampling profilers
In earlier articles, I wrote about two deterministic profilers, cProfile and line_profiler. These profilers are useful when you are developing code and want to profile either sections of code or an entire process. Since they are deterministic, they will tell you exactly how many times a function (or in the case of line_profiler
, a line) is executed and how much time it relatively takes to execute compared to the rest of your code. Because these profilers run within the observed process, they slow it down somewhat because they have to do bookkeeping and calculating in the midst of the program execution. For production code, modifying the code or restarting it with a profiler enabled is often not an option.
This is where sampling profilers can be helpful. A sampling profiler looks at an existing process and uses various tricks to determine what the running process is doing. You can manually try this yourself. For example, on linux you can use the pstack <pid>
(or gstack <pid>
) command to see what your process is doing. On a Mac, you can execute echo "thread backtrace all" | lldb -p <pid>
to see something similar. The output will be the stack of all the threads in your process. This works for any process, not just Python programs. For your Python programs, you’ll see the underlying C functions, not your Python functions. In some cases, checking the stack a few times this way may tell you if your process is stuck or where it is slow, provided you can do the translation to your own code. But doing this provides only a single sample in time. Since the process is continually executing, your sample may change each time you run the command (unless it’s blocked or you just happened to be very lucky).
A sampling profiler and surrounding tools take multiple snapshots of the system over time and then provide you with the ability to look over this data and understand where your code is slow.
py-spy
Py-spy uses system calls (process_vm_readv
on Linux, vm_read
on OSX, ReadProcessMemory
on Windows) to obtain the call stack, then translates that information into the Python function calls that you see in your source code. It samples multiple times per second so it has a good chance of seeing your program in the various states that it will be in over time. It is written in Rust for speed.
Getting py-spy into your project is very simple, it’s installable via pip
. To show you how to use it, I’ve created some sample code to profile and observe how py-spy can tell us about a running Python process. If you want to follow along, you can easily reproduce these steps.
First, I setup a new virtual environment using py-env and the pyenv-virtualenv plugin for this project. You can do this or setup a virtual environment using your preferred tool.
# whichever Python version you prefer pyenv install 3.8.7 # make our virtualenv (with above version) pyenv virtualenv 3.8.7 py-spy # activate it pyenv activate py-spy # install py-spy pip install py-spy # make sure we pick up the commands in our path pyenv rehash
That’s all there is to it, we now have the tools available. If you run py-spy, you can see the common usage.
$ py-spy py-spy 0.3.4 Sampling profiler for Python programs USAGE: py-spy <SUBCOMMAND> OPTIONS: -h, --help Prints help information -V, --version Prints version information SUBCOMMANDS: record Records stack trace information to a flamegraph, speedscope or raw file top Displays a top like view of functions consuming CPU dump Dumps stack traces for a target program to stdout help Prints this message or the help of the given subcommand(s)
An example
In order to demonstrate py-spy, I’ve written a simple long-running process what will consume streaming prices from a cryptocurrency exchange and generate a record every minute (this is also known as a bar). The bar contains various information from the past minute. The bar includes the high, low, and last price, the volume, and the Volume Weighted Average Price (VWAP). Right now, the code only logs these values, but could be extended to update a database. While it’s simple, it is a useful example to use since cryptocurrencies trade around the clock and will give us real world data to work with.
I’m using a Coinbase Pro API for Python to access data from the WebSocket feed. Here’s a first cut that has some debugging code left in place (along with two ways to generate the VWAP, one inefficient (the _vwap
method) and one more efficient). Let’s see if py-spy reveals how much time this code uses.
This code will end up generating a thread for the WebSocket client. The asyncio loop will set a timer for the next minute boundary to tell the client to log the bar data. It will run until you kill it (with Ctrl-C, for example).
#!/usr/bin/env python import argparse import functools import datetime import asyncio import logging import arrow import cbpro class BarClient(cbpro.WebsocketClient): def __init__(self, **kwargs): super().__init__(**kwargs) self._bar_volume = 0 self._weighted_price = 0.0 self._ticks = 0 self._bar_high = None self._bar_low = None self.last_msg = {} self._pxs = [] self._volumes = [] def next_minute_delay(self): delay = (arrow.now().shift(minutes=1).floor('minutes') - arrow.now()) return (delay.seconds + delay.microseconds/1e6) def _vwap(self): if len(self._pxs): wp = sum([x*y for x,y in zip(self._pxs, self._volumes)]) v = sum(self._volumes) return wp/v def on_message(self, msg): if 'last_size' in msg and 'price' in msg: last_size = float(msg['last_size']) price = float(msg['price']) self._bar_volume += last_size self._weighted_price += last_size * price self._ticks += 1 if self._bar_high is None or price > self._bar_high: self._bar_high = price if self._bar_low is None or price < self._bar_low: self._bar_low = price self._pxs.append(price) self._volumes.append(last_size) logging.debug("VWAP: %s", self._vwap()) self.last_msg = msg logging.debug("Message: %s", msg) def on_bar(self, loop): if self.last_msg is not None: if self._bar_volume == 0: self.last_msg['vwap'] = None else: self.last_msg['vwap'] = self._weighted_price/self._bar_volume self.last_msg['bar_bar_volume'] = self._bar_volume self.last_msg['bar_ticks'] = self._ticks self.last_msg['bar_high'] = self._bar_high self.last_msg['bar_low'] = self._bar_low last = self.last_msg.get('price') if last: last = float(last) self._bar_high = last self._bar_low = last logging.info("Bar: %s", self.last_msg) self._bar_volume = 0 self._weighted_price = 0.0 self._ticks = 0 self._pxs.clear() self._volumes.clear() // reschedule loop.call_at(loop.time() + self.next_minute_delay(), functools.partial(self.on_bar, loop)) def main(): argparser = argparse.ArgumentParser() argparser.add_argument("--product", default="BTC-USD", help="coinbase product") argparser.add_argument('-d', "--debug", action='store_true', help="debug logging") args = argparser.parse_args() cfg = {"format": "%(asctime)s - %(levelname)s - %(message)s"} if args.debug: cfg["level"] = logging.DEBUG else: cfg["level"] = logging.INFO logging.basicConfig(**cfg) client = BarClient(url="wss://ws-feed.pro.coinbase.com", products=args.product, channels=["ticker"]) loop = asyncio.get_event_loop() loop.call_at(loop.time() + client.next_minute_delay(), functools.partial(client.on_bar, loop)) loop.call_soon(client.start) try: loop.run_forever() finally: loop.close() if __name__ == '__main__': main()
Running the example
To run this code, you’ll need to install a few extra modules. The cbpro module is a simple Python wrapper of the Coinbase APIs. Arrow is a nice library for doing datetime logic.
pip install arrow cbpro
Now, you can run the code with debug logging and hopefully see some messages, depending on how busy the exchange is.
./coinbase_client.py -d 2021-03-14 17:20:12,828 - DEBUG - Using selector: KqueueSelector -- Subscribed! -- 2021-03-14 17:20:13,059 - DEBUG - Message: {'type': 'subscriptions', 'channels': [{'name': 'ticker', 'product_ids': ['BTC-USD']}]} 2021-03-14 17:20:13,060 - DEBUG - VWAP: 60132.57
Profiling the example
Now, let’s review the py-spy commands. First, using the dump command will give us a quick view of the stack, translated to Python functions.
A quick side note here, if you’re using a Mac you will need to run py-spy as sudo. On Linux, it depends on your security settings. Also, since I was using pyenv I needed to pass on my environment to sudo using the -E
flag so it picks up the right Python interpreter and the py-spy script in the path. I obtained the process id for my process using the ps
command in my shell (in my case it was 97520).
py-spy dump
sudo -E py-spy dump -p 97520 Process 97520: /Users/mcw/.pyenv/versions/py-spy/bin/python ./coinbase_client.py -d Python v3.8.7 (/Users/mcw/.pyenv/versions/3.8.7/bin/python3.8) Thread 0x113206DC0 (idle): "MainThread" select (selectors.py:558) _run_once (asyncio/base_events.py:1823) run_forever (asyncio/base_events.py:570) main (coinbase_client.py:107) <module> (coinbase_client.py:113) Thread 0x700009CAA000 (idle): "Thread-1" read (ssl.py:1101) recv (ssl.py:1226) recv (websocket/_socket.py:80) _recv (websocket/_core.py:427) recv_strict (websocket/_abnf.py:371) recv_header (websocket/_abnf.py:286) recv_frame (websocket/_abnf.py:336) recv_frame (websocket/_core.py:357) recv_data_frame (websocket/_core.py:323) recv_data (websocket/_core.py:310) recv (websocket/_core.py:293) _listen (cbpro/websocket_client.py:84) _go (cbpro/websocket_client.py:41) run (threading.py:870) _bootstrap_inner (threading.py:932) _bootstrap (threading.py:890)
You can see there’s two threads running. One is reading data, the other is in select
in the run loop. This is only useful for profiling if our program is stuck. One really nice feature though is if you give it the --locals
option, it will show you any local variables, which can be really helpful for debugging!
py-spy top
The next command to try is top
.
sudo -E py-spy top -p 97520
This will bring up an interface that looks very similar to the unix top
command. As your program runs and py-spy gathers samples, it will show you where it is spending time. Here is a screenshot of what that looked like for me after about 30 seconds.
py-spy record
Finally, you can record data using py-spy for later analysis or output. There is a raw format, speedscope format, and a flamegraph output. You can specify the amount of time you want to collect data (in seconds), or just let it collect data until you exit the program using Ctrl-C. For example, this command will generate a useful little SVG file flamegraph that you can interact with in a web browser.
sudo -E py-spy record -p 97520 --output py-spy.svg
You can also export the data in speedscope format and then upload it to the speedscope web tool for further analysis. This is a great tool for interactively seeing how your code executes.
I’d encourage you to run this code on your own and play with both the speedscope tool and the SVG output, but here’s two screen shots that help explain how it works. This first view is the overall SVG output. If you hover over the cells, it will show you the function details. You can see that most of the time is spent in the WebSocket client _listen
method. But the on_message
method shows up to the right of that (designated by the arrow)
If we click on it, we get a detailed view.
In my case, I see that my list comprehension and logging in the unneeded _vwap
method show up fairly high in the profile. I can easily remove this method (and the extra prices and volumes that I was tracking) since the VWAP can be calculated with just a running product and total volume (as I’m doing already in the code). It’s also interesting to see when the script is run in debug mode how much time logging takes
Summary
In summary, I’d encourage you to try out py-spy on some of your code. If you try to predict where your code will spend its time, how correct are you? Are there any findings that surprise you? Maybe compare the output of py-spy to a deterministic profiler like line_profiler.
I hope this overview of py-spy has been helpful and that you can deploy this tool in diagnosing performance issues in your own code.