Profiling 101 for Python Developers: Existing Profilers 3/6

This article is the third of a series of six on Profilers in the Python world, and how Blackfire is the best-of-breed tool to introspect code behavior and optimize its performance.

Share this:
Click to share on Twitter (Opens in new window)
Click to share on Reddit (Opens in new window)
Click to share on Facebook (Opens in new window)
Click to share on LinkedIn (Opens in new window)
More
Click to email a link to a friend (Opens in new window)
Click to share on Pinterest (Opens in new window)
Click to share on Pocket (Opens in new window)
Click to share on Tumblr (Opens in new window)

By Sümer Cip, on Feb 05, 2020

Blog post series index:

What is a Profiler?
The Many Types of Profilers
List of Existing Python Profilers (you are here)
Picking the Right Profiler
Profiles visualizations
Using Blackfire

Let’s have a look at some key profilers in the Python world, and how they work under the hood. The list includes various call tracing and memory profilers.

Note: This list is written as of January 2020, so there is a possibility that the information is outdated when you read this.

cProfile

cProfile is probably the go-to solution for most Python developers.

It is a deterministic profiler and included in Python standard library. It calculates the wall time per function call and is useful for profiling simple function calls, scripts (via python -m cProfile).

This is the output you will get with cProfile:

import time
import cProfile

def foo():
    time.sleep(1.0)

cProfile.run('foo()')

'''
   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.001    1.001 <string>:1(<module>)
        1    0.000    0.000    1.001    1.001 cprofile.py:5(foo)
        1    1.001    1.001    1.001    1.001 {time.sleep
'''

The percall fields are quotients of tottime and cumtime divided by ncalls. tottime is the time spent in the given function (excluding subcalls). That is why we have 0 for function foo. cumtime is the cumulative time spent including subcalls.

Yappi

Yappi is also another deterministic profiler and aims to tackle some real world needs which cProfile does not cover, like:

profiling of multi-threaded applications
CPU time profiling

Yappi can be started from any thread with a simple call like yappi.start(). Both wall and CPU time can be measured via yappi.set_clock_type() API call. You can also examine per-thread function traces by filtering on thread’s id. See yappi.get_func_stats(filter=...). See API for more details.

Let’s see the same example above:

import time
import yappi

def foo():
    time.sleep(1.0)

yappi.set_clock_type("wall")
yappi.start()
foo()
yappi.stop()
yappi.get_func_stats().print_all()

'''
Clock type: WALL
Ordered by: totaltime, desc

name                                  ncall  tsub      ttot      tavg
yappi_example.py:5 foo                1      0.000009  1.001098  1.001098
'''

You can easily see we set the clock_type we would like to see via set_clock_type() API before starting the profiler. The output is pretty much similar to cProfile. tsub is the time spent in the function excluding subcalls and ttot is the time spent in the function including subcalls.

Yappi can output profile information to pstats or callgrind formats.

pyinstrument

pyinstrument is a statistical profiler which can output the results as text, json or html. It also has an embedded Django and Flask middleware.

Let’s see the output it generates:

import time

def bar():
    time.sleep(1.0)

def foo():
    bar()

foo()

'''
  _     ._   __/__   _ _  _  _ _/_   Recorded: 14:29:58  Samples:  1
 /_//_/// /_\ / //_// / //_'/ //     Duration: 1.001     CPU time: 0.000
/   _/                      v3.0.3

Program: example.py

1.001 <module>  example.py:1
└─ 1.001 foo  example.py:8
   └─ 1.001 bar  example.py:4

To view this report with different options, run:
    pyinstrument --load-prev 2020-01-03T14-29-58 [options
'''

You can see that all call stack information is saved. There are options to save the output as a timeline without merging the sample results into a single result.

pyinstrument uses internal sampling. But instead of interrupting the process at every X millisecond, it uses a very different approach. It calls the sys.setprofile() API just like a deterministic profiler. When a call event happens it: uses this event to save the call stack information, saves a timestamp for remembering the last sample time and disables the setprofile callback.

This overcomes most of the complexities that came along with dealing with Unix signals and there is no additional overhead involved (as per what I measured). See here for more details.

py-spy

py-spy uses external sampling approach which means you neither need to install anything in the system nor instrument anything in the application. You can attach to any process by its PID and output the profile results in speedscope or flamegraph formats.

There are also two nifty commands in CLI interface: py-spy top and py-spy dump. top simply shows top like output for the executing functions in the attached application whereas dump dumps call stacks of all threads running the application. These might help a lot in debugging deadlock or intermittent performance issues. The overhead is no more than ~%10.

Let’s see an example output:

import time

def bar():
    time.sleep(1.0)

def foo():
    bar()

foo()
'''
GIL: 0.00%, Active: 4.44%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
  4.44%   4.44%   0.040s    0.040s   bar (example.py:5)
  0.00%   4.44%   0.000s    0.040s   foo (example.py:9)
  0.00%   4.44%   0.000s    0.040s   <module> (example.py:12)
'''

Note that the numbers are here biased by too few samples, which won’t happen when used in production.

tracemalloc

tracemalloc is included in the Python standard library as of Python 3.4.

The tracemalloc module is a tool to trace memory blocks allocated. It provides a call traceback where an object is allocated. It can aggregate this data and show statistics on allocated memory blocks per filename and per line number. More details on its API can be found in its own documentation.

It uses a new C API PyMem_SetAllocator to hook malloc/realloc/free functions. Whenever an event happens the hook function is called and tracemalloc saves a traceback along with the allocation. You can later inspect the traceback and see where the allocation happened to debug memory issues easily.

It also has a backport for Python2: pytracemalloc. It requires to manually patch and compile Python source to support the PyMem_SetAllocator/PyMem_GetAllocator APIs which are defined in PEP445. Backporting tracemalloc is explained here.

Let’s see an example:

import tracemalloc

class A:

    def __init__(self):
        self._d = []

    def alloc(self):
        self._d += [1] * 512

tracemalloc.start(25)
a = A()
a.alloc()
stats = tracemalloc.take_snapshot().statistics('traceback')
for t in stats:
    print(t)
'''
tm.py:15: size=4656 B, count=1, average=4656 B
tm.py:14: size=64 B, count=1, average=64 B
tm.py:14: size=40 B, count=1, average=40 B
'''

You can see that there are three allocations happening. For line 14 (a = A()), tracemalloc shows both the allocation of a and self._d. For line 15 (a.alloc()), tracemalloc shows the growth in dictionary.

As tracemalloc operates for every malloc/free, it can add some overhead to the application. In my tests, I have seen nearly 1.5-2x overhead at worst. So, it might not be something you would like to add to production and forget but it might be very helpful in situations where you have identified a memory issue and want to make further analysis.

objgraph

objgraph supports both Python2.7/3.x and can be installed easily via pip.

objgraph takes a different approach than tracemalloc. It shows reference chains to allocated objects and lets you visualize this kind of information into image files. Instead of tracing malloc/free, objgraph looks at objects that are in memory and ask the question to the object: who is pointing to you until it reaches a root object.

Let’s see an example:

import objgraph

x = [1]
y = [x]
objgraph.show_backrefs([x], filename='sample-graph.png')

And the output:

You can see the reference chain for x: x is referenced by a module’s __dict__ as it is a global variable and it is also referenced by another global variable y. This example do not have too deep reference chains but in real world, you could see deep chains of references pointing to the allocated object.

Here is an example taken from a real case (from here):

You can also get objects which are not referenced by any other object but still reside in memory(a possible reference leak) via objgraph.get_leaking_objects() API.

objgraph does not add any overhead to the application as it does not need to hold additional data. However the objects and their references can be a huge graph in itself and traversing it might take a while.

The API interface of objgraph is quite unique and comprehensive. You can do lots of introspection related to objects and references. You can find the API documentation here.

Blackfire.io

Blackfire is a deterministic profiler, offered as a SaaS.

It works with a language extension, the “probe”, which collects profiling data, and an “agent” which computes the profiling data before sending it to the Blackfire.io servers (removing any sensitive data, and making it more lightweight).

Profiling is triggered on demand either with a browser extension, or a command line for HTTP requests as well as CLI scripts.

Profiles can then be displayed via Blackfire’s web UI, shared with team mates and compared vs other profiles. Without any overhead for end user, as we’ll see in our next article!

Next article: Picking the Right Profiler

Sümer Cip

Sümer Cip is currently a Senior Software Engineer at Blackfire. He has been coding for nearly 20 years and contributed to many open source projects in his career. He has a deep love for Python and obsession with code performance and low-level stuff.