Profiling 101 for Python Developers: Existing Profilers 3/6
This article is the third of a series of six on Profilers in the Python world, and how Blackfire is the best-of-breed tool to introspect code behavior and optimize its performance.
Blog post series index:
- What is a Profiler?
- The Many Types of Profilers
- List of Existing Python Profilers (you are here)
- Picking the Right Profiler
- Profiles visualizations
- Using Blackfire
Let’s have a look at some key profilers in the Python world, and how they work under the hood. The list includes various call tracing and memory profilers.
Note: This list is written as of January 2020, so there is a possibility that the information is outdated when you read this.
cProfile
cProfile is probably the go-to solution for most Python developers.
It is a deterministic profiler and included in Python standard library. It calculates the wall time per function call and is useful for profiling simple function calls, scripts (via python -m cProfile
).
This is the output you will get with cProfile:
import time import cProfile def foo(): time.sleep(1.0) cProfile.run('foo()') ''' Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 1.001 1.001 <string>:1(<module>) 1 0.000 0.000 1.001 1.001 cprofile.py:5(foo) 1 1.001 1.001 1.001 1.001 {time.sleep '''
The percall
fields are quotients of tottime
and cumtime
divided by ncalls
. tottime
is the time spent in the given function (excluding subcalls). That is why we have 0
for function foo
. cumtime
is the cumulative time spent including subcalls.
Yappi
Yappi is also another deterministic profiler and aims to tackle some real world needs which cProfile does not cover, like:
- profiling of multi-threaded applications
- CPU time profiling
Yappi can be started from any thread with a simple call like yappi.start()
. Both wall and CPU time can be measured via yappi.set_clock_type()
API call. You can also examine per-thread function traces by filtering on thread’s id. See yappi.get_func_stats(filter=...)
. See API for more details.
Let’s see the same example above:
import time import yappi def foo(): time.sleep(1.0) yappi.set_clock_type("wall") yappi.start() foo() yappi.stop() yappi.get_func_stats().print_all() ''' Clock type: WALL Ordered by: totaltime, desc name ncall tsub ttot tavg yappi_example.py:5 foo 1 0.000009 1.001098 1.001098 '''
You can easily see we set the clock_type
we would like to see via set_clock_type()
API before starting the profiler. The output is pretty much similar to cProfile. tsub
is the time spent in the function excluding subcalls and ttot
is the time spent in the function including subcalls.
Yappi can output profile information to pstats or callgrind formats.
pyinstrument
pyinstrument is a statistical profiler which can output the results as text
, json
or html
. It also has an embedded Django and Flask middleware.
Let’s see the output it generates:
import time def bar(): time.sleep(1.0) def foo(): bar() foo() ''' _ ._ __/__ _ _ _ _ _/_ Recorded: 14:29:58 Samples: 1 /_//_/// /_\ / //_// / //_'/ // Duration: 1.001 CPU time: 0.000 / _/ v3.0.3 Program: example.py 1.001 <module> example.py:1 └─ 1.001 foo example.py:8 └─ 1.001 bar example.py:4 To view this report with different options, run: pyinstrument --load-prev 2020-01-03T14-29-58 [options '''
You can see that all call stack information is saved. There are options to save the output as a timeline without merging the sample results into a single result.
pyinstrument uses internal sampling. But instead of interrupting the process at every X millisecond, it uses a very different approach. It calls the sys.setprofile()
API just like a deterministic profiler. When a call
event happens it: uses this event to save the call stack information, saves a timestamp for remembering the last sample time and disables the setprofile
callback.
This overcomes most of the complexities that came along with dealing with Unix signals and there is no additional overhead involved (as per what I measured). See here for more details.
py-spy
py-spy uses external sampling approach which means you neither need to install anything in the system nor instrument anything in the application. You can attach to any process by its PID
and output the profile results in speedscope
or flamegraph
formats.
There are also two nifty commands in CLI interface: py-spy top
and py-spy dump
. top
simply shows top
like output for the executing functions in the attached application whereas dump
dumps call stacks of all threads running the application. These might help a lot in debugging deadlock or intermittent performance issues. The overhead is no more than ~%10.
Let’s see an example output:
import time def bar(): time.sleep(1.0) def foo(): bar() foo() ''' GIL: 0.00%, Active: 4.44%, Threads: 1 %Own %Total OwnTime TotalTime Function (filename:line) 4.44% 4.44% 0.040s 0.040s bar (example.py:5) 0.00% 4.44% 0.000s 0.040s foo (example.py:9) 0.00% 4.44% 0.000s 0.040s <module> (example.py:12) '''
Note that the numbers are here biased by too few samples, which won’t happen when used in production.
tracemalloc
tracemalloc
is included in the Python standard library as of Python 3.4.
The tracemalloc
module is a tool to trace memory blocks allocated. It provides a call traceback where an object is allocated. It can aggregate this data and show statistics on allocated memory blocks per filename and per line number. More details on its API can be found in its own documentation.
It uses a new C API PyMem_SetAllocator to hook malloc
/realloc
/free
functions. Whenever an event happens the hook function is called and tracemalloc
saves a traceback along with the allocation. You can later inspect the traceback and see where the allocation happened to debug memory issues easily.
It also has a backport for Python2: pytracemalloc. It requires to manually patch and compile Python source to support the PyMem_SetAllocator
/PyMem_GetAllocator
APIs which are defined in PEP445. Backporting tracemalloc
is explained here.
Let’s see an example:
import tracemalloc class A: def __init__(self): self._d = [] def alloc(self): self._d += [1] * 512 tracemalloc.start(25) a = A() a.alloc() stats = tracemalloc.take_snapshot().statistics('traceback') for t in stats: print(t) ''' tm.py:15: size=4656 B, count=1, average=4656 B tm.py:14: size=64 B, count=1, average=64 B tm.py:14: size=40 B, count=1, average=40 B '''
You can see that there are three allocations happening. For line 14 (a = A()
), tracemalloc
shows both the allocation of a
and self._d
. For line 15 (a.alloc()
), tracemalloc
shows the growth in dictionary.
As tracemalloc operates for every malloc
/free
, it can add some overhead to the application. In my tests, I have seen nearly 1.5-2x overhead at worst. So, it might not be something you would like to add to production and forget but it might be very helpful in situations where you have identified a memory issue and want to make further analysis.
objgraph
objgraph
supports both Python2.7/3.x and can be installed easily via pip.
objgraph
takes a different approach than tracemalloc
. It shows reference chains to allocated objects and lets you visualize this kind of information into image files. Instead of tracing malloc
/free
, objgraph
looks at objects that are in memory and ask the question to the object: who is pointing to you until it reaches a root object.
Let’s see an example:
import objgraph x = [1] y = [x] objgraph.show_backrefs([x], filename='sample-graph.png')
And the output:
You can see the reference chain for x
: x
is referenced by a module’s __dict__
as it is a global
variable and it is also referenced by another global variable y
. This example do not have too deep reference chains but in real world, you could see deep chains of references pointing to the allocated object.
Here is an example taken from a real case (from here):
You can also get objects which are not referenced by any other object but still reside in memory(a possible reference leak) via objgraph.get_leaking_objects() API.
objgraph
does not add any overhead to the application as it does not need to hold additional data. However the objects and their references can be a huge graph in itself and traversing it might take a while.
The API interface of objgraph
is quite unique and comprehensive. You can do lots of introspection related to objects and references. You can find the API documentation here.
Blackfire.io
Blackfire is a deterministic profiler, offered as a SaaS.
It works with a language extension, the “probe”, which collects profiling data, and an “agent” which computes the profiling data before sending it to the Blackfire.io servers (removing any sensitive data, and making it more lightweight).
Profiling is triggered on demand either with a browser extension, or a command line for HTTP requests as well as CLI scripts.
Profiles can then be displayed via Blackfire’s web UI, shared with team mates and compared vs other profiles. Without any overhead for end user, as we’ll see in our next article!
Next article: Picking the Right Profiler