What’s eating python performance

what’s eating python performance
Piotr Przymus
Nicolaus Copernicus University
1

about me
• Piotr Przymus PhD
• work @ Nicolaus Copernicus University in Toruń
• Interests: data mining and machine learning, databases,
GPGPU computing, high-performance computing.
• 9 years of Python experience.
2

introduction
Programmers waste enormous amounts of time thinking about, or
worrying about the speed of noncritical parts of their programs,
and these attempts at efficiency actually have a strong negative
impact when debugging and maintenance are considered. We
should forget about small efficiencies, say about 97% of the time:
premature optimisation is the root of all evil.
Donald Knuth, “Structured Programming With Go To
Statements”, 1974.
Yet we should not pass up our opportunities in that critical 3%.
4

premature optimisation
Premature optimisation may be stated as optimising code before
knowing whether we need to.
This may be bad as it impacts:
• your productivity,
• readability of the code,
• ease of maintenance and debugging,
• and it may contradict The Zen of Python ;).
Learn how to do proper assessment of your code in terms of
optimisation needs!
Remember that a strong felling that your code falls into the
remaining 3% does not count!
5

think before doing (think before coding)
Going for higher performance without a deeper reason may be just
a waste of your time. So start with:
• stating your reasons (Why do you need higher performance?),
• defining your goals (What would be an acceptable speed of
your code?),
• estimating time and resources you are willing to spend to
achieve these goals.
Re-evaluate all the pros and cons.
6

why do you need higher performance?
Good reasons:
• Computation cost reduction
• Significantly better user experience
• Significantly faster results
7

what would be an acceptable speed of your code?
This is an important and a difficult to answer question!
• Computation cost reduction
• Large projects with lots of computations
• They may benefit just from few percent improvements.
• Significantly better user experience of web/desktop
application.
• Note user experience is subjective, the user may:
• not notice the difference,
• or may not care about the change.
• The User is Always Right
• Significantly faster results
• Scientific computing, Data mining, Machine learning
• Large data sets processing
• Example: going from weeks to one day makes a huge
difference. 8

amdahl’s law
Amdahl’s law is used to find the maximum expected improvement
to an overall system when only part of the system is improved.
(wiki)
• Often used in parallel computing to predict the theoretical
maximum speedup.
• Assumes that the problem size remains the same!
Maximum expected improvement of a system, when only part of
the computation is improved
improvment =
1
(1 − P) + P
S
where:
• P is the proportion of improved computations,
• S is the improvement ratio. 9

amdahl’s law – example
Figure 1:Amdahl’s law example
10

amdahl’s law – example
If we improve:
• 30% of computations,
• so that they run twice as fast,
then P = 0.3 and S = 2, and the overall system improvement is
only
1
(1 − 0.3) + 0.3
2
= 1.1765.
11

test, measure, track down bottle-
necks
12

test, measure, track down bottlenecks
A starting point for optimisation is a running code that gives
correct results.
• Prepare a regression test suite!
Then rest of the optimisation process may be summarized as:
1. Test if the code works correctly.
2. Measure execution time
• if code is not fast enough use a profiler to identify the
bottlenecks,
• else Your done!
3. Fix performance problems.
4. Start from the beginning.
13

regression test suite
Before you start, prepare a regression test suite that:
• will guard the correctness of your code during the
optimisation.
• is comprehensive but yet quick-to-run.
Test will be ran very often – a reasonable execution time is a must!
14

measuring execution time
Measure execution time of your code. This is important because:
• it shows if you are getting any progress,
• it shows how far it is from the desired execution time (a.k.a.
acceptable speed),
• it allows you to compare various version of optimisations.
15

There are various tools to do that, among them:
• Custom made timer,
• Pythons timeit module,
• unix time (use /usr/bin/time as time is also a common shell
built in).
16

timeit
A module provides a simple way to time small bits of Python code,
has:
• command-line interface
1 $ python −m t i m e i t ’ ”−” . j o i n ( [ s t r (n) f o r n in range (100)
] ) ’
2 10000 loops , best of 3: 33.4 usec per loop
3 $ python −m t i m e i t ’ ”−” . j o i n (map( str , range (100) ) ) ’
4 10000 loops , best of 3: 25.2 usec per loop
• Python Interface
1 >>> t i m e i t . t i m e i t ( ’ ”−”. j o i n ( [ s t r (n) f o r n in range (100)
] ) ’ , number=10000)
2 0.7288308143615723
3 >>> t i m e i t . t i m e i t ( ’ ”−”. j o i n (map( str , range (100) ) ) ’ ,
number=10000)
17

/usr/bin/time -v – simple but useful
1 Command being timed : ”python universe−new . py”
2 User time ( seconds ) : 0.38
3 System time ( seconds ) : 1.61
4 Percent of CPU t h i s job got : 26%
5 Elapsed ( wall clock ) time (h :mm: ss or m: ss ) : 0:07.46
6 Average shared text s i z e ( kbytes ) : 0
7 Average unshared data s i z e ( kbytes ) : 0
8 Average stack s i z e ( kbytes ) : 0
9 Average t o t a l s i z e ( kbytes ) : 0
10 Maximum r esid e n t set s i z e ( kbytes ) : 22900
11 Average r es id en t set s i z e ( kbytes ) : 0
12 Major ( r e q u i r i n g I /O) page f a u l t s : 64
13 Minor ( reclaiming a frame ) page f a u l t s : 6370
14 Voluntary context switches : 3398
15 Involuntary context switches : 123
16 Swaps : 0
17 F i l e system inputs : 25656
18 F i l e system outputs : 0
19 Socket messages sent : 0
20 Socket messages received : 0
21 Signals d e l i v e r e d : 0
22 Page s i z e ( bytes ) : 4096
23 Exit status : 0
18

Notes on measuring:
• Try to measure multiple independent repetitions of your code.
• Establish the lower bound of your execution time!
• Prepare a testing environment that will allow you to get
comparable results.
• Consider writing a micro benchmark to check various
alternative solutions of some algorithm.
• Be careful measuring speed using artificial data.
• Re-validate using real data.
19

tracking down the bottlenecks
Profiling tools will give you a more in depth view of your code
performance.
Take a view of your program internals in terms of
• execution time
• and used memory.
20

tracking down the bottlenecks
There are various possible tools, like:
• vmprof – see next talk for details!
• cProfile – a profiling module available in Python standard
library,
• line_profiler – an external line-by line profiler,
• tools for visualizing profiling results such as runsnakerun.
21

output of cprofile
cProfiler is a deterministic profiling of Python programs.
• command-line interface
1 python -m cProfile [-o output_file] [-s
sort_order] myscript.py
• Python interface
1 import cProfile
2 import re
3 cProfile.run('re.compile("foo|bar")')
22

output of cprofile
1 197 function c a l l s (192 p r i m i t i v e c a l l s ) in 0.002 seconds
2
3 Ordered by : standard name
4
5 n c a l l s tottime p e r c a l l cumtime p e r c a l l filename : lineno ( function )
6 1 0.000 0.000 0.001 0.001 <string >:1(<module>)
7 1 0.000 0.000 0.001 0.001 re . py :212( compile )
8 1 0.000 0.000 0.001 0.001 re . py :268( _compile )
9 1 0.000 0.000 0.000 0.000 sre_compile . py :172( _compile_charset )
10 1 0.000 0.000 0.000 0.000 sre_compile . py :201( _optimize_charset )
11 4 0.000 0.000 0.000 0.000 sre_compile . py :25( _identityfunction )
12 3/1 0.000 0.000 0.000 0.000 sre_compile . py :33( _compile )
23

usage of line_profile
1 @profile
2 def do_stuff(numbers):
3 print numbers
4
5 numbers = 2
6 do_stuff(numbers)
24

output of line_profile
1 > python ”C: Python27 Scripts kernprof . py” −l −v example . py
2 2
3 Wrote p r o f i l e r e s u l t s to example . py . l p r o f
4 Timer unit : 3.2079e−07 s
5
6 F i l e : example . py
7 Function : do_stuff at l i n e 2
8 Total time : 0.00185256 s
9
10 Line # Hits Time Per Hit % Time Line Contents
11 ==============================================================
12 1 @profile
13 2 def do_stuff ( numbers ) :
14 3 1 5775 5775.0 100.0 p rin t numbers
25

runsnakerun
Figure 2:Runsnakerun
26

io bound vs compute bound
Learn how to classify types of performance bounds.
• The compute bound – large number of instructions is
making your code slow,
• the I/O bound – your code is slow because of various I/O
operations, like:
• disk access, network delays, other I/O.
Depending on the type of the bound, different optimisation
strategies will apply.
27

fixing the cause: performance tips
28

algorithms and data structures
Improving your algorithms time complexity is probably the best
thing you could do to optimise your code!
• Micro optimisation tricks will not bring you anywhere near to
the speed boost you could get from improving time complexity
of algorithm.
The big O notation matters!
• Check data structures used in your algorithms!
• Check out Time complexity @ Python’s Wiki
29

algorithms and data structures – example
Innocent lookup code placed in a large loop may generate a
performance issue.
1 def sanitize_1(user_input , stop_words):
2 """Sanitize using standard lists, new_list , iterate
over user_input check in stop_words list"""
3 new_list = []
4 for w in user_input: # longer list
5 if w not in stop_words: # shorter list
6 new_list.append(w)
7 return new_list
• Real data (Project Guttenberg, extended English stop list)
• Execution time 'pg11.txt': 2.4460400000000035, 'pg1342.txt
': 9.896383000000007, 'pg76.txt': 9.086391999999998
30

Innocent lookup code placed in a large loop may generate a
performance issue.
1 def sanitize_1d(user_input , stop_words):
2 """Sanitize using lists comprehension , iterate over
user_input , check in stop_words list"""
3 return [w for w in user_input if w not in stop_words
]
• Execution time 'pg11.txt': 2.4180460000000052, 'pg1342.txt
': 9.796099999999987, 'pg76.txt': 8.98378300000001
31

Often a trivial change, like changing a list to a set, may be the key
to solving the problem.
1 def sanitize_2d(user_input , stop_words):
2 """Sanitize using list comprehension and set"""
3 # even better if stop_words is already a set
4 stop_words = set(stop_words)
5 return [w for w in user_input if w not in stop_words
]
• Execution time
'pg11.txt': 0.02787999999999835, 'pg1342.txt':
0.1341930000000058, 'pg76.txt': 0.1227470000000066
Order of magnitude faster! 32

algorithms and data structures – in the wild
See excellent “A Python Optimization Anecdote” written by Pavel
Panchekha from Dropbox.
33

memory and i/o bounds
Some performance issues may be memory related, so check
memory utilization! Typical symptoms that indicate that your code
may have memory problems:
• your program never releases memory,
• or your program allocates way too much memory.
Also check if your code uses memory efficiently.
See may previous talk and references included therein.
• “Everything You Always Wanted to Know About Memory in
Python But Were Afraid to Ask”
34

memory and i/o bounds
I/O bounds may require more effort to deal with. Depending on
the problem there may be various solutions, consider using:
• asynchronous I/O with Python
• probabilistic and heuristic data structures instead of real data
• like Bloom filters,
• which are used to test whether an element is a member of a
set,
• false positive matches are possible, but false negatives are not.
• compressed data structures and lightweight compression
algorithms
35

lightweight compression
Lightweight compression algorithms – family of algorithms that are
primarily intended for real-time applications.
Lightweight compression algorithms favours compression and
decompression speed over compression ratio.
• Improved data transfer
• Lower memory footprint
• In some cases – improved internal memory access
0s 2s 4s 6s 8s 10s 12s 14s 16s
Time seconds
no compression
with compression
Processing timeData transfer
Figure 3:Lightweight compression idea
36

lightweight compression
Lightweight compression algorithms in Python:
• bindings to Snappy, lz4, others.
• write your own compression scheme.
Cassandra example:
Depending on the data characteristics of the table, compressing its
data can result in:
• 2x-4x reduction in data size
• 25-35% performance improvement on reads
• 5-10% performance improvement on writes
Cassandra supports both Snappy and lz4.
37

iteration independent calculations
Bring iteration-independent calculations outside of the loop.
This is a common sense and good practice.
• fix loops with code that performs computations that do not
change within loop,
Beware that such operations may be hidden in a class method or in
a free function.
38

branching in large loops.
Try to avoid conditional branching in large loops.
Check whatever instead of having if/else statements in the loop
body:
• it is possible to do the conditional check outside the loop,
• unroll branch in a loop,
• have separate loops for different branches.
39

function inlining
Python introduces relatively high overhead for function/method
calls.
In some cases it may be worth to consider code inlining to avoid
the overhead
• but this comes at a cost of code maintenance and readability.
40

function inlining
1 def sigmoid(x):
2 return math.tanh(x)
3
4 class BPNN:
5 def update(self, inputs):
6 ...
7 for i in range(self.ni-1):
8 self.ai[i] = sigmoid(inputs[i])
9 ...
41

function inlining
1 class BPNN:
2 def update(self, inputs):
3 ...
4 for i in range(self.ni-1):
5 self.ai[i] = math.tanh(input[i])
6 ...
42

other
• Use high performance datatypes – module Collections
• Loop unrolling
• Preallocation
• string.intern
• using locals instead of globals
• improving lookup time of function
function/method/variable/attribute
43

notes on the special cases
Use the right tools:
• When your code involves numerics – use numpy, scipy and
other specialized scientific libraries.
• This are highly optimised routines (usually based on external
scientific libraries).
• Consider pushing performance-critical code into C.
Remember to check your code with PyPy, you may be pleasantly
surprised.
44

notes on the special cases
Some problems may just need more computing power, so it may be
a good idea to:
• write code that utilizes multi core architecture
(mutliprocessing),
• or scale your code to multiple machines (task queues, spark,
grid like environment),
• or using hardware accelerators (pyOpenCL, pyCuda, pyMIC,
etc.)
45

final notes
• Optimize only when it is justified.
• Measure, profile and test.
• Optimization takes experimenting.
• Knowledge on what is going behind the scenes may help.
• Value your time. Performance tuning takes time, and your
time is expensive.
• judging by conference hotel - our time is expensive ;)
46

references
1. A Python Optimization Anecdote, Pavel Panchekha, 2011,
Dropbox.
2. Code optimization and its effects on Python, Karl-Oskar
Masing, 2013.
3. PythonSpeed, https://wiki.python.org
4. PythonSpeed / Performance Tips, https://wiki.python.org
5. Time complexity, https://wiki.python.org
6. PythonSpeed / Profiling Python
Programs,https://wiki.python.org
7. Performance, http://pypy.org
8. Everything You Always Wanted to Know About Memory in
Python But Were Afraid to Ask, http://przymus.org
47

What’s eating python performance

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a What’s eating python performance

Similar a What’s eating python performance (20)

Último

Último (20)

What’s eating python performance