Serge Guelton and Pierrick Brunet (Namek): “Pythran: Static Compilation of Parallel Scientific Kernels”
Abstract: As the use of Python coupled to Numpy/SciPy for numerical computation increases, many tools to optimize performance have emerged. Indeed, this duo has relatively poor performance when compared to scientific codes written in legacy languages like C or Fortran. Cython, Numba, numexpr and parakeet belongs to this new compiler ecosystem. And so does Pythran, a Python to C++11 translator for scientific Python.
Pythran uses a static compilation approach a la Cython, but with full backward compatibility with Python. It does not only turns Python code into C++ code, it also performs Python/Numpy specific optimizations, generates calls to a parallel, vectorized runtime and makes it possible to write OpenMP annotation in the original Python code. It supports a large range of Numpy functions and can combine them in efficient ways: it can optimize highlevel modern Python/Numpy codes and not only Fortran with a Python flavor ones.
This talk presents the existing compilation approach and optimization opportunities for numerical Python, their strengths and weaknesses, then focus on the specificities of the Pythran compiler.
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
PyData Paris 2015 - Track 3.2 Serge Guelton et Pierrick Brunet
1. Pythran: Static Compilation of
Parallel Scientific Kernels
a.k.a. Python/Numpy compiler for the
mass
Proudly made in Namek by &serge-sans-paille pbrunet
2. /us
Serge « sans paille » Guelton
$ w h o a m i
s g u e l t o n
R&D engineer at on compilation for security
Associate researcher at Télécom Bretagne
QuarksLab
Pierrick Brunet
$ w h o a m i
p b r u n e t
R&D engineer at on parallelismINRIAlpes/MOAIS
3. Pythran in a snake shell
A Numpy-centric Python-to-C++ translator
A Python code optimizer
A Pythonic C++ library
4. Core concepts
Focus on high-level constructs
Generate clean high level code
Optimize Python code before generated code
Vectorization and Paralllelism
Test, test, test
Bench, bench, bench
5. Ask StackOverflow
when you're looking for test cases
http://stackoverflow.com/[...]numba-or-cython-acceleration-in-
reaction-diffusion-algorithm
i m p o r t n u m p y a s n p
d e f G r a y S c o t t ( c o u n t s , D u , D v , F , k ) :
n = 3 0 0
U = n p . z e r o s ( ( n + 2 , n + 2 ) , d t y p e = n p . f l o a t 3 2 )
V = n p . z e r o s ( ( n + 2 , n + 2 ) , d t y p e = n p . f l o a t 3 2 )
u , v = U [ 1 : - 1 , 1 : - 1 ] , V [ 1 : - 1 , 1 : - 1 ]
r = 2 0
u [ : ] = 1 . 0
U [ n / 2 - r : n / 2 + r , n / 2 - r : n / 2 + r ] = 0 . 5 0
V [ n / 2 - r : n / 2 + r , n / 2 - r : n / 2 + r ] = 0 . 2 5
u + = 0 . 1 5 * n p . r a n d o m . r a n d o m ( ( n , n ) )
v + = 0 . 1 5 * n p . r a n d o m . r a n d o m ( ( n , n ) )
f o r i i n r a n g e ( c o u n t s ) :
L u = ( U [ 0 : - 2 , 1 : - 1 ] +
U [ 1 : - 1 , 0 : - 2 ] - 4 * U [ 1 : - 1 , 1 : - 1 ] + U [ 1 : - 1 , 2 : ] +
U [ 2 : , 1 : - 1 ] )
L v = ( V [ 0 : - 2 , 1 : - 1 ] +
6. Thread Summary
OP
My code is slow with Cython and Numba
Best Answer
You need to make all loops explicit
7. Cython Version
c i m p o r t c y t h o n
i m p o r t n u m p y a s n p
c i m p o r t n u m p y a s n p
c p d e f c y t h o n G r a y S c o t t ( i n t c o u n t s , d o u b l e D u , d o u b l e D v , d o u b l e F , d o u b l e k ) :
c d e f i n t n = 3 0 0
c d e f n p . n d a r r a y U = n p . z e r o s ( ( n + 2 , n + 2 ) , d t y p e = n p . f l o a t _ )
c d e f n p . n d a r r a y V = n p . z e r o s ( ( n + 2 , n + 2 ) , d t y p e = n p . f l o a t _ )
c d e f n p . n d a r r a y u = U [ 1 : - 1 , 1 : - 1 ]
c d e f n p . n d a r r a y v = V [ 1 : - 1 , 1 : - 1 ]
c d e f i n t r = 2 0
u [ : ] = 1 . 0
U [ n / 2 - r : n / 2 + r , n / 2 - r : n / 2 + r ] = 0 . 5 0
V [ n / 2 - r : n / 2 + r , n / 2 - r : n / 2 + r ] = 0 . 2 5
u + = 0 . 1 5 * n p . r a n d o m . r a n d o m ( ( n , n ) )
v + = 0 . 1 5 * n p . r a n d o m . r a n d o m ( ( n , n ) )
c d e f n p . n d a r r a y L u = n p . z e r o s _ l i k e ( u )
c d e f n p . n d a r r a y L v = n p . z e r o s _ l i k e ( v )
c d e f i n t i , c , r 1 , c 1 , r 2 , c 2
c d e f d o u b l e u v v
8. Pythran version
Add this line to the original kernel:
# p y t h r a n e x p o r t G r a y S c o t t ( i n t , f l o a t , f l o a t , f l o a t , f l o a t )
Timings
$ p y t h o n - m t i m e i t - s ' f r o m g r a y s c o t t i m p o r t G r a y S c o t t ' ' G r a y S c o t t ( 4 0 , 0 . 1 6 , 0 . 0
1 0 l o o p s , b e s t o f 3 : 5 2 . 9 m s e c p e r l o o p
$ c y t h o n g r a y s c o t t . p y x
$ g c c g r a y s c o t t . c ` ` - s h a r e d - f P I C - o g r a y s c o t t . s o
$ p y t h o n - m t i m e i t - s ' f r o m g r a y s c o t t i m p o r t G r a y S c o t t ' ' G r a y S c o t t ( 4 0 , 0 . 1 6 , 0 . 0
1 0 l o o p s , b e s t o f 3 : 3 6 . 4 m s e c p e r l o o p
$ p y t h r a n g r a y s c o t t . p y - O 3 - m a r c h = n a t i v e
$ p y t h o n - m t i m e i t - s ' f r o m g r a y s c o t t i m p o r t G r a y S c o t t ' ' G r a y S c o t t ( 4 0 , 0 . 1 6 , 0 . 0
1 0 l o o p s , b e s t o f 3 : 2 0 . 3 m s e c p e r l o o p
p y t h o n - c o n f i g - - c f l a g s - - l i b s
9. Lessons learnt
Explicit is not always better than implicit
Many ``optimization hints'' can be deduced by the compiler
High level constructs carry valuable informations
I am not saying Cython is bad. Cython does a great job. It is just
pragmatic where Pythran is idealist
10. Compilation Challenges
u = U [ 1 : - 1 , 1 : - 1 ]
U [ n / 2 - r : n / 2 + r , n / 2 - r : n / 2 + r ] = 0 . 5 0
u + = 0 . 1 5 * n p . r a n d o m . r a n d o m ( ( n , n ) )
Array views
Value broadcasting
Temporary arrays creation
Extended slices composition
Numpy API calls
11. Optimization Opportunities
L u = ( U [ 0 : - 2 , 1 : - 1 ] + U [ 1 : - 1 , 0 : - 2 ]
- 4 * U [ 1 : - 1 , 1 : - 1 ] + U [ 1 : - 1 , 2 : ] + U [ 2 : , 1 : - 1 ] )
Many useless temporaries
Lucould be forward-substituted
SIMD instruction generation opportunities
Parallel loop opportunities
12. Pythran Usage
$ p y t h r a n - - h e l p
u s a g e : p y t h r a n [ - h ] [ - o O U T P U T _ F I L E ] [ - E ] [ - e ] [ - f f l a g ] [ - v ] [ - p p a s s ]
[ - m m a c h i n e ] [ - I i n c l u d e _ d i r ] [ - L l d f l a g s ]
[ - D m a c r o _ d e f i n i t i o n ] [ - O l e v e l ] [ - g ]
i n p u t _ f i l e
p y t h r a n : a p y t h o n t o C + + c o m p i l e r
p o s i t i o n a l a r g u m e n t s :
i n p u t _ f i l e t h e p y t h r a n m o d u l e t o c o m p i l e , e i t h e r a . p y o r a . c p p
f i l e
o p t i o n a l a r g u m e n t s :
- h , - - h e l p s h o w t h i s h e l p m e s s a g e a n d e x i t
- o O U T P U T _ F I L E p a t h t o g e n e r a t e d f i l e
- E o n l y r u n t h e t r a n s l a t o r , d o n o t c o m p i l e
- e s i m i l a r t o - E , b u t d o e s n o t g e n e r a t e p y t h o n g l u e
- f f l a g a n y c o m p i l e r s w i t c h r e l e v a n t t o t h e u n d e r l y i n g C + +
c o m p i l e r
- v b e v e r b o s e
- p p a s s a n y p y t h r a n o p t i m i z a t i o n t o a p p l y b e f o r e c o d e
g e n e r a t i o n
- m m a c h i n e a n y m a c h i n e f l a g r e l e v a n t t o t h e u n d e r l y i n g C + +
13. Sample Usage
$ p y t h r a n i n p u t . p y # g e n e r a t e s i n p u t . s o
$ p y t h r a n i n p u t . p y - E # g e n e r a t e s i n p u t . c p p
$ p y t h r a n i n p u t . p y - O 3 - f o p e n m p # p a r a l l e l !
$ p y t h r a n i n p u t . p y - m a r c h = n a t i v e - O f a s t # E s o d M u m i x a m !
14. Type Annotations
Only for exported functions
# p y t h r a n e x p o r t f o o 0 ( )
# p y t h r a n e x p o r t f o o 1 ( i n t )
# p y t h r a n e x p o r t f o o 2 ( f l o a t 3 2 [ ] [ ] )
# p y t h r a n e x p o r t f o o 2 ( f l o a t 6 4 [ ] [ ] )
# p y t h r a n e x p o r t f o o 2 ( i n t 8 [ ] [ ] [ ] )
# p y t h r a n e x p o r t f o o 3 ( ( i n t , f l o a t ) , i n t l i s t , s t r : s t r d i c t )
16. Front End
100% based on the astmodule
Supports
Several standard module (incl. partial Numpy)
Polymorphic functions
ndarray, list, tuple, dict, str, int, long, float
Named parameters, default arguments
Generators…
Does not Support
Non-implicitely typed code
Global variable
Most Python modules (no CPython mode!)
User-defined classes…
17. Middle End
Iteratively applies high level, Python-aware optimizations:
Interprocedural Constant Folding
For-Based-Loop Unrolling
Forward Substitution
Instruction Selection
Deforestation
Scalar Renaming
dead Code Elimination
Fun Facts: can evaluate pure functions at compile time ☺
18. Back Ends
Python Back End
Useful for debugging!
C++11 Back End
C++11 implementation of __builtin__ numpy
itertools…
Lazy evaluation through Expression Templates
Relies on OpenMP, nt2and boost::simdfor the
parallelization / vectorization
19. C++11: Typing
the W.T.F. slide
Pythran translates Python implicitly statically typed polymorphic
code into C++ meta-programs that are instanciated for the user-
given types, and specialize them for the target architecture
20. C++11: Parallelism
Implicit
Array operations and several numpy functions are written using
OpenMP and Boost.simd
Explicit
OpenMP 3 support, ported to Python
# o m p p a r a l l e l f o r r e d u c t i o n ( + : r )
f o r i , v i n e n u m e r a t e ( l ) :
r + = i * v
21. Benchmarks
A collection of high-level benchmarks
https://github.com/serge-sans-paille/numpy-benchmarks
Code gathered from StackOverflow + other compiler code base
Mostly high-level code
Generate results for CPython, PyPy, Numba, Parakeet, Hope
and Pythran
Most kernels are too high level for Numba and Hope…
24. (Num)Focus: growcut
From the Numba codebase!
# p y t h r a n e x p o r t g r o w c u t ( f l o a t [ ] [ ] [ ] , f l o a t [ ] [ ] [ ] , f l o a t [ ] [ ] [ ] , i n t )
i m p o r t m a t h
i m p o r t n u m p y a s n p
d e f w i n d o w _ f l o o r ( i d x , r a d i u s ) :
i f r a d i u s > i d x :
r e t u r n 0
e l s e :
r e t u r n i d x - r a d i u s
d e f w i n d o w _ c e i l ( i d x , c e i l , r a d i u s ) :
i f i d x + r a d i u s > c e i l :
r e t u r n c e i l
e l s e :
r e t u r n i d x + r a d i u s
d e f g r o w c u t ( i m a g e , s t a t e , s t a t e _ n e x t , w i n d o w _ r a d i u s ) :
c h a n g e s = 0
s q r t _ 3 = m a t h . s q r t ( 3 . 0 )
h e i g h t = i m a g e . s h a p e [ 0 ]
26. Academic Results
Pythran: Enabling Static Optimization of Scientific Python
Programs, S. Guelton, P. Brunet et al. in CSD, 2015
Exploring the Vectorization of Python Constructs Using
Pythran and Boost SIMD, S. Guelton, J. Falcou and P. Brunet,
in WPMVP, 2014
Compiling Python modules to native parallel modules using
Pythran and OpenMP Annotations, S. Guelton, P. Brunet
and M. Amini, in PyHPC, 2013
Pythran: Enabling Static Optimization of Scientific Python
Programs, S. Guelton, P. Brunet et al. in SciPy, 2013
27. Powered by Strong Engineering
Preprequisite for reproductible science
2773 test cases, incl. unit testing, doctest, Continuous
integration (thx !)
Peer-reviewed code
Python2.7 and C++11
Linux, OSX (almost okay), Windows (on going)
User and Developer doc:
Hosted on
Releases on PyPi: $ pip install pythran
: $ apt-get install pythran
Travis
http://pythonhosted.org/pythran/
https://github.com/serge-sans-paille/pythran
Custom Debian repo
28. We need more peons
Pythonic needs a serious cleanup
Typing module needs better error reporting
OSX support is partial and Windows support is on-going
numpy.randomand numpy.linalg
29. THE END
AUTHORS
Serge Guelton, Pierrick Brunet, Mehdi Amini, Adrien
Merlini, Alan Raynaud, Eliott Coyac…
INDUSTRIAL SUPPORT
, , →You←
CONTRIBUTE
#pythran on FreeNode, pythran@freelists.org, GitHub repo
Silkan NumScale