SlideShare a Scribd company logo
1 of 92
Download to read offline
PyCUDA:
Harnessing the power of GPU with Python
Talk Structure




                    1. Why a GPU ?
                    2. How does It works ?
                    3. How do I Program it ?
                    4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo
Talk Structure




                    1. Why a GPU ?
                    2. How does It works ?
                    3. How do I Program it ?
                    4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo
WHY A GPU ?


PyCon 4 – Florence 2010 – Fabrizio Milo
APPLICATIONS & DEMOS


PyCon 4 – Florence 2010 – Fabrizio Milo
Why GPU?




PyCon 4 – Florence 2010 – Fabrizio Milo
Talk Structure




                    1. Why a GPU ?
                    2. How does it works ?
                    3. How do I Program it ?
                    4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo
How does it works ?




PyCon 4 – Florence 2010 – Fabrizio Milo
ALU   ALU

                                          Control

                                                            ALU   ALU




                                                    Cache




                                DRAM




                                                    CPU
PyCon 4 – Florence 2010 – Fabrizio Milo
DRAM




                                          GPU
PyCon 4 – Florence 2010 – Fabrizio Milo
ALU   ALU
                   Control
                                              ALU   ALU



                                      Cache




           DRAM                                           DRAM



                                      CPU                        GPU




PyCon 4 – Florence 2010 – Fabrizio Milo
CUDA




PyCon 4 – Florence 2010 – Fabrizio Milo
Compute Unified Device Architecture




PyCon 4 – Florence 2010 – Fabrizio Milo
CUDA
                      A Parallel Computing Architecture for NVIDIA GPUs




                                                Direct X
                                               Compute




PyCon 4 – Florence 2010 – Fabrizio Milo
Execution Model

                        CUDA
                                          Device Model




PyCon 4 – Florence 2010 – Fabrizio Milo
EXECUTION MODEL


PyCon 4 – Florence 2010 – Fabrizio Milo
Thread
                            Smallest unit of logic




PyCon 4 – Florence 2010 – Fabrizio Milo
A Block
                            A Group of Threads




PyCon 4 – Florence 2010 – Fabrizio Milo
A Grid
                            A Group of Blocks




PyCon 4 – Florence 2010 – Fabrizio Milo
One Block can have many threads




PyCon 4 – Florence 2010 – Fabrizio Milo
One Grid can have many blocks




PyCon 4 – Florence 2010 – Fabrizio Milo
The hardware

     DEVICE MODEL


PyCon 4 – Florence 2010 – Fabrizio Milo
Scalar Processor




PyCon 4 – Florence 2010 – Fabrizio Milo
Scalar Processor




PyCon 4 – Florence 2010 – Fabrizio Milo
Many Scalar Processors




PyCon 4 – Florence 2010 – Fabrizio Milo
+ Register File




PyCon 4 – Florence 2010 – Fabrizio Milo
+ Shared Memory




PyCon 4 – Florence 2010 – Fabrizio Milo
Multiprocessor




PyCon 4 – Florence 2010 – Fabrizio Milo
Device




PyCon 4 – Florence 2010 – Fabrizio Milo
Real Example: 10-Series Architecture

"   240 Scalar Processor (SP) cores execute kernel threads
"   30 Streaming Multiprocessors (SMs) each contain
         " 8 scalar processors
             
         "  1 double precision unit
         "  Shared memory




PyCon 4 – Florence 2010 – Fabrizio Milo
Software   Hardware

                                                         Scalar
                                                       Processor
                                           Thread




                                           Thread
                                            Block    Multiprocessor




                                            Grid        Device
PyCon 4 – Florence 2010 – Fabrizio Milo
Global Memory




PyCon 4 – Florence 2010 – Fabrizio Milo
Global Memory




PyCon 4 – Florence 2010 – Fabrizio Milo
RAM




                                     CPU    Global Memory




                            Host - Device




PyCon 4 – Florence 2010 – Fabrizio Milo
RAM




                                     CPU




                            Host – Multi Device




PyCon 4 – Florence 2010 – Fabrizio Milo
1. Why a GPU ?
                    2. How does It works ?
                    3. How do I Program it ?
                    4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo
Software   Hardware

                                                         Scalar
                                                       Processor
                                           Thread




                                           Thread
                                            Block    Multiprocessor




                                            Grid        Device
PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel


__global__ void multiply_them( float *dest,
     	   	     	    	    	     	 float *a, 	
     	   	     	    	    	     	 float *b )	
{	
   const int i = threadIdx.x;	
   dest[i] = a[i] * b[i];	
}	




                                          Thread
PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel


__global__ void multiply_them( float *dest,
     	   	     	    	    	     	 float *a, 	
     	   	     	    	    	     	 float *b )	
{	
   const int i = threadIdx.x;	
   dest[i] = a[i] * b[i];	
}	




                                          Thread
PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel


__global__ void multiply_them( float *dest,
     	   	     	    	    	   	 float *a, 	
     	   	     	    	    	   	 float *b )	
{	
   const int i = threadIdx.x;	
   dest[i] = a[i] * b[i];	
}	




                                          Block
PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel


__global__ void kernel( … )	
{	
   const int idx =	

                blockIdx.x * blockDim.x + threadIdx.x;	
        …	
}	




                                          Grid
PyCon 4 – Florence 2010 – Fabrizio Milo
How do I Program it ?


                                          Main Logic   Kernel


                                            GCC
                                                       NVCC




         CPU                                 .bin      .cubin   GPU




PyCon 4 – Florence 2010 – Fabrizio Milo
How do I Program it ?


                                          Main Logic                Kernel


                                            GCC
                                                                    NVCC



                                                                             GPU

                                             .bin                   .cubin




                                                    .bin   .cubin     .      CPU

PyCon 4 – Florence 2010 – Fabrizio Milo
RAM




                                     CPU    Global Memory




                            Host - Device




PyCon 4 – Florence 2010 – Fabrizio Milo
RAM




                                     CPU   Global Memory




PyCon 4 – Florence 2010 – Fabrizio Milo
Allocate Memory


cudaMalloc( pointer, size )	




 PyCon 4 – Florence 2010 – Fabrizio Milo
Copy to device


cudaMalloc( pointer, size )	

cudaMemcpy( dest, src, size, direction)	




 PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel Launch


cudaMalloc( pointer, size )	

cudaMemcpy( dest, src, size, direction)	

Kernel<<< # blocks, # threads >> (*params)	




 PyCon 4 – Florence 2010 – Fabrizio Milo
Get Back the Results


cudaMalloc( pointer, size )	

cudaMemcpy( dest, src, size, direction)	

Kernel<<< # blocks, # threads >> (*params)	

cudaMemcpy( dest, src, size, direction)	




 PyCon 4 – Florence 2010 – Fabrizio Milo
Error Handling




If(cudaMalloc( pointer, size ) != cudaSuccess){	
   handle_error()	
}	




 PyCon 4 – Florence 2010 – Fabrizio Milo
And soon it becomes …


If(cudaMalloc( pointer, size ) != cudaSuccess){	
 handle_error()	
}	

if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	

If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
 handle_error()	
}	

If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	




  PyCon 4 – Florence 2010 – Fabrizio Milo
And soon it becomes …
If(cudaMalloc( pointer, size ) != cudaSuccess){	
 handle_error()	                                                     If(cudaMalloc( pointer, size ) != cudaSuccess){	
}	                                                                    handle_error()	
                                                                     }	
if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	
                                                                     if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	
If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
 handle_error()	                                                     If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
}	                                                                    handle_error()	
                                                                     }	
If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	
                                                                     If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	

 If(cudaMalloc( pointer, size ) != cudaSuccess){	
  handle_error()	                                                     If(cudaMalloc( pointer, size ) != cudaSuccess){	
 }	                                                                    handle_error()	
                                                                      }	
 if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	
                                                                      if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	
 If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
  handle_error()	                                                     If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
 }	                                                                    handle_error()	
                                                                      }	
 If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	
                                                                      If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	


  If(cudaMalloc( pointer, size ) != cudaSuccess){	
   handle_error()	                                                     If(cudaMalloc( pointer, size ) != cudaSuccess){	
  }	                                                                    handle_error()	
                                                                       }	
  if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	
                                                                       if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {}	
  If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
   handle_error()	                                                     If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){	
  }	                                                                    handle_error()	
                                                                       }	
  If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	
                                                                       If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { }	




  PyCon 4 – Florence 2010 – Fabrizio Milo
PyCon 4 – Florence 2010 – Fabrizio Milo
1. Why a GPU ?
                    2. How does It works ?
                    3. How do I Program it ?
                    4. Can I Use Python ?

PyCon 4 – Florence 2010 – Fabrizio Milo
+




    & ANDREAS KLOCKNER

    = PYCUDA

PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy




                                             Provide
                                            Complete
                                             Access

  PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy




                                            AutoMatically
                                              Manage
                                             Resources

  PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy




                                             Check and
                                            Report Errors



  PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy




                                           Cross
                                          Platform



PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy




                                               Allow
                                            Interactive
                                                Use


  PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda Philosopy




                                              NumPy
                                            Integration



  PyCon 4 – Florence 2010 – Fabrizio Milo
NUMPY - ARRAY
PyCon 4 – Florence 2010 – Fabrizio Milo
1       1   1   1   1   1

                                               0                   99




import numpy	

 my_array = numpy.array([1,] * 100)	



 PyCon 4 – Florence 2010 – Fabrizio Milo
1   1   1   0   1   1




import numpy	

 my_array = numpy.array([1,] * 100)	

 my_array[3] = 0	
 PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: Workflow




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: Workflow




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: Workflow




PyCon 4 – Florence 2010 – Fabrizio Milo
Memory Allocation


cuda.mem_alloc( size_bytes )	




 PyCon 4 – Florence 2010 – Fabrizio Milo
Memory Copy


gpu_mem = cuda.mem_alloc( size_bytes )	

cuda.memcpy_htod( gpu_mem, cpu_mem )	




 PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel


gpu_mem = cuda.mem_alloc( size_bytes )	

cuda.memcpy_htod( gpu_mem, cpu_mem )	

SourceModule(“””	
__global__ void multiply_them( float *dest, float *a, 	
       	    	      	      	    	      	      float *b )	
{	
   const int i = threadIdx.x;	
   dest[i] = a[i] * b[i];	
}”””)	




  PyCon 4 – Florence 2010 – Fabrizio Milo
Kernel Launch


mod = SourceModule(“””	
__global__ void multiply_them( float *dest, float *a, 	
       	    	      	      	    	      	      float *b )	
{	
   const int i = threadIdx.x;	
   dest[i] = a[i] * b[i];	
}”””)	

multiply_them = mod.get_function(“multiply_them”)	
multiply_them ( *args, block=(30, 64, 1))	




  PyCon 4 – Florence 2010 – Fabrizio Milo
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCon 4 – Florence 2010 – Fabrizio Milo
PyCon 4 – Florence 2010 – Fabrizio Milo
Hello Gpu

     DEMO


PyCon 4 – Florence 2010 – Fabrizio Milo
GPUARRAY
PyCon 4 – Florence 2010 – Fabrizio Milo
gpuarray




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray




   gpuarray.to_gpu(numpy array)	

   numpy array = gpuarray.get()	




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray




   gpuarray.to_gpu(numpy array)	

   numpy array = gpuarray.get()	

     +, -, !, /, fill, sin, exp, rand, basic
     indexing, norm, inner product …

PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray: ElementWise



from pycuda.elementwise import ElementwiseKernel




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray: ElementWise



from pycuda.elementwise import ElementwiseKernel


lincomb = ElementwiseKernel(
      ” float a , float !x , float b , float !y , float !z”,
      ”z [ i ] = a !x[ i ] + b!y[i ] ”
)




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: GpuArray: ElementWise



from pycuda.elementwise import ElementwiseKernel


lin comb = ElementwiseKernel(
       ” float a , float !x , float b , float !y , float !z”,
       ”z [ i ] = a !x[ i ] + b!y[i ] ”
)

c gpu = gpuarray. empty like (a gpu)
lincomb (5, a gpu, 6, b gpu, c gpu)

assert la . norm((c gpu ! (5!a gpu+6!b gpu)).get()) < 1e!5
PyCon 4 – Florence 2010 – Fabrizio Milo
Meta-Programming


__kernel_template__ = “””	
__global__ void kernel( args )	
{	

for (int i=0; i={{ iterations }}; i++){	
 {{operations}}	
}	

}”””	




  See for example jinja2

  PyCon 4 – Florence 2010 – Fabrizio Milo
Meta-Programming




PyCon 4 – Florence 2010 – Fabrizio Milo
Meta-Programming




         Generate Source !




PyCon 4 – Florence 2010 – Fabrizio Milo
Performances ?




PyCon 4 – Florence 2010 – Fabrizio Milo
mandelbrot

     DEMO


PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda: Documentation




PyCon 4 – Florence 2010 – Fabrizio Milo
PyCuda

WebSite:
http://mathema.tician.de/software/ pycuda

License:
X Consortium License
  (no warranty, free for all use)

Dependencies:
  Python 2.4+, numpy, Boost
 PyCon 4 – Florence 2010 – Fabrizio Milo
In the Future …




    OPENCL

PyCon 4 – Florence 2010 – Fabrizio Milo
THANK YOU & HAVE FUN !


PyCon 4 – Florence 2010 – Fabrizio Milo
?

PyCon 4 – Florence 2010 – Fabrizio Milo

More Related Content

More from PyCon Italia

Spyppolare o non spyppolare
Spyppolare o non spyppolareSpyppolare o non spyppolare
Spyppolare o non spyppolarePyCon Italia
 
zc.buildout: "Un modo estremamente civile per sviluppare un'applicazione"
zc.buildout: "Un modo estremamente civile per sviluppare un'applicazione"zc.buildout: "Un modo estremamente civile per sviluppare un'applicazione"
zc.buildout: "Un modo estremamente civile per sviluppare un'applicazione"PyCon Italia
 
Undici anni di lavoro con Python
Undici anni di lavoro con PythonUndici anni di lavoro con Python
Undici anni di lavoro con PythonPyCon Italia
 
socket e SocketServer: il framework per i server Internet in Python
socket e SocketServer: il framework per i server Internet in Pythonsocket e SocketServer: il framework per i server Internet in Python
socket e SocketServer: il framework per i server Internet in PythonPyCon Italia
 
Qt mobile PySide bindings
Qt mobile PySide bindingsQt mobile PySide bindings
Qt mobile PySide bindingsPyCon Italia
 
Python: ottimizzazione numerica algoritmi genetici
Python: ottimizzazione numerica algoritmi geneticiPython: ottimizzazione numerica algoritmi genetici
Python: ottimizzazione numerica algoritmi geneticiPyCon Italia
 
Python in the browser
Python in the browserPython in the browser
Python in the browserPyCon Italia
 
PyPy 1.2: snakes never crawled so fast
PyPy 1.2: snakes never crawled so fastPyPy 1.2: snakes never crawled so fast
PyPy 1.2: snakes never crawled so fastPyCon Italia
 
OpenERP e l'arte della gestione aziendale con Python
OpenERP e l'arte della gestione aziendale con PythonOpenERP e l'arte della gestione aziendale con Python
OpenERP e l'arte della gestione aziendale con PythonPyCon Italia
 
New and improved: Coming changes to the unittest module
 	 New and improved: Coming changes to the unittest module 	 New and improved: Coming changes to the unittest module
New and improved: Coming changes to the unittest modulePyCon Italia
 
Monitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntopMonitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntopPyCon Italia
 
Jython for embedded software validation
Jython for embedded software validationJython for embedded software validation
Jython for embedded software validationPyCon Italia
 
Foxgame introduzione all'apprendimento automatico
Foxgame introduzione all'apprendimento automaticoFoxgame introduzione all'apprendimento automatico
Foxgame introduzione all'apprendimento automaticoPyCon Italia
 
Django è pronto per l'Enterprise
Django è pronto per l'EnterpriseDjango è pronto per l'Enterprise
Django è pronto per l'EnterprisePyCon Italia
 
Crogioli, alambicchi e beute: dove mettere i vostri dati.
Crogioli, alambicchi e beute: dove mettere i vostri dati.Crogioli, alambicchi e beute: dove mettere i vostri dati.
Crogioli, alambicchi e beute: dove mettere i vostri dati.PyCon Italia
 
Comet web applications with Python, Django & Orbited
Comet web applications with Python, Django & OrbitedComet web applications with Python, Django & Orbited
Comet web applications with Python, Django & OrbitedPyCon Italia
 
Cleanup and new optimizations in WPython 1.1
Cleanup and new optimizations in WPython 1.1Cleanup and new optimizations in WPython 1.1
Cleanup and new optimizations in WPython 1.1PyCon Italia
 

More from PyCon Italia (19)

Spyppolare o non spyppolare
Spyppolare o non spyppolareSpyppolare o non spyppolare
Spyppolare o non spyppolare
 
zc.buildout: "Un modo estremamente civile per sviluppare un'applicazione"
zc.buildout: "Un modo estremamente civile per sviluppare un'applicazione"zc.buildout: "Un modo estremamente civile per sviluppare un'applicazione"
zc.buildout: "Un modo estremamente civile per sviluppare un'applicazione"
 
Undici anni di lavoro con Python
Undici anni di lavoro con PythonUndici anni di lavoro con Python
Undici anni di lavoro con Python
 
socket e SocketServer: il framework per i server Internet in Python
socket e SocketServer: il framework per i server Internet in Pythonsocket e SocketServer: il framework per i server Internet in Python
socket e SocketServer: il framework per i server Internet in Python
 
Qt mobile PySide bindings
Qt mobile PySide bindingsQt mobile PySide bindings
Qt mobile PySide bindings
 
Python: ottimizzazione numerica algoritmi genetici
Python: ottimizzazione numerica algoritmi geneticiPython: ottimizzazione numerica algoritmi genetici
Python: ottimizzazione numerica algoritmi genetici
 
Python idiomatico
Python idiomaticoPython idiomatico
Python idiomatico
 
Python in the browser
Python in the browserPython in the browser
Python in the browser
 
PyPy 1.2: snakes never crawled so fast
PyPy 1.2: snakes never crawled so fastPyPy 1.2: snakes never crawled so fast
PyPy 1.2: snakes never crawled so fast
 
OpenERP e l'arte della gestione aziendale con Python
OpenERP e l'arte della gestione aziendale con PythonOpenERP e l'arte della gestione aziendale con Python
OpenERP e l'arte della gestione aziendale con Python
 
New and improved: Coming changes to the unittest module
 	 New and improved: Coming changes to the unittest module 	 New and improved: Coming changes to the unittest module
New and improved: Coming changes to the unittest module
 
Monitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntopMonitoraggio del Traffico di Rete Usando Python ed ntop
Monitoraggio del Traffico di Rete Usando Python ed ntop
 
Jython for embedded software validation
Jython for embedded software validationJython for embedded software validation
Jython for embedded software validation
 
Foxgame introduzione all'apprendimento automatico
Foxgame introduzione all'apprendimento automaticoFoxgame introduzione all'apprendimento automatico
Foxgame introduzione all'apprendimento automatico
 
Effective EC2
Effective EC2Effective EC2
Effective EC2
 
Django è pronto per l'Enterprise
Django è pronto per l'EnterpriseDjango è pronto per l'Enterprise
Django è pronto per l'Enterprise
 
Crogioli, alambicchi e beute: dove mettere i vostri dati.
Crogioli, alambicchi e beute: dove mettere i vostri dati.Crogioli, alambicchi e beute: dove mettere i vostri dati.
Crogioli, alambicchi e beute: dove mettere i vostri dati.
 
Comet web applications with Python, Django & Orbited
Comet web applications with Python, Django & OrbitedComet web applications with Python, Django & Orbited
Comet web applications with Python, Django & Orbited
 
Cleanup and new optimizations in WPython 1.1
Cleanup and new optimizations in WPython 1.1Cleanup and new optimizations in WPython 1.1
Cleanup and new optimizations in WPython 1.1
 

Recently uploaded

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

PyCuda: Come sfruttare la potenza delle schede video nelle applicazioni python

  • 1. PyCUDA: Harnessing the power of GPU with Python
  • 2. Talk Structure 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 3. Talk Structure 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 4. WHY A GPU ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 5. APPLICATIONS & DEMOS PyCon 4 – Florence 2010 – Fabrizio Milo
  • 6. Why GPU? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 7. Talk Structure 1. Why a GPU ? 2. How does it works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 8. How does it works ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 9. ALU ALU Control ALU ALU Cache DRAM CPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 10. DRAM GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 11. ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 12. CUDA PyCon 4 – Florence 2010 – Fabrizio Milo
  • 13. Compute Unified Device Architecture PyCon 4 – Florence 2010 – Fabrizio Milo
  • 14. CUDA A Parallel Computing Architecture for NVIDIA GPUs Direct X Compute PyCon 4 – Florence 2010 – Fabrizio Milo
  • 15. Execution Model CUDA Device Model PyCon 4 – Florence 2010 – Fabrizio Milo
  • 16. EXECUTION MODEL PyCon 4 – Florence 2010 – Fabrizio Milo
  • 17. Thread Smallest unit of logic PyCon 4 – Florence 2010 – Fabrizio Milo
  • 18. A Block A Group of Threads PyCon 4 – Florence 2010 – Fabrizio Milo
  • 19. A Grid A Group of Blocks PyCon 4 – Florence 2010 – Fabrizio Milo
  • 20. One Block can have many threads PyCon 4 – Florence 2010 – Fabrizio Milo
  • 21. One Grid can have many blocks PyCon 4 – Florence 2010 – Fabrizio Milo
  • 22. The hardware DEVICE MODEL PyCon 4 – Florence 2010 – Fabrizio Milo
  • 23. Scalar Processor PyCon 4 – Florence 2010 – Fabrizio Milo
  • 24. Scalar Processor PyCon 4 – Florence 2010 – Fabrizio Milo
  • 25. Many Scalar Processors PyCon 4 – Florence 2010 – Fabrizio Milo
  • 26. + Register File PyCon 4 – Florence 2010 – Fabrizio Milo
  • 27. + Shared Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 28. Multiprocessor PyCon 4 – Florence 2010 – Fabrizio Milo
  • 29. Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 30. Real Example: 10-Series Architecture "   240 Scalar Processor (SP) cores execute kernel threads "   30 Streaming Multiprocessors (SMs) each contain " 8 scalar processors   "  1 double precision unit "  Shared memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 31. Software Hardware Scalar Processor Thread Thread Block Multiprocessor Grid Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 32. Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 33. Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 34. RAM CPU Global Memory Host - Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 35. RAM CPU Host – Multi Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 36. 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 37. Software Hardware Scalar Processor Thread Thread Block Multiprocessor Grid Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 38. Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Thread PyCon 4 – Florence 2010 – Fabrizio Milo
  • 39. Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Thread PyCon 4 – Florence 2010 – Fabrizio Milo
  • 40. Kernel __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; } Block PyCon 4 – Florence 2010 – Fabrizio Milo
  • 41. Kernel __global__ void kernel( … ) { const int idx = blockIdx.x * blockDim.x + threadIdx.x; … } Grid PyCon 4 – Florence 2010 – Fabrizio Milo
  • 42. How do I Program it ? Main Logic Kernel GCC NVCC CPU .bin .cubin GPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 43. How do I Program it ? Main Logic Kernel GCC NVCC GPU .bin .cubin .bin .cubin . CPU PyCon 4 – Florence 2010 – Fabrizio Milo
  • 44. RAM CPU Global Memory Host - Device PyCon 4 – Florence 2010 – Fabrizio Milo
  • 45. RAM CPU Global Memory PyCon 4 – Florence 2010 – Fabrizio Milo
  • 46. Allocate Memory cudaMalloc( pointer, size ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 47. Copy to device cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 48. Kernel Launch cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) Kernel<<< # blocks, # threads >> (*params) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 49. Get Back the Results cudaMalloc( pointer, size ) cudaMemcpy( dest, src, size, direction) Kernel<<< # blocks, # threads >> (*params) cudaMemcpy( dest, src, size, direction) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 50. Error Handling If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() } PyCon 4 – Florence 2010 – Fabrizio Milo
  • 51. And soon it becomes … If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } PyCon 4 – Florence 2010 – Fabrizio Milo
  • 52. And soon it becomes … If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If(cudaMalloc( pointer, size ) != cudaSuccess){ handle_error() If(cudaMalloc( pointer, size ) != cudaSuccess){ } handle_error() } if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} if (cudaMemcpy( dest, src, size, direction ) == cudaSuccess) {} If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ handle_error() If (Kernel<<< # blocks, # threads >> (*params) != cudaSuccess){ } handle_error() } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } If( cudaMemcpy( dest, src, size, direction) != cudaSuccess) { } PyCon 4 – Florence 2010 – Fabrizio Milo
  • 53. PyCon 4 – Florence 2010 – Fabrizio Milo
  • 54. 1. Why a GPU ? 2. How does It works ? 3. How do I Program it ? 4. Can I Use Python ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 55. + & ANDREAS KLOCKNER = PYCUDA PyCon 4 – Florence 2010 – Fabrizio Milo
  • 56. PyCuda Philosopy Provide Complete Access PyCon 4 – Florence 2010 – Fabrizio Milo
  • 57. PyCuda Philosopy AutoMatically Manage Resources PyCon 4 – Florence 2010 – Fabrizio Milo
  • 58. PyCuda Philosopy Check and Report Errors PyCon 4 – Florence 2010 – Fabrizio Milo
  • 59. PyCuda Philosopy Cross Platform PyCon 4 – Florence 2010 – Fabrizio Milo
  • 60. PyCuda Philosopy Allow Interactive Use PyCon 4 – Florence 2010 – Fabrizio Milo
  • 61. PyCuda Philosopy NumPy Integration PyCon 4 – Florence 2010 – Fabrizio Milo
  • 62. NUMPY - ARRAY PyCon 4 – Florence 2010 – Fabrizio Milo
  • 63. 1 1 1 1 1 1 0 99 import numpy my_array = numpy.array([1,] * 100) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 64. 1 1 1 0 1 1 import numpy my_array = numpy.array([1,] * 100) my_array[3] = 0 PyCon 4 – Florence 2010 – Fabrizio Milo
  • 65. PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  • 66. PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  • 67. PyCuda: Workflow PyCon 4 – Florence 2010 – Fabrizio Milo
  • 68. Memory Allocation cuda.mem_alloc( size_bytes ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 69. Memory Copy gpu_mem = cuda.mem_alloc( size_bytes ) cuda.memcpy_htod( gpu_mem, cpu_mem ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 70. Kernel gpu_mem = cuda.mem_alloc( size_bytes ) cuda.memcpy_htod( gpu_mem, cpu_mem ) SourceModule(“”” __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; }”””) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 71. Kernel Launch mod = SourceModule(“”” __global__ void multiply_them( float *dest, float *a, float *b ) { const int i = threadIdx.x; dest[i] = a[i] * b[i]; }”””) multiply_them = mod.get_function(“multiply_them”) multiply_them ( *args, block=(30, 64, 1)) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 72. PyCon 4 – Florence 2010 – Fabrizio Milo
  • 73. PyCon 4 – Florence 2010 – Fabrizio Milo
  • 74. PyCon 4 – Florence 2010 – Fabrizio Milo
  • 75. Hello Gpu DEMO PyCon 4 – Florence 2010 – Fabrizio Milo
  • 76. GPUARRAY PyCon 4 – Florence 2010 – Fabrizio Milo
  • 77. gpuarray PyCon 4 – Florence 2010 – Fabrizio Milo
  • 78. PyCuda: GpuArray gpuarray.to_gpu(numpy array) numpy array = gpuarray.get() PyCon 4 – Florence 2010 – Fabrizio Milo
  • 79. PyCuda: GpuArray gpuarray.to_gpu(numpy array) numpy array = gpuarray.get() +, -, !, /, fill, sin, exp, rand, basic indexing, norm, inner product … PyCon 4 – Florence 2010 – Fabrizio Milo
  • 80. PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel PyCon 4 – Florence 2010 – Fabrizio Milo
  • 81. PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel lincomb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ” ) PyCon 4 – Florence 2010 – Fabrizio Milo
  • 82. PyCuda: GpuArray: ElementWise from pycuda.elementwise import ElementwiseKernel lin comb = ElementwiseKernel( ” float a , float !x , float b , float !y , float !z”, ”z [ i ] = a !x[ i ] + b!y[i ] ” ) c gpu = gpuarray. empty like (a gpu) lincomb (5, a gpu, 6, b gpu, c gpu) assert la . norm((c gpu ! (5!a gpu+6!b gpu)).get()) < 1e!5 PyCon 4 – Florence 2010 – Fabrizio Milo
  • 83. Meta-Programming __kernel_template__ = “”” __global__ void kernel( args ) { for (int i=0; i={{ iterations }}; i++){ {{operations}} } }””” See for example jinja2 PyCon 4 – Florence 2010 – Fabrizio Milo
  • 84. Meta-Programming PyCon 4 – Florence 2010 – Fabrizio Milo
  • 85. Meta-Programming Generate Source ! PyCon 4 – Florence 2010 – Fabrizio Milo
  • 86. Performances ? PyCon 4 – Florence 2010 – Fabrizio Milo
  • 87. mandelbrot DEMO PyCon 4 – Florence 2010 – Fabrizio Milo
  • 88. PyCuda: Documentation PyCon 4 – Florence 2010 – Fabrizio Milo
  • 89. PyCuda WebSite: http://mathema.tician.de/software/ pycuda License: X Consortium License (no warranty, free for all use) Dependencies: Python 2.4+, numpy, Boost PyCon 4 – Florence 2010 – Fabrizio Milo
  • 90. In the Future … OPENCL PyCon 4 – Florence 2010 – Fabrizio Milo
  • 91. THANK YOU & HAVE FUN ! PyCon 4 – Florence 2010 – Fabrizio Milo
  • 92. ? PyCon 4 – Florence 2010 – Fabrizio Milo