SlideShare a Scribd company logo
1 of 15
CONFLUX: GPGPU FOR .NET Eugene Burmako, 2010
Videocards: state of the art Equipment – tenth/hundreds of ALU clocked at ~1 GHz  Peak performance – 1 SP TFLOPS, > 100 DP GFLOPS API – random memory access, data structures, pointers, subroutines API maturity – nearly four years, several generations of graphics processors
Videocards: programmer’s PoV Modern GPU programming models (CUDA, AMD Stream, OpenCL, DirectCompute): Parallel algorithm is defined by the pair: 1) kernel (loop iteration), 2) iteration bounds. Kernel is compiled by the driver. Iteration bounds are used to create grid of threads. Input data is copied to video memory. Execution gets kicked off. Result is copied to main memory.
Example: SAXPY via CUDA __global__ void Saxpy(float a, float* X, float* Y)  { inti = blockDim.x * blockIdx.x + threadIdx.x;       Y[i] = a * X[i] + Y[i];  } cudaMemcpy(X, hX, cudaMemcpyHostToDevice); cudaMemcpy(Y, hY, cudaMemcpyHostToDevice); Saxpy<<<256, (N + 255) / 256>>>(a, hX, hY); cudaMemcpy(hY, Y, cudaMemcpyDeviceToHost);
Hot question
Official answer
In fact Brahma: Data structures: data parallel array. Computations: C# expressions, LINQ combinators. Accelerator v2: Data structures: data parallel array. Computations: arithmetic operators, number of predefined functions. This does the trick for a lot of algorithms. But what if we’ve got branching or non-regular memory access?
Example: CUDA  interop saxpy = @”__global__ void Saxpy(float a, float* X, float* Y)  { inti = blockDim.x * blockIdx.x + threadIdx.x;       Y[i] = a * X[i] + Y[i];  }”; nvcuda.cuModuleLoadDataEx(saxpy); nvcuda.cuMemcpyHtoD(X, Y); nvcuda.cuParamSeti(a, X, Y); nvcuda.cuLaunchGrid(256, (N + 255) / 256); nvcuda.cuMemcpyDtoH(Y);
Conflux Kernels are written in C#: data structures, local variables, branching, loops float a; float[] x; [Result] float[] y; vari = GlobalIdx.X; y[i] = a * x[i] + y[i];
Conflux Avoids explicit interop with unmanaged code, lets programmer use native .NET data types. float[] x, y; varcfg = new CudaConfig(); var kernel = cfg.Configure<Saxpy>(); y = kernel.Execute(a, x, y);
How does it work? Front end: decompiles C#. AST transformer: inlines calls, destructures classes and arrays, maps intrinsincs. Back end:generates PTX (NVIDIA GPU assembler) and/or multicoreIL. Interop: binds to nvcuda driver that is capable of executing GPU assembler.
Current progress http://bitbucket.org/conflux/conflux Proof of concept. Capable of computing hello-world of parallel computations: matrix multiplication. If we don’t take into account [currently]high overhead incurred by JIT-compilation, the idea works finely even for naïve code generator: 1x CPU < 2x CPU << GPU. Triple license: AGPL, exception for OSS projects, commercial.
Demo
Future work GPU-specific optimizations (e.g. diagonal stripes for optimizing bandwidth utilization of matrix transposition) Polyhedral model for loop nest optimization (can be configured to fit specific levels and sizes of memory hierarchy, there exist GPU-specific linear heuristics that optimize spatial and temporal locality). Distributed execution (a new level of memory hierarchy if we use polyhedral model).
Conclusion Conflux: GPGPU for .NET http://bitbucket.org/conflux/conflux eugene.burmako@confluxhpc.net

More Related Content

What's hot

GPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPGPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMP
Miller Lee
 
Rubinius @ RubyAndRails2010
Rubinius @ RubyAndRails2010Rubinius @ RubyAndRails2010
Rubinius @ RubyAndRails2010
Dirkjan Bussink
 

What's hot (20)

General Programming on the GPU - Confoo
General Programming on the GPU - ConfooGeneral Programming on the GPU - Confoo
General Programming on the GPU - Confoo
 
C# Assignmet Help
C# Assignmet HelpC# Assignmet Help
C# Assignmet Help
 
Efficient SIMD Vectorization for Hashing in OpenCL
Efficient SIMD Vectorization for Hashing in OpenCLEfficient SIMD Vectorization for Hashing in OpenCL
Efficient SIMD Vectorization for Hashing in OpenCL
 
Nicety of java 8 multithreading for advanced, Max Voronoy
Nicety of java 8 multithreading for advanced, Max VoronoyNicety of java 8 multithreading for advanced, Max Voronoy
Nicety of java 8 multithreading for advanced, Max Voronoy
 
GPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMPGPU Programming on CPU - Using C++AMP
GPU Programming on CPU - Using C++AMP
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation
 
C++ amp on linux
C++ amp on linuxC++ amp on linux
C++ amp on linux
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
On Mining Bitcoins - Fundamentals & Outlooks
On Mining Bitcoins - Fundamentals & OutlooksOn Mining Bitcoins - Fundamentals & Outlooks
On Mining Bitcoins - Fundamentals & Outlooks
 
Rubinius @ RubyAndRails2010
Rubinius @ RubyAndRails2010Rubinius @ RubyAndRails2010
Rubinius @ RubyAndRails2010
 
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)Next Generation Indexes For Big Data Engineering (ODSC East 2018)
Next Generation Indexes For Big Data Engineering (ODSC East 2018)
 
Multi qubit entanglement
Multi qubit entanglementMulti qubit entanglement
Multi qubit entanglement
 
Fast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in PracticeFast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in Practice
 
AA-sort with SSE4.1
AA-sort with SSE4.1AA-sort with SSE4.1
AA-sort with SSE4.1
 
2013 0928 programming by cuda
2013 0928 programming by cuda2013 0928 programming by cuda
2013 0928 programming by cuda
 
Cocos2d Performance Tips
Cocos2d Performance TipsCocos2d Performance Tips
Cocos2d Performance Tips
 
My bitmap
My bitmapMy bitmap
My bitmap
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装
 
Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10 Fast indexes with roaring #gomtl-10
Fast indexes with roaring #gomtl-10
 
TensorFlow Studying Part II for GPU
TensorFlow Studying Part II for GPUTensorFlow Studying Part II for GPU
TensorFlow Studying Part II for GPU
 

Similar to Conflux: gpgpu for .net (en)

Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
Angela Mendoza M.
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
Hanibei
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
mouhouioui
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded Programming
Sri Prasanna
 

Similar to Conflux: gpgpu for .net (en) (20)

Intro2 Cuda Moayad
Intro2 Cuda MoayadIntro2 Cuda Moayad
Intro2 Cuda Moayad
 
Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08Nvidia cuda tutorial_no_nda_apr08
Nvidia cuda tutorial_no_nda_apr08
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
CUDA Deep Dive
CUDA Deep DiveCUDA Deep Dive
CUDA Deep Dive
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Slide tesi
Slide tesiSlide tesi
Slide tesi
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 
Programming languages
Programming languagesProgramming languages
Programming languages
 
There is more to C
There is more to CThere is more to C
There is more to C
 
Lecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports DevelopmentLecture 6 Kernel Debugging + Ports Development
Lecture 6 Kernel Debugging + Ports Development
 
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...
 
Threaded Programming
Threaded ProgrammingThreaded Programming
Threaded Programming
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
GPU: Understanding CUDA
GPU: Understanding CUDAGPU: Understanding CUDA
GPU: Understanding CUDA
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Conflux: gpgpu for .net (en)

  • 1. CONFLUX: GPGPU FOR .NET Eugene Burmako, 2010
  • 2. Videocards: state of the art Equipment – tenth/hundreds of ALU clocked at ~1 GHz Peak performance – 1 SP TFLOPS, > 100 DP GFLOPS API – random memory access, data structures, pointers, subroutines API maturity – nearly four years, several generations of graphics processors
  • 3. Videocards: programmer’s PoV Modern GPU programming models (CUDA, AMD Stream, OpenCL, DirectCompute): Parallel algorithm is defined by the pair: 1) kernel (loop iteration), 2) iteration bounds. Kernel is compiled by the driver. Iteration bounds are used to create grid of threads. Input data is copied to video memory. Execution gets kicked off. Result is copied to main memory.
  • 4. Example: SAXPY via CUDA __global__ void Saxpy(float a, float* X, float* Y) { inti = blockDim.x * blockIdx.x + threadIdx.x; Y[i] = a * X[i] + Y[i]; } cudaMemcpy(X, hX, cudaMemcpyHostToDevice); cudaMemcpy(Y, hY, cudaMemcpyHostToDevice); Saxpy<<<256, (N + 255) / 256>>>(a, hX, hY); cudaMemcpy(hY, Y, cudaMemcpyDeviceToHost);
  • 7. In fact Brahma: Data structures: data parallel array. Computations: C# expressions, LINQ combinators. Accelerator v2: Data structures: data parallel array. Computations: arithmetic operators, number of predefined functions. This does the trick for a lot of algorithms. But what if we’ve got branching or non-regular memory access?
  • 8. Example: CUDA interop saxpy = @”__global__ void Saxpy(float a, float* X, float* Y) { inti = blockDim.x * blockIdx.x + threadIdx.x; Y[i] = a * X[i] + Y[i]; }”; nvcuda.cuModuleLoadDataEx(saxpy); nvcuda.cuMemcpyHtoD(X, Y); nvcuda.cuParamSeti(a, X, Y); nvcuda.cuLaunchGrid(256, (N + 255) / 256); nvcuda.cuMemcpyDtoH(Y);
  • 9. Conflux Kernels are written in C#: data structures, local variables, branching, loops float a; float[] x; [Result] float[] y; vari = GlobalIdx.X; y[i] = a * x[i] + y[i];
  • 10. Conflux Avoids explicit interop with unmanaged code, lets programmer use native .NET data types. float[] x, y; varcfg = new CudaConfig(); var kernel = cfg.Configure<Saxpy>(); y = kernel.Execute(a, x, y);
  • 11. How does it work? Front end: decompiles C#. AST transformer: inlines calls, destructures classes and arrays, maps intrinsincs. Back end:generates PTX (NVIDIA GPU assembler) and/or multicoreIL. Interop: binds to nvcuda driver that is capable of executing GPU assembler.
  • 12. Current progress http://bitbucket.org/conflux/conflux Proof of concept. Capable of computing hello-world of parallel computations: matrix multiplication. If we don’t take into account [currently]high overhead incurred by JIT-compilation, the idea works finely even for naïve code generator: 1x CPU < 2x CPU << GPU. Triple license: AGPL, exception for OSS projects, commercial.
  • 13. Demo
  • 14. Future work GPU-specific optimizations (e.g. diagonal stripes for optimizing bandwidth utilization of matrix transposition) Polyhedral model for loop nest optimization (can be configured to fit specific levels and sizes of memory hierarchy, there exist GPU-specific linear heuristics that optimize spatial and temporal locality). Distributed execution (a new level of memory hierarchy if we use polyhedral model).
  • 15. Conclusion Conflux: GPGPU for .NET http://bitbucket.org/conflux/conflux eugene.burmako@confluxhpc.net