Conflux:gpgpu for .net (en)

•Descargar como PPTX, PDF•

0 recomendaciones•378 vistas

Conflux provides a parallel programming framework to use CPUs and GPUs in collaboration as components of an integrated computing system. Conflux proposes already known kernel-based architecture that is compatible with CUDA,

Tecnología

CONFLUX: GPGPU FOR .NET Eugene Burmako, 2010

Videocards: state of the art Equipment – tenth/hundreds of ALU clocked at ~1 GHz Peak performance – 1 SP TFLOPS, > 100 DP GFLOPS API – random memory access, data structures, pointers, subroutines API maturity – nearly four years, several generations of graphics processors

Videocards: programmer’s PoV Modern GPU programming models (CUDA, AMD Stream, OpenCL, DirectCompute): Parallel algorithm is defined by the pair: 1) kernel (loop iteration), 2) iteration bounds. Kernel is compiled by the driver. Iteration bounds are used to create grid of threads. Input data is copied to video memory. Execution gets kicked off. Result is copied to main memory.

$Example: SAXPY via CUDA __global__ void Saxpy(float a, float* X, float* Y) { inti = blockDim.x * blockIdx.x + threadIdx.x; Y[i] = a * X[i] + Y[i]; } cudaMemcpy(X, hX, cudaMemcpyHostToDevice); cudaMemcpy(Y, hY, cudaMemcpyHostToDevice); Saxpy<<<256, (N + 255) / 256>>>(a, hX, hY); cudaMemcpy(hY, Y, cudaMemcpyDeviceToHost);$

In fact Brahma: Data structures: data parallel array. Computations: C# expressions, LINQ combinators. Accelerator v2: Data structures: data parallel array. Computations: arithmetic operators, number of predefined functions. This does the trick for a lot of algorithms. But what if we’ve got branching or non-regular memory access?

$Example: CUDA interop saxpy = @”__global__ void Saxpy(float a, float* X, float* Y) { inti = blockDim.x * blockIdx.x + threadIdx.x; Y[i] = a * X[i] + Y[i]; }”; nvcuda.cuModuleLoadDataEx(saxpy); nvcuda.cuMemcpyHtoD(X, Y); nvcuda.cuParamSeti(a, X, Y); nvcuda.cuLaunchGrid(256, (N + 255) / 256); nvcuda.cuMemcpyDtoH(Y);$

Conflux Kernels are written in C#: data structures, local variables, branching, loops float a; float[] x; [Result] float[] y; vari = GlobalIdx.X; y[i] = a * x[i] + y[i];

Conflux Avoids explicit interop with unmanaged code, lets programmer use native .NET data types. float[] x, y; varcfg = new CudaConfig(); var kernel = cfg.Configure<Saxpy>(); y = kernel.Execute(a, x, y);

How does it work? Front end: decompiles C#. AST transformer: inlines calls, destructures classes and arrays, maps intrinsincs. Back end:generates PTX (NVIDIA GPU assembler) and/or multicoreIL. Interop: binds to nvcuda driver that is capable of executing GPU assembler.

Current progress http://bitbucket.org/conflux/conflux Proof of concept. Capable of computing hello-world of parallel computations: matrix multiplication. If we don’t take into account [currently]high overhead incurred by JIT-compilation, the idea works finely even for naïve code generator: 1x CPU < 2x CPU << GPU. Triple license: AGPL, exception for OSS projects, commercial.

Future work GPU-specific optimizations (e.g. diagonal stripes for optimizing bandwidth utilization of matrix transposition) Polyhedral model for loop nest optimization (can be configured to fit specific levels and sizes of memory hierarchy, there exist GPU-specific linear heuristics that optimize spatial and temporal locality). Distributed execution (a new level of memory hierarchy if we use polyhedral model).

Conclusion Conflux: GPGPU for .NET http://bitbucket.org/conflux/conflux eugene.burmako@confluxhpc.net

Más contenido relacionado

La actualidad más candente

General Programming on the GPU - ConfooSirKetchup

C# Assignmet HelpProgramming Homework Help

Efficient SIMD Vectorization for Hashing in OpenCLJonas Traub

Nicety of java 8 multithreading for advanced, Max VoronoySigma Software

GPU Programming on CPU - Using C++AMPMiller Lee

Multilayer Neuronal network hardware implementation Nabil Chouba

C++ amp on linuxMiller Lee

Engineering fast indexesDaniel Lemire

On Mining Bitcoins - Fundamentals & OutlooksFilip Maertens

Rubinius @ RubyAndRails2010Dirkjan Bussink

Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire

Multi qubit entanglementVijayananda Mohire

Fast Wavelet Tree Construction in PracticeRakuten Group, Inc.

AA-sort with SSE4.1MITSUNARI Shigeo

2013 0928 programming by cuda小明王

Cocos2d Performance TipsKeisuke Hata

My bitmapMilruwan Perera

WebAssembly向け多倍長演算の実装MITSUNARI Shigeo

Fast indexes with roaring #gomtl-10 Daniel Lemire

TensorFlow Studying Part II for GPUTe-Yen Liu

La actualidad más candente (20)

General Programming on the GPU - Confoo

C# Assignmet Help

Efficient SIMD Vectorization for Hashing in OpenCL

Nicety of java 8 multithreading for advanced, Max Voronoy

GPU Programming on CPU - Using C++AMP

Multilayer Neuronal network hardware implementation

C++ amp on linux

Engineering fast indexes

On Mining Bitcoins - Fundamentals & Outlooks

Rubinius @ RubyAndRails2010

Next Generation Indexes For Big Data Engineering (ODSC East 2018)

Multi qubit entanglement

Fast Wavelet Tree Construction in Practice

AA-sort with SSE4.1

2013 0928 programming by cuda

Cocos2d Performance Tips

My bitmap

WebAssembly向け多倍長演算の実装

Fast indexes with roaring #gomtl-10

TensorFlow Studying Part II for GPU

Destacado

Fossilskrferraro

Sellers Presentation DavidOConnor

Sattose2013 megaAndrei Varanovich

techKNOW leadership for JefCoEd Tech Camp 7.11.13mwilson518

Cheap Tricksmwilson518

Explara publisher partner programSantosh Panda

Destacado (6)

Fossils

Sellers Presentation

Sattose2013 mega

techKNOW leadership for JefCoEd Tech Camp 7.11.13

Cheap Tricks

Explara publisher partner program

Similar a Conflux:gpgpu for .net (en)

Intro2 Cuda MoayadMoayadhn

Nvidia cuda tutorial_no_nda_apr08Angela Mendoza M.

Vpu technology &gpgpu computingArka Ghosh

CUDA Deep Divekrasul

Vpu technology &gpgpu computingArka Ghosh

Newbie’s guide to_the_gpgpu_universeOfer Rosenberg

Slide tesiNicolò Savioli

Introduction to CUDARaymond Tay

Cuda introductionHanibei

An Introduction to CUDA-OpenCL - University.pptxAnirudhGarg35

Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky

Programming languagesDmitry Zinoviev

There is more to CJuraj Michálek

Lecture 6 Kernel Debugging + Ports DevelopmentMohammed Farrag

Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...mouhouioui

Threaded ProgrammingSri Prasanna

gpuprogram_lecture,architecture_designsnARUNACHALAM468781

GPU: Understanding CUDAJoaquín Aparicio Ramos

Similar a Conflux:gpgpu for .net (en) (20)

Intro2 Cuda Moayad

Nvidia cuda tutorial_no_nda_apr08

Vpu technology &gpgpu computing

CUDA Deep Dive

Vpu technology &gpgpu computing

Newbie’s guide to_the_gpgpu_universe

Slide tesi

Introduction to CUDA

Cuda introduction

An Introduction to CUDA-OpenCL - University.pptx

Gpu workshop cluster universe: scripting cuda

Programming languages

There is more to C

Lecture 6 Kernel Debugging + Ports Development

Etude éducatif sur les GPUs & CPUs et les architectures paralleles -Programmi...

Threaded Programming

gpuprogram_lecture,architecture_designsn

GPU: Understanding CUDA

Último

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Histor y of HAM Radio presentation slidevu2urc

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

GenCyber Cyber Security Day PresentationMichael W. Hawkins

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Real Time Object Detection Using Open CVKhem

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Conflux:gpgpu for .net (en)

1. CONFLUX: GPGPU FOR .NET Eugene Burmako, 2010

2. Videocards: state of the art Equipment – tenth/hundreds of ALU clocked at ~1 GHz Peak performance – 1 SP TFLOPS, > 100 DP GFLOPS API – random memory access, data structures, pointers, subroutines API maturity – nearly four years, several generations of graphics processors

3. Videocards: programmer’s PoV Modern GPU programming models (CUDA, AMD Stream, OpenCL, DirectCompute): Parallel algorithm is defined by the pair: 1) kernel (loop iteration), 2) iteration bounds. Kernel is compiled by the driver. Iteration bounds are used to create grid of threads. Input data is copied to video memory. Execution gets kicked off. Result is copied to main memory.

4. Example: SAXPY via CUDA __global__ void Saxpy(float a, float* X, float* Y) { inti = blockDim.x * blockIdx.x + threadIdx.x; Y[i] = a * X[i] + Y[i]; } cudaMemcpy(X, hX, cudaMemcpyHostToDevice); cudaMemcpy(Y, hY, cudaMemcpyHostToDevice); Saxpy<<<256, (N + 255) / 256>>>(a, hX, hY); cudaMemcpy(hY, Y, cudaMemcpyDeviceToHost);

5. Hot question

6. Official answer

7. In fact Brahma: Data structures: data parallel array. Computations: C# expressions, LINQ combinators. Accelerator v2: Data structures: data parallel array. Computations: arithmetic operators, number of predefined functions. This does the trick for a lot of algorithms. But what if we’ve got branching or non-regular memory access?

8. Example: CUDA interop saxpy = @”__global__ void Saxpy(float a, float* X, float* Y) { inti = blockDim.x * blockIdx.x + threadIdx.x; Y[i] = a * X[i] + Y[i]; }”; nvcuda.cuModuleLoadDataEx(saxpy); nvcuda.cuMemcpyHtoD(X, Y); nvcuda.cuParamSeti(a, X, Y); nvcuda.cuLaunchGrid(256, (N + 255) / 256); nvcuda.cuMemcpyDtoH(Y);

9. Conflux Kernels are written in C#: data structures, local variables, branching, loops float a; float[] x; [Result] float[] y; vari = GlobalIdx.X; y[i] = a * x[i] + y[i];

10. Conflux Avoids explicit interop with unmanaged code, lets programmer use native .NET data types. float[] x, y; varcfg = new CudaConfig(); var kernel = cfg.Configure<Saxpy>(); y = kernel.Execute(a, x, y);

11. How does it work? Front end: decompiles C#. AST transformer: inlines calls, destructures classes and arrays, maps intrinsincs. Back end:generates PTX (NVIDIA GPU assembler) and/or multicoreIL. Interop: binds to nvcuda driver that is capable of executing GPU assembler.

12. Current progress http://bitbucket.org/conflux/conflux Proof of concept. Capable of computing hello-world of parallel computations: matrix multiplication. If we don’t take into account [currently]high overhead incurred by JIT-compilation, the idea works finely even for naïve code generator: 1x CPU < 2x CPU << GPU. Triple license: AGPL, exception for OSS projects, commercial.

13. Demo

14. Future work GPU-specific optimizations (e.g. diagonal stripes for optimizing bandwidth utilization of matrix transposition) Polyhedral model for loop nest optimization (can be configured to fit specific levels and sizes of memory hierarchy, there exist GPU-specific linear heuristics that optimize spatial and temporal locality). Distributed execution (a new level of memory hierarchy if we use polyhedral model).

15. Conclusion Conflux: GPGPU for .NET http://bitbucket.org/conflux/conflux eugene.burmako@confluxhpc.net

Conflux:gpgpu for .net (en)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (6)

Similar a Conflux:gpgpu for .net (en)

Similar a Conflux:gpgpu for .net (en) (20)

Último

Último (20)

Conflux:gpgpu for .net (en)