SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
Programar para GPUs
Alcides Fonseca
me@alcidesfonseca.com
Universidade de Coimbra, Portugal
Afinal tinhamos um Ferrari parado no nosso
computador, mesmo ao lado de um 2 Cavalos
About me
• Web Developer (Django, Ruby, PHP, …)
• Programador Excêntrico (Haskell, Scala)
• Investigador (GPGPU Programming)
• Docente (Sistemas Distribuídos, Sistemas 

Operativos e Compiladores)
Esta apresentação
• 20 Minutos - Bla bla bla
• 20 Minutos - printf(“Coden”);
• 20 Minutos - Q&A
Lei de Moore
Go multicore!
Paralelismo
Workstation

2010
Server #1

2011
Server #2

2013
CPU
Dual Core @
2.66GHz
2x6x2 Threads
@ 2.80 GHz
2x8x2 Threads

@ 2.00 GHz
RAM 4GB 24GB 32 GB
GPGPU
Memória
CPU
GPU
GPGPU
• Surgiu de Hackers Cientistas
• Análise visual de Robots

• Cracking de passwords UNIX

• Redes Neuronais
• Hoje em dia:
• Sequenciação de DNA

• Previsão de Sismos

• Geração de compostos Químicos

• Previsões e Análises Financeiras

• Cracking de passwords WiFi

• BitCoin Mining
Paralelismo
Workstation

2010
Server #1

2011
Server #2

2013
CPU
Dual Core @
2.66GHz
2x6x2 Threads
@ 2.80 GHz
2x8x2 Threads

@ 2.00 GHz
RAM 4GB 24GB 32 GB
GPU
NVIDIA
Geforce GTX
285
NVIDIA Quadro
4000
AMD Firepro
V4900
GPU #Cores 240 (1508MHz) 256 (950MHz) 480 (800MHz)
GPU memory 1GB 2GB 1GB
Back of the napkin
Workstation

2010
Server #1

2011
Server #2

2013
CPU
2 Cores

@ 2.66GHz
2x6x2 Threads
@ 2.80 GHz
2x8x2 Threads

@ 2.00 GHz
CPU Cores x
Frequency
5,32 GHz <67,2 GHz <64 GHz
GPU #Cores 240 (1508MHz) 256 (950MHz) 480 (800MHz)
GPU Cores x
Frequency
361,92 GHz 243,2 GHz 384 GHz
Benchmarks
Mas se as GPUs são assim tão poderosas, porque
é que ainda usamos CPUs???
Problema #1 - Memória limitada
Workstation

2010
Server #1

2011
Server #2

2013
RAM 4GB 24GB 32 GB
GPU memory 1GB 2GB 1GB
Problema #2 - Diferentes memórias
Lentíssimo
Problema #2 - Diferentes memórias
Problema #2 - Diferentes memórias
Problema #2 - Diferentes memórias
Problema #3 - Branching is a bad ideaAT I S T R E A M C O M P U T I N G
in turn, contain numerous processing elements, which are the fundamental,
programmable computational units that perform integer, single-precision floating-
point, double-precision floating-point, and transcendental operations. All stream
cores within a compute unit execute the same instruction sequence; different
compute units can execute different instructions.
Figure 1.2 Simplified Block Diagram of the GPU Compute Device1
1. Much of this is transparent to the programmer.
General-Purpose Registers
Branch
Execution
Unit
Processing
Element
T-Processing
Element
Instruction
and Control
Flow
Stream Core
Ultra-Threaded Dispatch Processor
Compute
Unit
Compute
Unit
Compute
Unit
Compute
Unit
if (threadId.x%2==0)
{ 

// do something

} else {

// do other thing

}

Thread Divergence
Resumindo
CPU GPU
MIMD SIMD
task parallel data parallel
low throughput high throughput
low latency high latency
Problema #4 - It’s hard
#ifndef GROUP_SIZE
#define GROUP_SIZE (64)
#endif
#ifndef OPERATIONS
#define OPERATIONS (1)
#endif
/////////////////////////////////////////////////////////////////////////////////////////////
///////
#define LOAD_GLOBAL_I2(s, i) 
vload2((size_t)(i), (__global const int*)(s))
#define STORE_GLOBAL_I2(s, i, v) 
vstore2((v), (size_t)(i), (__global int*)(s))
/////////////////////////////////////////////////////////////////////////////////////////////
///////
#define LOAD_LOCAL_I1(s, i) 
((__local const int*)(s))[(size_t)(i)]
#define STORE_LOCAL_I1(s, i, v) 
((__local int*)(s))[(size_t)(i)] = (v)
#define LOAD_LOCAL_I2(s, i) 
(int2)( (LOAD_LOCAL_I1(s, i)), 
(LOAD_LOCAL_I1(s, i + GROUP_SIZE)))
#define STORE_LOCAL_I2(s, i, v) 
STORE_LOCAL_I1(s, i, (v)[0]); 
STORE_LOCAL_I1(s, i + GROUP_SIZE, (v)[1])
#define ACCUM_LOCAL_I2(s, i, j) 
{ 
int2 x = LOAD_LOCAL_I2(s, i); 
int2 y = LOAD_LOCAL_I2(s, j); 
int2 xy = (x + y); 
STORE_LOCAL_I2(s, i, xy); 
}
/////////////////////////////////////////////////////////////////////////////////////////////
///////
__kernel void
reduce(
__global int2 *output,
__global const int2 *input,
__local int2 *shared,
const unsigned int n)
{
const int2 zero = (int2)(0.0f, 0.0f);
const unsigned int group_id = get_global_id(0) / get_local_size(0);
const unsigned int group_size = GROUP_SIZE;
const unsigned int group_stride = 2 * group_size;
const size_t local_stride = group_stride * group_size;
unsigned int op = 0;
unsigned int last = OPERATIONS - 1;
for(op = 0; op < OPERATIONS; op++)
{
const unsigned int offset = (last - op);
const size_t local_id = get_local_id(0) + offset;
STORE_LOCAL_I2(shared, local_id, zero);
size_t i = group_id * group_stride + local_id;
while (i < n)
{
int2 a = LOAD_GLOBAL_I2(input, i);
int2 b = LOAD_GLOBAL_I2(input, i + group_size);
int2 s = LOAD_LOCAL_I2(shared, local_id);
STORE_LOCAL_I2(shared, local_id, (a + b + s));
i += local_stride;
}
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 512)
if (local_id < 256) { ACCUM_LOCAL_I2(shared, local_id, local_id + 256); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 256)
if (local_id < 128) { ACCUM_LOCAL_I2(shared, local_id, local_id + 128); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 128)
if (local_id < 64) { ACCUM_LOCAL_I2(shared, local_id, local_id + 64); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 64)
if (local_id < 32) { ACCUM_LOCAL_I2(shared, local_id, local_id + 32); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 32)
if (local_id < 16) { ACCUM_LOCAL_I2(shared, local_id, local_id + 16); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 16)
if (local_id < 8) { ACCUM_LOCAL_I2(shared, local_id, local_id + 8); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 8)
if (local_id < 4) { ACCUM_LOCAL_I2(shared, local_id, local_id + 4); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 4)
if (local_id < 2) { ACCUM_LOCAL_I2(shared, local_id, local_id + 2); }
#endif
barrier(CLK_LOCAL_MEM_FENCE);
#if (GROUP_SIZE >= 2)
if (local_id < 1) { ACCUM_LOCAL_I2(shared, local_id, local_id + 1); }
#endif
}
barrier(CLK_LOCAL_MEM_FENCE);
if (get_local_id(0) == 0)
{
int2 v = LOAD_LOCAL_I2(shared, 0);
STORE_GLOBAL_I2(output, group_id, v);
}
}
int sum = 0;
for (int i=0; i<array.length; i++)
sum += array[i];
CPU sum GPU sum
Como programar para GPUs?
• CUDA (NVidia)

• OpenCL (Apple, Intel, NVidia, AMD)

• OpenACC (Microsoft)

• MATLAB

• Accelerate, MARS, ÆminiumGPU
ÆminiumGPU
3
9
4
16
5
25
6
36
map(λx . x2, [3,4,5,6])
reduce( λxy . x+y , [3,4,5,6]) 18
7 11
ÆminiumGPU Decision Mechanism
Name Size C/R Description
OuterAccess 3 C Global GPU memory read.
InnerAccess 3 C Local (thread-group) memory read. This area of the memory is faster than the global one.
ConstantAccess 3 C Constant (read-only) memory read. This memory is faster on some GPU models.
OuterWrite 3 C Write in global memory.
InnerWrite 3 C Write in local memory, which is also faster than in global.
BasicOps 3 C Simplest and fastest instructions. Include arithmetic, logical and binary operators.
TrigFuns 3 C Trigonometric functions, including sin, cos, tan, asin, acos and atan.
PowFuns 3 C pow, log and sqrt functions
CmpFuns 3 C max and min functions
Branches 3 C Number of possible branching instructions such as for, if and whiles
DataTo 1 R Size of input data transferred to the GPU in bytes.
DataFrom 1 R Size of output data transferred from the GPU in bytes.
ProgType 1 R One of the following values: Map, Reduce, PartialReduce or MapReduce, which are the
different types of operations supported by ÆminiumGPU.
Table I
LIST OF FEATURES
Código (Cuda & OpenCL)
Reduction
Input:
Reduction step 1:
Reduction step 2:
+
+
+
+
+
+
__syncthreads()
__syncthreads()
Thread Block
Avanços recentes
• Kernel calls from GPU

• Suporte para Multi-GPU

• Unified Memory

• Task parallelism (HyperQ)

• Melhores profilers

• Suporte para C++ (auto e lambda)
me@alcidesfonseca.com
Alcides Fonseca

Más contenido relacionado

La actualidad más candente

Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerun
idsecconf
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
Hanibei
 

La actualidad más candente (20)

助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
助教が吼える! 各界の若手研究者大集合「ハードウェアはやわらかい」
 
grsecurity and PaX
grsecurity and PaXgrsecurity and PaX
grsecurity and PaX
 
Code gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introductionCode gpu with cuda - CUDA introduction
Code gpu with cuda - CUDA introduction
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
Linux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloudLinux kernel tracing superpowers in the cloud
Linux kernel tracing superpowers in the cloud
 
Roll your own toy unix clone os
Roll your own toy unix clone osRoll your own toy unix clone os
Roll your own toy unix clone os
 
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
Kernel development
Kernel developmentKernel development
Kernel development
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
Linux kernel debugging
Linux kernel debuggingLinux kernel debugging
Linux kernel debugging
 
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the CompilerPragmatic Optimization in Modern Programming - Demystifying the Compiler
Pragmatic Optimization in Modern Programming - Demystifying the Compiler
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerun
 
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John MelonakosPT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos
 
Cuda introduction
Cuda introductionCuda introduction
Cuda introduction
 
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...
 

Destacado (8)

XMPP - Beyond IM
XMPP - Beyond IMXMPP - Beyond IM
XMPP - Beyond IM
 
O Futuro Da Web
O Futuro Da WebO Futuro Da Web
O Futuro Da Web
 
Future Programming Languages
Future Programming LanguagesFuture Programming Languages
Future Programming Languages
 
Usabilidade
UsabilidadeUsabilidade
Usabilidade
 
Django
DjangoDjango
Django
 
Workshop Git
Workshop GitWorkshop Git
Workshop Git
 
Programming Paradigms
Programming ParadigmsProgramming Paradigms
Programming Paradigms
 
Introdução Web
Introdução WebIntrodução Web
Introdução Web
 

Similar a Programar para GPUs

Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
Rohit Khatana
 

Similar a Programar para GPUs (20)

Java gpu computing
Java gpu computingJava gpu computing
Java gpu computing
 
Introduction to cuda geek camp singapore 2011
Introduction to cuda   geek camp singapore 2011Introduction to cuda   geek camp singapore 2011
Introduction to cuda geek camp singapore 2011
 
Exploring Gpgpu Workloads
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
Linux Initialization Process (1)
Linux Initialization Process (1)Linux Initialization Process (1)
Linux Initialization Process (1)
 
Parallel computing with Gpu
Parallel computing with GpuParallel computing with Gpu
Parallel computing with Gpu
 
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander NasonovMultiplatform JIT Code Generator for NetBSD by Alexander Nasonov
Multiplatform JIT Code Generator for NetBSD by Alexander Nasonov
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 
Static analysis of C++ source code
Static analysis of C++ source codeStatic analysis of C++ source code
Static analysis of C++ source code
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Introduction to Accelerators
Introduction to AcceleratorsIntroduction to Accelerators
Introduction to Accelerators
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
r2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCyr2con 2017 r2cLEMENCy
r2con 2017 r2cLEMENCy
 

Último

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Último (20)

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 

Programar para GPUs

  • 1. Programar para GPUs Alcides Fonseca me@alcidesfonseca.com Universidade de Coimbra, Portugal Afinal tinhamos um Ferrari parado no nosso computador, mesmo ao lado de um 2 Cavalos
  • 2. About me • Web Developer (Django, Ruby, PHP, …) • Programador Excêntrico (Haskell, Scala) • Investigador (GPGPU Programming) • Docente (Sistemas Distribuídos, Sistemas 
 Operativos e Compiladores)
  • 3. Esta apresentação • 20 Minutos - Bla bla bla • 20 Minutos - printf(“Coden”); • 20 Minutos - Q&A
  • 4. Lei de Moore Go multicore!
  • 5. Paralelismo Workstation 2010 Server #1 2011 Server #2 2013 CPU Dual Core @ 2.66GHz 2x6x2 Threads @ 2.80 GHz 2x8x2 Threads @ 2.00 GHz RAM 4GB 24GB 32 GB
  • 7. GPGPU • Surgiu de Hackers Cientistas • Análise visual de Robots • Cracking de passwords UNIX • Redes Neuronais • Hoje em dia: • Sequenciação de DNA • Previsão de Sismos • Geração de compostos Químicos • Previsões e Análises Financeiras • Cracking de passwords WiFi • BitCoin Mining
  • 8. Paralelismo Workstation 2010 Server #1 2011 Server #2 2013 CPU Dual Core @ 2.66GHz 2x6x2 Threads @ 2.80 GHz 2x8x2 Threads @ 2.00 GHz RAM 4GB 24GB 32 GB GPU NVIDIA Geforce GTX 285 NVIDIA Quadro 4000 AMD Firepro V4900 GPU #Cores 240 (1508MHz) 256 (950MHz) 480 (800MHz) GPU memory 1GB 2GB 1GB
  • 9. Back of the napkin Workstation 2010 Server #1 2011 Server #2 2013 CPU 2 Cores @ 2.66GHz 2x6x2 Threads @ 2.80 GHz 2x8x2 Threads @ 2.00 GHz CPU Cores x Frequency 5,32 GHz <67,2 GHz <64 GHz GPU #Cores 240 (1508MHz) 256 (950MHz) 480 (800MHz) GPU Cores x Frequency 361,92 GHz 243,2 GHz 384 GHz
  • 11. Mas se as GPUs são assim tão poderosas, porque é que ainda usamos CPUs???
  • 12. Problema #1 - Memória limitada Workstation 2010 Server #1 2011 Server #2 2013 RAM 4GB 24GB 32 GB GPU memory 1GB 2GB 1GB
  • 13. Problema #2 - Diferentes memórias Lentíssimo
  • 14. Problema #2 - Diferentes memórias
  • 15. Problema #2 - Diferentes memórias
  • 16. Problema #2 - Diferentes memórias
  • 17. Problema #3 - Branching is a bad ideaAT I S T R E A M C O M P U T I N G in turn, contain numerous processing elements, which are the fundamental, programmable computational units that perform integer, single-precision floating- point, double-precision floating-point, and transcendental operations. All stream cores within a compute unit execute the same instruction sequence; different compute units can execute different instructions. Figure 1.2 Simplified Block Diagram of the GPU Compute Device1 1. Much of this is transparent to the programmer. General-Purpose Registers Branch Execution Unit Processing Element T-Processing Element Instruction and Control Flow Stream Core Ultra-Threaded Dispatch Processor Compute Unit Compute Unit Compute Unit Compute Unit if (threadId.x%2==0) { // do something } else { // do other thing } Thread Divergence
  • 18. Resumindo CPU GPU MIMD SIMD task parallel data parallel low throughput high throughput low latency high latency
  • 19. Problema #4 - It’s hard #ifndef GROUP_SIZE #define GROUP_SIZE (64) #endif #ifndef OPERATIONS #define OPERATIONS (1) #endif ///////////////////////////////////////////////////////////////////////////////////////////// /////// #define LOAD_GLOBAL_I2(s, i) vload2((size_t)(i), (__global const int*)(s)) #define STORE_GLOBAL_I2(s, i, v) vstore2((v), (size_t)(i), (__global int*)(s)) ///////////////////////////////////////////////////////////////////////////////////////////// /////// #define LOAD_LOCAL_I1(s, i) ((__local const int*)(s))[(size_t)(i)] #define STORE_LOCAL_I1(s, i, v) ((__local int*)(s))[(size_t)(i)] = (v) #define LOAD_LOCAL_I2(s, i) (int2)( (LOAD_LOCAL_I1(s, i)), (LOAD_LOCAL_I1(s, i + GROUP_SIZE))) #define STORE_LOCAL_I2(s, i, v) STORE_LOCAL_I1(s, i, (v)[0]); STORE_LOCAL_I1(s, i + GROUP_SIZE, (v)[1]) #define ACCUM_LOCAL_I2(s, i, j) { int2 x = LOAD_LOCAL_I2(s, i); int2 y = LOAD_LOCAL_I2(s, j); int2 xy = (x + y); STORE_LOCAL_I2(s, i, xy); } ///////////////////////////////////////////////////////////////////////////////////////////// /////// __kernel void reduce( __global int2 *output, __global const int2 *input, __local int2 *shared, const unsigned int n) { const int2 zero = (int2)(0.0f, 0.0f); const unsigned int group_id = get_global_id(0) / get_local_size(0); const unsigned int group_size = GROUP_SIZE; const unsigned int group_stride = 2 * group_size; const size_t local_stride = group_stride * group_size; unsigned int op = 0; unsigned int last = OPERATIONS - 1; for(op = 0; op < OPERATIONS; op++) { const unsigned int offset = (last - op); const size_t local_id = get_local_id(0) + offset; STORE_LOCAL_I2(shared, local_id, zero); size_t i = group_id * group_stride + local_id; while (i < n) { int2 a = LOAD_GLOBAL_I2(input, i); int2 b = LOAD_GLOBAL_I2(input, i + group_size); int2 s = LOAD_LOCAL_I2(shared, local_id); STORE_LOCAL_I2(shared, local_id, (a + b + s)); i += local_stride; } barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 512) if (local_id < 256) { ACCUM_LOCAL_I2(shared, local_id, local_id + 256); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 256) if (local_id < 128) { ACCUM_LOCAL_I2(shared, local_id, local_id + 128); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 128) if (local_id < 64) { ACCUM_LOCAL_I2(shared, local_id, local_id + 64); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 64) if (local_id < 32) { ACCUM_LOCAL_I2(shared, local_id, local_id + 32); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 32) if (local_id < 16) { ACCUM_LOCAL_I2(shared, local_id, local_id + 16); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 16) if (local_id < 8) { ACCUM_LOCAL_I2(shared, local_id, local_id + 8); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 8) if (local_id < 4) { ACCUM_LOCAL_I2(shared, local_id, local_id + 4); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 4) if (local_id < 2) { ACCUM_LOCAL_I2(shared, local_id, local_id + 2); } #endif barrier(CLK_LOCAL_MEM_FENCE); #if (GROUP_SIZE >= 2) if (local_id < 1) { ACCUM_LOCAL_I2(shared, local_id, local_id + 1); } #endif } barrier(CLK_LOCAL_MEM_FENCE); if (get_local_id(0) == 0) { int2 v = LOAD_LOCAL_I2(shared, 0); STORE_GLOBAL_I2(output, group_id, v); } } int sum = 0; for (int i=0; i<array.length; i++) sum += array[i]; CPU sum GPU sum
  • 20. Como programar para GPUs? • CUDA (NVidia) • OpenCL (Apple, Intel, NVidia, AMD) • OpenACC (Microsoft) • MATLAB • Accelerate, MARS, ÆminiumGPU
  • 21. ÆminiumGPU 3 9 4 16 5 25 6 36 map(λx . x2, [3,4,5,6]) reduce( λxy . x+y , [3,4,5,6]) 18 7 11
  • 22. ÆminiumGPU Decision Mechanism Name Size C/R Description OuterAccess 3 C Global GPU memory read. InnerAccess 3 C Local (thread-group) memory read. This area of the memory is faster than the global one. ConstantAccess 3 C Constant (read-only) memory read. This memory is faster on some GPU models. OuterWrite 3 C Write in global memory. InnerWrite 3 C Write in local memory, which is also faster than in global. BasicOps 3 C Simplest and fastest instructions. Include arithmetic, logical and binary operators. TrigFuns 3 C Trigonometric functions, including sin, cos, tan, asin, acos and atan. PowFuns 3 C pow, log and sqrt functions CmpFuns 3 C max and min functions Branches 3 C Number of possible branching instructions such as for, if and whiles DataTo 1 R Size of input data transferred to the GPU in bytes. DataFrom 1 R Size of output data transferred from the GPU in bytes. ProgType 1 R One of the following values: Map, Reduce, PartialReduce or MapReduce, which are the different types of operations supported by ÆminiumGPU. Table I LIST OF FEATURES
  • 23. Código (Cuda & OpenCL)
  • 24. Reduction Input: Reduction step 1: Reduction step 2: + + + + + + __syncthreads() __syncthreads() Thread Block
  • 25. Avanços recentes • Kernel calls from GPU • Suporte para Multi-GPU • Unified Memory • Task parallelism (HyperQ) • Melhores profilers • Suporte para C++ (auto e lambda)