This document discusses x86 processor evolution, GPUs as accelerators, accelerated processing units (APUs), and OpenCL. It describes how x86 processors are gaining more cores and memory channels over time. It explains how GPUs can accelerate tasks like video transcoding using massively parallel processing. APUs integrate CPU and GPU cores on a single die to improve performance and efficiency. Finally, it introduces OpenCL as an open standard for programming heterogeneous systems like CPUs and GPUs.
5. 4P/24-core system example
very good scalability
One memory controller for every
MEMORY
MEMORY
processor
Full-duplex Hyper Transport links
(up to 5.2GHz)
MEMORY
MEMORY
Bus Optimization: HT Assist (Cache
Probe Filtering)
Still the only available 4P system
with Direct Connect Architecture
6. Direct Connect Architecture 1.0
Balanced and Scalable Design to Support up to 6 Cores
CHANNELS
2 MEMORY
2 MEMORY
CHANNELS
8 DIMMs 8 DIMMs
per CPU per CPU
CHANNELS
2 MEMORY
2 MEMORY
CHANNELS
8 DIMMs 8 DIMMs
per CPU per CPU
No front side bus HyperTransport™ technology
Integrated memory controller NUMA memory architecture
7. Direct Connect Architecture 2.0
Balanced and Scalable Design to Support up to 16 Cores* per CPU
CHANNELS
4 MEMORY
4 MEMORY
CHANNELS
12 DIMMs 12 DIMMs
per CPU per CPU
CHANNELS
4 MEMORY
4 MEMORY
CHANNELS
12 DIMMs 12 DIMMs
per CPU per CPU
• 1-hop between processors • Four memory channels
• Up to 50% more DIMMs • Up to 33% increase in CPU to CPU
communication speed±
8. What is next for x86 CPUs
• More processor cores to come
(12, 16, 16 double cores)
• More memory channels
(improves memory bandwidth per
core)
• Improved IPC
(8 per cycle is a target)
9. Top500 list - beyond the petaflop
Datacenters in the
USA will spend more
than $3 billion on
energy in 2009
12. 2011 GPU Architecture
AMD Radeon™ HD 6900 Series
Dual graphics engines
New VLIW4 core architecture
Up to 24 SIMD engines
Up to 96 Texture Units
Upgraded render back-ends
Improved anti-aliasing performance
Fast 256-bit GDDR5 memory interface
Up to 5.5 Gbps
New GPU compute features
13. Designing very efficient GPUs
Full load: 180W; Idle:27W
16
14.47
14 GFLOPS/W
12
GFLOPS/W
GFLOPS/mm2
10
7.50
8
4.50 7.90
GFLOPS/mm2
6
2.01 2.21 4.56
4
1.07 2.24
2 0.42 1.06 0.92
0
Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09
ATI Radeon™ ATI Radeon™ ATI Radeon™ HD ATI Radeon™ HD ATI Radeon™ HD ATI Radeon™ HD
X1800 XT X1900 XTX 2900 PRO 3870 4870 5870
14. Old and New in High Performance Computing
Old: Power is free, Transistors are expensive
New: Power expensive, Transistors free
(Can put more transistors on chip than can afford to turn on)
Old: Multiplies are slow, Memory access is fast
New: Multiplies fast, Memory slow
(up 200 clocks to DRAM memory, 4 clocks for FP multiply)
Old: Increasing Instruction Level Parallelism via compilers innovation
New: Explicit thread and data parallelism must be exploited
15. GPUs: more than just gaming
Processing power – millions of operations per second
Single Core 12
Dual Core 24
Quad Core 48
Hexa Core 72
12 Cores 144
2700
Radeon HD 5970
Both use GPUs
Wii Sports - Golf Oil exploration platform - 2010
15
16. DirectX® 11 Multi-Threading
Application, DirectX runtime, and DirectX driver can each run in separate
threads
Tasks like loading a texture or compiling a shader can execute in parallel
with main rendering thread
DirectX® 10 DirectX® 11
16
22. AMD Balanced Platform
GPU is ideal for data parallel algorithms
CPU is excellent for running some like image processing, CAE, etc
algorithms
Great use for ATI Stream
Ideal place to process if GPU is technology
fully loaded
Great use for additional GPUs
Great use for additional CPU
cores
Graphics Workloads
Serial/Task-Parallel Other Highly
Workloads Parallel Workloads
Delivers optimal performance for a wide range of
platform configurations
23. ATI Stream Technology is…
Heterogeneous: Developers leverage AMD GPUs and x86
CPUs for optimal application performance and user experience
High performance: Massively parallel, programmable GPU
architecture delivers unprecedented performance and power
efficiency
Industry Standards: OpenCL™ and DirectCompute 11 enable
cross-platform development
Sciences Government Engineering Gaming Digital Productivity
Content
Creation
24. Improvements already reached consumers
80%
70%
60%
50%
ATI
Stream
40%
30%
20%
10%
0%
Processor utilization
Adobe Flash plugin used by Youtube.com
Better image quality and video smoothness
Lower processor usage
26. Video Transcoding Sample
No GPU Acceleration
CPU Usage: 100%
Using four
CPU Cores
GPU Usage: 1%
CPU Usage: 100% Time to finish: 1h 52m Total Power: 0.23kW/h
GPU Usage: 1% Peak power: 145W Energy Price: $0.15 26
27. Video Transcoding Sample
ATI GPU Acceleration
CPU Usage: 45%
GPU Usage: 35%
Using hundreds of
Stream Processors
CPU Usage: 45% (100%) Time to finish: 26m (1h52m) Total Power: 0.11kW/h (0.23)
GPU Usage: 35% (1%) Peak power: 198W (145W) Energy Price: $0.07 ($0.15) 27
29. Today
Multi-core CPU TeraFLOPS-class GPU
~800 million transistors Up to 2 billion transistors
Multi-tasking Jogos em multiplos monitores
Video e audio Full HD
30. A new Era on performance evolution
Heterogeneous
Single-Core Multi-Core
computing
Challenge: Challenge: Pros:
Power consumption Power consumption Performance
Complexity Software Power efficient
Cons:
Software availability
Single-thread
Performance
Performance
?
We are here
We are here
We are here
Time Time x Cores Time
31. A new Era on performance evolution
Single-Core Multi-Core
CPU
Core efficiency
Software
Acceleration
Multimedia
Gaming
GPU
32. Putting all together – The Future is Fusion
AMD “Istambul” six-core processor RV500 GPU Core (2006)
1 2 3 4 5 6
Ring
L2 L2 L2 L2 L2 L2 Stop
Client Interface Client Interface
Cache L3
Client Interface
Client Interface
CROSSBAR
Ring Memory Ring
Stop Controller Stop
Hyper Memory
Client Interface
Transport Controller
Client Interface
Client Interface Client Interface
HyperTransport
Ring
Stop
PCI-e
Chipset
33. Putting all together – The Future is Fusion
AMD “Istambul” six-core processor RV700 GPU Core (2008-2009)
1 2 3 4 5 6
L2 L2 L2 L2 L2 L2
Cache L3
CROSSBAR
Hyper Memory
Transport Controller
HyperTransport
PCI-e
Chipset
34. Putting all together – The Future is Fusion
AMD “Istambul” six-core processor RV700 GPU Core
CROSSBAR
CROSSBAR
35. 2011: welcome to the APU time!
CPU APU GPU
“Supercomputing power in a notebook platform whose
battery lasts for a full day”
36. One Design, Fewer Watts, Massive Capability
“Zacate”
Discrete-level AMD
Dual-Core
Northbridge + CPU
+ DirectX® 11
GPU
= Fusion
APU
66 sq. mm 117 sq. mm 59 sq. mm 75 sq. mm
13 watts 25 watts 8 watts 18 watts
37. Graphics and Media Processing Efficiency
Improvements
2010 IGP-based Platform 2011 APU-based Platform
~17 GB/sec ~17 GB/sec
CPU
Cores DDR3 DIMM
CPU Memory
UNB / MC
Cores
CPU Chip DDR3 DIMM
APU Chip
MC
Memory UVD
UNB
GPU
~27 GB/sec
~7 GB/sec
Graphics requires
GPU UVD memory bandwidth ~27 GB/sec PCIe
to bring full
SB Functions capabilities to life 3X bandwidth between GPU and memory
Even the same sized GPU is substantially
more effective in this configuration
PCIe
Eliminate latency and power associated
with the extra chip crossing
Bandwidth pinch points and latency Substantially smaller physical foot print
hold back the GPU capabilities
38. “Ontario” & “Zacate” Architecture
APU
>2 x86 CPU Cores (40nm “Bobcat” core – 1 MB
L2, 64-bit FPU)
>C6 and power gating
>Array of SIMD Engines
• DX11 graphics performance
• Industry leading 3D and graphics processing
>3rd Generation Unified Video Decoder
>H.264, VC1, DixX/Xvid format
>DDR3 800-1066, 2 DIMMs, 64 bit channel
>BGA package
Display and I/O
>Two dedicated digital display interfaces
• Configurable externally as HDMI, DVI, and/or
Display Port
• Also supports a single link LVDS for internal
panels
>Integrated VGA
>5x8 PCIe®
> “Hudson” Fusion Controller Hub
40. ATI Stream SDK:
OpenCL™ For Multicore x86 CPUs and GPUs
http://developer.amd.com/
The Power of Fusion: Developers leverage heterogeneous
architecture to deliver superior user experience
• First complete OpenCL™ development platform
• Certified OpenCL 1.0 compliant by the Khronos Group
• Write code that can scale well on multi-core CPUs and GPUs
• AMD delivers on the promise of OpenCL™, with both high-
performance CPU and GPU technologies
• Available for download now as part of ATI Stream SDK beta
program – includes documentation, samples, and developer
support
41. OpenCL™: Game-Changing Development
Enabling Broad Adoption of GP-GPU Capabilities
Industry standard API: Open, multiplatform development
platform for heterogeneous architectures
The power of Fusion: Leverages CPUs and GPUs for
balanced system approach
Broad industry support: Created by architects from AMD,
Apple, IBM, Intel, Nvidia, Sony, etc.
Fast track development: Ratified in December; AMD is the
first company to provide a complete OpenCL solution
Momentum: Enormous interest from mainstream
developers and application ISVs
More stream-enabled applications across
all markets
42. Open Standards:
Maximize Developer Freedom and Addressable Market
Vendor specific Vendor neutral
Cross-platform limiters
Cross-platform enablers
• Apple Display Connector
• 3dfx Glide Digital Visual
OpenCL™ DirectX®
Interface
• Nvidia CUDA
• Nvidia Cg
• Rambus Certified DP JEDEC OpenGL®
• Unified Display Interface
43. Comparing OpenCL™ and DirectX® 11 DirectCompute
How will developers choose between OpenCL™ and DirectX® 11
DirectCompute?
Feature set is similar in both APIs
DirectX® 11 DirectCompute
Easiest path to add compute capabilities to existing DirectX
applications
Windows Vista® and Windows® 7 only
OpenCL™
Ideal path for new applications porting to the GPU for the first
time
True multiplatform: Windows®, Linux®, MacOS
Natural programming without dealing with a graphics API
44. Anatomy of OpenCL™
Language Specification
• C-based cross-platform programming interface
• Subset of ISO C99 with language extensions - familiar to developers
• Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error
• Online or offline compilation and build of compute kernel executables
• Includes a rich set of built-in functions
Platform Layer API
• A hardware abstraction layer over diverse computational resources
• Query, select and initialize compute devices
• Create compute contexts and work-queues
Runtime API
• Execute compute kernels
• Manage scheduling, compute, and memory resources
45. OpenCL Example
Scalar
void square(int n, const float *a, float *result)
{
int i;
for (i=0; i<n; i++)
result[i] = a[i] * a[i];
}
Data-Parallel
kernel dp_square (const float *a, float *result)
{
int id = get_global_id(0);
result[id] = a[id] * a[id];
}
// dp_square executes oven “n” work-items