SlideShare a Scribd company logo
1 of 20
Download to read offline
Compute API – Past & Future

                          Ofer Rosenberg
                Visual Computing Software




1
Intro and acknowledgments

• Who am I ?
   – For the past two years leading the Intel representation in OpenCL working group @
     Khronos
   – Additional background of Media, Signal Processing, etc.
   – http://il.linkedin.com/in/oferrosenberg


• Acknowledgments:
   – This presentation contains ideas based on talks with lots of people (who should be
     mentioned here)
   – Partial list:
       – AMD: Mike Houston, Ben Gaster
       – Apple: Aaftab Munshi
       – DICE: Johan Andersson
       – Intel: Aaron Lefohn, Stephen Junkins, David Blythe, Adam Lake, Yariv Aridor, Larry Seiler and
         more…
       – And others…




                                                                                                         2
Agenda

• The beginning – From Shaders to Compute

• The Past/Present: 1st Generation of Compute API’s
  – Caveats of the 1st generation


• The Future: 2nd Generation of Compute API’s
From Shaders to Compute

• In the beginning, GPU HW was fixed & optimized for Graphics…




               Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research,
               SIGGRAPH 2008:                                                                        4
From Shaders to Compute

• Graphics stages became programmable  GPUs evolved …




• This led to the traditional GPGPU approach…
                Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research,
                SIGGRAPH 2008:                                                                        5
From Shaders to Compute
    Traditional GPGPU
    • Write in graphics language and use the GPU
    • Highly effective, but :
       – The developer needs to learn another (not intuitive) language
       – The developer was limited by the graphics language




    • Then came CUDA & CTM…

                       Slides from “General Purpose Computation on Graphics Processors
6                      (GPGPU)”, Mike Houston, Stanford University Graphics Lab          6
The cradle of GPU Compute API’s




GeForce 8800 GTX (G80) was released on Nov. 2006                           ATI x1900 (R580) released on Jan 2006




CUDA 0.8 was released on Feb. 2007 (first official Beta)                        CTM was released on Nov. 2006
                        Slides from “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”, Ian Buck,
                        NVIDIA, SC06, & “Close to the Metal”, Justin Hensley, AMD, SIGGRAPH 2007                           7
The 1st generation of Platform Compute API

  • CUDA & CTM led the way to two compute standards: Direct Compute & OpenCL

  • DirectCompute is a Microsoft standard
         – Released as part of WIn7/DX11, a.k.a. Compute Shaders
         – Only runs under Windows on a GPU device


  • OpenCL is a cross-OS / cross-Vendor standard
         – Managed by a working group in Khronos
         – Apple is the spec editor & conformance owner
         – Work can be scheduled on both GPUs and CPUs

  Nov         June       Dec        Aug         Dec          Oct          Mar         June
  2006        2007       2007       2008        2008         2009         2010        2010


  CTM       CUDA 1.0   StreamSDK   CUDA 2.0   OpenCL 1.0   DirectX 11   CUDA 3.0   OpenCL 1.1
Released    Released    Released   Released    Released     Released    Released    Released




           The 1st Generation was developed on GPU HW which was tuned for graphics usages –
                                    just extended it for general usage

                                                                                                8
The 1st generation of Platform Compute API
Execution Model
• Execution model was driven directly from shader programming in graphics (“fragment
  processing”) :
   – Shader Programming : initiate one instance of the shader per vertex/pixel
   – Compute : initiate one instance for each point in an N-dimensional grid


• Fits GPU’s vision of array of scalar (or stream) processors




                     Drawing from OpenCL 1.1 Specification , Rev36
                                                                                       9
The 1st generation of Platform Compute API
Memory Model
• Distributed Memory system:
   – Abstraction: Application gets a “handle” to the memory object / resource
   – Explicit transactions: API for sync between Host & Device(s) : read/write, map/unmap


                                App           OCL    A   Dev1
                                              RT
                                         H               Dev2
                                                     A



• Three address spaces: Global, Local (Shared) & Private
   – Local/Shared Memory: the non-trivial memory space…




                                                                                            10
Disclaimer




 Next slides provide my opinion and thoughts on caveats and
      future improvements to the Platform Compute API.




                                                              11
The 2nd generation of Platform Compute API

• Recap:
  – The 1st generation : CUDA (until 3.0), OpenCL 1.x, DX11 CS
  – Defined on HW optimized for GFX, extended to General Compute


• The “cheese” has moved for GPUs
  – Compute becomes an important usage scenario
      – Advanced Graphics: Physics, Advanced Lighting Effects, Irregular Shadow Mapping, Screen Space
        Rendering
      – Media: Video Encoding & Processing, Image Processing, Image Segmentation, Face Recognition
      – Throughput: Scientific Simulations, Finance, Oil Searches
  – Developers feedback based on the 1st generation enables creating better HW/API


• The Second generation of Platform Compute API: “OpenCL Next”,
  DirectX12 ?

    The 2nd Generation of Compute API will run on HW which is designed with
                              Compute in mind

                                                                                                        12
Caveats of the 1st generation:
        Execution Model
        • Developers input:
            – Most “real world” usages for compute use fine-grain granularity (the gird is small – 100’s at best)
            – “Real world” kernels got sequential parts interleaved with the parallel code (reduction, condition
              testing, etc.)
__kernel foo()
{
    // code here runs for each point in the grid
    barrier(CLK_LOCAL_MEM_FENCE);
    if (local_id == 0)
    {
        // this code runs once per workgroup
    }
    // code here runs for each point in the grid
    barrier(CLK_GLOBAL_MEM_FENCE);
    if (global_id == 0)
    {                                                             Battlefield 2
        // this code runs only once                            execution phase DAG
    }                                                     (Image courtesy Johan Andersson, DICE)

    // code here runs for each point in the grid
}




               Using “fragment processing” for these usages results inefficient use of the machine


                                                                                                                    13
Caveats of the 1st generation
  Execution Model
  • The “array of scalar/stream processors” model is not optimal for CPU’s & GPU’s
  • Works well for large grids (like in traditional graphics), but on finer grain there is a better
    model…



NV Fermi                               AMD R600                          Intel NHM




           CPU’s and GPU’s are better modeled as multi-threaded vector machines

                                                                                                      14
The 2nd generation of Platform Compute API
Ideas for new execution model
• Goals
   – Support fine-grain task parallelism
   – Support complex application execution graphs:
   – Better match HW evolution: target multi-threaded vector machines
           – Aligned with CPU evolution, and SoC integration of CPU/GPU

• Solution: Tasking system as execution model foundation
                         Device Domain                                          Device
                                                                                             Tasking system:
   task                                          SW Thread

                                           ...
                                                                                  HW
                                                                                             • Task Q’s mapped to independent



                                                                                      task
                                   task




                                                    task

                                                                 task
                                                                        task

    task            task                                                        compute
           task
                    task
                                                                                  unit         HW units (~compute cores)
             task
     task                                                                         HW         • Device load balancing enabled via
                                           ...
                                                      task


                                                                         task




                                                                                               task stealing
                                    task




                                                                                     task
                                                                                compute
                                                                                  unit
    Task Pool                                                                                • OpenCL Analogy: Tasks execute at
                                                                                  HW
                                           ...                                                 the “work group level”
                                                                                      task
                                   task



                                                   task
                                                          task
                                                                 task
                                                                        task




    task                                                                        compute
                  task
                                                                                  unit
             task
                                                                                             • OpenCL Task ≠ CPU Task
    task                                                                          HW              •   More restricted: No Preemption
                                    task




                                                     task




                                                                                      task




                                                                                compute
                                                                                                  •   Evolved: Braided Task (sequential parts
                                                                                  unit
                                                                                                      & fine-grain parallel parts interleaved)


                                                                                                                                             15
The 2nd generation of Platform Compute API
Ideas for new execution model

• There are others who think along the same lines …




                 Slides from “Leading a new Era of Computing”, Chekib Akrout, Senior VP,
                 Technology Group, AMD, 2010 Financial Analyst Day                         16
Caveats of the 1st generation:
Memory Model
• Developers input:
   –   A growing number of compute workloads uses complex data structures (linked lists, trees, etc.)
   –   Performance: Cost of pointer marshaling & re-construct on device is high
   –   Porting complexity: need to add explicit transactions, marshaling, etc.
   –   Supporting a shared/unified address space (API & HW) is required

                                  App           OCL    A   Dev1
                                                RT
                                          H                Dev2
                                                       A




                                  App           OCL    A   Dev1
                                                RT
                                          A                Dev2
                                                 S     A




                  Shared/Unified Address Space between Host & Devices

                                                                                                        17
The 2nd generation of Platform Compute API
    Ideas for new memory model
Baseline:
Memory objects / resources will have                   Shared Address Space
the same starting address between
Host & Devices



                     Shared Address Space                                       Shared Address Space
                     w. relaxed consistency                                       w. full coherency
       • Extend existing OCL 1.x / DX11 Memory Model                   • New Model - Memory is coherent between Host & Device
       • Use explicit API calls to sync between Host & Device          • Use known “language level” mechanisms for concurrent
       • Suitable for Disjoint memory architectures (Discrete            access: atomics, volatile
         GPU’s, for example…)                                          • Suitable for Shared Memory architectures


                 Host                     Device                              Host                       Device


                     P                             P                                             P


                 P       P                     P       P                                     P       P

             P       P                     P       P                                     P       P


           Host Memory                 Device Memory                             Coherent/Shared Memory
                                                                                                                        18
Some more thoughts for the 2nd generation
(and beyond)
• Promote Heterogeneous Processing – not GPU only…                                        Execution
                                                                                            Time
   – Running code pending on problem domain:                                                                     GPU

                                                                                                      CPU
       – Matrix Multiply of 16x16 should run on the CPU
       – Matrix Multiply of 1000x1000 should run on the GPU                                                 Problem size


       – Where’s the decision point ? Better leave it to the Runtime… (requires API)
   – Load Balancing
       – Relevant especially on systems where the CPU & GPU are close in compute power


• One API to rule them all
   – Compute API as the underlying infrastructure to run Media & GFX
   – Extend the API to contain flexible pipeline, fixed-function HW, etc.




                   Slide from “Parallel Future of a Game Engine”, Johan Andersson, DICE
                                                                                                                    19
References:
•   “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”, Ian Buck, NVIDIA, SC06
     –   http://gpgpu.org/static/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf


•   “GPU Architecture: Implications & Trends”, David Luebke, NVIDIA Research, SIGGRAPH 2008:
     –   http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

•   “General Purpose Computation on Graphics Processors (GPGPU)”, Mike Houston, Stanford University Graphics Lab
     –   http://www-graphics.stanford.edu/~mhouston/public_talks/R520-mhouston.pdf

•   “Close to the Metal”, Justin Hensley, AMD, SIGGRAPH 2007
     –   http://gpgpu.org/static/s2007/slides/07-CTM-overview.pdf


•   “NVIDIA’s Fermi: The First Complete GPU Computing Architecture”, Peter N. Glaskowsky
     –   http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIAFermi-
         TheFirstCompleteGPUComputingArchitecture.pdf


•   “Leading a new Era of Computing”, Chekib Akrout, Senior VP, Technology Group, AMD, 2010 Financial Analyst Day
     –   http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9Njk3NTJ8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1


•   “Parallel Future of a Game Engine”, Johan Andersson, DICE
     –   http://www.slideshare.net/repii/parallel-futures-of-a-game-engine-2478448




                                                                                                                    20

More Related Content

What's hot

Introduction to Skia by Ryan Chou @20141008
Introduction to Skia by Ryan Chou @20141008Introduction to Skia by Ryan Chou @20141008
Introduction to Skia by Ryan Chou @20141008Ryan Chou
 
Graphical libraries
Graphical librariesGraphical libraries
Graphical librariesguestbd40369
 
Universal Render Pipeline and the features used to create the Boat Attack dem...
Universal Render Pipeline and the features used to create the Boat Attack dem...Universal Render Pipeline and the features used to create the Boat Attack dem...
Universal Render Pipeline and the features used to create the Boat Attack dem...Unity Technologies
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorAMD Developer Central
 
Software Parallelisation & Platform Generation for Heterogeneous Multicore Ar...
Software Parallelisation & Platform Generation for Heterogeneous Multicore Ar...Software Parallelisation & Platform Generation for Heterogeneous Multicore Ar...
Software Parallelisation & Platform Generation for Heterogeneous Multicore Ar...chiportal
 
EclipseCon 2011: Deciphering the CDT debugger alphabet soup
EclipseCon 2011: Deciphering the CDT debugger alphabet soupEclipseCon 2011: Deciphering the CDT debugger alphabet soup
EclipseCon 2011: Deciphering the CDT debugger alphabet soupBruce Griffith
 
CG simple openGL point & line-course 2
CG simple openGL point & line-course 2CG simple openGL point & line-course 2
CG simple openGL point & line-course 2fungfung Chen
 
Skia & Freetype - Android 2D Graphics Essentials
Skia & Freetype - Android 2D Graphics EssentialsSkia & Freetype - Android 2D Graphics Essentials
Skia & Freetype - Android 2D Graphics EssentialsKyungmin Lee
 
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...AMD Developer Central
 
Gallium3D - Mesa's New Driver Model
Gallium3D - Mesa's New Driver ModelGallium3D - Mesa's New Driver Model
Gallium3D - Mesa's New Driver ModelChia-I Wu
 
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011Shinya Takamaeda-Y
 
Final lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbFinal lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbr Skip
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicschangehee lee
 
Design your 3d game engine
Design your 3d game engineDesign your 3d game engine
Design your 3d game engineDaosheng Mu
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauAMD Developer Central
 
Video Terminal Evolution and The Future of Browsers
Video Terminal Evolution and The Future of BrowsersVideo Terminal Evolution and The Future of Browsers
Video Terminal Evolution and The Future of BrowsersThomas Walker Lynch
 
X Server Multi-rendering for OpenGL and PEX
X Server Multi-rendering for OpenGL and PEXX Server Multi-rendering for OpenGL and PEX
X Server Multi-rendering for OpenGL and PEXMark Kilgard
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozAMD Developer Central
 

What's hot (20)

Introduction to Skia by Ryan Chou @20141008
Introduction to Skia by Ryan Chou @20141008Introduction to Skia by Ryan Chou @20141008
Introduction to Skia by Ryan Chou @20141008
 
Graphical libraries
Graphical librariesGraphical libraries
Graphical libraries
 
Universal Render Pipeline and the features used to create the Boat Attack dem...
Universal Render Pipeline and the features used to create the Boat Attack dem...Universal Render Pipeline and the features used to create the Boat Attack dem...
Universal Render Pipeline and the features used to create the Boat Attack dem...
 
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael MantorGS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
GS-4152, AMD’s Radeon R9-290X, One Big dGPU, by Michael Mantor
 
Accelerated Android Development with Linaro
Accelerated Android Development with LinaroAccelerated Android Development with Linaro
Accelerated Android Development with Linaro
 
Software Parallelisation & Platform Generation for Heterogeneous Multicore Ar...
Software Parallelisation & Platform Generation for Heterogeneous Multicore Ar...Software Parallelisation & Platform Generation for Heterogeneous Multicore Ar...
Software Parallelisation & Platform Generation for Heterogeneous Multicore Ar...
 
EclipseCon 2011: Deciphering the CDT debugger alphabet soup
EclipseCon 2011: Deciphering the CDT debugger alphabet soupEclipseCon 2011: Deciphering the CDT debugger alphabet soup
EclipseCon 2011: Deciphering the CDT debugger alphabet soup
 
CG simple openGL point & line-course 2
CG simple openGL point & line-course 2CG simple openGL point & line-course 2
CG simple openGL point & line-course 2
 
Skia & Freetype - Android 2D Graphics Essentials
Skia & Freetype - Android 2D Graphics EssentialsSkia & Freetype - Android 2D Graphics Essentials
Skia & Freetype - Android 2D Graphics Essentials
 
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
HC-4019, "Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU...
 
Gallium3D - Mesa's New Driver Model
Gallium3D - Mesa's New Driver ModelGallium3D - Mesa's New Driver Model
Gallium3D - Mesa's New Driver Model
 
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
An FPGA-based Scalable Simulation Accelerator for Tile Architectures @HEART2011
 
Final lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tbFinal lisa opening_keynote_draft_-_v12.1tb
Final lisa opening_keynote_draft_-_v12.1tb
 
Improve Android System Component Performance
Improve Android System Component PerformanceImprove Android System Component Performance
Improve Android System Component Performance
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
 
Design your 3d game engine
Design your 3d game engineDesign your 3d game engine
Design your 3d game engine
 
GS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-BilodeauGS-4147, TressFX 2.0, by Bill-Bilodeau
GS-4147, TressFX 2.0, by Bill-Bilodeau
 
Video Terminal Evolution and The Future of Browsers
Video Terminal Evolution and The Future of BrowsersVideo Terminal Evolution and The Future of Browsers
Video Terminal Evolution and The Future of Browsers
 
X Server Multi-rendering for OpenGL and PEX
X Server Multi-rendering for OpenGL and PEXX Server Multi-rendering for OpenGL and PEX
X Server Multi-rendering for OpenGL and PEX
 
TressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas ThibierozTressFX The Fast and The Furry by Nicolas Thibieroz
TressFX The Fast and The Furry by Nicolas Thibieroz
 

Similar to Compute API –Past & Future

"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLinaro
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Johan Andersson
 
01 intro-bps-2011
01 intro-bps-201101 intro-bps-2011
01 intro-bps-2011mistercteam
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADDesign World
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyAMD Developer Central
 
Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Intel® Software
 
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)Budianto Tandianus
 
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingAchieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingMesbah Uddin Khan
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahAMD Developer Central
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++JetBrains
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio Owen Wu
 
N A G P A R I S280101
N A G P A R I S280101N A G P A R I S280101
N A G P A R I S280101John Holden
 

Similar to Compute API –Past & Future (20)

"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 
LCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience ReportLCU13: GPGPU on ARM Experience Report
LCU13: GPGPU on ARM Experience Report
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
01 intro-bps-2011
01 intro-bps-201101 intro-bps-2011
01 intro-bps-2011
 
OpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CADOpenCL & the Future of Desktop High Performance Computing in CAD
OpenCL & the Future of Desktop High Performance Computing in CAD
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
GPU Programming with Java
GPU Programming with JavaGPU Programming with Java
GPU Programming with Java
 
What is OpenGL ?
What is OpenGL ?What is OpenGL ?
What is OpenGL ?
 
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon SelleyPT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
PT-4052, Introduction to AMD Developer Tools, by Yaki Tebeka and Gordon Selley
 
Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)Introduction to Software Defined Visualization (SDVis)
Introduction to Software Defined Visualization (SDVis)
 
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
GPU Renderfarm with Integrated Asset Management & Production System (AMPS)
 
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU ComputingAchieving Improved Performance In Multi-threaded Programming With GPU Computing
Achieving Improved Performance In Multi-threaded Programming With GPU Computing
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
High-Performance Computing with C++
High-Performance Computing with C++High-Performance Computing with C++
High-Performance Computing with C++
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 
GPU Ecosystem
GPU EcosystemGPU Ecosystem
GPU Ecosystem
 
N A G P A R I S280101
N A G P A R I S280101N A G P A R I S280101
N A G P A R I S280101
 
GPU Algorithms and trends 2018
GPU Algorithms and trends 2018GPU Algorithms and trends 2018
GPU Algorithms and trends 2018
 

More from Ofer Rosenberg

Introduction To GPUs 2012
Introduction To GPUs 2012Introduction To GPUs 2012
Introduction To GPUs 2012Ofer Rosenberg
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux ClubOfer Rosenberg
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup WorkshopOfer Rosenberg
 

More from Ofer Rosenberg (6)

HSA Introduction
HSA IntroductionHSA Introduction
HSA Introduction
 
The GPGPU Continuum
The GPGPU ContinuumThe GPGPU Continuum
The GPGPU Continuum
 
From fermi to kepler
From fermi to keplerFrom fermi to kepler
From fermi to kepler
 
Introduction To GPUs 2012
Introduction To GPUs 2012Introduction To GPUs 2012
Introduction To GPUs 2012
 
Open CL For Haifa Linux Club
Open CL For Haifa Linux ClubOpen CL For Haifa Linux Club
Open CL For Haifa Linux Club
 
Open CL For Speedup Workshop
Open CL For Speedup WorkshopOpen CL For Speedup Workshop
Open CL For Speedup Workshop
 

Compute API –Past & Future

  • 1. Compute API – Past & Future Ofer Rosenberg Visual Computing Software 1
  • 2. Intro and acknowledgments • Who am I ? – For the past two years leading the Intel representation in OpenCL working group @ Khronos – Additional background of Media, Signal Processing, etc. – http://il.linkedin.com/in/oferrosenberg • Acknowledgments: – This presentation contains ideas based on talks with lots of people (who should be mentioned here) – Partial list: – AMD: Mike Houston, Ben Gaster – Apple: Aaftab Munshi – DICE: Johan Andersson – Intel: Aaron Lefohn, Stephen Junkins, David Blythe, Adam Lake, Yariv Aridor, Larry Seiler and more… – And others… 2
  • 3. Agenda • The beginning – From Shaders to Compute • The Past/Present: 1st Generation of Compute API’s – Caveats of the 1st generation • The Future: 2nd Generation of Compute API’s
  • 4. From Shaders to Compute • In the beginning, GPU HW was fixed & optimized for Graphics… Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008: 4
  • 5. From Shaders to Compute • Graphics stages became programmable  GPUs evolved … • This led to the traditional GPGPU approach… Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008: 5
  • 6. From Shaders to Compute Traditional GPGPU • Write in graphics language and use the GPU • Highly effective, but : – The developer needs to learn another (not intuitive) language – The developer was limited by the graphics language • Then came CUDA & CTM… Slides from “General Purpose Computation on Graphics Processors 6 (GPGPU)”, Mike Houston, Stanford University Graphics Lab 6
  • 7. The cradle of GPU Compute API’s GeForce 8800 GTX (G80) was released on Nov. 2006 ATI x1900 (R580) released on Jan 2006 CUDA 0.8 was released on Feb. 2007 (first official Beta) CTM was released on Nov. 2006 Slides from “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”, Ian Buck, NVIDIA, SC06, & “Close to the Metal”, Justin Hensley, AMD, SIGGRAPH 2007 7
  • 8. The 1st generation of Platform Compute API • CUDA & CTM led the way to two compute standards: Direct Compute & OpenCL • DirectCompute is a Microsoft standard – Released as part of WIn7/DX11, a.k.a. Compute Shaders – Only runs under Windows on a GPU device • OpenCL is a cross-OS / cross-Vendor standard – Managed by a working group in Khronos – Apple is the spec editor & conformance owner – Work can be scheduled on both GPUs and CPUs Nov June Dec Aug Dec Oct Mar June 2006 2007 2007 2008 2008 2009 2010 2010 CTM CUDA 1.0 StreamSDK CUDA 2.0 OpenCL 1.0 DirectX 11 CUDA 3.0 OpenCL 1.1 Released Released Released Released Released Released Released Released The 1st Generation was developed on GPU HW which was tuned for graphics usages – just extended it for general usage 8
  • 9. The 1st generation of Platform Compute API Execution Model • Execution model was driven directly from shader programming in graphics (“fragment processing”) : – Shader Programming : initiate one instance of the shader per vertex/pixel – Compute : initiate one instance for each point in an N-dimensional grid • Fits GPU’s vision of array of scalar (or stream) processors Drawing from OpenCL 1.1 Specification , Rev36 9
  • 10. The 1st generation of Platform Compute API Memory Model • Distributed Memory system: – Abstraction: Application gets a “handle” to the memory object / resource – Explicit transactions: API for sync between Host & Device(s) : read/write, map/unmap App OCL A Dev1 RT H Dev2 A • Three address spaces: Global, Local (Shared) & Private – Local/Shared Memory: the non-trivial memory space… 10
  • 11. Disclaimer Next slides provide my opinion and thoughts on caveats and future improvements to the Platform Compute API. 11
  • 12. The 2nd generation of Platform Compute API • Recap: – The 1st generation : CUDA (until 3.0), OpenCL 1.x, DX11 CS – Defined on HW optimized for GFX, extended to General Compute • The “cheese” has moved for GPUs – Compute becomes an important usage scenario – Advanced Graphics: Physics, Advanced Lighting Effects, Irregular Shadow Mapping, Screen Space Rendering – Media: Video Encoding & Processing, Image Processing, Image Segmentation, Face Recognition – Throughput: Scientific Simulations, Finance, Oil Searches – Developers feedback based on the 1st generation enables creating better HW/API • The Second generation of Platform Compute API: “OpenCL Next”, DirectX12 ? The 2nd Generation of Compute API will run on HW which is designed with Compute in mind 12
  • 13. Caveats of the 1st generation: Execution Model • Developers input: – Most “real world” usages for compute use fine-grain granularity (the gird is small – 100’s at best) – “Real world” kernels got sequential parts interleaved with the parallel code (reduction, condition testing, etc.) __kernel foo() { // code here runs for each point in the grid barrier(CLK_LOCAL_MEM_FENCE); if (local_id == 0) { // this code runs once per workgroup } // code here runs for each point in the grid barrier(CLK_GLOBAL_MEM_FENCE); if (global_id == 0) { Battlefield 2 // this code runs only once execution phase DAG } (Image courtesy Johan Andersson, DICE) // code here runs for each point in the grid } Using “fragment processing” for these usages results inefficient use of the machine 13
  • 14. Caveats of the 1st generation Execution Model • The “array of scalar/stream processors” model is not optimal for CPU’s & GPU’s • Works well for large grids (like in traditional graphics), but on finer grain there is a better model… NV Fermi AMD R600 Intel NHM CPU’s and GPU’s are better modeled as multi-threaded vector machines 14
  • 15. The 2nd generation of Platform Compute API Ideas for new execution model • Goals – Support fine-grain task parallelism – Support complex application execution graphs: – Better match HW evolution: target multi-threaded vector machines – Aligned with CPU evolution, and SoC integration of CPU/GPU • Solution: Tasking system as execution model foundation Device Domain Device Tasking system: task SW Thread ... HW • Task Q’s mapped to independent task task task task task task task compute task task unit HW units (~compute cores) task task HW • Device load balancing enabled via ... task task task stealing task task compute unit Task Pool • OpenCL Analogy: Tasks execute at HW ... the “work group level” task task task task task task task compute task unit task • OpenCL Task ≠ CPU Task task HW • More restricted: No Preemption task task task compute • Evolved: Braided Task (sequential parts unit & fine-grain parallel parts interleaved) 15
  • 16. The 2nd generation of Platform Compute API Ideas for new execution model • There are others who think along the same lines … Slides from “Leading a new Era of Computing”, Chekib Akrout, Senior VP, Technology Group, AMD, 2010 Financial Analyst Day 16
  • 17. Caveats of the 1st generation: Memory Model • Developers input: – A growing number of compute workloads uses complex data structures (linked lists, trees, etc.) – Performance: Cost of pointer marshaling & re-construct on device is high – Porting complexity: need to add explicit transactions, marshaling, etc. – Supporting a shared/unified address space (API & HW) is required App OCL A Dev1 RT H Dev2 A App OCL A Dev1 RT A Dev2 S A Shared/Unified Address Space between Host & Devices 17
  • 18. The 2nd generation of Platform Compute API Ideas for new memory model Baseline: Memory objects / resources will have Shared Address Space the same starting address between Host & Devices Shared Address Space Shared Address Space w. relaxed consistency w. full coherency • Extend existing OCL 1.x / DX11 Memory Model • New Model - Memory is coherent between Host & Device • Use explicit API calls to sync between Host & Device • Use known “language level” mechanisms for concurrent • Suitable for Disjoint memory architectures (Discrete access: atomics, volatile GPU’s, for example…) • Suitable for Shared Memory architectures Host Device Host Device P P P P P P P P P P P P P P P Host Memory Device Memory Coherent/Shared Memory 18
  • 19. Some more thoughts for the 2nd generation (and beyond) • Promote Heterogeneous Processing – not GPU only… Execution Time – Running code pending on problem domain: GPU CPU – Matrix Multiply of 16x16 should run on the CPU – Matrix Multiply of 1000x1000 should run on the GPU Problem size – Where’s the decision point ? Better leave it to the Runtime… (requires API) – Load Balancing – Relevant especially on systems where the CPU & GPU are close in compute power • One API to rule them all – Compute API as the underlying infrastructure to run Media & GFX – Extend the API to contain flexible pipeline, fixed-function HW, etc. Slide from “Parallel Future of a Game Engine”, Johan Andersson, DICE 19
  • 20. References: • “GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU”, Ian Buck, NVIDIA, SC06 – http://gpgpu.org/static/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf • “GPU Architecture: Implications & Trends”, David Luebke, NVIDIA Research, SIGGRAPH 2008: – http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf • “General Purpose Computation on Graphics Processors (GPGPU)”, Mike Houston, Stanford University Graphics Lab – http://www-graphics.stanford.edu/~mhouston/public_talks/R520-mhouston.pdf • “Close to the Metal”, Justin Hensley, AMD, SIGGRAPH 2007 – http://gpgpu.org/static/s2007/slides/07-CTM-overview.pdf • “NVIDIA’s Fermi: The First Complete GPU Computing Architecture”, Peter N. Glaskowsky – http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIAFermi- TheFirstCompleteGPUComputingArchitecture.pdf • “Leading a new Era of Computing”, Chekib Akrout, Senior VP, Technology Group, AMD, 2010 Financial Analyst Day – http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9Njk3NTJ8Q2hpbGRJRD0tMXxUeXBlPTM=&t=1 • “Parallel Future of a Game Engine”, Johan Andersson, DICE – http://www.slideshare.net/repii/parallel-futures-of-a-game-engine-2478448 20