SlideShare una empresa de Scribd logo
1 de 44
CS 354
Performance Analysis

Mark Kilgard
University of Texas
April 26, 2012
CS 354                                            2



         Today’s material
        In-class quiz
            On acceleration structures lecture
        Lecture topic
            Graphic Performance Analysis
CS 354                            3



         My Office Hours
        Tuesday, before class
            Painter (PAI) 5.35
            8:45 a.m. to 9:15
        Thursday, after class
            ACE 6.302
            11:00 a.m. to 12


        Randy’s office hours
            Monday & Wednesday
            11 a.m. to 12:00
            Painter (PAI) 5.33
CS 354                                            4



         Last time, this time
        Last lecture, we discussed
            Acceleration structures
        This lecture
            Graphics Performance Analysis
        Projects
            Project 4 on ray tracing on Piazza
               Due May 2, 2012
               Get started!
CS 354                                                                            5

                                      On a sheet of paper
         Daily Quiz                   • Write your EID, name, and date
                                      • Write #1, #2, #3 followed by its answer
        Multiple choice: Which is
         NOT a bounding volume                 True of False: Volume
         representation                         rendering can be accelerated
                                                by the GPU by drawing
         a) sphere
                                                blended slices of the volume.
         b) axis-aligned bounding box
         c) object aligned bounding box
         d) bounding graph point
         e) convex polyhedron

        True or False: Place objects
         within a uniform grid is easier
         than placing objects within a
         KD tree.
CS 354                                                                6



         Graphics Performance Analysis

        Generating synthetic images by computer
         is computationally—and bandwidth—
         intensive
            Achieving interactive rates is key
                 60 frames/second ≈ real-time interactivity
            Worth optimizing
                 Entertainment and intuition tied to interactivity
        How do we think about graphics
         performance analysis?
CS 354                                              7



         Framing Amdahl’s Law
        Assume a workload with two parts
          First part in A%
          Second part is B%
          Such that A% + B% = 100%
        If we have a technique to speedup the
         second part by N times
          But have no speedup for the first part
          What overall speed up can we expect?
CS 354                                                       8



         Amdahl’s Equation
        Assume A% + B% = 100%
        If the un-optimized effort is 100%, the optimized
         effort should be smaller
                                       B%
              OptimizedEffort = A% +
                                        N
        Speedup is ratio of UnoptimizedEffort to
         OptimizedEffort
                          100%             1
             Speedup =             =
                              B%                B
                         A% +        ( B − 1) +
                               N                N
CS 354                                                       9



         Who was Amdahl?
        Gene Amdahl
            CPU architect for IBM in 1960s
                 Helped design IBM’s System/360 mainframe
                  architecture
            Left IBM to found Amdahl computer
                 Building IBM compatible mainframes
        Why?
            Evaluating whether to invest in parallel
             processing or not
CS 354                                                                        10



         Parallelization
        Broadly speaking, computer tasks can be broken
         into two portions
            Sequential sub-tasks
                 Naturally requires steps to be done in a particular order
                 Examples: text layout, entropy decoding
            Parallel sub-tasks
                 Problem splits into lots of independent chunks of work
                 Chunks of work can be done by separate processing units
                  simultaneously: parallelization
                 Examples: tracing rays, shading pixels, transforming
                  vertices
CS 354                             11

         Serial Work Sandwiching
         Parallel Work
CS 354                                                      12



         Example of Amdahl’s Law
        Say a task is 50% serial and 50% parallel
        Consider using 4 parallel processors on the
         parallel portion
            Speedup: 1.6x
        Consider using 40 parallel processor on parallel
         portion
            Speedup: 1.951x
        Consider limit:              1
                             lim              =2
                             n →∞        .5
                                    .5 +
                                          n
CS 354                           13



         Graph of Amdahl’s Law
CS 354                                                    14



         Pessimism about Parallelism?
      Amdahl’s Law can instill pessimism about
       parallel processing
      If the serial work percent is high, adding
       parallel units has low benefit
          Assumes fixed “problem” size
          So workload stays same size even as parallel
           execution resources are added
        So why do GPUs offer 100’s of cores
         then?
CS 354                                                                       15



         Gustafson's Law
        Observation
            by John Gustafson
            With N parallel unit, bigger problems can be attacked
        Great example
            Increasing GPU resolution
            Was 640x480 pixels, now 1920x1200
            More parallel units means more pixels can be
             processed simultaneously
                 Supporting rendering resolutions previously unattainable
        Problem size improvement
                   problemScale = N − A( N − 1)
CS 354                                                    16



         Example
        Say a task is 50% serial and 50% parallel
        Consider using 4 parallel processors on the
         parallel portion
            Problem scales up: 2.5x
        Consider 100 parallel processors
            Problem scales up: 50.5x


        Also consider heterogeneous nature of graphics
         processing units
CS 354                                                           17

         Coherent Work vs.
         Incoherent Work
        Not all parallel work is created equal
        Coherent work = “adjacent” chunks of work
         performing similar operations and memory
         accesses
            Example: camera rays, pixel shading
            Allows sharing control of instruction execution
            Good for caches
        Incoherent work = “adjacent” chunks of work
         performing dissimilar operations and memory
         accesses
            Examples: reflection, shadow, and refraction rays
            Bad for caches
CS 354                                                           18



         Coherent vs. Incoherent Rays




          coherent = camera rays     coherent = light rays

                                   incoherent = reflected rays
CS 354                                                             19



         Keeping Work Coherent?
        How do we keep work concurrent?
        Pipelines
            Careful because they can introduce latency
        Data structures
        SPMD (or SIMD) execution
            Single Program, Multiple Data
            To exploit Single Instruction, Multiple Data (SIMD)
             units
            Bundling “adjacent” work elements helps cache and
             memory access efficiency
CS 354                                     20



         Pipeline Processing
        Parallel and naturally coherent
A Simplified Graphics Pipeline
CS 354                                                                   21



                       Application
                                                       Application-
                                                   OpenGL API boundary
               Vertex batching & assembly


                   Triangle assembly


                    Triangle clipping


                 NDC to window space


                  Triangle rasterization


                   Fragment shading


                      Depth testing         Depth buffer


                      Color update          Framebuffer
CS 354                                                                                                     22



         Another View of the Graphics Pipeline

   3D Application
     or Game


   OpenGL API
                                                     CPU – GPU
                                                      Boundary
       GPU           Vertex           Primitive                    Clipping, Setup,             Raster
     Front End      Assembly          Assembly                    and Rasterization            Operations


                               Vertex                Geometry                    Fragment
                               Shader                Program                      Shader


           Attribute Fetch

Legend
                             Parameter Buffer Read                  Texture Fetch     Framebuffer Access
 programmable

 fixed-function
                                                       Memory Interface
                                                                                        OpenGL 3.3
CS 354                                                 23



         Modeling Pipeline Efficiency
        Rate of processing for sequential tasks
          Assume three tasks
          Run time is sum of each operation’s time
                A+B+C
        Rate of processing in a pipeline
          Assume three tasks, treated as stages
          Performance gated by slowest operation
              Three operations in pipeline: A, B, C
              Run time = max(A,B,C)
CS 354                                                      24



         Hardware Clocks
        Heart beat of hardware
            Measured in frequency
              Hertz (Hz) = cycles per second
              Megahertz, gigahertz = million, billion Hz

      Faster clocks = faster computation and
       data transfer
      So why not simply raise clocks?
          High clocks consume more power
          Circuits are only rated to a maximum clock
           speed before becoming unreliable
CS 354                                                                           25



         Clock Domains
        Given chip may have multiple clocks running
        Three key domains (GPU-centric)
            Graphics clock—for fixed-function units
                 Example uses: rasterization, texture filtering, blending
                 Optimize for throughput, not latency
                      Can often instance more units instead of raising clocks
            Processor clock—for programmable shader units
                 Example: shader instruction execution
                 Generally higher than graphics clock
                      Because optimized for latency rather than throughput
            Memory clock—for talking to external memory
                 Depends on speed rating of external memory
            Other domains too
                 Display clock, PCI-Express bus clock
                 Generally not crucial to rendering performance
CS 354                                                                                                     26

          3D Pipeline Programmable
          Domains run on Unified Hardware
            Unified Streaming Processor Array (SPA) architecture
             means same capabilities for all domains
                Plus tessellation + compute (not shown below)


                                                                     ,
       GPU          Vertex        Primitive                   Clipping, Setup,
                                                                                                Raster
     Front End     Assembly       Assembly                   and Rasterization                 Operations



         Can be            Vertex                Primitive                       Fragment
         unified          Program                Program                         Program

         hardware!
     Attribute Fetch     Parameter Buffer Read                  Texture Fetch         Framebuffer Access


                                                   Memory Interface
CS 354                                                                                 27



         Memory Bandwidth
        Raw memory bandwidth
            Physical clock rate
                 Examples: 3 Ghz
            Memory bus width
                 64-bit, 128-bit, 192-bit, 256-bit, 384-bit
                 Wider buses are faster but more expensive to route all those wires
            Signaling rate
                 Double data rate (DDR) means signals are sent on the rising and
                  falling clock edges
                 Often logical memory clock rate includes signaling rate
        Computing raw memory bandwidth
  bandwidth = physicalClock × signalPerClock × busWidth
CS 354                                                     28



         Latency vs. Throughput
        Raw bandwidth is reduced by memory
         utilization bandwidth
          Unrealistic to expect 100% utilization
          GPUs are much better than CPUs generally
        Trade-off
          Maximizing throughput (utilization) increases
           latency
          Minimizing latency reduces utilization
CS 354                                                                     29



         Computing Bandwidth                            [GeForce GTX 680
                                                             board]

        Example: GeForce GTX 680
          Latest NVIDIA generation
          3.54 billion transistors in 28 nm process
        Memory characteristics
          6 GHz memory clock (includes signaling rate)
          256-bit memory interface
          = 192 gigabytes/second
                6 billion × 256 bits/clock × 1byte/8bits

                                                            [GK104 die]
CS 354                                                                                                                    30

                           GeForce Peak
                           Memory Bandwidth Trends
                          200
                                                          128-bit interface        256-bit interface
                          180



                                                                                                         Raw
                          160                                                                            bandwidth
   Gigabytes per second




                          140

                                                                                                         Effective raw
                                                                                                         bandwidth
                          120
                                                                                                         with
                                                                                                         compression
                          100
                                                                                                         Expon.
                                                                                                         (Effective raw
                                                                                                         bandwidth
                          80
                                                                                                         with
                                                                                                         compression)
                          60
                                                                                                         Expon. (Raw
                                                                                                         bandwidth)

                          40




                          20




                           0
                                GeForce2   GeForce3   GeForce4 T i GeForce FX    GeForce      GeForce
                                  GT S                   4600                   6800 Ultra   7800 GT X
CS 354                                                        31

         Effective GPU
         Memory Bandwidth
        Compression schemes
          Lossless depth and color (when multisampling)
           compression
          Lossy texture compression (S3TC / DXTC)
          Typically assumes 4:1 compression
        Avoidance useless work
          Early killing of fragments (Z cull)
          Avoiding useless blending and texture fetches
        Very clever memory controller designs
          Combining memory accesses for improved coherency
          Caches for texture fetches
CS 354                                    32



         Other Metrics
      Host bandwidth
      Vertex pulling
      Vertex transformation
      Triangle rasterization and setup
      Fragment shading rate
      Shader instruction rate
      Raster (blending) operation rate
      Early Z reject rate
CS 354                              33

         Kepler GeForce GTX 680
         High-level Block Diagram
        8 Streaming
         Multiprocessors
         (SMX)
        1536 CUDA Cores
        8 Geometry Units
        4 Raster Units
        128 Texture units
        32 Raster operations
        256-bit GDDR5
         memory
CS 354                                                34



         Kepler Streaming Multiprocessor




                              8 more copies of this
CS 354                                35

         Prior Generation Streaming
         Multiprocessor (SM)
        Multi-processor
         execution unit (Fermi)
          32 scalar processor
           cores
          Warp is a unit of
           thread execution of up
           to 32 threads
        Two workloads
            Graphics
                 Vertex shader
                 Tessellation
                 Geometry shader
                 Fragment shader
            Compute
CS 354                                                      36



         Power Gating
        Computer architecture has hit the “power wall”
        Low-power operation is at a premium
            Battery-powered devices
            Thermal constraints
            Economic constraints
        Power Management (PM) works to reduce
         power by
            Lower clocks when performance isn’t required
            Disabling hardware units
                 Avoids leakage
CS 354                                                                             37



         Scene Graph Labor
        High-level division of scene graph labor
        Four pipeline stages
            App (application)
                 Code that manipulates/modifies the scene graph in response to
                  user input or other events
            Isect (intersection)
                 Geometric queries such as collision detection or picking
            Cull
                 Traverse the scene graph to find the nodes to be rendered
                       Best example: eliminate objects out of view
                 Optimize the ordering of nodes
                       Sort objects to minimize graphics hardware state changes
            Draw
                 Communicating drawing commands to the hardware
                 Generally through graphics API (OpenGL or Direct3D)
        Can map well to multi-processor CPU systems
CS 354                                               38



         App-cull-draw Threading
        App-cull-draw processing on one CPU core




        App-cull-draw processing on multiple CPUs
CS 354                                                39



         Scene Graph Profiling
      Scene graph should help provide insight
       into performance
      Process statistics
          What’s going on?
          Time stamps
        Database statistics
            How complex is the scene in any frame?
CS 354                                                           40

         Example:
         Depth Complexity Visualization
        How many pixels are being rendered?
            Pixels can be rasterized by multiple objects
            Depth complexity is the average number of times a
             pixel or color sample is updated per frame




          yellow and black indicate higher depth complexity
CS 354                                    41

         Example:
         Heads-up Display of Statistics
        Process statistics
            How long is
             everything taking?
        Database statistic
            What is being
             rendered?
        Overlaying on
         active scene often
         value
            Dynamic update
CS 354                                                         42



         Benchmarking
        Synthetic benchmarks focus on rendering
         particular operations in isolation
            What is the blended pixel performance
        Application benchmarks
            Try to reflect what a real application would do
CS 354                                                                43

         Tips for Interactive
         Performance Analysis
        Vary things you can control
            Change window resolution
                 Making it smaller and seeing better performance
        Null driver analysis
            Skip the actual rendering calls
            What if the driver was *infinitely” fast
        Use occlusion queries to monitor how many
         samples (pixels) are actually got to need
        Keep data on the GPU
            Let GPU do Direct Memory Access (DMA)
            Keep from swapping textures and buffers
                 Easy when multi-gigabyte graphics cards available
CS 354                                          44



         Next Class
        Next lecture
            Surfaces
            Programmable tessellation

        Reading
            None

        Project 4
            Project 4 is a simple ray tracer
            Due Wednesday, May 2, 2012

Más contenido relacionado

La actualidad más candente

CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU ArchitectureMark Kilgard
 
CS 354 Surfaces, Programmable Tessellation, and NPR Graphics
CS 354 Surfaces, Programmable Tessellation, and NPR GraphicsCS 354 Surfaces, Programmable Tessellation, and NPR Graphics
CS 354 Surfaces, Programmable Tessellation, and NPR GraphicsMark Kilgard
 
CS 354 More Graphics Pipeline
CS 354 More Graphics PipelineCS 354 More Graphics Pipeline
CS 354 More Graphics PipelineMark Kilgard
 
CS 354 Acceleration Structures
CS 354 Acceleration StructuresCS 354 Acceleration Structures
CS 354 Acceleration StructuresMark Kilgard
 
An Introduction to NV_path_rendering
An Introduction to NV_path_renderingAn Introduction to NV_path_rendering
An Introduction to NV_path_renderingMark Kilgard
 
CS 354 Introduction
CS 354 IntroductionCS 354 Introduction
CS 354 IntroductionMark Kilgard
 
CS 354 Texture Mapping
CS 354 Texture MappingCS 354 Texture Mapping
CS 354 Texture MappingMark Kilgard
 
CS 354 Understanding Color
CS 354 Understanding ColorCS 354 Understanding Color
CS 354 Understanding ColorMark Kilgard
 
CS 354 Blending, Compositing, Anti-aliasing
CS 354 Blending, Compositing, Anti-aliasingCS 354 Blending, Compositing, Anti-aliasing
CS 354 Blending, Compositing, Anti-aliasingMark Kilgard
 
CS 354 Viewing Stuff
CS 354 Viewing StuffCS 354 Viewing Stuff
CS 354 Viewing StuffMark Kilgard
 
CS 354 Pixel Updating
CS 354 Pixel UpdatingCS 354 Pixel Updating
CS 354 Pixel UpdatingMark Kilgard
 
Shadow Mapping with Today's OpenGL Hardware
Shadow Mapping with Today's OpenGL HardwareShadow Mapping with Today's OpenGL Hardware
Shadow Mapping with Today's OpenGL HardwareMark Kilgard
 
Mesh Generation and Topological Data Analysis
Mesh Generation and Topological Data AnalysisMesh Generation and Topological Data Analysis
Mesh Generation and Topological Data AnalysisDon Sheehy
 
Real-time Shadowing Techniques: Shadow Volumes
Real-time Shadowing Techniques: Shadow VolumesReal-time Shadowing Techniques: Shadow Volumes
Real-time Shadowing Techniques: Shadow VolumesMark Kilgard
 
A Video Watermarking Scheme to Hinder Camcorder Piracy
A Video Watermarking Scheme to Hinder Camcorder PiracyA Video Watermarking Scheme to Hinder Camcorder Piracy
A Video Watermarking Scheme to Hinder Camcorder PiracyIOSR Journals
 
Practical and Robust Stenciled Shadow Volumes for Hardware-Accelerated Rendering
Practical and Robust Stenciled Shadow Volumes for Hardware-Accelerated RenderingPractical and Robust Stenciled Shadow Volumes for Hardware-Accelerated Rendering
Practical and Robust Stenciled Shadow Volumes for Hardware-Accelerated RenderingMark Kilgard
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingMark Kilgard
 
Clustered defered and forward shading
Clustered defered and forward shadingClustered defered and forward shading
Clustered defered and forward shadingWuBinbo
 

La actualidad más candente (20)

CS 354 GPU Architecture
CS 354 GPU ArchitectureCS 354 GPU Architecture
CS 354 GPU Architecture
 
CS 354 Shadows
CS 354 ShadowsCS 354 Shadows
CS 354 Shadows
 
CS 354 Surfaces, Programmable Tessellation, and NPR Graphics
CS 354 Surfaces, Programmable Tessellation, and NPR GraphicsCS 354 Surfaces, Programmable Tessellation, and NPR Graphics
CS 354 Surfaces, Programmable Tessellation, and NPR Graphics
 
CS 354 Typography
CS 354 TypographyCS 354 Typography
CS 354 Typography
 
CS 354 More Graphics Pipeline
CS 354 More Graphics PipelineCS 354 More Graphics Pipeline
CS 354 More Graphics Pipeline
 
CS 354 Acceleration Structures
CS 354 Acceleration StructuresCS 354 Acceleration Structures
CS 354 Acceleration Structures
 
An Introduction to NV_path_rendering
An Introduction to NV_path_renderingAn Introduction to NV_path_rendering
An Introduction to NV_path_rendering
 
CS 354 Introduction
CS 354 IntroductionCS 354 Introduction
CS 354 Introduction
 
CS 354 Texture Mapping
CS 354 Texture MappingCS 354 Texture Mapping
CS 354 Texture Mapping
 
CS 354 Understanding Color
CS 354 Understanding ColorCS 354 Understanding Color
CS 354 Understanding Color
 
CS 354 Blending, Compositing, Anti-aliasing
CS 354 Blending, Compositing, Anti-aliasingCS 354 Blending, Compositing, Anti-aliasing
CS 354 Blending, Compositing, Anti-aliasing
 
CS 354 Viewing Stuff
CS 354 Viewing StuffCS 354 Viewing Stuff
CS 354 Viewing Stuff
 
CS 354 Pixel Updating
CS 354 Pixel UpdatingCS 354 Pixel Updating
CS 354 Pixel Updating
 
Shadow Mapping with Today's OpenGL Hardware
Shadow Mapping with Today's OpenGL HardwareShadow Mapping with Today's OpenGL Hardware
Shadow Mapping with Today's OpenGL Hardware
 
Mesh Generation and Topological Data Analysis
Mesh Generation and Topological Data AnalysisMesh Generation and Topological Data Analysis
Mesh Generation and Topological Data Analysis
 
Real-time Shadowing Techniques: Shadow Volumes
Real-time Shadowing Techniques: Shadow VolumesReal-time Shadowing Techniques: Shadow Volumes
Real-time Shadowing Techniques: Shadow Volumes
 
A Video Watermarking Scheme to Hinder Camcorder Piracy
A Video Watermarking Scheme to Hinder Camcorder PiracyA Video Watermarking Scheme to Hinder Camcorder Piracy
A Video Watermarking Scheme to Hinder Camcorder Piracy
 
Practical and Robust Stenciled Shadow Volumes for Hardware-Accelerated Rendering
Practical and Robust Stenciled Shadow Volumes for Hardware-Accelerated RenderingPractical and Robust Stenciled Shadow Volumes for Hardware-Accelerated Rendering
Practical and Robust Stenciled Shadow Volumes for Hardware-Accelerated Rendering
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and Culling
 
Clustered defered and forward shading
Clustered defered and forward shadingClustered defered and forward shading
Clustered defered and forward shading
 

Similar a CS 354 Performance Analysis

Parallelising Dynamic Programming
Parallelising Dynamic ProgrammingParallelising Dynamic Programming
Parallelising Dynamic ProgrammingRaphael Reitzig
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...gdgsurrey
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladevPavel Tsukanov
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCAapo Kyrölä
 
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...Umbra Software
 
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defenseLarge-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defenseAapo Kyrölä
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processinghuguk
 
Why computer programming
Why computer programmingWhy computer programming
Why computer programmingTUOS-Sam
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
Adding more visuals without affecting performance
Adding more visuals without affecting performanceAdding more visuals without affecting performance
Adding more visuals without affecting performanceSt1X
 
Reproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to ArchitectureReproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to ArchitectureJason Riedy
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basicsnpinto
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsDatabricks
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSkills Matter
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futureTakayuki Muranushi
 
Two numerical graph algorithms
Two numerical graph algorithmsTwo numerical graph algorithms
Two numerical graph algorithmsDavid Gleich
 

Similar a CS 354 Performance Analysis (20)

Parallelising Dynamic Programming
Parallelising Dynamic ProgrammingParallelising Dynamic Programming
Parallelising Dynamic Programming
 
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...
 
Thinking in parallel ab tuladev
Thinking in parallel ab tuladevThinking in parallel ab tuladev
Thinking in parallel ab tuladev
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...
Umbra Ignite 2015: Alex Evans – Learning from failure – prototypes, R&D, iter...
 
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defenseLarge-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
Why computer programming
Why computer programmingWhy computer programming
Why computer programming
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Advance analysis of algo
Advance analysis of algoAdvance analysis of algo
Advance analysis of algo
 
Adding more visuals without affecting performance
Adding more visuals without affecting performanceAdding more visuals without affecting performance
Adding more visuals without affecting performance
 
Reproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to ArchitectureReproducible Linear Algebra from Application to Architecture
Reproducible Linear Algebra from Application to Architecture
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
Bootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B TestsBootstrapping of PySpark Models for Factorial A/B Tests
Bootstrapping of PySpark Models for Factorial A/B Tests
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
Peyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_futurePeyton jones-2011-parallel haskell-the_future
Peyton jones-2011-parallel haskell-the_future
 
Two numerical graph algorithms
Two numerical graph algorithmsTwo numerical graph algorithms
Two numerical graph algorithms
 

Más de Mark Kilgard

D11: a high-performance, protocol-optional, transport-optional, window system...
D11: a high-performance, protocol-optional, transport-optional, window system...D11: a high-performance, protocol-optional, transport-optional, window system...
D11: a high-performance, protocol-optional, transport-optional, window system...Mark Kilgard
 
Computers, Graphics, Engineering, Math, and Video Games for High School Students
Computers, Graphics, Engineering, Math, and Video Games for High School StudentsComputers, Graphics, Engineering, Math, and Video Games for High School Students
Computers, Graphics, Engineering, Math, and Video Games for High School StudentsMark Kilgard
 
NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017Mark Kilgard
 
NVIDIA OpenGL 4.6 in 2017
NVIDIA OpenGL 4.6 in 2017NVIDIA OpenGL 4.6 in 2017
NVIDIA OpenGL 4.6 in 2017Mark Kilgard
 
NVIDIA OpenGL in 2016
NVIDIA OpenGL in 2016NVIDIA OpenGL in 2016
NVIDIA OpenGL in 2016Mark Kilgard
 
Virtual Reality Features of NVIDIA GPUs
Virtual Reality Features of NVIDIA GPUsVirtual Reality Features of NVIDIA GPUs
Virtual Reality Features of NVIDIA GPUsMark Kilgard
 
Migrating from OpenGL to Vulkan
Migrating from OpenGL to VulkanMigrating from OpenGL to Vulkan
Migrating from OpenGL to VulkanMark Kilgard
 
EXT_window_rectangles
EXT_window_rectanglesEXT_window_rectangles
EXT_window_rectanglesMark Kilgard
 
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...Mark Kilgard
 
Accelerating Vector Graphics Rendering using the Graphics Hardware Pipeline
Accelerating Vector Graphics Rendering using the Graphics Hardware PipelineAccelerating Vector Graphics Rendering using the Graphics Hardware Pipeline
Accelerating Vector Graphics Rendering using the Graphics Hardware PipelineMark Kilgard
 
NV_path rendering Functional Improvements
NV_path rendering Functional ImprovementsNV_path rendering Functional Improvements
NV_path rendering Functional ImprovementsMark Kilgard
 
OpenGL 4.5 Update for NVIDIA GPUs
OpenGL 4.5 Update for NVIDIA GPUsOpenGL 4.5 Update for NVIDIA GPUs
OpenGL 4.5 Update for NVIDIA GPUsMark Kilgard
 
SIGGRAPH Asia 2012: GPU-accelerated Path Rendering
SIGGRAPH Asia 2012: GPU-accelerated Path RenderingSIGGRAPH Asia 2012: GPU-accelerated Path Rendering
SIGGRAPH Asia 2012: GPU-accelerated Path RenderingMark Kilgard
 
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and Beyond
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and BeyondSIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and Beyond
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and BeyondMark Kilgard
 
Programming with NV_path_rendering: An Annex to the SIGGRAPH Asia 2012 paper...
Programming with NV_path_rendering:  An Annex to the SIGGRAPH Asia 2012 paper...Programming with NV_path_rendering:  An Annex to the SIGGRAPH Asia 2012 paper...
Programming with NV_path_rendering: An Annex to the SIGGRAPH Asia 2012 paper...Mark Kilgard
 
GPU accelerated path rendering fastforward
GPU accelerated path rendering fastforwardGPU accelerated path rendering fastforward
GPU accelerated path rendering fastforwardMark Kilgard
 
GPU-accelerated Path Rendering
GPU-accelerated Path RenderingGPU-accelerated Path Rendering
GPU-accelerated Path RenderingMark Kilgard
 
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingSIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingMark Kilgard
 
SIGGRAPH 2012: NVIDIA OpenGL for 2012
SIGGRAPH 2012: NVIDIA OpenGL for 2012SIGGRAPH 2012: NVIDIA OpenGL for 2012
SIGGRAPH 2012: NVIDIA OpenGL for 2012Mark Kilgard
 

Más de Mark Kilgard (20)

D11: a high-performance, protocol-optional, transport-optional, window system...
D11: a high-performance, protocol-optional, transport-optional, window system...D11: a high-performance, protocol-optional, transport-optional, window system...
D11: a high-performance, protocol-optional, transport-optional, window system...
 
Computers, Graphics, Engineering, Math, and Video Games for High School Students
Computers, Graphics, Engineering, Math, and Video Games for High School StudentsComputers, Graphics, Engineering, Math, and Video Games for High School Students
Computers, Graphics, Engineering, Math, and Video Games for High School Students
 
NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017
 
NVIDIA OpenGL 4.6 in 2017
NVIDIA OpenGL 4.6 in 2017NVIDIA OpenGL 4.6 in 2017
NVIDIA OpenGL 4.6 in 2017
 
NVIDIA OpenGL in 2016
NVIDIA OpenGL in 2016NVIDIA OpenGL in 2016
NVIDIA OpenGL in 2016
 
Virtual Reality Features of NVIDIA GPUs
Virtual Reality Features of NVIDIA GPUsVirtual Reality Features of NVIDIA GPUs
Virtual Reality Features of NVIDIA GPUs
 
Migrating from OpenGL to Vulkan
Migrating from OpenGL to VulkanMigrating from OpenGL to Vulkan
Migrating from OpenGL to Vulkan
 
EXT_window_rectangles
EXT_window_rectanglesEXT_window_rectangles
EXT_window_rectangles
 
OpenGL for 2015
OpenGL for 2015OpenGL for 2015
OpenGL for 2015
 
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...
 
Accelerating Vector Graphics Rendering using the Graphics Hardware Pipeline
Accelerating Vector Graphics Rendering using the Graphics Hardware PipelineAccelerating Vector Graphics Rendering using the Graphics Hardware Pipeline
Accelerating Vector Graphics Rendering using the Graphics Hardware Pipeline
 
NV_path rendering Functional Improvements
NV_path rendering Functional ImprovementsNV_path rendering Functional Improvements
NV_path rendering Functional Improvements
 
OpenGL 4.5 Update for NVIDIA GPUs
OpenGL 4.5 Update for NVIDIA GPUsOpenGL 4.5 Update for NVIDIA GPUs
OpenGL 4.5 Update for NVIDIA GPUs
 
SIGGRAPH Asia 2012: GPU-accelerated Path Rendering
SIGGRAPH Asia 2012: GPU-accelerated Path RenderingSIGGRAPH Asia 2012: GPU-accelerated Path Rendering
SIGGRAPH Asia 2012: GPU-accelerated Path Rendering
 
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and Beyond
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and BeyondSIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and Beyond
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and Beyond
 
Programming with NV_path_rendering: An Annex to the SIGGRAPH Asia 2012 paper...
Programming with NV_path_rendering:  An Annex to the SIGGRAPH Asia 2012 paper...Programming with NV_path_rendering:  An Annex to the SIGGRAPH Asia 2012 paper...
Programming with NV_path_rendering: An Annex to the SIGGRAPH Asia 2012 paper...
 
GPU accelerated path rendering fastforward
GPU accelerated path rendering fastforwardGPU accelerated path rendering fastforward
GPU accelerated path rendering fastforward
 
GPU-accelerated Path Rendering
GPU-accelerated Path RenderingGPU-accelerated Path Rendering
GPU-accelerated Path Rendering
 
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web RenderingSIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
 
SIGGRAPH 2012: NVIDIA OpenGL for 2012
SIGGRAPH 2012: NVIDIA OpenGL for 2012SIGGRAPH 2012: NVIDIA OpenGL for 2012
SIGGRAPH 2012: NVIDIA OpenGL for 2012
 

Último

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Último (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

CS 354 Performance Analysis

  • 1. CS 354 Performance Analysis Mark Kilgard University of Texas April 26, 2012
  • 2. CS 354 2 Today’s material  In-class quiz  On acceleration structures lecture  Lecture topic  Graphic Performance Analysis
  • 3. CS 354 3 My Office Hours  Tuesday, before class  Painter (PAI) 5.35  8:45 a.m. to 9:15  Thursday, after class  ACE 6.302  11:00 a.m. to 12  Randy’s office hours  Monday & Wednesday  11 a.m. to 12:00  Painter (PAI) 5.33
  • 4. CS 354 4 Last time, this time  Last lecture, we discussed  Acceleration structures  This lecture  Graphics Performance Analysis  Projects  Project 4 on ray tracing on Piazza  Due May 2, 2012  Get started!
  • 5. CS 354 5 On a sheet of paper Daily Quiz • Write your EID, name, and date • Write #1, #2, #3 followed by its answer  Multiple choice: Which is NOT a bounding volume  True of False: Volume representation rendering can be accelerated by the GPU by drawing a) sphere blended slices of the volume. b) axis-aligned bounding box c) object aligned bounding box d) bounding graph point e) convex polyhedron  True or False: Place objects within a uniform grid is easier than placing objects within a KD tree.
  • 6. CS 354 6 Graphics Performance Analysis  Generating synthetic images by computer is computationally—and bandwidth— intensive  Achieving interactive rates is key  60 frames/second ≈ real-time interactivity  Worth optimizing  Entertainment and intuition tied to interactivity  How do we think about graphics performance analysis?
  • 7. CS 354 7 Framing Amdahl’s Law  Assume a workload with two parts  First part in A%  Second part is B%  Such that A% + B% = 100%  If we have a technique to speedup the second part by N times  But have no speedup for the first part  What overall speed up can we expect?
  • 8. CS 354 8 Amdahl’s Equation  Assume A% + B% = 100%  If the un-optimized effort is 100%, the optimized effort should be smaller B% OptimizedEffort = A% + N  Speedup is ratio of UnoptimizedEffort to OptimizedEffort 100% 1 Speedup = = B% B A% + ( B − 1) + N N
  • 9. CS 354 9 Who was Amdahl?  Gene Amdahl  CPU architect for IBM in 1960s  Helped design IBM’s System/360 mainframe architecture  Left IBM to found Amdahl computer  Building IBM compatible mainframes  Why?  Evaluating whether to invest in parallel processing or not
  • 10. CS 354 10 Parallelization  Broadly speaking, computer tasks can be broken into two portions  Sequential sub-tasks  Naturally requires steps to be done in a particular order  Examples: text layout, entropy decoding  Parallel sub-tasks  Problem splits into lots of independent chunks of work  Chunks of work can be done by separate processing units simultaneously: parallelization  Examples: tracing rays, shading pixels, transforming vertices
  • 11. CS 354 11 Serial Work Sandwiching Parallel Work
  • 12. CS 354 12 Example of Amdahl’s Law  Say a task is 50% serial and 50% parallel  Consider using 4 parallel processors on the parallel portion  Speedup: 1.6x  Consider using 40 parallel processor on parallel portion  Speedup: 1.951x  Consider limit: 1 lim =2 n →∞ .5 .5 + n
  • 13. CS 354 13 Graph of Amdahl’s Law
  • 14. CS 354 14 Pessimism about Parallelism?  Amdahl’s Law can instill pessimism about parallel processing  If the serial work percent is high, adding parallel units has low benefit  Assumes fixed “problem” size  So workload stays same size even as parallel execution resources are added  So why do GPUs offer 100’s of cores then?
  • 15. CS 354 15 Gustafson's Law  Observation  by John Gustafson  With N parallel unit, bigger problems can be attacked  Great example  Increasing GPU resolution  Was 640x480 pixels, now 1920x1200  More parallel units means more pixels can be processed simultaneously  Supporting rendering resolutions previously unattainable  Problem size improvement problemScale = N − A( N − 1)
  • 16. CS 354 16 Example  Say a task is 50% serial and 50% parallel  Consider using 4 parallel processors on the parallel portion  Problem scales up: 2.5x  Consider 100 parallel processors  Problem scales up: 50.5x  Also consider heterogeneous nature of graphics processing units
  • 17. CS 354 17 Coherent Work vs. Incoherent Work  Not all parallel work is created equal  Coherent work = “adjacent” chunks of work performing similar operations and memory accesses  Example: camera rays, pixel shading  Allows sharing control of instruction execution  Good for caches  Incoherent work = “adjacent” chunks of work performing dissimilar operations and memory accesses  Examples: reflection, shadow, and refraction rays  Bad for caches
  • 18. CS 354 18 Coherent vs. Incoherent Rays coherent = camera rays coherent = light rays incoherent = reflected rays
  • 19. CS 354 19 Keeping Work Coherent?  How do we keep work concurrent?  Pipelines  Careful because they can introduce latency  Data structures  SPMD (or SIMD) execution  Single Program, Multiple Data  To exploit Single Instruction, Multiple Data (SIMD) units  Bundling “adjacent” work elements helps cache and memory access efficiency
  • 20. CS 354 20 Pipeline Processing  Parallel and naturally coherent
  • 21. A Simplified Graphics Pipeline CS 354 21 Application Application- OpenGL API boundary Vertex batching & assembly Triangle assembly Triangle clipping NDC to window space Triangle rasterization Fragment shading Depth testing Depth buffer Color update Framebuffer
  • 22. CS 354 22 Another View of the Graphics Pipeline 3D Application or Game OpenGL API CPU – GPU Boundary GPU Vertex Primitive Clipping, Setup, Raster Front End Assembly Assembly and Rasterization Operations Vertex Geometry Fragment Shader Program Shader Attribute Fetch Legend Parameter Buffer Read Texture Fetch Framebuffer Access programmable fixed-function Memory Interface OpenGL 3.3
  • 23. CS 354 23 Modeling Pipeline Efficiency  Rate of processing for sequential tasks  Assume three tasks  Run time is sum of each operation’s time  A+B+C  Rate of processing in a pipeline  Assume three tasks, treated as stages  Performance gated by slowest operation  Three operations in pipeline: A, B, C  Run time = max(A,B,C)
  • 24. CS 354 24 Hardware Clocks  Heart beat of hardware  Measured in frequency  Hertz (Hz) = cycles per second  Megahertz, gigahertz = million, billion Hz  Faster clocks = faster computation and data transfer  So why not simply raise clocks?  High clocks consume more power  Circuits are only rated to a maximum clock speed before becoming unreliable
  • 25. CS 354 25 Clock Domains  Given chip may have multiple clocks running  Three key domains (GPU-centric)  Graphics clock—for fixed-function units  Example uses: rasterization, texture filtering, blending  Optimize for throughput, not latency  Can often instance more units instead of raising clocks  Processor clock—for programmable shader units  Example: shader instruction execution  Generally higher than graphics clock  Because optimized for latency rather than throughput  Memory clock—for talking to external memory  Depends on speed rating of external memory  Other domains too  Display clock, PCI-Express bus clock  Generally not crucial to rendering performance
  • 26. CS 354 26 3D Pipeline Programmable Domains run on Unified Hardware  Unified Streaming Processor Array (SPA) architecture means same capabilities for all domains  Plus tessellation + compute (not shown below) , GPU Vertex Primitive Clipping, Setup, Raster Front End Assembly Assembly and Rasterization Operations Can be Vertex Primitive Fragment unified Program Program Program hardware! Attribute Fetch Parameter Buffer Read Texture Fetch Framebuffer Access Memory Interface
  • 27. CS 354 27 Memory Bandwidth  Raw memory bandwidth  Physical clock rate  Examples: 3 Ghz  Memory bus width  64-bit, 128-bit, 192-bit, 256-bit, 384-bit  Wider buses are faster but more expensive to route all those wires  Signaling rate  Double data rate (DDR) means signals are sent on the rising and falling clock edges  Often logical memory clock rate includes signaling rate  Computing raw memory bandwidth bandwidth = physicalClock × signalPerClock × busWidth
  • 28. CS 354 28 Latency vs. Throughput  Raw bandwidth is reduced by memory utilization bandwidth  Unrealistic to expect 100% utilization  GPUs are much better than CPUs generally  Trade-off  Maximizing throughput (utilization) increases latency  Minimizing latency reduces utilization
  • 29. CS 354 29 Computing Bandwidth [GeForce GTX 680 board]  Example: GeForce GTX 680  Latest NVIDIA generation  3.54 billion transistors in 28 nm process  Memory characteristics  6 GHz memory clock (includes signaling rate)  256-bit memory interface  = 192 gigabytes/second  6 billion × 256 bits/clock × 1byte/8bits [GK104 die]
  • 30. CS 354 30 GeForce Peak Memory Bandwidth Trends 200 128-bit interface 256-bit interface 180 Raw 160 bandwidth Gigabytes per second 140 Effective raw bandwidth 120 with compression 100 Expon. (Effective raw bandwidth 80 with compression) 60 Expon. (Raw bandwidth) 40 20 0 GeForce2 GeForce3 GeForce4 T i GeForce FX GeForce GeForce GT S 4600 6800 Ultra 7800 GT X
  • 31. CS 354 31 Effective GPU Memory Bandwidth  Compression schemes  Lossless depth and color (when multisampling) compression  Lossy texture compression (S3TC / DXTC)  Typically assumes 4:1 compression  Avoidance useless work  Early killing of fragments (Z cull)  Avoiding useless blending and texture fetches  Very clever memory controller designs  Combining memory accesses for improved coherency  Caches for texture fetches
  • 32. CS 354 32 Other Metrics  Host bandwidth  Vertex pulling  Vertex transformation  Triangle rasterization and setup  Fragment shading rate  Shader instruction rate  Raster (blending) operation rate  Early Z reject rate
  • 33. CS 354 33 Kepler GeForce GTX 680 High-level Block Diagram  8 Streaming Multiprocessors (SMX)  1536 CUDA Cores  8 Geometry Units  4 Raster Units  128 Texture units  32 Raster operations  256-bit GDDR5 memory
  • 34. CS 354 34 Kepler Streaming Multiprocessor 8 more copies of this
  • 35. CS 354 35 Prior Generation Streaming Multiprocessor (SM)  Multi-processor execution unit (Fermi)  32 scalar processor cores  Warp is a unit of thread execution of up to 32 threads  Two workloads  Graphics  Vertex shader  Tessellation  Geometry shader  Fragment shader  Compute
  • 36. CS 354 36 Power Gating  Computer architecture has hit the “power wall”  Low-power operation is at a premium  Battery-powered devices  Thermal constraints  Economic constraints  Power Management (PM) works to reduce power by  Lower clocks when performance isn’t required  Disabling hardware units  Avoids leakage
  • 37. CS 354 37 Scene Graph Labor  High-level division of scene graph labor  Four pipeline stages  App (application)  Code that manipulates/modifies the scene graph in response to user input or other events  Isect (intersection)  Geometric queries such as collision detection or picking  Cull  Traverse the scene graph to find the nodes to be rendered  Best example: eliminate objects out of view  Optimize the ordering of nodes  Sort objects to minimize graphics hardware state changes  Draw  Communicating drawing commands to the hardware  Generally through graphics API (OpenGL or Direct3D)  Can map well to multi-processor CPU systems
  • 38. CS 354 38 App-cull-draw Threading  App-cull-draw processing on one CPU core  App-cull-draw processing on multiple CPUs
  • 39. CS 354 39 Scene Graph Profiling  Scene graph should help provide insight into performance  Process statistics  What’s going on?  Time stamps  Database statistics  How complex is the scene in any frame?
  • 40. CS 354 40 Example: Depth Complexity Visualization  How many pixels are being rendered?  Pixels can be rasterized by multiple objects  Depth complexity is the average number of times a pixel or color sample is updated per frame yellow and black indicate higher depth complexity
  • 41. CS 354 41 Example: Heads-up Display of Statistics  Process statistics  How long is everything taking?  Database statistic  What is being rendered?  Overlaying on active scene often value  Dynamic update
  • 42. CS 354 42 Benchmarking  Synthetic benchmarks focus on rendering particular operations in isolation  What is the blended pixel performance  Application benchmarks  Try to reflect what a real application would do
  • 43. CS 354 43 Tips for Interactive Performance Analysis  Vary things you can control  Change window resolution  Making it smaller and seeing better performance  Null driver analysis  Skip the actual rendering calls  What if the driver was *infinitely” fast  Use occlusion queries to monitor how many samples (pixels) are actually got to need  Keep data on the GPU  Let GPU do Direct Memory Access (DMA)  Keep from swapping textures and buffers  Easy when multi-gigabyte graphics cards available
  • 44. CS 354 44 Next Class  Next lecture  Surfaces  Programmable tessellation  Reading  None  Project 4  Project 4 is a simple ray tracer  Due Wednesday, May 2, 2012