SlideShare a Scribd company logo
1 of 70
Confidential © 2018 Arm Limited
Owen Wu (owen.wu@arm.com)
Developer Relations Engineer
Mali GPU
Architecture And
Mobile Studio
TGDF
2 Confidential © 2018 Arm Limited
Agenda
• GPU Architecture
• Tile-based Rendering
• Render Passes
• Mobile Studio Introduction
• Performance Analysis Workflow
• Streamline
• Graphics Analyzer
• Offline Shader Compiler
• Performance Advisor
GPU Architecture
4 Confidential © 2018 Arm Limited
Bifrost Shader Core
4
— Mali-G30, Mali-G50, and Mali-G70 series
— Unified shader core architecture
— Can scale from a single core for low-end
devices all the way up to 32 cores
— L2 Cache - Typically in the range of 64-128KB
per shader core
— Able to write one 32-bit pixel per core per
clock
— 8-core design to have a total of 256-bits of
memory bandwidth (for both read and write)
per clock cycle
5 Confidential © 2018 Arm Limited
Bifrost Shader Core
5
— Black blocks are fixed function unit
— Execution core consist of 5 units
– Execution Engine : executing shader/arithmetic
– Load/Store Unit : shader memory accesses
– Varying Unit : fixed-function varying interpolator
– ZS/Blend Unit : accesses to the tile-memory
– Texture Unit : memory access to do with textures
6 Confidential © 2018 Arm Limited
Index-Driven Vertex Shading (IDVS)
6
— Index-Driven Vertex Shading (IDVS) geometry processing pipeline
— Processed all of the vertex shading before culling primitives, often resulting in
wasted computation and bandwidth for the vertices which are only used in culled
triangles
— IDVS splits the shader in two halves
– Position shading runs before culling
– Varying shading runs after it for the visible vertices which survive culling
— Deinterleave packed vertex buffers partially
– Place attributes contributing to position in one packed buffer
– Attributes contributing to non-position varyings in a second packed buffer
– Non-position varyings are not pulled into the cache
7 Confidential © 2018 Arm Limited
Forward Pixel Kill (FPK)
7
— Early-Z and front-to-back rendering removes most overdraw
— Pixels can be killed in-flight if a future pixel will occlude it
— Calculations already in flight can be terminated at any time if we spot that a
later thread will write opaque data to the same pixel location
Tile-based
Rendering
9 Confidential © 2018 Arm Limited
Immediate Mode GPU
9
— Traditional desktop GPU architecture
— Vertex shaders and the fragment shaders being executed in sequence on each
primitive in each draw call
for draw in renderPass:
for primitive in draw:
for vertex in primitive:
execute_vertex_shader(vertex)
for fragment in primitive:
execute_fragment_shader(fragment)
10 Confidential © 2018 Arm Limited
Immediate Mode GPU
10
11 Confidential © 2018 Arm Limited
Tile-Based GPU
11
— Designed to minimize the amount of external memory accesses which are needed
during rendering
— Tile-based renders split the screen into small pieces – Mali renders 16x16 tiles
— Process fragment shading on each small tile
— Writing result out to memory
— Split each render pass into two distinct processing passes
– Executes all of the geometry related processing, and generates the tile lists
– Executes all of the fragment processing, tile by tile
12 Confidential © 2018 Arm Limited
Tile-Based GPU
12
— Vertex Pass
for draw in renderPass:
for primitive in draw:
for vertex in primitive:
execute_vertex_shader(vertex)
append_tile_list(primitive)
— Fragment Pass
for tile in renderPass:
for primitive in tile:
for fragment in primitive:
execute_fragment_shader(fragment)
13 Confidential © 2018 Arm Limited
Tile-Based GPU
13
Render Passes
15 Confidential © 2018 Arm Limited
Render Passes
15
— Render passes are essential concepts
— Render pass is a single execution of the rendering pipeline
– Need initializing in tile memory at the start of the render pass
– May need writing back out to memory at the end of the render pass
— Minimize the amount of memory traffic in to and out of the tile memory
— Avoiding reading at the start of a render pass
— Avoiding writing at the end of a pass
16 Confidential © 2018 Arm Limited
Vertex Pass
Facing Test
Culling
Frustum Test
Culling
Sample Test
Culling
Polygon List
Position
Shading
Varying
Shading
Do only position
shading here for
all vertices
Cull back facing
polygons
Cull polygons
which are outside
frustum
Cull polygons
which are smaller
than a pixel
Generate polygon
list and write to
external memory
Do varying
shading for
surviving vertices
17 Confidential © 2018 Arm Limited
Fragment Pass
Rasterizer Early ZS Test
Blender
Polygon List
of
Tile
Frame
Buffer
Tile
RAM
Fragment
Thread
Creator
Execution Core
Tile Write
Transaction
Elimination
Process
Engine 0
Process
Engine N
Late ZS Test
Rasterize polygon
to many quads
Kill the quad
which is occluded
by other pixels
(if possible)
Generate
fragment threads
from quad
Execute fragment
shader
Kill the fragment
which is occluded
by other pixels
(if the fragment
can’t do early ZS
test)
Do compression
to save bandwidth
18 Confidential © 2018 Arm Limited
Render Passes in OpenGL
18
— OpenGL ES API has no explicit render passes in the API level
— Driver must infer which rendering operations form a single render pass
— Drawing commands are added to the current render pass
— Render pass is submitted for processing when an API call changes the framebuffer
or forces a flush of the queued work
19 Confidential © 2018 Arm Limited
Render Passes in OpenGL
19
— The most common causes for ending a render pass
– Called glBindFramebuffer() to change the GL_FRAMEBUFFER or GL_DRAW_FRAMEBUFFER target
– Called glFramebufferTexture*() or glFramebufferRenderbuffer() to change the attachments
– Called eglSwapBuffers()
– Called glFlush() or glFinish()
– Created a glFenceSync() then called glClientWaitSync() to wait
20 Confidential © 2018 Arm Limited
Efficient Render Passes
20
— Process each render pass once
– Bind each framebuffer object only once
– Making all required draw calls before switching to the next context
– Avoid unnecessary context switch
21 Confidential © 2018 Arm Limited
Efficient Render Passes
21
— Minimizing start of tile loads
– Can cheaply initialize the tile memory to a clear color value
– Ensure that you clear or invalidate all of your attachments at the start of each render pass
– Can use any of the following calls
– glClear(), glClearBuffer*(), glInvalidateFramebuffer()
— Minimizing end of tile stores
– Avoid writing back to main memory whenever is possible
– Can notify the driver that an attachment is transient by marking the content as invalid using a call
to glInvalidateFramebuffer() as the last "draw call”
Mobile Studio
Introduction
23 Confidential © 2018 Arm Limited
What is in the box?
Streamline
Graphics
Analyzer
Mali Offline
Compiler
(separate download)
Performance
Advisor
(closed beta)
Download Arm Mobile Studio: http://developer.arm.com/mobile-studio
24 Confidential © 2018 Arm Limited
Streamline
Performance Analyzer
Mali GPU support
 Analyze and optimize Mali™ GPU
graphics and compute workloads
 Accelerate your workflow using
built-in analysis templates
Optimize for energy
 Move beyond simple frame time
and FPS tracking
 Monitor overall usage of processor
cycles and memory bandwidth
Speed up your app
 Find out where the system is
spending the most time
 Tune code for cache efficiency
Application event traceNative code profiling
 Break performance
down by function
 View cost alongside
disassembly listing
Arm CPU support
 Profile 32-bit and 64-bit apps for
ARMv7-A and ARMv8-A cores
 Tune multi-threading for
DynamIQ multi-core systems
 Annotate software
workloads
 Define logical event
channel structure
 Trace cross-channel
task dependencies
Tune your rendering
 Identify critical-path GPU
shader core resources
 Detect content inefficiency
25 Confidential © 2018 Arm Limited
Graphics Analyzer
GPU API Debugger
Shader analysis
 Capture and view all shaders used
 Optimize shader performance using
integrated Mali Offline Compiler
Cross platform
 Host support for Windows,
macOS, and Linux
 Target support for any Android
GPU
Rendering API debug
 Graphics debug for content
developers
 Support for all versions of
OpenGL ES and Vulkan
Android utility appVisual analysis views
 Native mode
 Overdraw mode
 Shader map mode
 Fragment count mode
State visibility
 Show API state after every API call
 Trace backwards from point-of-use
to API call responsible for state set
 Manage on-device
connection
 Select and launch
user application
Frame analysis
 Diagnose root causes
of rendering errors
 Identify sources of
rendering inefficiency
26 Confidential © 2018 Arm Limited
s
Mali Offline Compiler
Shader static analysis
Rapid iteration
 Verify impact of shader changes
without needing whole application
rebuild
Profile for any Mali GPU
 Cost shader code for every Mali
GPU without needing hardware
Mali GPU aware
 Support for all actively
shipping Mali GPUs
 Cycle counts reflect
specific microarchitecture
Critical path analysisControl flow aware
 Best case control flow
 Worst case control flow
Syntax verification
 Verify correctness of code changes
 Receive clear error diagnostics for
invalid shaders
 Identify dominant
shader resource
 Target this for
optimization!
Register usage
 Work registers
 Uniform registers
 Stack spilling
27 Confidential © 2018 Arm Limited
Performance Advisor
Analysis Reports
Fast workflow
 Integrate data capture and analysis
into nightly CI test workflow
 Read results over a nice cup of tea
Caveats …
 Still under development
 Currently in closed beta
Overview chartsRegion views
 Split by dynamic
behavior
 Split by application
annotation
Executive dashboard
 Show high level status summary
 Show status breakdown by regions
of interest
 See performance
trends over time
 See region splits
Summary reports
 Easy-to-use performance
status reports
 Integrated initial root cause
analysis
28 Confidential © 2018 Arm Limited
• Not all devices provide necessary performance data
• CPU data requirements:
• Kernel 4.4 or higher
• Kernel perf config includes CPU PMU
• GPU data requirements:
• Supported Mali GPU and driver
• Supported device list can be found online*
• Other devices may work!
• Conformance tests for ODMs under development
* https://developer.arm.com/products/software-development-tools/arm-mobile-studio/support/supported-devices
Device requirements
Samsung
Galaxy S9
(Exynos version)
Oppo
R15
Huawei
Mate 10 Pro
Workflow
30 Confidential © 2018 Arm Limited
Analysis workflow
Measure
Triage
Identify
hot spots
Determine
probable cause
Optimize
• Optimization is data-driven science
• Avoid guess work!
• Use data to identify root causes
• Disciplined workflow required
• Consistent methodology
• Reliable underlying data
• Particularly important for GPU analysis
• Graphics API provides thick abstraction
• Many interacting behaviors in the GPU hardware
• Tools are essential to make this efficient
31 Confidential © 2018 Arm Limited
Analyze
… with Streamline
Improve
shaders
… with Mali Offline Compiler
Tune rendering
… with Graphics Analyzer
Optimize code
… with StreamlineMonitor
… with Performance Advisor
Streamline
Performance triage nurse
33 Confidential © 2018 Arm Limited
Full system view
• Streamline provides at-a-glance overview
• Time correlated views of:
• CPU activity
• CPU counters
• GPU counters
• Thermal sensors
• Thread scheduling
• Main bottlenecks usually clearly visible
• Detailed data available when needed
34 Confidential © 2018 Arm Limited
Streamline
35 Confidential © 2018 Arm Limited
Event-based sampling
• Software profiling defaults to time-based sampling
• PC sampled every millisecond
• Event-based sampling allows triggering on events
• Select CPU PMU event to use as a trigger
• Set threshold for triggering; e.g. every 10,000 events
• PC sampled every N events
• Can be used for cache aware profiling
• Sample on L1 or L2 cache refill
• Can be used for bandwidth aware profiling
• Sample on external bus access
36 Confidential © 2018 Arm Limited
Basic Performance Analysis
• Can load template charts
• Custom charts
• Can use custom expression
• From big picture to details
Big Picture
Details
37 Confidential © 2018 Arm Limited
Triage nurse scenarios
Vsync bound Fragment bound CPU bound
Serialization problems Thermally bound
16.6 ms
38 Confidential © 2018 Arm Limited
Custom Counter Samples
• Pixels per Primitive
• (($MaliTilesTilesRendered * 256) / $MaliTilerCullingVisible)
• Fragment per Pixel
• (($MaliQuadsFragmentQuadsShaded * 4) / ($MaliTilesTilesRendered * 256) )
• Fragment Cycles per Pixel
• ($MaliCoreCyclesFragmentCycles / ($MaliTilesTilesRendered * 256) )
39 Confidential © 2018 Arm Limited
39
— Let you instrument your source code by
adding annotations to it
— Unity package
– Make sure that you are using IL2CPP as the Scripting
Backend
– Set the C++ Compiler Configuration to Debug
– Set the Target Architecture to ARM64
– Build development build
Streamline Annotations
40 Confidential © 2018 Arm Limited
Custom Activity Maps are the
most advanced form of annotation
and are a mechanism for showing
global (cross-thread) activities that
may have complex dependencies.
Each Custom Activity Map appears
as its own view in the lower half of
the Streamline UI.
Channels provide a separate row
of information alongside each
thread. Annotations can be placed
into a channel, and unlike a marker,
each annotation spans a range of
time.
Markers are the simplest form of
annotation - a single point-in-time
with a label that will appear at the
top of Streamline’s Timeline view.
Streamline Annotations
40
Replace with image
41 Confidential © 2018 Arm Limited
41
Unity Package Link :
https://github.com/ARM-software/Tool-Solutions/blob/master/mobile-application-
profiling/mobile-studio-with-unity/ArmMobileStudio.unitypackage?raw=true
Streamline Annotations
Graphics Analyzer
Rendering debug
43 Confidential © 2018 Arm Limited
Graphics Analyzer
 Support OpenGL ES 1.x, 2.x, 3.x, Vulkan and OpenCL 1.1 API Trace and Debug Tool
 MGD allows developers to trace OpenGL ES, Vulkan and OpenCL API calls in their
application and understand frame-by-frame the effect on the application to help
identify possible issues
 Android and Linux Arm based target platforms are currently supported
44 Confidential © 2018 Arm Limited
Graphics Analyzer
• Graphics API debugger
• Capture API calls
• Capture input resources
• Capture output rendering
• Step through frames
• … by render pass
• … by draw call
• Compare behavior with best practices
• Streamline can show what is going wrong
• Graphics Analyzer can help show why
45 Confidential © 2018 Arm Limited
Trace outline
Frame capture
Vertex data
API calls
Statistics
Target state
Shaders
Textures,
Buffers,
Uniforms,
…
46 Confidential © 2018 Arm Limited
Outline & Trace view
 Quick access to key graphics events
• Frames
• Render Passes
• Draw calls
 Investigate at frame level
 Jump between the outline view and
the trace view seamlessly
47 Confidential © 2018 Arm Limited
Uniforms and Vertex Attributes
• Uniforms and vertex attributes for each
draw call
• When a draw call is selected, all the
associated data is available
Uniforms
 Show uniform values, including samplers,
matrices and arrays
Vertex Attributes
 Show all the vertex attributes, their name and
their position
• This can be useful to debug graphics
issues
48 Confidential © 2018 Arm Limited
Shaders reports and statistics
• All the shaders being used by the
application are reported
Shader statistics
• Number of instructions for each GPU
pipeline
• Number of work registers and uniform
registers
• How many times that shader has been
executed
49 Confidential © 2018 Arm Limited
Frame Capture and Analysis
• Frames can be fully captured to analyze
the effect of each draw call
• All the images can be exported and
analyzed separately.
50 Confidential © 2018 Arm Limited
Shader map mode
Native shaders are
replaced with different
solid colors
Overdraw mode
Highlights where overdraw
happens (ie. objects are
drawn on top of each other)
Native mode
Frames are rendered with
the original shaders
Alternative Drawing Modes
50
Replace with image
Mali Offline Compiler
Shader profiling
52 Confidential © 2018 Arm Limited
Mali Offline Compiler
 Offline Compiler for OpenGL ES shaders and OpenCL kernels
 Helps application engineers optimize their shaders on the Mali platform
 Ensure shaders compile properly and reduce online compiling time
 You could download the Mali offline compiler from below link:
https://developer.arm.com/products/software-development-tools/graphics-
development-tools/mali-offline-compiler
53 Confidential © 2018 Arm Limited
Mali Offline Compiler
• Language support for:
• OpenGL ES ESSL
• Vulkan SPIR-V
• Shader validation
• Compilation warnings and errors
• Static performance analysis
• Register usage, and stack spilling
• Performance breakdown by functional unit
54 Confidential © 2018 Arm Limited
Mali Offline Compiler
 Use GA to capture the API calls and shaders to
understand the AP behavior.
 Use Offline Shader compiler to profiling
instruction counts for ALU, L/S, TEX
 If the shader needs more registers than the
available one, the GPU would need to perform
registers spilling
 registers spilling will cause big inefficiencies
and higher Load/Store utilization
Mali_Offline_Compiler_v4.3.0$ ./malisc --core Mali-T600 --revision
0p0_15dev0
--driver Mali-T600_r4p0-00rel0 --vertex shader-176.vert –V
ARM Mali Offline Shader Compiler v4.3.0 (C) Copyright 2007-2014
ARM Limited. All rights reserved.
Compilation successful. 3 work registers used, 16 uniform registers
used, spilling not used.
A L/S T Total Bound
Cycles: 9 5 0 14 A
Shortest Path: 4.5 5 0 9.5 L/S
Longest Path: 4.5 5 0 9.5 L/S
Note: The cycles counts do not include possible stalls due to cache
misses.
Performance Advisor
Continuous integration & reporting
Caveat: pre-release tool, all functionality is subject to change …
56 Confidential © 2018 Arm Limited
Workflow today
Measure
FPS
Detect
Slowdown
Manual
Analysis
Art Fix
Game Fix
Engine Fix
57 Confidential © 2018 Arm Limited
Continuous integration: a better way!
Measure
SoC Data
Sources
Detect
Slowdown
Automated
Triage
Art Fix
Game Fix
Engine Fix
Difficult
Problems
58 Confidential © 2018 Arm Limited
Performance Advisor
• An automated performance triage nurse
• Move beyond simple FPS-based regression tracking
• Perform an automated first pass analysis
• Generate easy to read performance report
• Route common issues directly to the team to review
• Free up performance experts to focus on the difficult problems
• Integrate into nightly continuous integration
• Catch major issues early
• Detect gradual regressions before they start impacting users
59 Confidential © 2018 Arm Limited
High-level overview
60 Confidential © 2018 Arm Limited
FPS Analysis
61 Confidential © 2018 Arm Limited
Tailored region labeling
62 Confidential © 2018 Arm Limited
Region-by-region analysis
63 Confidential © 2018 Arm Limited
Informed decisions
• UI Artist sees there is too much overdraw in
the Main Menu scene and fixes it
• Environment Artists keep an eye on texture
bottlenecks and reduce resolution
• Game Leads make decisions about issues that
will take time to solve
•
• No Graphics Programmers or Technical Artists
needed
Roadmap
65 Confidential © 2018 Arm Limited
Roadmap
• Ease of use improvements
• Simplified installation and setup
• One click connect and data capture
• Updated user documentation and tutorials
• New device testing
• Conformance testing for SiPs and ODMs
• Samsung Galaxy S10 series (Exynos 9820)
• Huawei Mate 20 series (Kirin 980)
• Performance Advisor beta program
• Register your interest online
• http://developer.arm.com/mobile-studio
• Mali Offline Compiler improvements
• Improved Bifrost reporting
• Shader pipeline compilation
• Investigate RenderDoc for Vulkan API debug
66 Confidential © 2018 Arm Limited
An even better way?
Measure
SoC Data
Sources
Detect
Slowdown
Automated
Triage
Art Fix
Game Fix
Engine Fix
Difficult
Solutions
Automated
Frame
Analysis
Arm Developer
Resources
68 Confidential © 2018 Arm Limited
More Resources From Arm
68
Arm Developer Home:
https://developer.arm.com/
Vulkan Samples + Tutorials :
github.com/ARM-software/vulkan_best_practice_for_mobile_developers
Best-practice Guides :
developer.arm.com/graphics/developer-guides/mali-gpu-best-practices
Mobile Studio :
https://developer.arm.com/tools-and-software/graphics-and-gaming/arm-mobile-studio
Thank You
Danke
Merci
谢谢
ありがとう
Gracias
Kiitos
감사합니다
धन्यवाद
‫ا‬ً‫شكر‬
‫תודה‬
Confidential © 2019 Arm Limited
The Arm trademarks featured in this presentation are registered
trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks
featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks

More Related Content

What's hot

Deferred shading
Deferred shadingDeferred shading
Deferred shading
Frank Chao
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Johan Andersson
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
Johan Andersson
 

What's hot (20)

Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
 
Shiny PC Graphics in Battlefield 3
Shiny PC Graphics in Battlefield 3Shiny PC Graphics in Battlefield 3
Shiny PC Graphics in Battlefield 3
 
Beyond porting
Beyond portingBeyond porting
Beyond porting
 
Dissecting the Rendering of The Surge
Dissecting the Rendering of The SurgeDissecting the Rendering of The Surge
Dissecting the Rendering of The Surge
 
OpenGL 3.2 and More
OpenGL 3.2 and MoreOpenGL 3.2 and More
OpenGL 3.2 and More
 
Deferred shading
Deferred shadingDeferred shading
Deferred shading
 
Physically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbitePhysically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in Frostbite
 
Stochastic Screen-Space Reflections
Stochastic Screen-Space ReflectionsStochastic Screen-Space Reflections
Stochastic Screen-Space Reflections
 
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
Unite Berlin 2018 - Book of the Dead Optimizing Performance for High End Cons...
 
Lighting the City of Glass
Lighting the City of GlassLighting the City of Glass
Lighting the City of Glass
 
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next GenerationTaking Killzone Shadow Fall Image Quality Into The Next Generation
Taking Killzone Shadow Fall Image Quality Into The Next Generation
 
The Rendering Technology of Killzone 2
The Rendering Technology of Killzone 2The Rendering Technology of Killzone 2
The Rendering Technology of Killzone 2
 
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio [Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
[Unite Seoul 2019] Mali GPU Architecture and Mobile Studio
 
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3DirectX 11 Rendering in Battlefield 3
DirectX 11 Rendering in Battlefield 3
 
NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017NVIDIA OpenGL and Vulkan Support for 2017
NVIDIA OpenGL and Vulkan Support for 2017
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
 
Terrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable SystemTerrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable System
 
NVIDIA OpenGL in 2016
NVIDIA OpenGL in 2016NVIDIA OpenGL in 2016
NVIDIA OpenGL in 2016
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
 

Similar to [TGDF 2019] Mali GPU Architecture and Mobile Studio

Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imagination
Droidcon Berlin
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
changehee lee
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
Linaro
 

Similar to [TGDF 2019] Mali GPU Architecture and Mobile Studio (20)

[Unity Forum 2019] Mobile Graphics Optimization Guides
[Unity Forum 2019] Mobile Graphics Optimization Guides[Unity Forum 2019] Mobile Graphics Optimization Guides
[Unity Forum 2019] Mobile Graphics Optimization Guides
 
JIT Spraying Never Dies - Bypass CFG By Leveraging WARP Shader JIT Spraying.pdf
JIT Spraying Never Dies - Bypass CFG By Leveraging WARP Shader JIT Spraying.pdfJIT Spraying Never Dies - Bypass CFG By Leveraging WARP Shader JIT Spraying.pdf
JIT Spraying Never Dies - Bypass CFG By Leveraging WARP Shader JIT Spraying.pdf
 
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla MahGS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
GS-4106 The AMD GCN Architecture - A Crash Course, by Layla Mah
 
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the FutureSupermicro’s Universal GPU: Modular, Standards Based and Built for the Future
Supermicro’s Universal GPU: Modular, Standards Based and Built for the Future
 
Droidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imaginationDroidcon2013 triangles gangolells_imagination
Droidcon2013 triangles gangolells_imagination
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
 
Unity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studioUnity mobile game performance profiling – using arm mobile studio
Unity mobile game performance profiling – using arm mobile studio
 
Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022Anatomy of ROCgdb presentation at gcc cauldron 2022
Anatomy of ROCgdb presentation at gcc cauldron 2022
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
Airframe: Lightweight Building Blocks for Scala - Scale By The Bay 2018
 
Flowframes
FlowframesFlowframes
Flowframes
 
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of ThingsJerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
JerryScript: An ultra-lighteweight JavaScript Engine for the Internet of Things
 
Gcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodesGcn performance ftw by stephan hodes
Gcn performance ftw by stephan hodes
 
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
HKG18-301 - Dramatically Accelerate 96Board Software via an FPGA with Integra...
 
Memory Management in TIZEN - Samsung SW Platform Team
Memory Management in TIZEN - Samsung SW Platform TeamMemory Management in TIZEN - Samsung SW Platform Team
Memory Management in TIZEN - Samsung SW Platform Team
 
Clang: More than just a C/C++ Compiler
Clang: More than just a C/C++ CompilerClang: More than just a C/C++ Compiler
Clang: More than just a C/C++ Compiler
 
Spi drivers
Spi driversSpi drivers
Spi drivers
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
GPU Design on FPGA
GPU Design on FPGAGPU Design on FPGA
GPU Design on FPGA
 
NVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and TransparencyNVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and Transparency
 

More from Owen Wu

More from Owen Wu (6)

Unreal Fest 2023 - Lumen with Immortalis
Unreal Fest 2023 - Lumen with ImmortalisUnreal Fest 2023 - Lumen with Immortalis
Unreal Fest 2023 - Lumen with Immortalis
 
COSCUP 2023 - Make Your Own Ray Tracing GPU with FPGA
COSCUP 2023 - Make Your Own Ray Tracing GPU with FPGACOSCUP 2023 - Make Your Own Ray Tracing GPU with FPGA
COSCUP 2023 - Make Your Own Ray Tracing GPU with FPGA
 
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
[Unite Seoul 2020] Mobile Graphics Best Practices for Artists
 
[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for Artist[TGDF 2020] Mobile Graphics Best Practices for Artist
[TGDF 2020] Mobile Graphics Best Practices for Artist
 
[GDC 2012] Enhancing Graphics in Unreal Engine 3 Titles Using AMD Code Submis...
[GDC 2012] Enhancing Graphics in Unreal Engine 3 Titles Using AMD Code Submis...[GDC 2012] Enhancing Graphics in Unreal Engine 3 Titles Using AMD Code Submis...
[GDC 2012] Enhancing Graphics in Unreal Engine 3 Titles Using AMD Code Submis...
 
[TGDF 2014] 進階Shader技術
[TGDF 2014] 進階Shader技術[TGDF 2014] 進階Shader技術
[TGDF 2014] 進階Shader技術
 

Recently uploaded

Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Recently uploaded (20)

(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 

[TGDF 2019] Mali GPU Architecture and Mobile Studio

  • 1. Confidential © 2018 Arm Limited Owen Wu (owen.wu@arm.com) Developer Relations Engineer Mali GPU Architecture And Mobile Studio TGDF
  • 2. 2 Confidential © 2018 Arm Limited Agenda • GPU Architecture • Tile-based Rendering • Render Passes • Mobile Studio Introduction • Performance Analysis Workflow • Streamline • Graphics Analyzer • Offline Shader Compiler • Performance Advisor
  • 4. 4 Confidential © 2018 Arm Limited Bifrost Shader Core 4 — Mali-G30, Mali-G50, and Mali-G70 series — Unified shader core architecture — Can scale from a single core for low-end devices all the way up to 32 cores — L2 Cache - Typically in the range of 64-128KB per shader core — Able to write one 32-bit pixel per core per clock — 8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle
  • 5. 5 Confidential © 2018 Arm Limited Bifrost Shader Core 5 — Black blocks are fixed function unit — Execution core consist of 5 units – Execution Engine : executing shader/arithmetic – Load/Store Unit : shader memory accesses – Varying Unit : fixed-function varying interpolator – ZS/Blend Unit : accesses to the tile-memory – Texture Unit : memory access to do with textures
  • 6. 6 Confidential © 2018 Arm Limited Index-Driven Vertex Shading (IDVS) 6 — Index-Driven Vertex Shading (IDVS) geometry processing pipeline — Processed all of the vertex shading before culling primitives, often resulting in wasted computation and bandwidth for the vertices which are only used in culled triangles — IDVS splits the shader in two halves – Position shading runs before culling – Varying shading runs after it for the visible vertices which survive culling — Deinterleave packed vertex buffers partially – Place attributes contributing to position in one packed buffer – Attributes contributing to non-position varyings in a second packed buffer – Non-position varyings are not pulled into the cache
  • 7. 7 Confidential © 2018 Arm Limited Forward Pixel Kill (FPK) 7 — Early-Z and front-to-back rendering removes most overdraw — Pixels can be killed in-flight if a future pixel will occlude it — Calculations already in flight can be terminated at any time if we spot that a later thread will write opaque data to the same pixel location
  • 9. 9 Confidential © 2018 Arm Limited Immediate Mode GPU 9 — Traditional desktop GPU architecture — Vertex shaders and the fragment shaders being executed in sequence on each primitive in each draw call for draw in renderPass: for primitive in draw: for vertex in primitive: execute_vertex_shader(vertex) for fragment in primitive: execute_fragment_shader(fragment)
  • 10. 10 Confidential © 2018 Arm Limited Immediate Mode GPU 10
  • 11. 11 Confidential © 2018 Arm Limited Tile-Based GPU 11 — Designed to minimize the amount of external memory accesses which are needed during rendering — Tile-based renders split the screen into small pieces – Mali renders 16x16 tiles — Process fragment shading on each small tile — Writing result out to memory — Split each render pass into two distinct processing passes – Executes all of the geometry related processing, and generates the tile lists – Executes all of the fragment processing, tile by tile
  • 12. 12 Confidential © 2018 Arm Limited Tile-Based GPU 12 — Vertex Pass for draw in renderPass: for primitive in draw: for vertex in primitive: execute_vertex_shader(vertex) append_tile_list(primitive) — Fragment Pass for tile in renderPass: for primitive in tile: for fragment in primitive: execute_fragment_shader(fragment)
  • 13. 13 Confidential © 2018 Arm Limited Tile-Based GPU 13
  • 15. 15 Confidential © 2018 Arm Limited Render Passes 15 — Render passes are essential concepts — Render pass is a single execution of the rendering pipeline – Need initializing in tile memory at the start of the render pass – May need writing back out to memory at the end of the render pass — Minimize the amount of memory traffic in to and out of the tile memory — Avoiding reading at the start of a render pass — Avoiding writing at the end of a pass
  • 16. 16 Confidential © 2018 Arm Limited Vertex Pass Facing Test Culling Frustum Test Culling Sample Test Culling Polygon List Position Shading Varying Shading Do only position shading here for all vertices Cull back facing polygons Cull polygons which are outside frustum Cull polygons which are smaller than a pixel Generate polygon list and write to external memory Do varying shading for surviving vertices
  • 17. 17 Confidential © 2018 Arm Limited Fragment Pass Rasterizer Early ZS Test Blender Polygon List of Tile Frame Buffer Tile RAM Fragment Thread Creator Execution Core Tile Write Transaction Elimination Process Engine 0 Process Engine N Late ZS Test Rasterize polygon to many quads Kill the quad which is occluded by other pixels (if possible) Generate fragment threads from quad Execute fragment shader Kill the fragment which is occluded by other pixels (if the fragment can’t do early ZS test) Do compression to save bandwidth
  • 18. 18 Confidential © 2018 Arm Limited Render Passes in OpenGL 18 — OpenGL ES API has no explicit render passes in the API level — Driver must infer which rendering operations form a single render pass — Drawing commands are added to the current render pass — Render pass is submitted for processing when an API call changes the framebuffer or forces a flush of the queued work
  • 19. 19 Confidential © 2018 Arm Limited Render Passes in OpenGL 19 — The most common causes for ending a render pass – Called glBindFramebuffer() to change the GL_FRAMEBUFFER or GL_DRAW_FRAMEBUFFER target – Called glFramebufferTexture*() or glFramebufferRenderbuffer() to change the attachments – Called eglSwapBuffers() – Called glFlush() or glFinish() – Created a glFenceSync() then called glClientWaitSync() to wait
  • 20. 20 Confidential © 2018 Arm Limited Efficient Render Passes 20 — Process each render pass once – Bind each framebuffer object only once – Making all required draw calls before switching to the next context – Avoid unnecessary context switch
  • 21. 21 Confidential © 2018 Arm Limited Efficient Render Passes 21 — Minimizing start of tile loads – Can cheaply initialize the tile memory to a clear color value – Ensure that you clear or invalidate all of your attachments at the start of each render pass – Can use any of the following calls – glClear(), glClearBuffer*(), glInvalidateFramebuffer() — Minimizing end of tile stores – Avoid writing back to main memory whenever is possible – Can notify the driver that an attachment is transient by marking the content as invalid using a call to glInvalidateFramebuffer() as the last "draw call”
  • 23. 23 Confidential © 2018 Arm Limited What is in the box? Streamline Graphics Analyzer Mali Offline Compiler (separate download) Performance Advisor (closed beta) Download Arm Mobile Studio: http://developer.arm.com/mobile-studio
  • 24. 24 Confidential © 2018 Arm Limited Streamline Performance Analyzer Mali GPU support  Analyze and optimize Mali™ GPU graphics and compute workloads  Accelerate your workflow using built-in analysis templates Optimize for energy  Move beyond simple frame time and FPS tracking  Monitor overall usage of processor cycles and memory bandwidth Speed up your app  Find out where the system is spending the most time  Tune code for cache efficiency Application event traceNative code profiling  Break performance down by function  View cost alongside disassembly listing Arm CPU support  Profile 32-bit and 64-bit apps for ARMv7-A and ARMv8-A cores  Tune multi-threading for DynamIQ multi-core systems  Annotate software workloads  Define logical event channel structure  Trace cross-channel task dependencies Tune your rendering  Identify critical-path GPU shader core resources  Detect content inefficiency
  • 25. 25 Confidential © 2018 Arm Limited Graphics Analyzer GPU API Debugger Shader analysis  Capture and view all shaders used  Optimize shader performance using integrated Mali Offline Compiler Cross platform  Host support for Windows, macOS, and Linux  Target support for any Android GPU Rendering API debug  Graphics debug for content developers  Support for all versions of OpenGL ES and Vulkan Android utility appVisual analysis views  Native mode  Overdraw mode  Shader map mode  Fragment count mode State visibility  Show API state after every API call  Trace backwards from point-of-use to API call responsible for state set  Manage on-device connection  Select and launch user application Frame analysis  Diagnose root causes of rendering errors  Identify sources of rendering inefficiency
  • 26. 26 Confidential © 2018 Arm Limited s Mali Offline Compiler Shader static analysis Rapid iteration  Verify impact of shader changes without needing whole application rebuild Profile for any Mali GPU  Cost shader code for every Mali GPU without needing hardware Mali GPU aware  Support for all actively shipping Mali GPUs  Cycle counts reflect specific microarchitecture Critical path analysisControl flow aware  Best case control flow  Worst case control flow Syntax verification  Verify correctness of code changes  Receive clear error diagnostics for invalid shaders  Identify dominant shader resource  Target this for optimization! Register usage  Work registers  Uniform registers  Stack spilling
  • 27. 27 Confidential © 2018 Arm Limited Performance Advisor Analysis Reports Fast workflow  Integrate data capture and analysis into nightly CI test workflow  Read results over a nice cup of tea Caveats …  Still under development  Currently in closed beta Overview chartsRegion views  Split by dynamic behavior  Split by application annotation Executive dashboard  Show high level status summary  Show status breakdown by regions of interest  See performance trends over time  See region splits Summary reports  Easy-to-use performance status reports  Integrated initial root cause analysis
  • 28. 28 Confidential © 2018 Arm Limited • Not all devices provide necessary performance data • CPU data requirements: • Kernel 4.4 or higher • Kernel perf config includes CPU PMU • GPU data requirements: • Supported Mali GPU and driver • Supported device list can be found online* • Other devices may work! • Conformance tests for ODMs under development * https://developer.arm.com/products/software-development-tools/arm-mobile-studio/support/supported-devices Device requirements Samsung Galaxy S9 (Exynos version) Oppo R15 Huawei Mate 10 Pro
  • 30. 30 Confidential © 2018 Arm Limited Analysis workflow Measure Triage Identify hot spots Determine probable cause Optimize • Optimization is data-driven science • Avoid guess work! • Use data to identify root causes • Disciplined workflow required • Consistent methodology • Reliable underlying data • Particularly important for GPU analysis • Graphics API provides thick abstraction • Many interacting behaviors in the GPU hardware • Tools are essential to make this efficient
  • 31. 31 Confidential © 2018 Arm Limited Analyze … with Streamline Improve shaders … with Mali Offline Compiler Tune rendering … with Graphics Analyzer Optimize code … with StreamlineMonitor … with Performance Advisor
  • 33. 33 Confidential © 2018 Arm Limited Full system view • Streamline provides at-a-glance overview • Time correlated views of: • CPU activity • CPU counters • GPU counters • Thermal sensors • Thread scheduling • Main bottlenecks usually clearly visible • Detailed data available when needed
  • 34. 34 Confidential © 2018 Arm Limited Streamline
  • 35. 35 Confidential © 2018 Arm Limited Event-based sampling • Software profiling defaults to time-based sampling • PC sampled every millisecond • Event-based sampling allows triggering on events • Select CPU PMU event to use as a trigger • Set threshold for triggering; e.g. every 10,000 events • PC sampled every N events • Can be used for cache aware profiling • Sample on L1 or L2 cache refill • Can be used for bandwidth aware profiling • Sample on external bus access
  • 36. 36 Confidential © 2018 Arm Limited Basic Performance Analysis • Can load template charts • Custom charts • Can use custom expression • From big picture to details Big Picture Details
  • 37. 37 Confidential © 2018 Arm Limited Triage nurse scenarios Vsync bound Fragment bound CPU bound Serialization problems Thermally bound 16.6 ms
  • 38. 38 Confidential © 2018 Arm Limited Custom Counter Samples • Pixels per Primitive • (($MaliTilesTilesRendered * 256) / $MaliTilerCullingVisible) • Fragment per Pixel • (($MaliQuadsFragmentQuadsShaded * 4) / ($MaliTilesTilesRendered * 256) ) • Fragment Cycles per Pixel • ($MaliCoreCyclesFragmentCycles / ($MaliTilesTilesRendered * 256) )
  • 39. 39 Confidential © 2018 Arm Limited 39 — Let you instrument your source code by adding annotations to it — Unity package – Make sure that you are using IL2CPP as the Scripting Backend – Set the C++ Compiler Configuration to Debug – Set the Target Architecture to ARM64 – Build development build Streamline Annotations
  • 40. 40 Confidential © 2018 Arm Limited Custom Activity Maps are the most advanced form of annotation and are a mechanism for showing global (cross-thread) activities that may have complex dependencies. Each Custom Activity Map appears as its own view in the lower half of the Streamline UI. Channels provide a separate row of information alongside each thread. Annotations can be placed into a channel, and unlike a marker, each annotation spans a range of time. Markers are the simplest form of annotation - a single point-in-time with a label that will appear at the top of Streamline’s Timeline view. Streamline Annotations 40 Replace with image
  • 41. 41 Confidential © 2018 Arm Limited 41 Unity Package Link : https://github.com/ARM-software/Tool-Solutions/blob/master/mobile-application- profiling/mobile-studio-with-unity/ArmMobileStudio.unitypackage?raw=true Streamline Annotations
  • 43. 43 Confidential © 2018 Arm Limited Graphics Analyzer  Support OpenGL ES 1.x, 2.x, 3.x, Vulkan and OpenCL 1.1 API Trace and Debug Tool  MGD allows developers to trace OpenGL ES, Vulkan and OpenCL API calls in their application and understand frame-by-frame the effect on the application to help identify possible issues  Android and Linux Arm based target platforms are currently supported
  • 44. 44 Confidential © 2018 Arm Limited Graphics Analyzer • Graphics API debugger • Capture API calls • Capture input resources • Capture output rendering • Step through frames • … by render pass • … by draw call • Compare behavior with best practices • Streamline can show what is going wrong • Graphics Analyzer can help show why
  • 45. 45 Confidential © 2018 Arm Limited Trace outline Frame capture Vertex data API calls Statistics Target state Shaders Textures, Buffers, Uniforms, …
  • 46. 46 Confidential © 2018 Arm Limited Outline & Trace view  Quick access to key graphics events • Frames • Render Passes • Draw calls  Investigate at frame level  Jump between the outline view and the trace view seamlessly
  • 47. 47 Confidential © 2018 Arm Limited Uniforms and Vertex Attributes • Uniforms and vertex attributes for each draw call • When a draw call is selected, all the associated data is available Uniforms  Show uniform values, including samplers, matrices and arrays Vertex Attributes  Show all the vertex attributes, their name and their position • This can be useful to debug graphics issues
  • 48. 48 Confidential © 2018 Arm Limited Shaders reports and statistics • All the shaders being used by the application are reported Shader statistics • Number of instructions for each GPU pipeline • Number of work registers and uniform registers • How many times that shader has been executed
  • 49. 49 Confidential © 2018 Arm Limited Frame Capture and Analysis • Frames can be fully captured to analyze the effect of each draw call • All the images can be exported and analyzed separately.
  • 50. 50 Confidential © 2018 Arm Limited Shader map mode Native shaders are replaced with different solid colors Overdraw mode Highlights where overdraw happens (ie. objects are drawn on top of each other) Native mode Frames are rendered with the original shaders Alternative Drawing Modes 50 Replace with image
  • 52. 52 Confidential © 2018 Arm Limited Mali Offline Compiler  Offline Compiler for OpenGL ES shaders and OpenCL kernels  Helps application engineers optimize their shaders on the Mali platform  Ensure shaders compile properly and reduce online compiling time  You could download the Mali offline compiler from below link: https://developer.arm.com/products/software-development-tools/graphics- development-tools/mali-offline-compiler
  • 53. 53 Confidential © 2018 Arm Limited Mali Offline Compiler • Language support for: • OpenGL ES ESSL • Vulkan SPIR-V • Shader validation • Compilation warnings and errors • Static performance analysis • Register usage, and stack spilling • Performance breakdown by functional unit
  • 54. 54 Confidential © 2018 Arm Limited Mali Offline Compiler  Use GA to capture the API calls and shaders to understand the AP behavior.  Use Offline Shader compiler to profiling instruction counts for ALU, L/S, TEX  If the shader needs more registers than the available one, the GPU would need to perform registers spilling  registers spilling will cause big inefficiencies and higher Load/Store utilization Mali_Offline_Compiler_v4.3.0$ ./malisc --core Mali-T600 --revision 0p0_15dev0 --driver Mali-T600_r4p0-00rel0 --vertex shader-176.vert –V ARM Mali Offline Shader Compiler v4.3.0 (C) Copyright 2007-2014 ARM Limited. All rights reserved. Compilation successful. 3 work registers used, 16 uniform registers used, spilling not used. A L/S T Total Bound Cycles: 9 5 0 14 A Shortest Path: 4.5 5 0 9.5 L/S Longest Path: 4.5 5 0 9.5 L/S Note: The cycles counts do not include possible stalls due to cache misses.
  • 55. Performance Advisor Continuous integration & reporting Caveat: pre-release tool, all functionality is subject to change …
  • 56. 56 Confidential © 2018 Arm Limited Workflow today Measure FPS Detect Slowdown Manual Analysis Art Fix Game Fix Engine Fix
  • 57. 57 Confidential © 2018 Arm Limited Continuous integration: a better way! Measure SoC Data Sources Detect Slowdown Automated Triage Art Fix Game Fix Engine Fix Difficult Problems
  • 58. 58 Confidential © 2018 Arm Limited Performance Advisor • An automated performance triage nurse • Move beyond simple FPS-based regression tracking • Perform an automated first pass analysis • Generate easy to read performance report • Route common issues directly to the team to review • Free up performance experts to focus on the difficult problems • Integrate into nightly continuous integration • Catch major issues early • Detect gradual regressions before they start impacting users
  • 59. 59 Confidential © 2018 Arm Limited High-level overview
  • 60. 60 Confidential © 2018 Arm Limited FPS Analysis
  • 61. 61 Confidential © 2018 Arm Limited Tailored region labeling
  • 62. 62 Confidential © 2018 Arm Limited Region-by-region analysis
  • 63. 63 Confidential © 2018 Arm Limited Informed decisions • UI Artist sees there is too much overdraw in the Main Menu scene and fixes it • Environment Artists keep an eye on texture bottlenecks and reduce resolution • Game Leads make decisions about issues that will take time to solve • • No Graphics Programmers or Technical Artists needed
  • 65. 65 Confidential © 2018 Arm Limited Roadmap • Ease of use improvements • Simplified installation and setup • One click connect and data capture • Updated user documentation and tutorials • New device testing • Conformance testing for SiPs and ODMs • Samsung Galaxy S10 series (Exynos 9820) • Huawei Mate 20 series (Kirin 980) • Performance Advisor beta program • Register your interest online • http://developer.arm.com/mobile-studio • Mali Offline Compiler improvements • Improved Bifrost reporting • Shader pipeline compilation • Investigate RenderDoc for Vulkan API debug
  • 66. 66 Confidential © 2018 Arm Limited An even better way? Measure SoC Data Sources Detect Slowdown Automated Triage Art Fix Game Fix Engine Fix Difficult Solutions Automated Frame Analysis
  • 68. 68 Confidential © 2018 Arm Limited More Resources From Arm 68 Arm Developer Home: https://developer.arm.com/ Vulkan Samples + Tutorials : github.com/ARM-software/vulkan_best_practice_for_mobile_developers Best-practice Guides : developer.arm.com/graphics/developer-guides/mali-gpu-best-practices Mobile Studio : https://developer.arm.com/tools-and-software/graphics-and-gaming/arm-mobile-studio
  • 70. The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. www.arm.com/company/policies/trademarks

Editor's Notes

  1. Our latest GPU architecture is Bifrost and Mali G30, G50 and G70 series are using Bifrost architecture. It’s a unified architecture so that means one shader core for both vertex and fragment shader. And the number of shader core can scale from single core all the way up to 32 shader cores. And there is a L2 cache to reduce the latency , typically in the range of 64 to 128 KB per shader code. This architecture is capable to write one 32 bits pixel per core per clock. So on a 8 core design, it actually can write 256 bits pixel per clock.
  2. This is the block model of shader core. In the picture, the black block is fixed function block. And every shader core contains a execution core which consist of 5 units. First one is execution engine. The execution engine is responsible for executing shader code and has arithmetic processing power. Execution core may contain multiple execution engines. Load/Store unit is responsible for shader memory access. Varying unit is responsible for varying interpolation. It also has the same arithmetic design of execution engine. ZS/Blend unit is responsible for the tile-memory access. Texture unit is responsible for any memory access to do with texture. It’s separated from load/store unit.
  3. There are few new features in Bifrost architecture. IDVS is one of them. Usually, GPU will process all the vertex shading before culling primitives. But often resulting in wasted computation power and bandwidth for the vertices which are only used in culled primitives. So IDVS can splits the shader in two parts. One is for position shading before culling and one is for varying shading after culling. By this way, we can save the compute power and bandwidth. And it’s done by driver and hardware so basically it’s invisible to the developer. But the developer can do something to make IDVS more efficient. If the developer can put position related attribute in one buffer and other attributes to another buffer. Then the non-position attribute will not be pulled into cache so the cache can store more vertex data.
  4. The other new feature I want to mention is forward pixel kill. Although using early-z and front-to-back rendering can remove most overdraw. FPK can kill the pixel thread in-flight if we found the pixel will be occluded. So the calculation thread already in flight can be terminated at any time if we spot that a later thread will write opaque data to the same pixel location.
  5. Traditional desktop GPU architecture usually use immediate rendering mode. That means vertex shaders and fragment shaders will be executed in sequence on each primitive in each draw call. Here is the pseudo code for immediate rendering mode.
  6. The picture shows the traditional immediate rendering mode pipeline. And you can see there are a lot of external memory access at fragment stage. For mobile device, this kind of external memory access is quite bad for energy efficiency.
  7. Most morden mobile GPU use tile-base rendering mode. It is designed to minimize the amount of external memory accesses which are needed during rendering. Tiled-baes renders split the screen into small piece, the tile size of Mali render is16x16. And GPU will process fragment shading on each small tile. Then write tile result out to external memory when this tile is finished. One big difference from immediate rendering mode is that the GPU will split each render pass into two distinct processing passes. One is vertex shading pass and generate the tile lists. The other one is fragment shading pass which execute fragment shading tile by tile.
  8. And that is the pseudo code for tile-based rendering. The GPU will execute the vertex pass first for all primitives in one render pass and finally generate a tile list. Then GPU will execute fragment pass for each render pass tile by tile.
  9. The picture shows the pipeline of tile-based rendering mode. As you can see from the picture, tile-based rendering moves the memory access to tile memory instead of external memory and this is much faster and energy efficiency. One thing needs to mention is that there is an extra external memory read and write for tile list which immediate mode GPU doesn’t have. So extra geometry pass on mobile device is usually more expensive than on desktop.
  10. Render passes are essential concepts for tile-based renders and it is a single execution of the rendering pipeline, rendering a single output image into a set of framebuffer. Each render pass needs initializing in tile memory at the start of the render pass and writing back to external memory at the end of the render pass. To get the most benefit from the tile-based rendering approach it is critical that applications minimize the amount of memory traffic in to and out of the tile memory. That means avoiding reading in older framebuffer values at the start of a render pass and avoiding writing out values at the end of each render pass.
  11. Here is an animation about how tile-based GPU works. Primitives rendered will do position shading first. After position shading, the primitives need to pass facing test culling, Any back facing primitives will be culled here. Next one is frustum test culling. The primitives outside viewport are culled here. Then go to sample test culling, any primitives which are smaller than one pixel will be culled here. Finally, all survived primitives can do the varying shading then add to tile list.
  12. After the vertex pass of a render pass is finished, GPU can run fragment pass for this render pass. The rasterizer will rasterize the primitives into quad which is 2x2 sized. Then all quads need to do early Z test before going to next stage. But if the fragment is using discard, alpha to coverage or writing depth, the quad will skip early Z test and do late Z test later. Early Z test will kill any quad which is fully occluded by other pixels. After early Z, fragment thread creator will create 4 fragment threads for a quad then the fragment thread will be dispatch to execution engines. And for those fragments can not take advantage of early Z, later Z test will be done here to kill the fragments which are occluded. Finally the survived fragments will do blending then write to tile memory. After all primitives in tile list have been finished, the tile writer will write the tile result back to external memory. Mali has a transaction Eliminator which can compress the tile result writing back to external memory to save bandwidth.
  13. But OpenGL ES API has no explicit render passes in the API level. So the driver must infer which rendering operations form a single render pass. How the OpenGL ES driver form a render pass? Basically, driver will add drawing commands to the current render pass and render pass is submitted for processing when an API call changes the framebuffer or forces a flush of the queued work.
  14. So here I listed the most common causes for ending a render pass. Whenever you call those APIs, the driver will end the current render pass and submit it.
  15. Now we know how render pass works, how can we write code for efficient render pass? There are few things you can follow to make sure you write efficient render pass code. First, processing each render pass once. That means binding each framebuffer object only once. Don’t switch framebuffer object back and forth. Making all required draw calls before switching to the next context and avoid unnecessary context switch.
  16. Second, Minimizing the load from last framebuffer at the start of render pass. There are two ways to avoid the tile load. First is using glClear() function to cheaply initialize the tile memory to a clear color value. The second is using glInvalidateFramebuffer() function to hint driver don’t load data from external memory. Third, Minimizing the store to framebuffer at the end of render pass. Avoid writing back to main memory whenever is possible. You can notify the driver that an attachment is transient by marking the content as invalid using a call to glInvalidateFramebuffer() as the last "draw call”.
  17. Mobile Studio consists of four component tools, although at the moment only two are actually in the public tool bundle. Streamline, a system profiler for CPU and GPU performance. Graphics Analyzer, an API debugger for OpenGL ES and Vulkan rendering APIs. In addition we have: Mali Offline Compiler, a syntax checker and static analysis tool for GPU shader programs, which is currently available as a separate download. Performance Advisor, a new tool which places automated performance analysis into a continuous integration workflow. This is currently still in development in a closed beta, but expect to see this joining the Studio release early next year.
  18. Android does not mandate a guaranteed level of data access for tooling, so not every device will provide all desirable data sources off the shelf today. We are publishing a list of supported devices online, which are currently the ones we test internally. Expect this device list to grow over time, and if you are a device manufacturer please come and talk to us about tooling conformance testing and how you can get more devices on to that list.
  19. Streamline annotations let you instrument your source code by adding annotations to it. It is written in C++ and we implement a c# wrapper so the developer can import it as a Unity package now. To use streamline annotations, you need some setups for your Unity project: First, make sure that you are using IL2CPP as the scriptable backend Second, set c++ compiler configuration to debug Third, set the Target Achitecture to ARM64, we are currently supporting ARM64 only. Finally, you need to build a development build apk.
  20. There are 3 types of annotations. First is marker. It is the simplest form annotation. It is just a single point-in-time with a label that will appear at the top of Streamline’s timeline view. As you can see from the picture, the green labels on the top are the markers. Second is channel. Many annotations can be placed into a channel, and unlike a marker, each annotation spans a range of time. You can use it to mark the total time the game spent on some operations. On the picture, yellow and blue labels represent different annotations in the same channel. The final one is Custom Activity Maps which are the most advanced form of annotation. You can think it as a structural map with many channels like the picture shows. You can group as many channels as you want into one map. The map will appear as its own view in the lower half of the Streamline UI like the picture shows.
  21. You can download Unity package here.
  22. This is how the tool looks, all of the views are customizable so you can show only the data you need per API. APITrace is every single call that you make to your chosen api. Can get into the millions quite easily. Dynamic Help is static analysis so we have had a list of our things to watch out for by our experts so it gives you pointers. Textures and Shaders so we get every single asset in your application. And we run shaders through the offline compiler this makes them easily sortable. Frame Outline allows you to quickly navigate between the whole trace to find your problem area fast.
  23. Outline view shows all the frames and draw calls that are rendered. You can select any draw call, or in fact any API call, and see the state of the application at any point Investigate at frame level. Find out what draw calls have higher geometry impact
  24. This widow shows content of uniforms and vertex attributes for each draw call. When a draw call is selected, all the associated data is available. Uniform tab can show uniform values, including samplers, matrices and arrays. Vertex attribute tab show all the vertex attributes, their name and their position. You can also view the rendering mesh in 3D here. This can be useful to debug graphics issues.
  25. Shaders reports and statistics window is also very useful. All the shaders being used by the application are reported. Shaders are compiled with mali offline compiler and report the number of instructions, number of work registers and uniform registers here. Additionally, how many times that shader has been executed can also be reported here.
  26. A native resolution snapshot of each framebuffer is captured after every draw call. The capture happens on target, so even target dependent bugs or precision issues can be investigated.
  27. Graphics Analyzer also has few different drawing modes to help you debug the application. Native mode will render with original shaders. Overdraw mode will Highlights where overdraw happens. Shader map mode will show each shader with different color, it helps you to identify particular shader.
  28. This report is based on an example application, created for demonstration purposes. The first thing the user sees in the summary is at the very top of the report, it gives us a high level overall view of what is reported below.  It shows us this pie chart, which is give the user information about what the application was bound by throughout the capture, and it gives us the average FPS for the application. In this particular capture we can see that the average FPS was below target at 42, with a target of 60. We can then look at the pie chart and see that for around three quarters of this capture, the application was either CPU or Vertex bound. At first glance of this report, we can clearly see that performance improvements could be made, with some direction of where the user can start focusing their attention. Vsync bound being the ideal representation, represent a well running application
  29. FPS analysis graph gives a clearer view of where in the application the user might be able to make performance improvements. Start looking deeper into the issues highlighted in the summary It highlights the Bound areas, allowing the user to see at  are clearly highlighted Plotted FPS and Overdraw to see if there is any correlation between them. From this graph we can see there is an obvious connection here... low fps high overdraw and CPU to Vertex bound as we move to being vsync it swaps The graph is also interactive, allowing the user to toggle overdraw and FPS on and off, which can help in situations where the user only wants to focus on bounded areas or where the FPS and Overdraw are closely plotted. 
  30.  This screen shows a space ape capture with nested regions. We can see they have identified a loading screen, introduction scene and first 10s of game play. Loading screen might not be of as much interest here, time taken will probably be more important,  but we can see in the first 10s of game play that the application goes CPU bound and FPS drops significantly. Add hover on region
  31. Each region defined has its own analysis section with advice and links to further actions that can be taken All of this information is packaged into one report, which can be integrated into CI systems, or run manualy, and reduces the reliance on technical experts to spends long amounts of time determining why application have performance issues. This enables teams to move forward, empowering them with deeper knowledge, to understand where the application needs attention. In turn Freeing up the indivdual expert to concentrate on other areas.
  32. There are more resource you can find on our website. If you are interested in Vulkan, you should check this link, there are many useful information and samples.