[TGDF 2019] Mali GPU Architecture and Mobile Studio

Confidential © 2018 Arm Limited
Owen Wu (owen.wu@arm.com)
Developer Relations Engineer
Mali GPU
Architecture And
Mobile Studio
TGDF

2 Confidential © 2018 Arm Limited
Agenda
• GPU Architecture
• Tile-based Rendering
• Render Passes
• Mobile Studio Introduction
• Performance Analysis Workflow
• Streamline
• Graphics Analyzer
• Offline Shader Compiler
• Performance Advisor

Bifrost Shader Core
4
— Mali-G30, Mali-G50, and Mali-G70 series
— Unified shader core architecture
— Can scale from a single core for low-end
devices all the way up to 32 cores
— L2 Cache - Typically in the range of 64-128KB
per shader core
— Able to write one 32-bit pixel per core per
clock
— 8-core design to have a total of 256-bits of
memory bandwidth (for both read and write)
per clock cycle

Bifrost Shader Core
5
— Black blocks are fixed function unit
— Execution core consist of 5 units
– Execution Engine : executing shader/arithmetic
– Load/Store Unit : shader memory accesses
– Varying Unit : fixed-function varying interpolator
– ZS/Blend Unit : accesses to the tile-memory
– Texture Unit : memory access to do with textures

Index-Driven Vertex Shading (IDVS)
6
— Index-Driven Vertex Shading (IDVS) geometry processing pipeline
— Processed all of the vertex shading before culling primitives, often resulting in
wasted computation and bandwidth for the vertices which are only used in culled
triangles
— IDVS splits the shader in two halves
– Position shading runs before culling
– Varying shading runs after it for the visible vertices which survive culling
— Deinterleave packed vertex buffers partially
– Place attributes contributing to position in one packed buffer
– Attributes contributing to non-position varyings in a second packed buffer
– Non-position varyings are not pulled into the cache

Forward Pixel Kill (FPK)
7
— Early-Z and front-to-back rendering removes most overdraw
— Pixels can be killed in-flight if a future pixel will occlude it
— Calculations already in flight can be terminated at any time if we spot that a
later thread will write opaque data to the same pixel location

Immediate Mode GPU
9
— Traditional desktop GPU architecture
— Vertex shaders and the fragment shaders being executed in sequence on each
primitive in each draw call
for draw in renderPass:
for primitive in draw:
for vertex in primitive:
execute_vertex_shader(vertex)
for fragment in primitive:
execute_fragment_shader(fragment)

Immediate Mode GPU
10

Tile-Based GPU
11
— Designed to minimize the amount of external memory accesses which are needed
during rendering
— Tile-based renders split the screen into small pieces – Mali renders 16x16 tiles
— Process fragment shading on each small tile
— Writing result out to memory
— Split each render pass into two distinct processing passes
– Executes all of the geometry related processing, and generates the tile lists
– Executes all of the fragment processing, tile by tile

Tile-Based GPU
12
— Vertex Pass
for draw in renderPass:
for primitive in draw:
for vertex in primitive:
execute_vertex_shader(vertex)
append_tile_list(primitive)
— Fragment Pass
for tile in renderPass:
for primitive in tile:
for fragment in primitive:
execute_fragment_shader(fragment)

Tile-Based GPU
13

Render Passes
15
— Render passes are essential concepts
— Render pass is a single execution of the rendering pipeline
– Need initializing in tile memory at the start of the render pass
– May need writing back out to memory at the end of the render pass
— Minimize the amount of memory traffic in to and out of the tile memory
— Avoiding reading at the start of a render pass
— Avoiding writing at the end of a pass

Vertex Pass
Facing Test
Culling
Frustum Test
Culling
Sample Test
Culling
Polygon List
Position
Shading
Varying
Shading
Do only position
shading here for
all vertices
Cull back facing
polygons
Cull polygons
which are outside
frustum
Cull polygons
which are smaller
than a pixel
Generate polygon
list and write to
external memory
Do varying
shading for
surviving vertices

Fragment Pass
Rasterizer Early ZS Test
Blender
Polygon List
of
Tile
Frame
Buffer
Tile
RAM
Fragment
Thread
Creator
Execution Core
Tile Write
Transaction
Elimination
Process
Engine 0
Process
Engine N
Late ZS Test
Rasterize polygon
to many quads
Kill the quad
which is occluded
by other pixels
(if possible)
Generate
fragment threads
from quad
Execute fragment
shader
Kill the fragment
which is occluded
by other pixels
(if the fragment
can’t do early ZS
test)
Do compression
to save bandwidth

Render Passes in OpenGL
18
— OpenGL ES API has no explicit render passes in the API level
— Driver must infer which rendering operations form a single render pass
— Drawing commands are added to the current render pass
— Render pass is submitted for processing when an API call changes the framebuffer
or forces a flush of the queued work

Render Passes in OpenGL
19
— The most common causes for ending a render pass
– Called glBindFramebuffer() to change the GL_FRAMEBUFFER or GL_DRAW_FRAMEBUFFER target
– Called glFramebufferTexture*() or glFramebufferRenderbuffer() to change the attachments
– Called eglSwapBuffers()
– Called glFlush() or glFinish()
– Created a glFenceSync() then called glClientWaitSync() to wait

Efficient Render Passes
20
— Process each render pass once
– Bind each framebuffer object only once
– Making all required draw calls before switching to the next context
– Avoid unnecessary context switch

Efficient Render Passes
21
— Minimizing start of tile loads
– Can cheaply initialize the tile memory to a clear color value
– Ensure that you clear or invalidate all of your attachments at the start of each render pass
– Can use any of the following calls
– glClear(), glClearBuffer*(), glInvalidateFramebuffer()
— Minimizing end of tile stores
– Avoid writing back to main memory whenever is possible
– Can notify the driver that an attachment is transient by marking the content as invalid using a call
to glInvalidateFramebuffer() as the last "draw call”

What is in the box?
Streamline
Graphics
Analyzer
Mali Offline
Compiler
(separate download)
Performance
Advisor
(closed beta)
Download Arm Mobile Studio: http://developer.arm.com/mobile-studio

Streamline
Performance Analyzer
Mali GPU support
 Analyze and optimize Mali™ GPU
graphics and compute workloads
 Accelerate your workflow using
built-in analysis templates
Optimize for energy
 Move beyond simple frame time
and FPS tracking
 Monitor overall usage of processor
cycles and memory bandwidth
Speed up your app
 Find out where the system is
spending the most time
 Tune code for cache efficiency
Application event traceNative code profiling
 Break performance
down by function
 View cost alongside
disassembly listing
Arm CPU support
 Profile 32-bit and 64-bit apps for
ARMv7-A and ARMv8-A cores
 Tune multi-threading for
DynamIQ multi-core systems
 Annotate software
workloads
 Define logical event
channel structure
 Trace cross-channel
task dependencies
Tune your rendering
 Identify critical-path GPU
shader core resources
 Detect content inefficiency

Graphics Analyzer
GPU API Debugger
Shader analysis
 Capture and view all shaders used
 Optimize shader performance using
integrated Mali Offline Compiler
Cross platform
 Host support for Windows,
macOS, and Linux
 Target support for any Android
GPU
Rendering API debug
 Graphics debug for content
developers
 Support for all versions of
OpenGL ES and Vulkan
Android utility appVisual analysis views
 Native mode
 Overdraw mode
 Shader map mode
 Fragment count mode
State visibility
 Show API state after every API call
 Trace backwards from point-of-use
to API call responsible for state set
 Manage on-device
connection
 Select and launch
user application
Frame analysis
 Diagnose root causes
of rendering errors
 Identify sources of
rendering inefficiency

s
Mali Offline Compiler
Shader static analysis
Rapid iteration
 Verify impact of shader changes
without needing whole application
rebuild
Profile for any Mali GPU
 Cost shader code for every Mali
GPU without needing hardware
Mali GPU aware
 Support for all actively
shipping Mali GPUs
 Cycle counts reflect
specific microarchitecture
Critical path analysisControl flow aware
 Best case control flow
 Worst case control flow
Syntax verification
 Verify correctness of code changes
 Receive clear error diagnostics for
invalid shaders
 Identify dominant
shader resource
 Target this for
optimization!
Register usage
 Work registers
 Uniform registers
 Stack spilling

Performance Advisor
Analysis Reports
Fast workflow
 Integrate data capture and analysis
into nightly CI test workflow
 Read results over a nice cup of tea
Caveats …
 Still under development
 Currently in closed beta
Overview chartsRegion views
 Split by dynamic
behavior
 Split by application
annotation
Executive dashboard
 Show high level status summary
 Show status breakdown by regions
of interest
 See performance
trends over time
 See region splits
Summary reports
 Easy-to-use performance
status reports
 Integrated initial root cause
analysis

• Not all devices provide necessary performance data
• CPU data requirements:
• Kernel 4.4 or higher
• Kernel perf config includes CPU PMU
• GPU data requirements:
• Supported Mali GPU and driver
• Supported device list can be found online*
• Other devices may work!
• Conformance tests for ODMs under development
* https://developer.arm.com/products/software-development-tools/arm-mobile-studio/support/supported-devices
Device requirements
Samsung
Galaxy S9
(Exynos version)
Oppo
R15
Huawei
Mate 10 Pro

Analysis workflow
Measure
Triage
Identify
hot spots
Determine
probable cause
Optimize
• Optimization is data-driven science
• Avoid guess work!
• Use data to identify root causes
• Disciplined workflow required
• Consistent methodology
• Reliable underlying data
• Particularly important for GPU analysis
• Graphics API provides thick abstraction
• Many interacting behaviors in the GPU hardware
• Tools are essential to make this efficient

Analyze
… with Streamline
Improve
shaders
… with Mali Offline Compiler
Tune rendering
… with Graphics Analyzer
Optimize code
… with StreamlineMonitor
… with Performance Advisor

Streamline
Performance triage nurse

Full system view
• Streamline provides at-a-glance overview
• Time correlated views of:
• CPU activity
• CPU counters
• GPU counters
• Thermal sensors
• Thread scheduling
• Main bottlenecks usually clearly visible
• Detailed data available when needed

Streamline

Event-based sampling
• Software profiling defaults to time-based sampling
• PC sampled every millisecond
• Event-based sampling allows triggering on events
• Select CPU PMU event to use as a trigger
• Set threshold for triggering; e.g. every 10,000 events
• PC sampled every N events
• Can be used for cache aware profiling
• Sample on L1 or L2 cache refill
• Can be used for bandwidth aware profiling
• Sample on external bus access

Basic Performance Analysis
• Can load template charts
• Custom charts
• Can use custom expression
• From big picture to details
Big Picture
Details

Triage nurse scenarios
Vsync bound Fragment bound CPU bound
Serialization problems Thermally bound
16.6 ms

Custom Counter Samples
• Pixels per Primitive
• (($MaliTilesTilesRendered * 256) / $MaliTilerCullingVisible)
• Fragment per Pixel
• (($MaliQuadsFragmentQuadsShaded * 4) / ($MaliTilesTilesRendered * 256) )
• Fragment Cycles per Pixel
• ($MaliCoreCyclesFragmentCycles / ($MaliTilesTilesRendered * 256) )

39
— Let you instrument your source code by
adding annotations to it
— Unity package
– Make sure that you are using IL2CPP as the Scripting
Backend
– Set the C++ Compiler Configuration to Debug
– Set the Target Architecture to ARM64
– Build development build
Streamline Annotations

Custom Activity Maps are the
most advanced form of annotation
and are a mechanism for showing
global (cross-thread) activities that
may have complex dependencies.
Each Custom Activity Map appears
as its own view in the lower half of
the Streamline UI.
Channels provide a separate row
of information alongside each
thread. Annotations can be placed
into a channel, and unlike a marker,
each annotation spans a range of
time.
Markers are the simplest form of
annotation - a single point-in-time
with a label that will appear at the
top of Streamline’s Timeline view.
40
Replace with image

41
Unity Package Link :
https://github.com/ARM-software/Tool-Solutions/blob/master/mobile-application-
profiling/mobile-studio-with-unity/ArmMobileStudio.unitypackage?raw=true

Graphics Analyzer
Rendering debug

Graphics Analyzer
 Support OpenGL ES 1.x, 2.x, 3.x, Vulkan and OpenCL 1.1 API Trace and Debug Tool
 MGD allows developers to trace OpenGL ES, Vulkan and OpenCL API calls in their
application and understand frame-by-frame the effect on the application to help
identify possible issues
 Android and Linux Arm based target platforms are currently supported

Graphics Analyzer
• Graphics API debugger
• Capture API calls
• Capture input resources
• Capture output rendering
• Step through frames
• … by render pass
• … by draw call
• Compare behavior with best practices
• Streamline can show what is going wrong
• Graphics Analyzer can help show why

Trace outline
Frame capture
Vertex data
API calls
Statistics
Target state
Shaders
Textures,
Buffers,
Uniforms,
…

Outline & Trace view
 Quick access to key graphics events
• Frames
• Render Passes
• Draw calls
 Investigate at frame level
 Jump between the outline view and
the trace view seamlessly

Uniforms and Vertex Attributes
• Uniforms and vertex attributes for each
draw call
• When a draw call is selected, all the
associated data is available
Uniforms
 Show uniform values, including samplers,
matrices and arrays
Vertex Attributes
 Show all the vertex attributes, their name and
their position
• This can be useful to debug graphics
issues

Shaders reports and statistics
• All the shaders being used by the
application are reported
Shader statistics
• Number of instructions for each GPU
pipeline
• Number of work registers and uniform
registers
• How many times that shader has been
executed

Frame Capture and Analysis
• Frames can be fully captured to analyze
the effect of each draw call
• All the images can be exported and
analyzed separately.

Shader map mode
Native shaders are
replaced with different
solid colors
Overdraw mode
Highlights where overdraw
happens (ie. objects are
drawn on top of each other)
Native mode
Frames are rendered with
the original shaders
Alternative Drawing Modes
50
Replace with image

Shader profiling

 Offline Compiler for OpenGL ES shaders and OpenCL kernels
 Helps application engineers optimize their shaders on the Mali platform
 Ensure shaders compile properly and reduce online compiling time
 You could download the Mali offline compiler from below link:
https://developer.arm.com/products/software-development-tools/graphics-
development-tools/mali-offline-compiler

• Language support for:
• OpenGL ES ESSL
• Vulkan SPIR-V
• Shader validation
• Compilation warnings and errors
• Static performance analysis
• Register usage, and stack spilling
• Performance breakdown by functional unit

 Use GA to capture the API calls and shaders to
understand the AP behavior.
 Use Offline Shader compiler to profiling
instruction counts for ALU, L/S, TEX
 If the shader needs more registers than the
available one, the GPU would need to perform
registers spilling
 registers spilling will cause big inefficiencies
and higher Load/Store utilization
Mali_Offline_Compiler_v4.3.0$ ./malisc --core Mali-T600 --revision
0p0_15dev0
--driver Mali-T600_r4p0-00rel0 --vertex shader-176.vert –V
ARM Mali Offline Shader Compiler v4.3.0 (C) Copyright 2007-2014
ARM Limited. All rights reserved.
Compilation successful. 3 work registers used, 16 uniform registers
used, spilling not used.
A L/S T Total Bound
Cycles: 9 5 0 14 A
Shortest Path: 4.5 5 0 9.5 L/S
Longest Path: 4.5 5 0 9.5 L/S
Note: The cycles counts do not include possible stalls due to cache
misses.

Performance Advisor
Continuous integration & reporting
Caveat: pre-release tool, all functionality is subject to change …

Workflow today
Measure
FPS
Detect
Slowdown
Manual
Analysis
Art Fix
Game Fix
Engine Fix

Continuous integration: a better way!
Measure
SoC Data
Sources
Detect
Slowdown
Automated
Triage
Art Fix
Game Fix
Engine Fix
Difficult
Problems

Performance Advisor
• An automated performance triage nurse
• Move beyond simple FPS-based regression tracking
• Perform an automated first pass analysis
• Generate easy to read performance report
• Route common issues directly to the team to review
• Free up performance experts to focus on the difficult problems
• Integrate into nightly continuous integration
• Catch major issues early
• Detect gradual regressions before they start impacting users

High-level overview

FPS Analysis

Tailored region labeling

Region-by-region analysis

Informed decisions
• UI Artist sees there is too much overdraw in
the Main Menu scene and fixes it
• Environment Artists keep an eye on texture
bottlenecks and reduce resolution
• Game Leads make decisions about issues that
will take time to solve
•
• No Graphics Programmers or Technical Artists
needed

Roadmap
• Ease of use improvements
• Simplified installation and setup
• One click connect and data capture
• Updated user documentation and tutorials
• New device testing
• Conformance testing for SiPs and ODMs
• Samsung Galaxy S10 series (Exynos 9820)
• Huawei Mate 20 series (Kirin 980)
• Performance Advisor beta program
• Register your interest online
• http://developer.arm.com/mobile-studio
• Mali Offline Compiler improvements
• Improved Bifrost reporting
• Shader pipeline compilation
• Investigate RenderDoc for Vulkan API debug

An even better way?
Measure
SoC Data
Sources
Detect
Slowdown
Automated
Triage
Art Fix
Game Fix
Engine Fix
Difficult
Solutions
Automated
Frame
Analysis

More Resources From Arm
68
Arm Developer Home:
https://developer.arm.com/
Vulkan Samples + Tutorials :
github.com/ARM-software/vulkan_best_practice_for_mobile_developers
Best-practice Guides :
developer.arm.com/graphics/developer-guides/mali-gpu-best-practices
Mobile Studio :
https://developer.arm.com/tools-and-software/graphics-and-gaming/arm-mobile-studio

The Arm trademarks featured in this presentation are registered
trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks
featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks

[TGDF 2019] Mali GPU Architecture and Mobile Studio

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [TGDF 2019] Mali GPU Architecture and Mobile Studio

Similar to [TGDF 2019] Mali GPU Architecture and Mobile Studio (20)

More from Owen Wu

More from Owen Wu (6)

Recently uploaded

Recently uploaded (20)

[TGDF 2019] Mali GPU Architecture and Mobile Studio

Editor's Notes