70. The Arm trademarks featured in this presentation are registered
trademarks or trademarks of Arm Limited (or its subsidiaries) in
the US and/or elsewhere. All rights reserved. All other marks
featured may be trademarks of their respective owners.
www.arm.com/company/policies/trademarks
Editor's Notes
Our latest GPU architecture is Bifrost and Mali G30, G50 and G70 series are using Bifrost architecture.
It’s a unified architecture so that means one shader core for both vertex and fragment shader.
And the number of shader core can scale from single core all the way up to 32 shader cores.
And there is a L2 cache to reduce the latency , typically in the range of 64 to 128 KB per shader code.
This architecture is capable to write one 32 bits pixel per core per clock. So on a 8 core design, it actually can write 256 bits pixel per clock.
This is the block model of shader core. In the picture, the black block is fixed function block.
And every shader core contains a execution core which consist of 5 units.
First one is execution engine. The execution engine is responsible for executing shader code and has arithmetic processing power. Execution core may contain multiple execution engines.
Load/Store unit is responsible for shader memory access.
Varying unit is responsible for varying interpolation. It also has the same arithmetic design of execution engine.
ZS/Blend unit is responsible for the tile-memory access.
Texture unit is responsible for any memory access to do with texture. It’s separated from load/store unit.
There are few new features in Bifrost architecture. IDVS is one of them.
Usually, GPU will process all the vertex shading before culling primitives. But often resulting in wasted computation power and bandwidth for the vertices which are only used in culled primitives.
So IDVS can splits the shader in two parts. One is for position shading before culling and one is for varying shading after culling. By this way, we can save the compute power and bandwidth.
And it’s done by driver and hardware so basically it’s invisible to the developer. But the developer can do something to make IDVS more efficient. If the developer can put position related attribute in one buffer and other attributes to another buffer. Then the non-position attribute will not be pulled into cache so the cache can store more vertex data.
The other new feature I want to mention is forward pixel kill. Although using early-z and front-to-back rendering can remove most overdraw. FPK can kill the pixel thread in-flight if we found the pixel will be occluded.
So the calculation thread already in flight can be terminated at any time if we spot that a later thread will write opaque data to the same pixel location.
Traditional desktop GPU architecture usually use immediate rendering mode. That means vertex shaders and fragment shaders will be executed in sequence on each primitive in each draw call.
Here is the pseudo code for immediate rendering mode.
The picture shows the traditional immediate rendering mode pipeline. And you can see there are a lot of external memory access at fragment stage. For mobile device, this kind of external memory access is quite bad for energy efficiency.
Most morden mobile GPU use tile-base rendering mode. It is designed to minimize the amount of external memory accesses which are needed during rendering.
Tiled-baes renders split the screen into small piece, the tile size of Mali render is16x16. And GPU will process fragment shading on each small tile. Then write tile result out to external memory when this tile is finished.
One big difference from immediate rendering mode is that the GPU will split each render pass into two distinct processing passes. One is vertex shading pass and generate the tile lists. The other one is fragment shading pass which execute fragment shading tile by tile.
And that is the pseudo code for tile-based rendering. The GPU will execute the vertex pass first for all primitives in one render pass and finally generate a tile list.
Then GPU will execute fragment pass for each render pass tile by tile.
The picture shows the pipeline of tile-based rendering mode. As you can see from the picture, tile-based rendering moves the memory access to tile memory instead of external memory and this is much faster and energy efficiency.
One thing needs to mention is that there is an extra external memory read and write for tile list which immediate mode GPU doesn’t have. So extra geometry pass on mobile device is usually more expensive than on desktop.
Render passes are essential concepts for tile-based renders and it is a single execution of the rendering pipeline, rendering a single output image into a set of framebuffer.
Each render pass needs initializing in tile memory at the start of the render pass and writing back to external memory at the end of the render pass.
To get the most benefit from the tile-based rendering approach it is critical that applications minimize the amount of memory traffic in to and out of the tile memory.
That means avoiding reading in older framebuffer values at the start of a render pass and avoiding writing out values at the end of each render pass.
Here is an animation about how tile-based GPU works.
Primitives rendered will do position shading first. After position shading, the primitives need to pass facing test culling, Any back facing primitives will be culled here.
Next one is frustum test culling. The primitives outside viewport are culled here. Then go to sample test culling, any primitives which are smaller than one pixel will be culled here.
Finally, all survived primitives can do the varying shading then add to tile list.
After the vertex pass of a render pass is finished, GPU can run fragment pass for this render pass.
The rasterizer will rasterize the primitives into quad which is 2x2 sized.
Then all quads need to do early Z test before going to next stage. But if the fragment is using discard, alpha to coverage or writing depth, the quad will skip early Z test and do late Z test later. Early Z test will kill any quad which is fully occluded by other pixels.
After early Z, fragment thread creator will create 4 fragment threads for a quad then the fragment thread will be dispatch to execution engines. And for those fragments can not take advantage of early Z, later Z test will be done here to kill the fragments which are occluded.
Finally the survived fragments will do blending then write to tile memory. After all primitives in tile list have been finished, the tile writer will write the tile result back to external memory. Mali has a transaction Eliminator which can compress the tile result writing back to external memory to save bandwidth.
But OpenGL ES API has no explicit render passes in the API level. So the driver must infer which rendering operations form a single render pass.
How the OpenGL ES driver form a render pass? Basically, driver will add drawing commands to the current render pass and render pass is submitted for processing when an API call changes the framebuffer or forces a flush of the queued work.
So here I listed the most common causes for ending a render pass. Whenever you call those APIs, the driver will end the current render pass and submit it.
Now we know how render pass works, how can we write code for efficient render pass? There are few things you can follow to make sure you write efficient render pass code.
First, processing each render pass once.
That means binding each framebuffer object only once. Don’t switch framebuffer object back and forth. Making all required draw calls before switching to the next context and avoid unnecessary context switch.
Second, Minimizing the load from last framebuffer at the start of render pass. There are two ways to avoid the tile load. First is using glClear() function to cheaply initialize the tile memory to a clear color value. The second is using glInvalidateFramebuffer() function to hint driver don’t load data from external memory.
Third, Minimizing the store to framebuffer at the end of render pass. Avoid writing back to main memory whenever is possible. You can notify the driver that an attachment is transient by marking the content as invalid using a call to glInvalidateFramebuffer() as the last "draw call”.
Mobile Studio consists of four component tools, although at the moment only two are actually in the public tool bundle.
Streamline, a system profiler for CPU and GPU performance.
Graphics Analyzer, an API debugger for OpenGL ES and Vulkan rendering APIs.
In addition we have:
Mali Offline Compiler, a syntax checker and static analysis tool for GPU shader programs, which is currently available as a separate download.
Performance Advisor, a new tool which places automated performance analysis into a continuous integration workflow. This is currently still in development in a closed beta, but expect to see this joining the Studio release early next year.
Android does not mandate a guaranteed level of data access for tooling, so not every device will provide all desirable data sources off the shelf today.
We are publishing a list of supported devices online, which are currently the ones we test internally. Expect this device list to grow over time, and if you are a device manufacturer please come and talk to us about tooling conformance testing and how you can get more devices on to that list.
Streamline annotations let you instrument your source code by adding annotations to it. It is written in C++ and we implement a c# wrapper so the developer can import it as a Unity package now.
To use streamline annotations, you need some setups for your Unity project:
First, make sure that you are using IL2CPP as the scriptable backend
Second, set c++ compiler configuration to debug
Third, set the Target Achitecture to ARM64, we are currently supporting ARM64 only.
Finally, you need to build a development build apk.
There are 3 types of annotations.
First is marker. It is the simplest form annotation. It is just a single point-in-time with a label that will appear at the top of Streamline’s timeline view. As you can see from the picture, the green labels on the top are the markers.
Second is channel. Many annotations can be placed into a channel, and unlike a marker, each annotation spans a range of time. You can use it to mark the total time the game spent on some operations. On the picture, yellow and blue labels represent different annotations in the same channel.
The final one is Custom Activity Maps which are the most advanced form of annotation. You can think it as a structural map with many channels like the picture shows. You can group as many channels as you want into one map. The map will appear as its own view in the lower half of the Streamline UI like the picture shows.
You can download Unity package here.
This is how the tool looks, all of the views are customizable so you can show only the data you need per API.
APITrace is every single call that you make to your chosen api. Can get into the millions quite easily.
Dynamic Help is static analysis so we have had a list of our things to watch out for by our experts so it gives you pointers.
Textures and Shaders so we get every single asset in your application. And we run shaders through the offline compiler this makes them easily sortable.
Frame Outline allows you to quickly navigate between the whole trace to find your problem area fast.
Outline view shows all the frames and draw calls that are rendered.
You can select any draw call, or in fact any API call, and see the state of the application at any point
Investigate at frame level. Find out what draw calls have higher geometry impact
This widow shows content of uniforms and vertex attributes for each draw call.
When a draw call is selected, all the associated data is available.
Uniform tab can show uniform values, including samplers, matrices and arrays.
Vertex attribute tab show all the vertex attributes, their name and their position.
You can also view the rendering mesh in 3D here.
This can be useful to debug graphics issues.
Shaders reports and statistics window is also very useful.
All the shaders being used by the application are reported.
Shaders are compiled with mali offline compiler and report the number of instructions, number of work registers
and uniform registers here.
Additionally, how many times that shader has been executed can also be reported here.
A native resolution snapshot of each framebuffer is captured after every draw call.
The capture happens on target, so even target dependent bugs or precision issues can be investigated.
Graphics Analyzer also has few different drawing modes to help you debug the application.
Native mode will render with original shaders.
Overdraw mode will Highlights where overdraw happens.
Shader map mode will show each shader with different color, it helps you to identify particular shader.
This report is based on an example application, created for demonstration purposes.
The first thing the user sees in the summary is at the very top of the report, it gives us a high level overall view of what is reported below. It shows us this pie chart, which is give the user information about what the application was bound by throughout the capture, and it gives us the average FPS for the application.
In this particular capture we can see that the average FPS was below target at 42, with a target of 60. We can then look at the pie chart and see that for around three quarters of this capture, the application was either CPU or Vertex bound.
At first glance of this report, we can clearly see that performance improvements could be made, with some direction of where the user can start focusing their attention.
Vsync bound being the ideal representation, represent a well running application
FPS analysis graph gives a clearer view of where in the application the user might be able to make performance improvements.
Start looking deeper into the issues highlighted in the summary
It highlights the Bound areas, allowing the user to see at are clearly highlighted
Plotted FPS and Overdraw to see if there is any correlation between them. From this graph we can see there is an obvious connection here... low fps high overdraw and CPU to Vertex bound as we move to being vsync it swaps
The graph is also interactive, allowing the user to toggle overdraw and FPS on and off, which can help in situations where the user only wants to focus on bounded areas or where the FPS and Overdraw are closely plotted.
This screen shows a space ape capture with nested regions.
We can see they have identified a loading screen, introduction scene and first 10s of game play. Loading screen might not be of as much interest here, time taken will probably be more important, but we can see in the first 10s of game play that the application goes CPU bound and FPS drops significantly. Add hover on region
Each region defined has its own analysis section with advice and links to further actions that can be taken
All of this information is packaged into one report, which can be integrated into CI systems, or run manualy, and reduces the reliance on technical experts to spends long amounts of time determining why application have performance issues. This enables teams to move forward, empowering them with deeper knowledge, to understand where the application needs attention. In turn Freeing up the indivdual expert to concentrate on other areas.
There are more resource you can find on our website. If you are interested in Vulkan, you should check this link, there are many useful information and samples.