A set of mobile game optimization best practices. This presentation extensively covers PowerVR series of GPUs from Imagination Technologies and iOS, however the majority of recommendations can be applied to other GPUs and mobile operating systems.
2. Mobile GPUs architecture
• There are 3 major mobile GPU architectures
on a market:
• IMR (Immediate Mode Renderer)
• TBR (Tile Based Renderer)
• TBDR (Tile Based Deferred Renderer)
2
3. IMR
• Renders anything sent to the GPU
immediately. It makes no assumption about
what is going to be submitted next.
• Application has to sort opaque geometry front
to back.
• It’s basically a brute force.
• Nvidia, AMD.
3
4. TBR
• Improves on IMR, but still is an IMR.
• Bandwidth is a precious resource on mobiles
and TBR tries to reduce data transfers as much
as possible.
• Your geometry is split in to tiles and then it is
processed per tile. Tiles have small amount of
memory for colour, depthstencil buffers, so
they have no need to do transfers fromto
system memory.
• Qualcomm Adreno, ARM Mali 4
5. TBDR
• It is deferred i.e. all the graphics is drawn
somewhere later.
• And this is where all the magic happens!
• The GPU is aware of context - it know’s what is
going to be drawn in future and this allows it
to employ some awesome optimisations,
reduce power consumption, bandwidth and a
fillrate.
• Imagination PowerVR.
5
9. What you might know
• Pixel perfect HSR (Hidden Surface Removal),
Adreno and ARM does not feature this.
• But still needs to sort transparent geometry!
• Avoid doing alpha test. Use alpha blend
instead
10. What you might not know
• HSR still requires vertices to be processed!
• …thus don’t forget to cull your geometry on
CPU!
• Prefer Stencil Test before Scissor.
– Stencil test is performed in hardware on PowerVR
GPUs.
– Stencil mask is stored in fast on-chip memory
– Stencil can be of any form in contrast to the
rectangular Scissor
11. What you might not know
• Why no alpha test?!
o Alpha testdiscard requires fragment shader to run, before visibility for
current fragment can be determined. This will remove benefits of HSR
o Even more! If shader code contains discard, than any geometry rendered
with this shader will suffer from alpha test drawbacks. Even if this key-word
is under condition, USSE (PVR’s shader engine) does assumes, that this
condition may be hit.
o Move discard into separate shader
o Draw opaque geometry, than alpha tested one and alpha blended in the end
12. What you might know
• Bandwidth matters
1. Use constant colour per object, instead of per
vertex
2. Simplify your models. Use smaller data types.
3. Use indexed triangles or non-indexed triangle
strips
4. Use VBO instead of client arrays
5. Use VAO
13. What you might not know
• VBOs allocations are aligned by 4KB page size.
That means, your small buffer for just a
couple of triangles will occupy 4KB in
memory, - large amount of small VBOs can
defragment and waste you memory.
14. What you might not know
• Updating your VBO data each frame:
1. glBufferSubData. If it is used to update big part of the
original data it will harm performance. Try to avoid
updates to buffers, that are in use now
2. glBufferData. It’s OK to completely overwrite original
data. Old data will be orphaned by driver and a new
data storage will be allocated
3. glMapBuffer with triple buffered VBO is preferred way
to update your data
• EXT_map_buffer_range (iOS 6+ only), when you need to
update only a subset of a buffer object.
15. What you might not know
int bufferID = 0; //initialization
for (int i = 0; i < 3; ++i) // allocate data for 3 vbo only, do not upload it
{
glBindBuffer(vertexBuffer[i]);
glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW);
}
//...
glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]);
void* ptr = glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES);
//update data here
glUnmapBufferOES(GL_ARRAY_BUFFER);
++bufferID;
if (bufferID == 3) //cycling through 3 buffers
{
bufferID = 0;
}
16. What you might not know
• This scheme will give you the best performance
possible – without blocking CPU or GPU, no
redundant memcpy operations, lower CPU load, but
extra memory is used (note, that you will need no
extra temporal buffer to store your data before
sending it to VBO). This is ideal for dynamic
batching of sprites.
update(1), draw(1), gpuworking(..............)
update(2), draw(2), gpuworking(..............)
update(3), draw(3), gpuworking(..............)
17. What you might not know
• Float type is native to GPU
• …that means any other type will be converted
to float by USSE
• …resulting in few additional cycles
• Thus it’s your choice of tradeoff between
bandwidthstorage and additional cycles
18. What you might know
• Use interleaved vertex data
– Align each vertex attribute by 4 bytes boundaries
19. What you might not know
• If you don’t align your data, driver will do this
instead.
• …resulting in slower performance.
20. What you might not know
• PowerVR SGX 5XT GPU series have a vertex
cache for last 12 vertex indices. Optimise your
indexed geometry for this cache size.
• PowerVR Series 6 (XT) has 16k of vertex cache
• Take a look at optimisers, that use Tom
Forsyth’s algorithm
http://home.comcast.net/~tom_forsyth/paper
s/fast_vert_cache_opt.html
21. What you might know
• Split your vertex data into two parts:
1. Static VBO - the one, that never will be changed
2. Dynamic VBO – the one, that needs to be
updated frequently
• Split your vertex data into few VBOs, when few
meshes share the same set of attributes
23. What you might know
• Bandwidth matters
1. Use lower precision formats - RGBA4444,
RGBA5551
2. Use PVRTC compressed textures
3. Use atlases
4. Use mipmaps. They improve texture cache
efficiency and quality.
24. What you might not know
• Avoid RGB8 format - texture data has to be
aligned, so driver will pad RGB8 to RGBA8.
• Try to replace it with RGB565
24
25. What you might not know
• Why PVRTC?
1. PVRTC provides great compression, resulting in
smaller texture size, improved cache, saved
bandwidth and decreased power consumption
2. PVRTC stores pixel data in GPU’s native order i.e
BGRA, instead of RGBA, in blocks optimised for
data access pattern.
26. What you might not know
• It doesn’t matter whether your textures are in
RGBA or BGRA format - the driver will still do
internal processing on a texture data to
improve memory access locality and cache
efficiency.
26
27. What you might not know
• On PVR 6 (XT) driver will reserve memory for both
texture and mip maps chain, but it will commit
memory only for mip level 0.
• If you’ll decide to generate mip maps driver will
commit pages reserved for mip chain.
• That’s expectable.
28. What you might not know
• On PVR 55MP (tested on iOS 4 – 7.1.1 versions)
driver will ALWAYS commit memory for mip maps,
regardless, whether you requested to create them, or
not.
• That means you’ll waste 33% of memory!
• In most cases you don’t need mip maps for 2D
games, but you are forced to pay this overhead.
• That’s too bad for 2D games. However there is one
workaround – make your textures NPOT (non-power
of two).
28
29. What you might not know
• Luckily, there is one solution to this problem.
• Core OpenGL ES 2.0 doesn’t support mip maps
for NPoT (non power of two) textures, so if
you’ll make your textures to be NPoT, you will
not pay this memory overhead.
29
30. What you might not know
• Interesting notes:
• glTexImage2D driver implementation has a
function CheckFastPath. When you upload
PoT texture you’ll hit this fast path. NPoT
textures omit it.
• When you upload a lot of textures you
VRAM gets defragmented, so driver will
remap memory - i.e. it will create one big
buffer for few small textures and will move
them to that buffer 30
31. What you might not know
• Let’s take a look on a texture upload process.
• Usual way to do this:
1. Load texture to temporal buffer in RAM
1. Encode texture if it is stored in compressed file format
– JPGPNG
2. Feed this buffer to glTexImage2D
3. Draw!
• Looks simple, but is it the fastest way?
32. What you might not know
• …NO!
void* buf = malloc(TEXTURE_SIZE); //4mb for RGBA8 1024x1024 texture
LoadTexture(textureName);
glBindTexture(GL_TEXTURE_2D, textureID);
glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, buf);
// buf is copied into internal buffer, created by driver (that's obvious)
free(buf); // because buffer can be freed immediately after glTexImage2D
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);
// driver will do some additional work to fully upload texture first time it is actually used!
• A lot of redundant work!
33. What you might not know
• Jedi way to upload textures:
int fileHandle = open(filename, O_RDONLY);
void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mapping
glBindTexture(GL_TEXTURE_2D, textureID);
glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr);
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);
// driver will do some additional work to fully upload texture first time it is actually used!
munmap(ptr, TEXTURE_SIZE);
• File mapping does not copy your file data into RAM! It
does load file data page by page, when it’s accessed.
• Thus we eliminated one redundant copy, dramatically
decreased texture upload time and decreased memory
fragmentation
34. What you might not know
• Keep in my, that textures are finally wired only
when they are used first time. So draw them
off screen immediately after glTexImage2D,
otherwise it will take too long to render the
first frame and it will be nearly impossible to
track the cause of this.
34
35. What you might not know
• NPOT textures works only with the
GL_CLAMP_TO_EDGE wrap mode
• POT are preferable, they gives you the best
performance possible
• Use NPOT textures with dimensions multiple to
32 pixels for best performance
• Driver will pad data of your NPOT texture to
match the size of the closes POT values.
36. What you might not know
• Prefer OES_texture_half_float instead of
OES_texture_float
• Texture reads fetch only 32 bits per texel, thus RGBA float
texture will result in 4 texture reads
37. What you might not know
• Always use glClear at the beginning of the
frame…
• … and EXT_discard_framebuffer at the end.
• PVR GPU series have a fast on chip
depthstencil buffer for each tile. If you forget
to cleardiscard depth buffer, it will be
uploaded from HW to SW
38. What you might know
• Prefer multi texturing instead of multiple
passes
• Configure texture parameters before feeding
image data to driver
40. What you might know
• Be wise with precision hints
• Avoid branching
• Eliminate loops
• Do not use discard. Place discard instruction as
early, as possible to avoid useless
computations
41. What you might not know
• Code inside of dynamic branch (condition is
non constant value) will be executed anyway
and than it will be orphaned if condition is
false
42. What you might not know
• highp – represents 32 bit floating point value
• mediump – represents 16 bit floating point
value in range of [-65520, 65520]
• lowp – 10 bit fixed point values in range of [-2,
2] with step of 1/256
• Try to give the same precision to all you
operands, because conversion takes some time
43. What you might not know
• highp values are calculated on a scalar
processor only on USSE1 (thats PVR 5):
highp vec4 v1, v2;
highp float s1, s2;
v2 = (v1 * s1) * s2;
//scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied by s2 on
//a scalar processor again – 4 additional operations
v2 = v1 * (s1 * s2);
//s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar processor
45. What you might know
• Typical CPU found in mobile devices:
1. ARMv7ARMv8 architecture
2. Cortex AXKraitSwift or Cyclone
3. Up to 2300 MHz
4. Up to 8 cores
5. Thumb-2 instructions set
46. What you might not know
• ARMv7 has no hardware support for integer
division
• VFPv3, VFPv4 FPU
• NEON SIMD engine
• Unaligned access is done in software on Cortex
A8. That means it is hundred times slower
• Cortex A8 is in-order CPU. Cortex A9+ are out
of order
47. What you might not know
• Cortex A9+ core has full VFPv3 FPU, while
Cortex A8 has a VFPLite. That means, that float
operations take 1 cycle on A9 and 10 cycles on
A8!
48. What you might not know
• NEON – 16 registers, 128 bit wide each.
Supports operations on 8, 16, 32 and 64 bits
integers and 32 bits float values
• NEON can be used for:
– Software geometry instancing;
– Skinning;
– As a general vertex processor;
– Other, typical, applications for SIMD.
49. What you might not know
• There are 3 ways to use NEON engine in your
code:
1. Intrinsics
1.1 GLKMath
2. Handwritten NEON assembly
3. Autovectorization. Add –mllvm –vectorize –
mllvm –bb-vectorize-aligned-only to Other CC++
Flags in project settings and you are ready to go.
53. What you might not know
• Summary:
Running time, ms CPU usage, %
Intrinsics 2764 19
Assembly 3664 20
FPU 6209 25-28
FPU autovectorized 5028 22-24
• Intrinsics got me 25% speedup over assembly.
• Note that speed of code generated from
intrinsics will vary from compiler to compiler.
Modern compilers are really good in this.
54. What you might not know
• Intrinsics advantages over assembly:
– Higher level code;
– Much simpler;
– No need to manage registers;
– You can vectorize basic blocks and build
solution for every new problem with this
blocks. In contrast to assembly – you have to
solve each new problem from scratch;
55. What you might not know
• Assembly advantages over intrinsics:
– Code generated from intrinsics vary from
compiler to compiler and can give you really
big difference in speed. Assembly code will
always be the same.
59. What you might not know
• For detailed explanation on
intrinsicsassembly see:
http://infocenter.arm.com/help/index.jsp?topi
c=/com.arm.doc.dui0491e/CIHJBEFE.html
Notas del editor
In this presentation I am going to talk mostly about Imagination Technologies GPUs. This is at least 50% of the market. All test I did on iOS, but I assume, you’ll get the same behaviour on Android.
This presentation will consist from few parts, each dedicated to optimisation problems in one area.