SlideShare una empresa de Scribd logo
1 de 59
Optimising games for mobiles
by Dmytro Vovk
Mobile GPUs architecture
• There are 3 major mobile GPU architectures
on a market:
• IMR (Immediate Mode Renderer)
• TBR (Tile Based Renderer)
• TBDR (Tile Based Deferred Renderer)
2
IMR
• Renders anything sent to the GPU
immediately. It makes no assumption about
what is going to be submitted next.
• Application has to sort opaque geometry front
to back.
• It’s basically a brute force.
• Nvidia, AMD.
3
TBR
• Improves on IMR, but still is an IMR.
• Bandwidth is a precious resource on mobiles
and TBR tries to reduce data transfers as much
as possible.
• Your geometry is split in to tiles and then it is
processed per tile. Tiles have small amount of
memory for colour, depthstencil buffers, so
they have no need to do transfers fromto
system memory.
• Qualcomm Adreno, ARM Mali 4
TBDR
• It is deferred i.e. all the graphics is drawn
somewhere later.
• And this is where all the magic happens!
• The GPU is aware of context - it know’s what is
going to be drawn in future and this allows it
to employ some awesome optimisations,
reduce power consumption, bandwidth and a
fillrate.
• Imagination PowerVR.
5
GENERAL RECOMMENDATIONS
What you might know
• Batch, Batch, Batch!
http://ce.u-
sys.org/Veranstaltungen/Interaktive%20Computergraphik%20(St
amminger)/papers/BatchBatchBatch.pdf
• Render from one thread only
• Avoid synchronisations:
1. glFlush/glFinish;
2. Querying GL states;
3. Accessing render targets;
VERTEX DATA RECOMMENDATIONS
What you might know
• Pixel perfect HSR (Hidden Surface Removal),
Adreno and ARM does not feature this.
• But still needs to sort transparent geometry!
• Avoid doing alpha test. Use alpha blend
instead
What you might not know
• HSR still requires vertices to be processed!
• …thus don’t forget to cull your geometry on
CPU!
• Prefer Stencil Test before Scissor.
– Stencil test is performed in hardware on PowerVR
GPUs.
– Stencil mask is stored in fast on-chip memory
– Stencil can be of any form in contrast to the
rectangular Scissor
What you might not know
• Why no alpha test?!
o Alpha testdiscard requires fragment shader to run, before visibility for
current fragment can be determined. This will remove benefits of HSR
o Even more! If shader code contains discard, than any geometry rendered
with this shader will suffer from alpha test drawbacks. Even if this key-word
is under condition, USSE (PVR’s shader engine) does assumes, that this
condition may be hit.
o Move discard into separate shader
o Draw opaque geometry, than alpha tested one and alpha blended in the end
What you might know
• Bandwidth matters
1. Use constant colour per object, instead of per
vertex
2. Simplify your models. Use smaller data types.
3. Use indexed triangles or non-indexed triangle
strips
4. Use VBO instead of client arrays
5. Use VAO
What you might not know
• VBOs allocations are aligned by 4KB page size.
That means, your small buffer for just a
couple of triangles will occupy 4KB in
memory, - large amount of small VBOs can
defragment and waste you memory.
What you might not know
• Updating your VBO data each frame:
1. glBufferSubData. If it is used to update big part of the
original data it will harm performance. Try to avoid
updates to buffers, that are in use now
2. glBufferData. It’s OK to completely overwrite original
data. Old data will be orphaned by driver and a new
data storage will be allocated
3. glMapBuffer with triple buffered VBO is preferred way
to update your data
• EXT_map_buffer_range (iOS 6+ only), when you need to
update only a subset of a buffer object.
What you might not know
int bufferID = 0; //initialization
for (int i = 0; i < 3; ++i) // allocate data for 3 vbo only, do not upload it
{
glBindBuffer(vertexBuffer[i]);
glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW);
}
//...
glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]);
void* ptr = glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES);
//update data here
glUnmapBufferOES(GL_ARRAY_BUFFER);
++bufferID;
if (bufferID == 3) //cycling through 3 buffers
{
bufferID = 0;
}
What you might not know
• This scheme will give you the best performance
possible – without blocking CPU or GPU, no
redundant memcpy operations, lower CPU load, but
extra memory is used (note, that you will need no
extra temporal buffer to store your data before
sending it to VBO). This is ideal for dynamic
batching of sprites.
update(1), draw(1), gpuworking(..............)
update(2), draw(2), gpuworking(..............)
update(3), draw(3), gpuworking(..............)
What you might not know
• Float type is native to GPU
• …that means any other type will be converted
to float by USSE
• …resulting in few additional cycles
• Thus it’s your choice of tradeoff between
bandwidthstorage and additional cycles
What you might know
• Use interleaved vertex data
– Align each vertex attribute by 4 bytes boundaries
What you might not know
• If you don’t align your data, driver will do this
instead.
• …resulting in slower performance.
What you might not know
• PowerVR SGX 5XT GPU series have a vertex
cache for last 12 vertex indices. Optimise your
indexed geometry for this cache size.
• PowerVR Series 6 (XT) has 16k of vertex cache
• Take a look at optimisers, that use Tom
Forsyth’s algorithm
http://home.comcast.net/~tom_forsyth/paper
s/fast_vert_cache_opt.html
What you might know
• Split your vertex data into two parts:
1. Static VBO - the one, that never will be changed
2. Dynamic VBO – the one, that needs to be
updated frequently
• Split your vertex data into few VBOs, when few
meshes share the same set of attributes
TEXTURE DATA
RECOMMENDATIONS
What you might know
• Bandwidth matters
1. Use lower precision formats - RGBA4444,
RGBA5551
2. Use PVRTC compressed textures
3. Use atlases
4. Use mipmaps. They improve texture cache
efficiency and quality.
What you might not know
• Avoid RGB8 format - texture data has to be
aligned, so driver will pad RGB8 to RGBA8.
• Try to replace it with RGB565
24
What you might not know
• Why PVRTC?
1. PVRTC provides great compression, resulting in
smaller texture size, improved cache, saved
bandwidth and decreased power consumption
2. PVRTC stores pixel data in GPU’s native order i.e
BGRA, instead of RGBA, in blocks optimised for
data access pattern.
What you might not know
• It doesn’t matter whether your textures are in
RGBA or BGRA format - the driver will still do
internal processing on a texture data to
improve memory access locality and cache
efficiency.
26
What you might not know
• On PVR 6 (XT) driver will reserve memory for both
texture and mip maps chain, but it will commit
memory only for mip level 0.
• If you’ll decide to generate mip maps driver will
commit pages reserved for mip chain.
• That’s expectable.
What you might not know
• On PVR 55MP (tested on iOS 4 – 7.1.1 versions)
driver will ALWAYS commit memory for mip maps,
regardless, whether you requested to create them, or
not.
• That means you’ll waste 33% of memory!
• In most cases you don’t need mip maps for 2D
games, but you are forced to pay this overhead.
• That’s too bad for 2D games. However there is one
workaround – make your textures NPOT (non-power
of two).
28
What you might not know
• Luckily, there is one solution to this problem.
• Core OpenGL ES 2.0 doesn’t support mip maps
for NPoT (non power of two) textures, so if
you’ll make your textures to be NPoT, you will
not pay this memory overhead.
29
What you might not know
• Interesting notes:
• glTexImage2D driver implementation has a
function CheckFastPath. When you upload
PoT texture you’ll hit this fast path. NPoT
textures omit it.
• When you upload a lot of textures you
VRAM gets defragmented, so driver will
remap memory - i.e. it will create one big
buffer for few small textures and will move
them to that buffer 30
What you might not know
• Let’s take a look on a texture upload process.
• Usual way to do this:
1. Load texture to temporal buffer in RAM
1. Encode texture if it is stored in compressed file format
– JPGPNG
2. Feed this buffer to glTexImage2D
3. Draw!
• Looks simple, but is it the fastest way?
What you might not know
• …NO!
void* buf = malloc(TEXTURE_SIZE); //4mb for RGBA8 1024x1024 texture
LoadTexture(textureName);
glBindTexture(GL_TEXTURE_2D, textureID);
glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, buf);
// buf is copied into internal buffer, created by driver (that's obvious)
free(buf); // because buffer can be freed immediately after glTexImage2D
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);
// driver will do some additional work to fully upload texture first time it is actually used!
• A lot of redundant work!
What you might not know
• Jedi way to upload textures:
int fileHandle = open(filename, O_RDONLY);
void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mapping
glBindTexture(GL_TEXTURE_2D, textureID);
glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr);
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);
// driver will do some additional work to fully upload texture first time it is actually used!
munmap(ptr, TEXTURE_SIZE);
• File mapping does not copy your file data into RAM! It
does load file data page by page, when it’s accessed.
• Thus we eliminated one redundant copy, dramatically
decreased texture upload time and decreased memory
fragmentation
What you might not know
• Keep in my, that textures are finally wired only
when they are used first time. So draw them
off screen immediately after glTexImage2D,
otherwise it will take too long to render the
first frame and it will be nearly impossible to
track the cause of this.
34
What you might not know
• NPOT textures works only with the
GL_CLAMP_TO_EDGE wrap mode
• POT are preferable, they gives you the best
performance possible
• Use NPOT textures with dimensions multiple to
32 pixels for best performance
• Driver will pad data of your NPOT texture to
match the size of the closes POT values.
What you might not know
• Prefer OES_texture_half_float instead of
OES_texture_float
• Texture reads fetch only 32 bits per texel, thus RGBA float
texture will result in 4 texture reads
What you might not know
• Always use glClear at the beginning of the
frame…
• … and EXT_discard_framebuffer at the end.
• PVR GPU series have a fast on chip
depthstencil buffer for each tile. If you forget
to cleardiscard depth buffer, it will be
uploaded from HW to SW
What you might know
• Prefer multi texturing instead of multiple
passes
• Configure texture parameters before feeding
image data to driver
SHADERS BEST PRACTICES
What you might know
• Be wise with precision hints
• Avoid branching
• Eliminate loops
• Do not use discard. Place discard instruction as
early, as possible to avoid useless
computations
What you might not know
• Code inside of dynamic branch (condition is
non constant value) will be executed anyway
and than it will be orphaned if condition is
false
What you might not know
• highp – represents 32 bit floating point value
• mediump – represents 16 bit floating point
value in range of [-65520, 65520]
• lowp – 10 bit fixed point values in range of [-2,
2] with step of 1/256
• Try to give the same precision to all you
operands, because conversion takes some time
What you might not know
• highp values are calculated on a scalar
processor only on USSE1 (thats PVR 5):
highp vec4 v1, v2;
highp float s1, s2;
v2 = (v1 * s1) * s2;
//scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied by s2 on
//a scalar processor again – 4 additional operations
v2 = v1 * (s1 * s2);
//s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar processor
HARDWARE FEATURES
What you might know
• Typical CPU found in mobile devices:
1. ARMv7ARMv8 architecture
2. Cortex AXKraitSwift or Cyclone
3. Up to 2300 MHz
4. Up to 8 cores
5. Thumb-2 instructions set
What you might not know
• ARMv7 has no hardware support for integer
division
• VFPv3, VFPv4 FPU
• NEON SIMD engine
• Unaligned access is done in software on Cortex
A8. That means it is hundred times slower
• Cortex A8 is in-order CPU. Cortex A9+ are out
of order
What you might not know
• Cortex A9+ core has full VFPv3 FPU, while
Cortex A8 has a VFPLite. That means, that float
operations take 1 cycle on A9 and 10 cycles on
A8!
What you might not know
• NEON – 16 registers, 128 bit wide each.
Supports operations on 8, 16, 32 and 64 bits
integers and 32 bits float values
• NEON can be used for:
– Software geometry instancing;
– Skinning;
– As a general vertex processor;
– Other, typical, applications for SIMD.
What you might not know
• There are 3 ways to use NEON engine in your
code:
1. Intrinsics
1.1 GLKMath
2. Handwritten NEON assembly
3. Autovectorization. Add –mllvm –vectorize –
mllvm –bb-vectorize-aligned-only to Other CC++
Flags in project settings and you are ready to go.
What you might not know
• Intrinsics:
What you might not know
• Assembly:
What you might not know
• Summary:
Running time, ms CPU usage, %
Intrinsics 2764 19
Assembly 3664 20
FPU 6209 25-28
FPU autovectorized 5028 22-24
• Intrinsics got me 25% speedup over assembly.
• Note that speed of code generated from
intrinsics will vary from compiler to compiler.
Modern compilers are really good in this.
What you might not know
• Intrinsics advantages over assembly:
– Higher level code;
– Much simpler;
– No need to manage registers;
– You can vectorize basic blocks and build
solution for every new problem with this
blocks. In contrast to assembly – you have to
solve each new problem from scratch;
What you might not know
• Assembly advantages over intrinsics:
– Code generated from intrinsics vary from
compiler to compiler and can give you really
big difference in speed. Assembly code will
always be the same.
What you might not know
__attribute__((always_inline)) void Matrix4ByVec4(const
float32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__
vec, float32x4_t* __restrict__ result)
{
(*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]);
(*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]);
(*result) = vmlaq_n_f32((*result), (*mat).val[2], (*vec)[2]);
(*result) = vmlaq_n_f32((*result), (*mat).val[3], (*vec)[3]);
}
What you might not know
__attribute__((always_inline)) void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2,
float32x4x4_t* __restrict__ r)
{
#ifdef INTRINSICS
(*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0));
(*r).val[1] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[1], 0));
(*r).val[2] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[2], 0));
(*r).val[3] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[3], 0));
(*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1));
(*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[1], vgetq_lane_f32((*m2).val[1], 1));
(*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[1], vgetq_lane_f32((*m2).val[2], 1));
(*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[1], vgetq_lane_f32((*m2).val[3], 1));
(*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[2], vgetq_lane_f32((*m2).val[0], 2));
(*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[2], vgetq_lane_f32((*m2).val[1], 2));
(*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[2], vgetq_lane_f32((*m2).val[2], 2));
(*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[2], vgetq_lane_f32((*m2).val[3], 2));
(*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[3], vgetq_lane_f32((*m2).val[0], 3));
(*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[3], vgetq_lane_f32((*m2).val[1], 3));
(*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[3], vgetq_lane_f32((*m2).val[2], 3));
(*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[3], vgetq_lane_f32((*m2).val[3], 3));
}
What you might not know
__asm__ volatile
(
"vldmia %6, { q0-q3 } nt"
"vldmia %0, { q8-q11 }nt"
"vmul.f32 q12, q8, d0[0]nt"
"vmul.f32 q13, q8, d2[0]nt"
"vmul.f32 q14, q8, d4[0]nt"
"vmul.f32 q15, q8, d6[0]nt"
"vmla.f32 q12, q9, d0[1]nt"
"vmla.f32 q13, q9, d2[1]nt"
"vmla.f32 q14, q9, d4[1]nt"
"vmla.f32 q15, q9, d6[1]nt"
"vmla.f32 q12, q10, d1[0]nt"
"vmla.f32 q13, q10, d3[0]nt"
"vmla.f32 q14, q10, d5[0]nt"
"vmla.f32 q15, q10, d7[0]nt"
"vmla.f32 q12, q11, d1[1]nt"
"vmla.f32 q13, q11, d3[1]nt"
"vmla.f32 q14, q11, d5[1]nt"
"vmla.f32 q15, q11, d7[1]nt"
"vldmia %1, { q0-q3 } nt"
"vmul.f32 q8, q12, d0[0]nt"
"vmul.f32 q9, q12, d2[0]nt"
"vmul.f32 q10, q12, d4[0]nt"
"vmul.f32 q11, q12, d6[0]nt"
"vmla.f32 q8, q13, d0[1]nt"
"vmla.f32 q8, q14, d1[0]nt"
"vmla.f32 q8, q15, d1[1]nt"
"vmla.f32 q9, q13, d2[1]nt"
"vmla.f32 q9, q14, d3[0]nt"
"vmla.f32 q9, q15, d3[1]nt"
"vmla.f32 q10, q13, d4[1]nt"
"vmla.f32 q10, q14, d5[0]nt"
"vmla.f32 q10, q15, d5[1]nt"
"vmla.f32 q11, q13, d6[1]nt"
"vmla.f32 q11, q14, d7[0]nt"
"vmla.f32 q11, q15, d7[1]nt"
"vstmia %2, { q8 }nt"
"vstmia %3, { q9 }nt"
"vstmia %4, { q10 }nt"
"vstmia %5, { q11 }"
:
: "r" (proj), "r" (squareVertices), "r" (v1), "r" (v2), "r" (v3), "r" (v4), "r" (modelView)
: "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15"
);
What you might not know
• For detailed explanation on
intrinsicsassembly see:
http://infocenter.arm.com/help/index.jsp?topi
c=/com.arm.doc.dui0491e/CIHJBEFE.html

Más contenido relacionado

La actualidad más candente

BitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven rendererBitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven renderer
tobias_persson
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 

La actualidad más candente (18)

Unite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platformsUnite 2013 optimizing unity games for mobile platforms
Unite 2013 optimizing unity games for mobile platforms
 
Kato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance ComputingKato Mivule: An Overview of CUDA for High Performance Computing
Kato Mivule: An Overview of CUDA for High Performance Computing
 
OpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering TechniquesOpenGL 4.4 - Scene Rendering Techniques
OpenGL 4.4 - Scene Rendering Techniques
 
Cuda tutorial
Cuda tutorialCuda tutorial
Cuda tutorial
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game Engine
 
Future Directions for Compute-for-Graphics
Future Directions for Compute-for-GraphicsFuture Directions for Compute-for-Graphics
Future Directions for Compute-for-Graphics
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
Scene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game EnginesScene Graphs & Component Based Game Engines
Scene Graphs & Component Based Game Engines
 
Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion Culling
 
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!
 
Introduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : NotesIntroduction to CUDA C: NVIDIA : Notes
Introduction to CUDA C: NVIDIA : Notes
 
Built for performance: the UIElements Renderer – Unite Copenhagen 2019
Built for performance: the UIElements Renderer – Unite Copenhagen 2019Built for performance: the UIElements Renderer – Unite Copenhagen 2019
Built for performance: the UIElements Renderer – Unite Copenhagen 2019
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
 
BitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven rendererBitSquid Tech: Benefits of a data-driven renderer
BitSquid Tech: Benefits of a data-driven renderer
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009NVidia CUDA Tutorial - June 15, 2009
NVidia CUDA Tutorial - June 15, 2009
 
Cross-scene references: A shock to the system - Unite Copenhagen 2019
Cross-scene references: A shock to the system - Unite Copenhagen 2019Cross-scene references: A shock to the system - Unite Copenhagen 2019
Cross-scene references: A shock to the system - Unite Copenhagen 2019
 

Destacado

Volvio el oso arturo a showmatch
Volvio el oso arturo a showmatchVolvio el oso arturo a showmatch
Volvio el oso arturo a showmatch
Pancho Cattaneo
 
March of dimes
March of dimesMarch of dimes
March of dimes
rrife
 
Raspberry Stake, Tree stake, Nursery Stake
Raspberry Stake, Tree stake, Nursery StakeRaspberry Stake, Tree stake, Nursery Stake
Raspberry Stake, Tree stake, Nursery Stake
wellco_ivy
 

Destacado (20)

Battery Optimization for Android Apps - Devoxx14
Battery Optimization for Android Apps - Devoxx14Battery Optimization for Android Apps - Devoxx14
Battery Optimization for Android Apps - Devoxx14
 
How to build rock solid apps & keep 100m+ users happy
How to build rock solid apps & keep 100m+ users happyHow to build rock solid apps & keep 100m+ users happy
How to build rock solid apps & keep 100m+ users happy
 
Volvio el oso arturo a showmatch
Volvio el oso arturo a showmatchVolvio el oso arturo a showmatch
Volvio el oso arturo a showmatch
 
Anti malaria month june 2013
Anti malaria month june 2013Anti malaria month june 2013
Anti malaria month june 2013
 
CV_PDhawad
CV_PDhawadCV_PDhawad
CV_PDhawad
 
Writing in the right way for your website, by Expert Market
Writing in the right way for your website, by Expert MarketWriting in the right way for your website, by Expert Market
Writing in the right way for your website, by Expert Market
 
March of dimes
March of dimesMarch of dimes
March of dimes
 
Wmd.ppt
Wmd.pptWmd.ppt
Wmd.ppt
 
Anti dengue month , July 2013
Anti dengue month , July 2013Anti dengue month , July 2013
Anti dengue month , July 2013
 
Raspberry Stake, Tree stake, Nursery Stake
Raspberry Stake, Tree stake, Nursery StakeRaspberry Stake, Tree stake, Nursery Stake
Raspberry Stake, Tree stake, Nursery Stake
 
Adding more visuals without affecting performance
Adding more visuals without affecting performanceAdding more visuals without affecting performance
Adding more visuals without affecting performance
 
Modul07 a
Modul07 aModul07 a
Modul07 a
 
Think vis 2013
Think vis 2013Think vis 2013
Think vis 2013
 
Win Over Your Toughest Audiences
Win Over Your Toughest AudiencesWin Over Your Toughest Audiences
Win Over Your Toughest Audiences
 
Changes to improve your health
Changes to improve your health Changes to improve your health
Changes to improve your health
 
Pip 2013-2014
Pip 2013-2014Pip 2013-2014
Pip 2013-2014
 
FDRS Competition Presentation:
FDRS Competition Presentation:FDRS Competition Presentation:
FDRS Competition Presentation:
 
Evaluation
EvaluationEvaluation
Evaluation
 
New Age, New Learners, New Skills
New Age, New Learners, New SkillsNew Age, New Learners, New Skills
New Age, New Learners, New Skills
 
Code vectorization for mobile devices
Code vectorization for mobile devicesCode vectorization for mobile devices
Code vectorization for mobile devices
 

Similar a Optimizing Games for Mobiles

Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate GuideДмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
UA Mobile
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overhead
Cass Everitt
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
changehee lee
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
guest40fc7cd
 

Similar a Optimizing Games for Mobiles (20)

Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate GuideДмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overhead
 
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
 
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics
 
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPU
 
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
[Osxdev]metal
[Osxdev]metal[Osxdev]metal
[Osxdev]metal
 
Computer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
 
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SHow I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S
 
C++ Advanced Memory Management With Allocators
C++ Advanced Memory Management With AllocatorsC++ Advanced Memory Management With Allocators
C++ Advanced Memory Management With Allocators
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Optimizing Games for Mobiles

  • 1. Optimising games for mobiles by Dmytro Vovk
  • 2. Mobile GPUs architecture • There are 3 major mobile GPU architectures on a market: • IMR (Immediate Mode Renderer) • TBR (Tile Based Renderer) • TBDR (Tile Based Deferred Renderer) 2
  • 3. IMR • Renders anything sent to the GPU immediately. It makes no assumption about what is going to be submitted next. • Application has to sort opaque geometry front to back. • It’s basically a brute force. • Nvidia, AMD. 3
  • 4. TBR • Improves on IMR, but still is an IMR. • Bandwidth is a precious resource on mobiles and TBR tries to reduce data transfers as much as possible. • Your geometry is split in to tiles and then it is processed per tile. Tiles have small amount of memory for colour, depthstencil buffers, so they have no need to do transfers fromto system memory. • Qualcomm Adreno, ARM Mali 4
  • 5. TBDR • It is deferred i.e. all the graphics is drawn somewhere later. • And this is where all the magic happens! • The GPU is aware of context - it know’s what is going to be drawn in future and this allows it to employ some awesome optimisations, reduce power consumption, bandwidth and a fillrate. • Imagination PowerVR. 5
  • 7. What you might know • Batch, Batch, Batch! http://ce.u- sys.org/Veranstaltungen/Interaktive%20Computergraphik%20(St amminger)/papers/BatchBatchBatch.pdf • Render from one thread only • Avoid synchronisations: 1. glFlush/glFinish; 2. Querying GL states; 3. Accessing render targets;
  • 9. What you might know • Pixel perfect HSR (Hidden Surface Removal), Adreno and ARM does not feature this. • But still needs to sort transparent geometry! • Avoid doing alpha test. Use alpha blend instead
  • 10. What you might not know • HSR still requires vertices to be processed! • …thus don’t forget to cull your geometry on CPU! • Prefer Stencil Test before Scissor. – Stencil test is performed in hardware on PowerVR GPUs. – Stencil mask is stored in fast on-chip memory – Stencil can be of any form in contrast to the rectangular Scissor
  • 11. What you might not know • Why no alpha test?! o Alpha testdiscard requires fragment shader to run, before visibility for current fragment can be determined. This will remove benefits of HSR o Even more! If shader code contains discard, than any geometry rendered with this shader will suffer from alpha test drawbacks. Even if this key-word is under condition, USSE (PVR’s shader engine) does assumes, that this condition may be hit. o Move discard into separate shader o Draw opaque geometry, than alpha tested one and alpha blended in the end
  • 12. What you might know • Bandwidth matters 1. Use constant colour per object, instead of per vertex 2. Simplify your models. Use smaller data types. 3. Use indexed triangles or non-indexed triangle strips 4. Use VBO instead of client arrays 5. Use VAO
  • 13. What you might not know • VBOs allocations are aligned by 4KB page size. That means, your small buffer for just a couple of triangles will occupy 4KB in memory, - large amount of small VBOs can defragment and waste you memory.
  • 14. What you might not know • Updating your VBO data each frame: 1. glBufferSubData. If it is used to update big part of the original data it will harm performance. Try to avoid updates to buffers, that are in use now 2. glBufferData. It’s OK to completely overwrite original data. Old data will be orphaned by driver and a new data storage will be allocated 3. glMapBuffer with triple buffered VBO is preferred way to update your data • EXT_map_buffer_range (iOS 6+ only), when you need to update only a subset of a buffer object.
  • 15. What you might not know int bufferID = 0; //initialization for (int i = 0; i < 3; ++i) // allocate data for 3 vbo only, do not upload it { glBindBuffer(vertexBuffer[i]); glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW); } //... glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]); void* ptr = glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES); //update data here glUnmapBufferOES(GL_ARRAY_BUFFER); ++bufferID; if (bufferID == 3) //cycling through 3 buffers { bufferID = 0; }
  • 16. What you might not know • This scheme will give you the best performance possible – without blocking CPU or GPU, no redundant memcpy operations, lower CPU load, but extra memory is used (note, that you will need no extra temporal buffer to store your data before sending it to VBO). This is ideal for dynamic batching of sprites. update(1), draw(1), gpuworking(..............) update(2), draw(2), gpuworking(..............) update(3), draw(3), gpuworking(..............)
  • 17. What you might not know • Float type is native to GPU • …that means any other type will be converted to float by USSE • …resulting in few additional cycles • Thus it’s your choice of tradeoff between bandwidthstorage and additional cycles
  • 18. What you might know • Use interleaved vertex data – Align each vertex attribute by 4 bytes boundaries
  • 19. What you might not know • If you don’t align your data, driver will do this instead. • …resulting in slower performance.
  • 20. What you might not know • PowerVR SGX 5XT GPU series have a vertex cache for last 12 vertex indices. Optimise your indexed geometry for this cache size. • PowerVR Series 6 (XT) has 16k of vertex cache • Take a look at optimisers, that use Tom Forsyth’s algorithm http://home.comcast.net/~tom_forsyth/paper s/fast_vert_cache_opt.html
  • 21. What you might know • Split your vertex data into two parts: 1. Static VBO - the one, that never will be changed 2. Dynamic VBO – the one, that needs to be updated frequently • Split your vertex data into few VBOs, when few meshes share the same set of attributes
  • 23. What you might know • Bandwidth matters 1. Use lower precision formats - RGBA4444, RGBA5551 2. Use PVRTC compressed textures 3. Use atlases 4. Use mipmaps. They improve texture cache efficiency and quality.
  • 24. What you might not know • Avoid RGB8 format - texture data has to be aligned, so driver will pad RGB8 to RGBA8. • Try to replace it with RGB565 24
  • 25. What you might not know • Why PVRTC? 1. PVRTC provides great compression, resulting in smaller texture size, improved cache, saved bandwidth and decreased power consumption 2. PVRTC stores pixel data in GPU’s native order i.e BGRA, instead of RGBA, in blocks optimised for data access pattern.
  • 26. What you might not know • It doesn’t matter whether your textures are in RGBA or BGRA format - the driver will still do internal processing on a texture data to improve memory access locality and cache efficiency. 26
  • 27. What you might not know • On PVR 6 (XT) driver will reserve memory for both texture and mip maps chain, but it will commit memory only for mip level 0. • If you’ll decide to generate mip maps driver will commit pages reserved for mip chain. • That’s expectable.
  • 28. What you might not know • On PVR 55MP (tested on iOS 4 – 7.1.1 versions) driver will ALWAYS commit memory for mip maps, regardless, whether you requested to create them, or not. • That means you’ll waste 33% of memory! • In most cases you don’t need mip maps for 2D games, but you are forced to pay this overhead. • That’s too bad for 2D games. However there is one workaround – make your textures NPOT (non-power of two). 28
  • 29. What you might not know • Luckily, there is one solution to this problem. • Core OpenGL ES 2.0 doesn’t support mip maps for NPoT (non power of two) textures, so if you’ll make your textures to be NPoT, you will not pay this memory overhead. 29
  • 30. What you might not know • Interesting notes: • glTexImage2D driver implementation has a function CheckFastPath. When you upload PoT texture you’ll hit this fast path. NPoT textures omit it. • When you upload a lot of textures you VRAM gets defragmented, so driver will remap memory - i.e. it will create one big buffer for few small textures and will move them to that buffer 30
  • 31. What you might not know • Let’s take a look on a texture upload process. • Usual way to do this: 1. Load texture to temporal buffer in RAM 1. Encode texture if it is stored in compressed file format – JPGPNG 2. Feed this buffer to glTexImage2D 3. Draw! • Looks simple, but is it the fastest way?
  • 32. What you might not know • …NO! void* buf = malloc(TEXTURE_SIZE); //4mb for RGBA8 1024x1024 texture LoadTexture(textureName); glBindTexture(GL_TEXTURE_2D, textureID); glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, buf); // buf is copied into internal buffer, created by driver (that's obvious) free(buf); // because buffer can be freed immediately after glTexImage2D glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0); // driver will do some additional work to fully upload texture first time it is actually used! • A lot of redundant work!
  • 33. What you might not know • Jedi way to upload textures: int fileHandle = open(filename, O_RDONLY); void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mapping glBindTexture(GL_TEXTURE_2D, textureID); glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr); glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0); // driver will do some additional work to fully upload texture first time it is actually used! munmap(ptr, TEXTURE_SIZE); • File mapping does not copy your file data into RAM! It does load file data page by page, when it’s accessed. • Thus we eliminated one redundant copy, dramatically decreased texture upload time and decreased memory fragmentation
  • 34. What you might not know • Keep in my, that textures are finally wired only when they are used first time. So draw them off screen immediately after glTexImage2D, otherwise it will take too long to render the first frame and it will be nearly impossible to track the cause of this. 34
  • 35. What you might not know • NPOT textures works only with the GL_CLAMP_TO_EDGE wrap mode • POT are preferable, they gives you the best performance possible • Use NPOT textures with dimensions multiple to 32 pixels for best performance • Driver will pad data of your NPOT texture to match the size of the closes POT values.
  • 36. What you might not know • Prefer OES_texture_half_float instead of OES_texture_float • Texture reads fetch only 32 bits per texel, thus RGBA float texture will result in 4 texture reads
  • 37. What you might not know • Always use glClear at the beginning of the frame… • … and EXT_discard_framebuffer at the end. • PVR GPU series have a fast on chip depthstencil buffer for each tile. If you forget to cleardiscard depth buffer, it will be uploaded from HW to SW
  • 38. What you might know • Prefer multi texturing instead of multiple passes • Configure texture parameters before feeding image data to driver
  • 40. What you might know • Be wise with precision hints • Avoid branching • Eliminate loops • Do not use discard. Place discard instruction as early, as possible to avoid useless computations
  • 41. What you might not know • Code inside of dynamic branch (condition is non constant value) will be executed anyway and than it will be orphaned if condition is false
  • 42. What you might not know • highp – represents 32 bit floating point value • mediump – represents 16 bit floating point value in range of [-65520, 65520] • lowp – 10 bit fixed point values in range of [-2, 2] with step of 1/256 • Try to give the same precision to all you operands, because conversion takes some time
  • 43. What you might not know • highp values are calculated on a scalar processor only on USSE1 (thats PVR 5): highp vec4 v1, v2; highp float s1, s2; v2 = (v1 * s1) * s2; //scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied by s2 on //a scalar processor again – 4 additional operations v2 = v1 * (s1 * s2); //s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar processor
  • 45. What you might know • Typical CPU found in mobile devices: 1. ARMv7ARMv8 architecture 2. Cortex AXKraitSwift or Cyclone 3. Up to 2300 MHz 4. Up to 8 cores 5. Thumb-2 instructions set
  • 46. What you might not know • ARMv7 has no hardware support for integer division • VFPv3, VFPv4 FPU • NEON SIMD engine • Unaligned access is done in software on Cortex A8. That means it is hundred times slower • Cortex A8 is in-order CPU. Cortex A9+ are out of order
  • 47. What you might not know • Cortex A9+ core has full VFPv3 FPU, while Cortex A8 has a VFPLite. That means, that float operations take 1 cycle on A9 and 10 cycles on A8!
  • 48. What you might not know • NEON – 16 registers, 128 bit wide each. Supports operations on 8, 16, 32 and 64 bits integers and 32 bits float values • NEON can be used for: – Software geometry instancing; – Skinning; – As a general vertex processor; – Other, typical, applications for SIMD.
  • 49. What you might not know • There are 3 ways to use NEON engine in your code: 1. Intrinsics 1.1 GLKMath 2. Handwritten NEON assembly 3. Autovectorization. Add –mllvm –vectorize – mllvm –bb-vectorize-aligned-only to Other CC++ Flags in project settings and you are ready to go.
  • 50.
  • 51. What you might not know • Intrinsics:
  • 52. What you might not know • Assembly:
  • 53. What you might not know • Summary: Running time, ms CPU usage, % Intrinsics 2764 19 Assembly 3664 20 FPU 6209 25-28 FPU autovectorized 5028 22-24 • Intrinsics got me 25% speedup over assembly. • Note that speed of code generated from intrinsics will vary from compiler to compiler. Modern compilers are really good in this.
  • 54. What you might not know • Intrinsics advantages over assembly: – Higher level code; – Much simpler; – No need to manage registers; – You can vectorize basic blocks and build solution for every new problem with this blocks. In contrast to assembly – you have to solve each new problem from scratch;
  • 55. What you might not know • Assembly advantages over intrinsics: – Code generated from intrinsics vary from compiler to compiler and can give you really big difference in speed. Assembly code will always be the same.
  • 56. What you might not know __attribute__((always_inline)) void Matrix4ByVec4(const float32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__ vec, float32x4_t* __restrict__ result) { (*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]); (*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]); (*result) = vmlaq_n_f32((*result), (*mat).val[2], (*vec)[2]); (*result) = vmlaq_n_f32((*result), (*mat).val[3], (*vec)[3]); }
  • 57. What you might not know __attribute__((always_inline)) void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2, float32x4x4_t* __restrict__ r) { #ifdef INTRINSICS (*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0)); (*r).val[1] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[1], 0)); (*r).val[2] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[2], 0)); (*r).val[3] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[3], 0)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[1], vgetq_lane_f32((*m2).val[1], 1)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[1], vgetq_lane_f32((*m2).val[2], 1)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[1], vgetq_lane_f32((*m2).val[3], 1)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[2], vgetq_lane_f32((*m2).val[0], 2)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[2], vgetq_lane_f32((*m2).val[1], 2)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[2], vgetq_lane_f32((*m2).val[2], 2)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[2], vgetq_lane_f32((*m2).val[3], 2)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[3], vgetq_lane_f32((*m2).val[0], 3)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[3], vgetq_lane_f32((*m2).val[1], 3)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[3], vgetq_lane_f32((*m2).val[2], 3)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[3], vgetq_lane_f32((*m2).val[3], 3)); }
  • 58. What you might not know __asm__ volatile ( "vldmia %6, { q0-q3 } nt" "vldmia %0, { q8-q11 }nt" "vmul.f32 q12, q8, d0[0]nt" "vmul.f32 q13, q8, d2[0]nt" "vmul.f32 q14, q8, d4[0]nt" "vmul.f32 q15, q8, d6[0]nt" "vmla.f32 q12, q9, d0[1]nt" "vmla.f32 q13, q9, d2[1]nt" "vmla.f32 q14, q9, d4[1]nt" "vmla.f32 q15, q9, d6[1]nt" "vmla.f32 q12, q10, d1[0]nt" "vmla.f32 q13, q10, d3[0]nt" "vmla.f32 q14, q10, d5[0]nt" "vmla.f32 q15, q10, d7[0]nt" "vmla.f32 q12, q11, d1[1]nt" "vmla.f32 q13, q11, d3[1]nt" "vmla.f32 q14, q11, d5[1]nt" "vmla.f32 q15, q11, d7[1]nt" "vldmia %1, { q0-q3 } nt" "vmul.f32 q8, q12, d0[0]nt" "vmul.f32 q9, q12, d2[0]nt" "vmul.f32 q10, q12, d4[0]nt" "vmul.f32 q11, q12, d6[0]nt" "vmla.f32 q8, q13, d0[1]nt" "vmla.f32 q8, q14, d1[0]nt" "vmla.f32 q8, q15, d1[1]nt" "vmla.f32 q9, q13, d2[1]nt" "vmla.f32 q9, q14, d3[0]nt" "vmla.f32 q9, q15, d3[1]nt" "vmla.f32 q10, q13, d4[1]nt" "vmla.f32 q10, q14, d5[0]nt" "vmla.f32 q10, q15, d5[1]nt" "vmla.f32 q11, q13, d6[1]nt" "vmla.f32 q11, q14, d7[0]nt" "vmla.f32 q11, q15, d7[1]nt" "vstmia %2, { q8 }nt" "vstmia %3, { q9 }nt" "vstmia %4, { q10 }nt" "vstmia %5, { q11 }" : : "r" (proj), "r" (squareVertices), "r" (v1), "r" (v2), "r" (v3), "r" (v4), "r" (modelView) : "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15" );
  • 59. What you might not know • For detailed explanation on intrinsicsassembly see: http://infocenter.arm.com/help/index.jsp?topi c=/com.arm.doc.dui0491e/CIHJBEFE.html

Notas del editor

  1. In this presentation I am going to talk mostly about Imagination Technologies GPUs. This is at least 50% of the market. All test I did on iOS, but I assume, you’ll get the same behaviour on Android. This presentation will consist from few parts, each dedicated to optimisation problems in one area.
  2. I’ll start from the most common recommendations.