SlideShare una empresa de Scribd logo
1 de 51
Descargar para leer sin conexión
Learn iOS Game Optimization. Ultimate

                 by Dmitriy Vovk
Want to achieve the same level of
technology speed? Welcome!
                                Image is used without any permissions 
What you might know
• Batch, Batch, Batch!

• Render from one thread only
• Avoid synchronizations:
  1. glFlush/glFinish;
  2. Querying GL states;
  3. Accessing render targets;
Vertex Data
What you might know
• Pixel perfect HSR (Hidden Surface
• But still need to sort opaque
• Avoid doing alpha test. Use alpha
  blend instead
What you might not know
• HSR still requires vertices to be processed!
• …thus don’t forget to cull your geometry on
• Prefer Stencil Test before Scissor.
  – Stencil test is performed in hardware on
    PowerVR GPUs, thus resulting in dramatically
    increased performance.
  – Stencil can be of any form in contrast to the
    rectangular Scissor
What you might not know
• Why no alpha test?!
o Alpha testdiscard requires fragment shader to run, before
  visibility for current fragment can be determined. This will
  remove benefits of HSR
o Even more! If shader code contains discard, than any
  geometry rendered with this shader will suffer from alpha
  test drawbacks. Even if this key-word is under condition,
  USSE does assumes, that this condition may be hit.
o Move discard into separate shader
o Draw opaque geometry, than alpha tested one and alpha
  blended in the end
What you might know
• Bandwidth matters
 1. Use constant color per object, instead of
    per vertex
 2. Simplify your models. Use smaller data
 3. Use indexed triangles or non-indexed
    triangle strips
 4. Use VBO instead of client arrays
 5. Use VAO
What you might not know
–   VAO implementation on at least
    iOS 4.0 did harmed your
–   VBOs are allocated at 4KB page
    size multiples. Be aware of that.
    Large amount of small VBOs can
    defragment and waste you
What you might not know
• Updating your VBO data each frame:
 1. glBufferSubData, that updates big part of the
    original data do harm performance. Try not to
    update buffer, that is used now
 2. glBufferData, that will completely overwrite original
    data is OK. Old data will be orphaned by driver and
    storage for new one will be allocated
 3. glMapBuffer with triple buffered VBO is preferred
    way to update your data
 4. EXT_map_buffer_range (iOS 6 only), when you need to
    update only a subset of a buffer object.
What you might not know
int bufferID = 0; //initialization
for (int i = 0; i < 3; ++i)// only allocate data for 3 vbo, do not upload it
glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]);
//update data here
if (bufferID == 3) //cycling through 3 buffers
bufferID = 0;
What you might not know
• This scheme will give you the best
  performance possible – no blocking CPU by
  GPU (or vice versa), no redundant memcpy
  operations, lower CPU load, but extra
  memory is used (note, that you will need no
  extra temporal buffer to store your data
  before sending it to VBO).

  update(1), draw(1), gpuworking(................)
      update(2), draw(2), gpuworking(................)
      update(3), draw(3), gpuworking(................)
What you might not know
• Float type is native to GPU
• …that means any other type will be
  converted to float by USSE
• …resulting in few additional cycles
• Thus it’s your choice in tradeoff
  between bandwidthstorage and
  additional cycles
What you might know
• Use interleaved vertex data
  – Align each vertex attribute by 4 bytes
What you might not know
• Why you have to do this?!
  – You don’t. Driver can do this instead of
  – …resulting in slower performance.
What you might know
• Split your vertex data into two parts:
  1. Static VBO - the one, that never will be
  2. Dynamic VBO – the one, that needs to
     be updated frequently
• Split your vertex data into few VBOs,
  when few meshes share the same set
  of attributes
Texture Data
What you might know
• Bandwidth matters
  1. Use lower precision formats i.e.
  2. Use PVRTC compressed textures
  3. Use atlases
  4. Use mipmaps. They improve texture
     cache efficiency and quality.
What you might not know
• iOS OpenGL ES drivers from 4.0 version
  prior to 6.0 has a bug, that will ALWAYS
  reserve memory for mipmaps, regardless,
  whether you requested to create them, or
  not. And you don’t need mip maps for 2D
• …but there are one workaround – make
  your textures NPOT (non-power of two).
What you might not know
• NPOT textures works only with the
  GL_CLAMP_TO_EDGE warp mode
• POT are preferable, they gives you the best
  performance possible
• Use NPOT textures with dimensions multiple to
  32 pixels for best performance
• Driver will pad data of your NPOT texture to
  match the size of the closes POT values.
What you might not know
• Why do I have to use PVRTC? It looks
  1.PVRTC provides great compression,
    resulting in smaller texture size,
    improved cache, saved bandwidth and
    decreased power consumption
  2.PVRTC stores pixel data in GPU’s native
    order i.e BGRA, instead of RGBA
What you might not know
1. RGBA:
 •   Requires pixel data to be shuffled by driver into
 •   Has options for RGB422, RGB565, RGBA4444,
2. BGRA:
 •   Stores data in GPU’s native order
 •   Has option only for BGRA8888 for upload and
     BGRA888, BGRA5551, BGRA4444 for ReadPixels
What you might not know
• Prefer OES_texture_half_float instead of
• Texture reads read only 32 bits per texel, thus
  RGBA float texture will result in 4 texture reads
What you might know
• Prefer multitexturing instead of
  multiple passes
• Configure texture parameters before
  feeding image data to driver
What you might not know
• Texture uploading to the GPU is a
• Usual way to do this:
  1. Load texture to temporal buffer in RAM
  2. Feed this buffer to glTexImage2D
  3. Draw!
• Looks simple and fast, right?
What you might not know
• …NO!

void* buf = malloc(TEXTURE_SIZE);        //4mb for RGBA8 1024x1024 texture


glBindTexture(GL_TEXTURE_2D, textureID);

glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, &buf);

// buf is copied into internal buffer, created by driver (that's obvious)

free(buf); // because buffer can be freed immediately after glTexImage2D


// driver will do some additional work to fully upload texture first time it is actually used!

• Textures are finally uploaded only when they are used
  first time. So draw them off screen immediately after
• A lot of redundant work!
What you might not know
• Jedi way to upload textures:
void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mapping

glBindTexture(GL_TEXTURE_2D, textureID);

glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr);

// buf is copied into internal buffer, created by driver (that's obvious)

free(buf); // because buffer can be freed immediately after glTexImage2D


// driver will do some additional work to fully upload texture first time it is actually used!

munmap(ptr, TEXTURE_SIZE);

• File mapping does not copy your file data into RAM! It
  does load file data page by page, when it’s accessed.
• Thus we eliminated one redundant copy, dramatically
  increased texture upload time and decreased memory
What you might not know
• Always use glClear at the beginning
  of the frame…
• … and EXT_discard_framebuffer at
  the end.
• PVR GPU series have a fast on chip
  depth buffer for each tile. If you
  forget to cleardiscard depth buffer, it
  will be uploaded from HW to SW
Shaders Best Practices
What you might know
• Be wise with precision hints
• Avoid branching
• Eliminate loops
• Do not use discard. Place discard
  instruction as early, as possible to
  avoid useless computations
What you might not know
• Code inside of dynamic branch (it’s
  condition is evaluated against value
  calculated in shader) will be
  executed anyway and than it will be
  orphaned if condition is false
What you might not know
• highp – represents 32 bit floating point value
• mediump – represents 16 bit floating point
  value in range of [-65520, 65520]
• lowp – 10 bit fixed point values in range of [-2,
  2] with step of 1/256
• Try to give the same precision to all you
  operands, because conversion takes some
What you might not know
• highp values are calculated on a scalar
  processor on USSE1 only:
 highp vec4 v1, v2;

 highp float s1, s2;

 // Bad

 v2 = (v1 * s1) * s2;

 //scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied
 by s2 on //a scalar processor again – 4 additional operations

 // Good

 v2 = v1 * (s1 * s2);

 //s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar
Hardware features
What you might know
• Typical CPU found in iOS devices:
 1. ARMv7 architecture
 2. Cortex A8Cortex A9Custom Apple
 3. 600 – 1300 MHz
 4. 1-2 cores
 5. Thumb-2 instructions set
What you might not know
• ARMv7 has no hardware support for
  integer division
• VFPv3 FPUVFPv4 on Apple A6 (rumored)
• NEON SIMD engine
• Unaligned access is done in software on
  Cortex A8. That means a hundred times
• Cortex A8 is in-order CPU. Cortex A9+ are
  out of order
What you might not know
• Cortex A9 core has full VFPv3 FPU,
  while Cortex A8 has a VFPLite. That
  means, that float operations take 1
  cycle on A9 and 10 cycles on A8!
What you might not know
• NEON – 16 registers, 128 bit wide each.
  Supports operations on 8, 16, 32 and
  64 bits integers and 32 bits float values
• NEON can be used for:
  – Software geometry instancing;
  – Skinning on ES 1.1;
  – As a general vertex processor;
  – Other, typical, applications for SIMD.
What you might not know
• USSE1 architecture is scalar, NEON is
  vector by nature. Move your vertex
  processing to CPU from GPU to
  speedup calculations*
• ???????
• PROFIT!!!111

• *NOTE. That doesn’t apply to USSE2 hardware
What you might not know
• The weakest side of mobile GPUs is a fill
  rate. Fill rate is quickly killed by
  blending. 2D games are heavy on this.
  PowerVR USSE engine doesn’t care what
  to do – vertex or fragments processing.
  Moving you vertex processing to CPU
  (NEON) will leave some room space for
  fragment processing. It will have more
  effect on USSE1, scalar hardware.
What you might not know
• There are 3 ways to use NEON engine
  in your code:
  1. Intrinsics
       2. 1.1 GLKMath
  3. Handwritten NEON assembly
  4. Autovectorization. Add –mllvm –vectorize
     –mllvm –bb-vectorize-aligned-only to
     Other C Flags in project settings and you
     are ready to go.
What you might not know
• Intrinsics:
What you might not know
• Assembly:
What you might not know
• Summary:
                    Running time, CPU usage, %
       Intrinsics   2764         19
       Assembly     3664         20
       FPU          6209         25-28
        FPU            5028      22-24
•   Intrinsics got me 25%
        autovectorized    speedup over
    assembly. Let’s see the code!
• Note that speed of intrinsics code vary from
  compiler to compiler.
What you might not know
__attribute__((always_inline)) void Matrix4ByVec4(const
float32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__ vec,
float32x4_t* __restrict__ result)
    (*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]);

    (*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]);
    (*result) = vmlaq_n_f32((*result), (*mat).val[2], (*vec)[2]);
    (*result) = vmlaq_n_f32((*result), (*mat).val[3], (*vec)[3]);
What you might not know
__attribute__((always_inline)) void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2, float32x4x4_t* __restrict__ r)



    (*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0));

    (*r).val[1] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[1], 0));

    (*r).val[2] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[2], 0));

    (*r).val[3] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[3], 0));

    (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1));

    (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[1], vgetq_lane_f32((*m2).val[1], 1));

    (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[1], vgetq_lane_f32((*m2).val[2], 1));

    (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[1], vgetq_lane_f32((*m2).val[3], 1));

    (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[2], vgetq_lane_f32((*m2).val[0], 2));

    (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[2], vgetq_lane_f32((*m2).val[1], 2));

    (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[2], vgetq_lane_f32((*m2).val[2], 2));

    (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[2], vgetq_lane_f32((*m2).val[3], 2));

    (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[3], vgetq_lane_f32((*m2).val[0], 3));

    (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[3], vgetq_lane_f32((*m2).val[1], 3));

    (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[3], vgetq_lane_f32((*m2).val[2], 3));

    (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[3], vgetq_lane_f32((*m2).val[3], 3));

What you might not know
__asm__ volatile


  "vldmia %6, { q0-q3 } nt"

  "vldmia %0, { q8-q11 }nt"

  "vmul.f32 q12, q8, d0[0]nt"

  "vmul.f32 q13, q8, d2[0]nt"

  "vmul.f32 q14, q8, d4[0]nt"

  "vmul.f32 q15, q8, d6[0]nt"

  "vmla.f32 q12, q9, d0[1]nt"

  "vmla.f32 q13, q9, d2[1]nt"

  "vmla.f32 q14, q9, d4[1]nt"

  "vmla.f32 q15, q9, d6[1]nt"

  "vmla.f32 q12, q10, d1[0]nt"

  "vmla.f32 q13, q10, d3[0]nt"

  "vmla.f32 q14, q10, d5[0]nt"

  "vmla.f32 q15, q10, d7[0]nt"

  "vmla.f32 q12, q11, d1[1]nt"

  "vmla.f32 q13, q11, d3[1]nt"

  "vmla.f32 q14, q11, d5[1]nt"

  "vmla.f32 q15, q11, d7[1]nt"

  "vldmia %1, { q0-q3 } nt"

  "vmul.f32 q8, q12, d0[0]nt"

  "vmul.f32 q9, q12, d2[0]nt"

  "vmul.f32 q10, q12, d4[0]nt"

  "vmul.f32 q11, q12, d6[0]nt"

  "vmla.f32 q8, q13, d0[1]nt"

  "vmla.f32 q8, q14, d1[0]nt"

  "vmla.f32 q8, q15, d1[1]nt"

  "vmla.f32 q9, q13, d2[1]nt"

  "vmla.f32 q9, q14, d3[0]nt"

  "vmla.f32 q9, q15, d3[1]nt"

  "vmla.f32 q10, q13, d4[1]nt"

  "vmla.f32 q10, q14, d5[0]nt"

  "vmla.f32 q10, q15, d5[1]nt"

  "vmla.f32 q11, q13, d6[1]nt"

  "vmla.f32 q11, q14, d7[0]nt"

  "vmla.f32 q11, q15, d7[1]nt"

  "vstmia %2, { q8 }nt"

  "vstmia %3, { q9 }nt"

  "vstmia %4, { q10 }nt"

  "vstmia %5, { q11 }"


  : "r" (proj), "r" (squareVertices), "r" (v1), "r" (v2), "r" (v3), "r" (v4), "r" (modelView)

  : "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15"

What you might not know
• For detailed explanation on
  intrinsicsassembly see:
Contact me

Más contenido relacionado

La actualidad más candente

Optimizing Games for Mobiles
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for MobilesSt1X
Sig13 ce future_gfx
Sig13 ce future_gfxSig13 ce future_gfx
Sig13 ce future_gfxCass Everitt
OpenGL 4.5 Update for NVIDIA GPUs
OpenGL 4.5 Update for NVIDIA GPUsOpenGL 4.5 Update for NVIDIA GPUs
OpenGL 4.5 Update for NVIDIA GPUsMark Kilgard
Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion CullingIntel® Software
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicschangehee lee Cloud advantages Cloud Cloud advantages Cloud advantagesAndrew Wong
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevAMD Developer Central
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11smashflt
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Johan Andersson
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well Mark Kilgard
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility bufferWolfgang Engel
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect AndromedaElectronic Arts / DICE
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game EngineJohan Andersson
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Johan Andersson
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasAMD Developer Central
Gdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_glGdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_glchangehee lee
Oit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsOit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsHolger Gruen
Checkerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOCCheckerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOCQLOC

La actualidad más candente (20)

Optimizing Games for Mobiles
Optimizing Games for MobilesOptimizing Games for Mobiles
Optimizing Games for Mobiles
Sig13 ce future_gfx
Sig13 ce future_gfxSig13 ce future_gfx
Sig13 ce future_gfx
Hair in Tomb Raider
Hair in Tomb RaiderHair in Tomb Raider
Hair in Tomb Raider
OpenGL 4.5 Update for NVIDIA GPUs
OpenGL 4.5 Update for NVIDIA GPUsOpenGL 4.5 Update for NVIDIA GPUs
OpenGL 4.5 Update for NVIDIA GPUs
Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion Culling
Smedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphicsSmedberg niklas bringing_aaa_graphics
Smedberg niklas bringing_aaa_graphics Cloud advantages Cloud Cloud advantages Cloud advantages
Introduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan NevraevIntroduction to Direct 3D 12 by Ivan Nevraev
Introduction to Direct 3D 12 by Ivan Nevraev
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11
Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!Your Game Needs Direct3D 11, So Get Started Now!
Your Game Needs Direct3D 11, So Get Started Now!
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game Engine
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth ThomasHoly smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Gdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_glGdc 14 bringing unreal engine 4 to open_gl
Gdc 14 bringing unreal engine 4 to open_gl
Oit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked ListsOit And Indirect Illumination Using Dx11 Linked Lists
Oit And Indirect Illumination Using Dx11 Linked Lists
Checkerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOCCheckerboard Rendering in Dark Souls: Remastered by QLOC
Checkerboard Rendering in Dark Souls: Remastered by QLOC


Дмитрий Малеев-Мобильная Геймификация или как вырабатывать-привычки
Дмитрий Малеев-Мобильная Геймификация или как вырабатывать-привычкиДмитрий Малеев-Мобильная Геймификация или как вырабатывать-привычки
Дмитрий Малеев-Мобильная Геймификация или как вырабатывать-привычкиUA Mobile
Ranc. tahunan psk tahun 5 2013
Ranc. tahunan psk tahun 5 2013Ranc. tahunan psk tahun 5 2013
Ranc. tahunan psk tahun 5 2013Noraini Mohd Yunus
Михаил Галушко - Разработка WinRT приложений для Windows 8: реальный опыт
Михаил Галушко -  Разработка WinRT приложений для Windows 8: реальный опытМихаил Галушко -  Разработка WinRT приложений для Windows 8: реальный опыт
Михаил Галушко - Разработка WinRT приложений для Windows 8: реальный опытUA Mobile
Вадим Розов- Разработка под Blackberry. Подводные грабли
Вадим Розов- Разработка под Blackberry. Подводные граблиВадим Розов- Разработка под Blackberry. Подводные грабли
Вадим Розов- Разработка под Blackberry. Подводные граблиUA Mobile
S4 tarea4 armoi
S4 tarea4 armoiS4 tarea4 armoi
S4 tarea4 armoikikieres
مشروع الربح من تزين البالونات
مشروع الربح من تزين البالوناتمشروع الربح من تزين البالونات
مشروع الربح من تزين البالوناتAhmed Farahat

Destacado (6)

Дмитрий Малеев-Мобильная Геймификация или как вырабатывать-привычки
Дмитрий Малеев-Мобильная Геймификация или как вырабатывать-привычкиДмитрий Малеев-Мобильная Геймификация или как вырабатывать-привычки
Дмитрий Малеев-Мобильная Геймификация или как вырабатывать-привычки
Ranc. tahunan psk tahun 5 2013
Ranc. tahunan psk tahun 5 2013Ranc. tahunan psk tahun 5 2013
Ranc. tahunan psk tahun 5 2013
Михаил Галушко - Разработка WinRT приложений для Windows 8: реальный опыт
Михаил Галушко -  Разработка WinRT приложений для Windows 8: реальный опытМихаил Галушко -  Разработка WinRT приложений для Windows 8: реальный опыт
Михаил Галушко - Разработка WinRT приложений для Windows 8: реальный опыт
Вадим Розов- Разработка под Blackberry. Подводные грабли
Вадим Розов- Разработка под Blackberry. Подводные граблиВадим Розов- Разработка под Blackberry. Подводные грабли
Вадим Розов- Разработка под Blackberry. Подводные грабли
S4 tarea4 armoi
S4 tarea4 armoiS4 tarea4 armoi
S4 tarea4 armoi
مشروع الربح من تزين البالونات
مشروع الربح من تزين البالوناتمشروع الربح من تزين البالونات
مشروع الربح من تزين البالونات

Similar a Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

Optimization in Unity: simple tips for developing with "no surprises" / Anton...
Optimization in Unity: simple tips for developing with "no surprises" / Anton...Optimization in Unity: simple tips for developing with "no surprises" / Anton...
Optimization in Unity: simple tips for developing with "no surprises" / Anton...DevGAMM Conference
[Osxdev]metalNAVER D2
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUJiansong Chen
Computer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I💻 Anton Gerdelan
Unity - Internals: memory and performance
Unity - Internals: memory and performanceUnity - Internals: memory and performance
Unity - Internals: memory and performanceCodemotion
Kettunen, miaubiz fuzzing at scale and in style
Kettunen, miaubiz   fuzzing at scale and in styleKettunen, miaubiz   fuzzing at scale and in style
Kettunen, miaubiz fuzzing at scale and in styleDefconRussia
C++ Advanced Memory Management With Allocators
C++ Advanced Memory Management With AllocatorsC++ Advanced Memory Management With Allocators
C++ Advanced Memory Management With AllocatorsGlobalLogic Ukraine
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...Edge AI and Vision Alliance
Tales from the Field
Tales from the FieldTales from the Field
Tales from the FieldMongoDB
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance CachingScyllaDB
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiTakuya ASADA
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León

Similar a Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide (20)

Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Optimization in Unity: simple tips for developing with "no surprises" / Anton...
Optimization in Unity: simple tips for developing with "no surprises" / Anton...Optimization in Unity: simple tips for developing with "no surprises" / Anton...
Optimization in Unity: simple tips for developing with "no surprises" / Anton...
OpenGL ES and Mobile GPU
OpenGL ES and Mobile GPUOpenGL ES and Mobile GPU
OpenGL ES and Mobile GPU
Computer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming IComputer Graphics - Lecture 01 - 3D Programming I
Computer Graphics - Lecture 01 - 3D Programming I
Basics of JVM Tuning
Basics of JVM TuningBasics of JVM Tuning
Basics of JVM Tuning
Unity - Internals: memory and performance
Unity - Internals: memory and performanceUnity - Internals: memory and performance
Unity - Internals: memory and performance
Gpgpu intro
Gpgpu introGpgpu intro
Gpgpu intro
Kettunen, miaubiz fuzzing at scale and in style
Kettunen, miaubiz   fuzzing at scale and in styleKettunen, miaubiz   fuzzing at scale and in style
Kettunen, miaubiz fuzzing at scale and in style
C++ Advanced Memory Management With Allocators
C++ Advanced Memory Management With AllocatorsC++ Advanced Memory Management With Allocators
C++ Advanced Memory Management With Allocators
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
OpenGL for 2015
OpenGL for 2015OpenGL for 2015
OpenGL for 2015
cachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Cachingcachegrand: A Take on High Performance Caching
cachegrand: A Take on High Performance Caching
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
SMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgiSMP implementation for OpenBSD/sgi
SMP implementation for OpenBSD/sgi

Más de UA Mobile

Владимир Кириллов-TCP-Performance for-Mobile-Applications
Владимир Кириллов-TCP-Performance for-Mobile-ApplicationsВладимир Кириллов-TCP-Performance for-Mobile-Applications
Владимир Кириллов-TCP-Performance for-Mobile-ApplicationsUA Mobile
Денис Лебедев-Управление зависимостями с помощью CocoaPods
Денис Лебедев-Управление зависимостями с помощью CocoaPodsДенис Лебедев-Управление зависимостями с помощью CocoaPods
Денис Лебедев-Управление зависимостями с помощью CocoaPodsUA Mobile
Александр Терещук - Memory Analyzer Tool and memory optimization tips in Android
Александр Терещук - Memory Analyzer Tool and memory optimization tips in AndroidАлександр Терещук - Memory Analyzer Tool and memory optimization tips in Android
Александр Терещук - Memory Analyzer Tool and memory optimization tips in AndroidUA Mobile
Максим Щеглов - Google Cloud Messaging for Android
Максим Щеглов - Google Cloud Messaging for AndroidМаксим Щеглов - Google Cloud Messaging for Android
Максим Щеглов - Google Cloud Messaging for AndroidUA Mobile
Александр Додатко - Работа с датами в ObjectiveC и SQLite
Александр Додатко - Работа с датами в ObjectiveC и SQLiteАлександр Додатко - Работа с датами в ObjectiveC и SQLite
Александр Додатко - Работа с датами в ObjectiveC и SQLiteUA Mobile
Сергей Арнаут - Stream yourself with Android
Сергей Арнаут - Stream yourself with AndroidСергей Арнаут - Stream yourself with Android
Сергей Арнаут - Stream yourself with AndroidUA Mobile
Павел Юрийчук - Разработка приложений под мобильные браузеры
Павел Юрийчук - Разработка приложений под мобильные браузерыПавел Юрийчук - Разработка приложений под мобильные браузеры
Павел Юрийчук - Разработка приложений под мобильные браузерыUA Mobile
Олег Апостол - Плюсы и минусы различных тач-платформ глазами веб-разработчика
Олег Апостол - Плюсы и минусы различных тач-платформ глазами веб-разработчикаОлег Апостол - Плюсы и минусы различных тач-платформ глазами веб-разработчика
Олег Апостол - Плюсы и минусы различных тач-платформ глазами веб-разработчикаUA Mobile
Евгений Галкин-Рекламные возможности Google для продвижения мобильных приложений
Евгений Галкин-Рекламные возможности Google для продвижения мобильных приложенийЕвгений Галкин-Рекламные возможности Google для продвижения мобильных приложений
Евгений Галкин-Рекламные возможности Google для продвижения мобильных приложенийUA Mobile
Алексей Лельчук - От аутсорсинга к продуктам: трансформация компании и ментал...
Алексей Лельчук - От аутсорсинга к продуктам: трансформация компании и ментал...Алексей Лельчук - От аутсорсинга к продуктам: трансформация компании и ментал...
Алексей Лельчук - От аутсорсинга к продуктам: трансформация компании и ментал...UA Mobile
Tdd objective c
Tdd objective cTdd objective c
Tdd objective cUA Mobile
Mobile automation uamobile
Mobile automation uamobileMobile automation uamobile
Mobile automation uamobileUA Mobile

Más de UA Mobile (12)

Владимир Кириллов-TCP-Performance for-Mobile-Applications
Владимир Кириллов-TCP-Performance for-Mobile-ApplicationsВладимир Кириллов-TCP-Performance for-Mobile-Applications
Владимир Кириллов-TCP-Performance for-Mobile-Applications
Денис Лебедев-Управление зависимостями с помощью CocoaPods
Денис Лебедев-Управление зависимостями с помощью CocoaPodsДенис Лебедев-Управление зависимостями с помощью CocoaPods
Денис Лебедев-Управление зависимостями с помощью CocoaPods
Александр Терещук - Memory Analyzer Tool and memory optimization tips in Android
Александр Терещук - Memory Analyzer Tool and memory optimization tips in AndroidАлександр Терещук - Memory Analyzer Tool and memory optimization tips in Android
Александр Терещук - Memory Analyzer Tool and memory optimization tips in Android
Максим Щеглов - Google Cloud Messaging for Android
Максим Щеглов - Google Cloud Messaging for AndroidМаксим Щеглов - Google Cloud Messaging for Android
Максим Щеглов - Google Cloud Messaging for Android
Александр Додатко - Работа с датами в ObjectiveC и SQLite
Александр Додатко - Работа с датами в ObjectiveC и SQLiteАлександр Додатко - Работа с датами в ObjectiveC и SQLite
Александр Додатко - Работа с датами в ObjectiveC и SQLite
Сергей Арнаут - Stream yourself with Android
Сергей Арнаут - Stream yourself with AndroidСергей Арнаут - Stream yourself with Android
Сергей Арнаут - Stream yourself with Android
Павел Юрийчук - Разработка приложений под мобильные браузеры
Павел Юрийчук - Разработка приложений под мобильные браузерыПавел Юрийчук - Разработка приложений под мобильные браузеры
Павел Юрийчук - Разработка приложений под мобильные браузеры
Олег Апостол - Плюсы и минусы различных тач-платформ глазами веб-разработчика
Олег Апостол - Плюсы и минусы различных тач-платформ глазами веб-разработчикаОлег Апостол - Плюсы и минусы различных тач-платформ глазами веб-разработчика
Олег Апостол - Плюсы и минусы различных тач-платформ глазами веб-разработчика
Евгений Галкин-Рекламные возможности Google для продвижения мобильных приложений
Евгений Галкин-Рекламные возможности Google для продвижения мобильных приложенийЕвгений Галкин-Рекламные возможности Google для продвижения мобильных приложений
Евгений Галкин-Рекламные возможности Google для продвижения мобильных приложений
Алексей Лельчук - От аутсорсинга к продуктам: трансформация компании и ментал...
Алексей Лельчук - От аутсорсинга к продуктам: трансформация компании и ментал...Алексей Лельчук - От аутсорсинга к продуктам: трансформация компании и ментал...
Алексей Лельчук - От аутсорсинга к продуктам: трансформация компании и ментал...
Tdd objective c
Tdd objective cTdd objective c
Tdd objective c
Mobile automation uamobile
Mobile automation uamobileMobile automation uamobile
Mobile automation uamobile

Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

  • 1. Learn iOS Game Optimization. Ultimate Guide by Dmitriy Vovk
  • 2. Want to achieve the same level of technology speed? Welcome! Image is used without any permissions 
  • 4. What you might know • Batch, Batch, Batch! http :// papers/BatchBatchBatch.pdf • Render from one thread only • Avoid synchronizations: 1. glFlush/glFinish; 2. Querying GL states; 3. Accessing render targets;
  • 6. What you might know • Pixel perfect HSR (Hidden Surface Removal), • But still need to sort opaque geometry! • Avoid doing alpha test. Use alpha blend instead
  • 7. What you might not know • HSR still requires vertices to be processed! • …thus don’t forget to cull your geometry on CPU! • Prefer Stencil Test before Scissor. – Stencil test is performed in hardware on PowerVR GPUs, thus resulting in dramatically increased performance. – Stencil can be of any form in contrast to the rectangular Scissor
  • 8. What you might not know • Why no alpha test?! o Alpha testdiscard requires fragment shader to run, before visibility for current fragment can be determined. This will remove benefits of HSR o Even more! If shader code contains discard, than any geometry rendered with this shader will suffer from alpha test drawbacks. Even if this key-word is under condition, USSE does assumes, that this condition may be hit. o Move discard into separate shader o Draw opaque geometry, than alpha tested one and alpha blended in the end
  • 9. What you might know • Bandwidth matters 1. Use constant color per object, instead of per vertex 2. Simplify your models. Use smaller data types. 3. Use indexed triangles or non-indexed triangle strips 4. Use VBO instead of client arrays 5. Use VAO
  • 10. What you might not know – VAO implementation on at least iOS 4.0 did harmed your performance – VBOs are allocated at 4KB page size multiples. Be aware of that. Large amount of small VBOs can defragment and waste you memory.
  • 11. What you might not know • Updating your VBO data each frame: 1. glBufferSubData, that updates big part of the original data do harm performance. Try not to update buffer, that is used now 2. glBufferData, that will completely overwrite original data is OK. Old data will be orphaned by driver and storage for new one will be allocated 3. glMapBuffer with triple buffered VBO is preferred way to update your data 4. EXT_map_buffer_range (iOS 6 only), when you need to update only a subset of a buffer object.
  • 12. What you might not know int bufferID = 0; //initialization for (int i = 0; i < 3; ++i)// only allocate data for 3 vbo, do not upload it { glBindBuffer(vertexBuffer[i]); glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW); } //... glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]); void* ptr = glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES); //update data here glUnmapBufferOES(GL_ARRAY_BUFFER); ++bufferID; if (bufferID == 3) //cycling through 3 buffers { bufferID = 0; }
  • 13. What you might not know • This scheme will give you the best performance possible – no blocking CPU by GPU (or vice versa), no redundant memcpy operations, lower CPU load, but extra memory is used (note, that you will need no extra temporal buffer to store your data before sending it to VBO). update(1), draw(1), gpuworking(................) update(2), draw(2), gpuworking(................) update(3), draw(3), gpuworking(................)
  • 14. What you might not know • Float type is native to GPU • …that means any other type will be converted to float by USSE • …resulting in few additional cycles • Thus it’s your choice in tradeoff between bandwidthstorage and additional cycles
  • 15. What you might know • Use interleaved vertex data – Align each vertex attribute by 4 bytes boundaries
  • 16. What you might not know • Why you have to do this?! – You don’t. Driver can do this instead of you – …resulting in slower performance.
  • 17. What you might know • Split your vertex data into two parts: 1. Static VBO - the one, that never will be changed 2. Dynamic VBO – the one, that needs to be updated frequently • Split your vertex data into few VBOs, when few meshes share the same set of attributes
  • 19. What you might know • Bandwidth matters 1. Use lower precision formats i.e. RGB565 2. Use PVRTC compressed textures 3. Use atlases 4. Use mipmaps. They improve texture cache efficiency and quality.
  • 20. What you might not know • iOS OpenGL ES drivers from 4.0 version prior to 6.0 has a bug, that will ALWAYS reserve memory for mipmaps, regardless, whether you requested to create them, or not. And you don’t need mip maps for 2D graphics. • …but there are one workaround – make your textures NPOT (non-power of two).
  • 21. What you might not know • NPOT textures works only with the GL_CLAMP_TO_EDGE warp mode • POT are preferable, they gives you the best performance possible • Use NPOT textures with dimensions multiple to 32 pixels for best performance • Driver will pad data of your NPOT texture to match the size of the closes POT values.
  • 22. What you might not know • Why do I have to use PVRTC? It looks ugly! 1.PVRTC provides great compression, resulting in smaller texture size, improved cache, saved bandwidth and decreased power consumption 2.PVRTC stores pixel data in GPU’s native order i.e BGRA, instead of RGBA
  • 23. What you might not know • BGRA vs RGBA 1. RGBA: • Requires pixel data to be shuffled by driver into BGRA • Has options for RGB422, RGB565, RGBA4444, RGBA5551 2. BGRA: • Stores data in GPU’s native order • Has option only for BGRA8888 for upload and BGRA888, BGRA5551, BGRA4444 for ReadPixels
  • 24. What you might not know • Prefer OES_texture_half_float instead of OES_texture_float • Texture reads read only 32 bits per texel, thus RGBA float texture will result in 4 texture reads
  • 25. What you might know • Prefer multitexturing instead of multiple passes • Configure texture parameters before feeding image data to driver
  • 26. What you might not know • Texture uploading to the GPU is a mess! • Usual way to do this: 1. Load texture to temporal buffer in RAM 2. Feed this buffer to glTexImage2D 3. Draw! • Looks simple and fast, right?
  • 27. What you might not know • …NO! void* buf = malloc(TEXTURE_SIZE); //4mb for RGBA8 1024x1024 texture LoadTexture(textureName); glBindTexture(GL_TEXTURE_2D, textureID); glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, &buf); // buf is copied into internal buffer, created by driver (that's obvious) free(buf); // because buffer can be freed immediately after glTexImage2D glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0); // driver will do some additional work to fully upload texture first time it is actually used! • Textures are finally uploaded only when they are used first time. So draw them off screen immediately after glTexImage2D • A lot of redundant work!
  • 28. What you might not know • Jedi way to upload textures: void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mapping glBindTexture(GL_TEXTURE_2D, textureID); glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr); // buf is copied into internal buffer, created by driver (that's obvious) free(buf); // because buffer can be freed immediately after glTexImage2D glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0); // driver will do some additional work to fully upload texture first time it is actually used! munmap(ptr, TEXTURE_SIZE); • File mapping does not copy your file data into RAM! It does load file data page by page, when it’s accessed. • Thus we eliminated one redundant copy, dramatically increased texture upload time and decreased memory fragmentation
  • 29. What you might not know • Always use glClear at the beginning of the frame… • … and EXT_discard_framebuffer at the end. • PVR GPU series have a fast on chip depth buffer for each tile. If you forget to cleardiscard depth buffer, it will be uploaded from HW to SW
  • 31. What you might know • Be wise with precision hints • Avoid branching • Eliminate loops • Do not use discard. Place discard instruction as early, as possible to avoid useless computations
  • 32. What you might not know • Code inside of dynamic branch (it’s condition is evaluated against value calculated in shader) will be executed anyway and than it will be orphaned if condition is false
  • 33. What you might not know • highp – represents 32 bit floating point value • mediump – represents 16 bit floating point value in range of [-65520, 65520] • lowp – 10 bit fixed point values in range of [-2, 2] with step of 1/256 • Try to give the same precision to all you operands, because conversion takes some time
  • 34. What you might not know • highp values are calculated on a scalar processor on USSE1 only: highp vec4 v1, v2; highp float s1, s2; // Bad v2 = (v1 * s1) * s2; //scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied by s2 on //a scalar processor again – 4 additional operations // Good v2 = v1 * (s1 * s2); //s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar processor
  • 36. What you might know • Typical CPU found in iOS devices: 1. ARMv7 architecture 2. Cortex A8Cortex A9Custom Apple cores 3. 600 – 1300 MHz 4. 1-2 cores 5. Thumb-2 instructions set
  • 37. What you might not know • ARMv7 has no hardware support for integer division • VFPv3 FPUVFPv4 on Apple A6 (rumored) • NEON SIMD engine • Unaligned access is done in software on Cortex A8. That means a hundred times slower • Cortex A8 is in-order CPU. Cortex A9+ are out of order
  • 38. What you might not know • Cortex A9 core has full VFPv3 FPU, while Cortex A8 has a VFPLite. That means, that float operations take 1 cycle on A9 and 10 cycles on A8!
  • 39. What you might not know • NEON – 16 registers, 128 bit wide each. Supports operations on 8, 16, 32 and 64 bits integers and 32 bits float values • NEON can be used for: – Software geometry instancing; – Skinning on ES 1.1; – As a general vertex processor; – Other, typical, applications for SIMD.
  • 40. What you might not know • USSE1 architecture is scalar, NEON is vector by nature. Move your vertex processing to CPU from GPU to speedup calculations* • ??????? • PROFIT!!!111 • *NOTE. That doesn’t apply to USSE2 hardware
  • 41. What you might not know • The weakest side of mobile GPUs is a fill rate. Fill rate is quickly killed by blending. 2D games are heavy on this. PowerVR USSE engine doesn’t care what to do – vertex or fragments processing. Moving you vertex processing to CPU (NEON) will leave some room space for fragment processing. It will have more effect on USSE1, scalar hardware.
  • 42. What you might not know • There are 3 ways to use NEON engine in your code: 1. Intrinsics 2. 1.1 GLKMath 3. Handwritten NEON assembly 4. Autovectorization. Add –mllvm –vectorize –mllvm –bb-vectorize-aligned-only to Other C Flags in project settings and you are ready to go.
  • 43.
  • 44. What you might not know • Intrinsics:
  • 45. What you might not know • Assembly:
  • 46. What you might not know • Summary: Running time, CPU usage, % ms Intrinsics 2764 19 Assembly 3664 20 FPU 6209 25-28 FPU 5028 22-24 • Intrinsics got me 25% autovectorized speedup over assembly. Let’s see the code! • Note that speed of intrinsics code vary from compiler to compiler.
  • 47. What you might not know __attribute__((always_inline)) void Matrix4ByVec4(const float32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__ vec, float32x4_t* __restrict__ result) { (*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]); (*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]); (*result) = vmlaq_n_f32((*result), (*mat).val[2], (*vec)[2]); (*result) = vmlaq_n_f32((*result), (*mat).val[3], (*vec)[3]); }
  • 48. What you might not know __attribute__((always_inline)) void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2, float32x4x4_t* __restrict__ r) { #ifdef INTRINSICS (*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0)); (*r).val[1] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[1], 0)); (*r).val[2] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[2], 0)); (*r).val[3] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[3], 0)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[1], vgetq_lane_f32((*m2).val[1], 1)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[1], vgetq_lane_f32((*m2).val[2], 1)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[1], vgetq_lane_f32((*m2).val[3], 1)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[2], vgetq_lane_f32((*m2).val[0], 2)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[2], vgetq_lane_f32((*m2).val[1], 2)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[2], vgetq_lane_f32((*m2).val[2], 2)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[2], vgetq_lane_f32((*m2).val[3], 2)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[3], vgetq_lane_f32((*m2).val[0], 3)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[3], vgetq_lane_f32((*m2).val[1], 3)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[3], vgetq_lane_f32((*m2).val[2], 3)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[3], vgetq_lane_f32((*m2).val[3], 3)); }
  • 49. What you might not know __asm__ volatile ( "vldmia %6, { q0-q3 } nt" "vldmia %0, { q8-q11 }nt" "vmul.f32 q12, q8, d0[0]nt" "vmul.f32 q13, q8, d2[0]nt" "vmul.f32 q14, q8, d4[0]nt" "vmul.f32 q15, q8, d6[0]nt" "vmla.f32 q12, q9, d0[1]nt" "vmla.f32 q13, q9, d2[1]nt" "vmla.f32 q14, q9, d4[1]nt" "vmla.f32 q15, q9, d6[1]nt" "vmla.f32 q12, q10, d1[0]nt" "vmla.f32 q13, q10, d3[0]nt" "vmla.f32 q14, q10, d5[0]nt" "vmla.f32 q15, q10, d7[0]nt" "vmla.f32 q12, q11, d1[1]nt" "vmla.f32 q13, q11, d3[1]nt" "vmla.f32 q14, q11, d5[1]nt" "vmla.f32 q15, q11, d7[1]nt" "vldmia %1, { q0-q3 } nt" "vmul.f32 q8, q12, d0[0]nt" "vmul.f32 q9, q12, d2[0]nt" "vmul.f32 q10, q12, d4[0]nt" "vmul.f32 q11, q12, d6[0]nt" "vmla.f32 q8, q13, d0[1]nt" "vmla.f32 q8, q14, d1[0]nt" "vmla.f32 q8, q15, d1[1]nt" "vmla.f32 q9, q13, d2[1]nt" "vmla.f32 q9, q14, d3[0]nt" "vmla.f32 q9, q15, d3[1]nt" "vmla.f32 q10, q13, d4[1]nt" "vmla.f32 q10, q14, d5[0]nt" "vmla.f32 q10, q15, d5[1]nt" "vmla.f32 q11, q13, d6[1]nt" "vmla.f32 q11, q14, d7[0]nt" "vmla.f32 q11, q15, d7[1]nt" "vstmia %2, { q8 }nt" "vstmia %3, { q9 }nt" "vstmia %4, { q10 }nt" "vstmia %5, { q11 }" : : "r" (proj), "r" (squareVertices), "r" (v1), "r" (v2), "r" (v3), "r" (v4), "r" (modelView) : "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15" );
  • 50. What you might not know • For detailed explanation on intrinsicsassembly see: com.arm.doc.dui0491e/CIHJBEFE.html