SlideShare una empresa de Scribd logo
1 de 68
Descargar para leer sin conexión
Caching In
Understand, measure and use your CPU
        Cache more effectively




         Richard Warburton
What are you talking about?

● Why do you care?

● Hardware Fundamentals

● Measurement

● Principles and Examples
Why do you care?
Perhaps why /should/ you care.
The Problem        Very Fast




 Relatively Slow
The Solution: CPU Cache

● Core Demands Data, looks at its cache
  ○ If present (a "hit") then data returned to register
  ○ If absent (a "miss") then data looked up from
    memory and stored in the cache


● Fast memory is expensive, a small amount
  is affordable
Multilevel Cache: Intel Sandybridge
    Physical Core 0                               Physical Core N
   HT: 2 Logical Cores                          HT: 2 Logical Cores


  Level 1      Level 1                          Level 1     Level 1
   Data      Instruction          ....           Data     Instruction
  Cache        Cache                            Cache       Cache



     Level 2 Cache                                Level 2 Cache




                         Shared Level 3 Cache
CPU Stalls
● Want your CPU to execute instructions

● Stall = CPU can't execute because its
  waiting on memory

● Stalls displace potential computation

● Busy-work Strategies
  ○ Out-of-Order Execution
  ○ Simultaneous MultiThreading (SMT) /
    HyperThreading (HT)
How bad is a miss?
          Location        Latency in Clockcycles

          Register        0

          L1 Cache        3

          L2 Cache        9

          L3 Cache        21

          Main Memory     150-400




NB: Sandybridge Numbers
Cache Lines
● Data transferred in cache lines

● Fixed size block of memory

● Usually 64 bytes in current x86 CPUs

● Purely hardware consideration
You don't need to
know everything
Architectural Takeaways



● Modern CPUs have large multilevel Caches

● Cache misses cause Stalls

● Stalls are expensive
Measurement
Barometer, Sun Dial ... and CPU Counters
Do you care?
● DON'T look at CPU Caching behaviour first!

● Not all Performance Problems CPU Bound
  ○   Garbage Collection
  ○   Networking
  ○   Database or External Service
  ○   I/O


● Consider caching behaviour when you're
  execution bound and know your hotspot.
CPU Performance Counters
● CPU Register which can count events
   ○ Eg: the number of Level 3 Cache Misses


● Model Specific Registers
   ○ not instruction set standardised by x86
   ○ differ by CPU (eg Penryn vs Sandybridge)


● Don't worry - leave details to tooling
Measurement: Instructions Retired

● The number of instructions which executed
  until completion
   ○ ignores branch mispredictions


● When stalled you're not retiring instructions

● Aim to maximise instruction retirement when
  reducing cache misses
Cache Misses
● Cache Level
  ○ Level 3
  ○ Level 2
  ○ Level 1 (Data vs Instruction)


● What to measure
  ○   Hits
  ○   Misses
  ○   Reads vs Writes
  ○   Calculate Ratio
Cache Profilers
● Open Source
  ○ perfstat
  ○ linux (rdmsr/wrmsr)
  ○ Linux Kernel (via /proc)


● Proprietary
  ○   jClarity jMSR
  ○   Intel VTune
  ○   AMD Code Analyst
  ○   Visual Studio 2012
Good Benchmarking Practice

● Warmups

● Measure and rule-out other factors

● Specific Caching Ideas ...
Take Care with Working Set Size
Good Benchmark = Low Variance
You can measure Cache behaviour
Principles & Examples
     Sequential Code
Prefetching


●   Prefetch = Eagerly load data
●   Adjacent Cache Line Prefetch
●   Data Cache Unit (Streaming) Prefetch
●   Problem: CPU Prediction isn't perfect
●   Solution: Arrange Data so accesses
    are predictable
Temporal Locality
            Repeatedly referring to same data in a short time span


Spatial Locality
                Referring to data that is close together in memory



Sequential Locality
              Referring to data that is arranged linearly in memory
General Principles

● Use smaller data types (-XX:+UseCompressedOops)

● Avoid 'big holes' in your data

● Make accesses as linear as possible
Primitive Arrays

// Sequential Access = Predictable

for (int i=0; i<someArray.length; i++)
   someArray[i]++;
Primitive Arrays - Skipping Elements

// Holes Hurt

for (int i=0; i<someArray.length; i += SKIP)
   someArray[i]++;
Primitive Arrays - Skipping Elements
Multidimensional Arrays
● Multidimensional Arrays are really Arrays of
  Arrays in Java. (Unlike C)

● Some people realign their accesses:

for (int col=0; col<COLS; col++) {
  for (int row=0; row<ROWS; row++) {
    array[ROWS * col + row]++;
  }
}
Bad Access Alignment

   Strides the wrong way, bad
   locality.

   array[COLS * row + col]++;




   Strides the right way, good
   locality.

   array[ROWS * col + row]++;
Data Locality vs Java Heap Layout
                      class Foo {
          count
                         Integer count;
0               bar      Bar bar;
                         Baz baz;
1         baz
                      }
2
                      // No alignment guarantees
3                     for (Foo foo : foos) {
                         foo.count = 5;
    ...
                         foo.bar.visit();
                      }
Data Locality vs Java Heap Layout
● Serious Java Weakness

● Location of objects in memory hard to
  guarantee.

● Garbage Collection Impact
  ○ Mark-Sweep
  ○ Copying Collector
Data Layout Principles
●   Primitive Collections (GNU Trove, etc.)
●   Arrays > Linked Lists
●   Hashtable > Search Tree
●   Avoid Code bloating (Loop Unrolling)
●   Custom Data Structures
    ○ Judy Arrays
      ■ an associative array/map
    ○ kD-Trees
      ■ generalised Binary Space Partitioning
    ○ Z-Order Curve
      ■ multidimensional data in one dimension
Game Event Server
class PlayerAttack {
  private long playerId;
  private int weapon;
  private int energy;
  ...
  private int getWeapon() {
    return weapon;
  }
  ...
Flyweight Unsafe (1)
static final Unsafe unsafe = ... ;

long space = OBJ_COUNT * ATTACK_OBJ_SIZE;

address = unsafe.allocateMemory(space);

static PlayerAttack get(int index) {
   long loc = address +
                  (ATTACK_OBJ_SIZE * index)
   return new PlayerAttack(loc);
}
Flyweight Unsafe (2)
class PlayerAttack {
   static final long WEAPON_OFFSET = 8
   private long loc;
   public int getWeapon() {
     long address = loc + WEAPON_OFFSET
     return unsafe.getInt(address);
   }
   public void setWeapon(int weapon) {
     long address = loc + WEAPON_OFFSET
     unsafe.setInt(address, weapon);
   }
}
SLAB
● Manually writing this is:
   ○ time consuming
   ○ error prone
   ○ boilerplate-ridden


● Prototype library:
   ○ http://github.com/RichardWarburton/slab
   ○ Maps an interface to a flyweight
Slab (2)
public interface GameEvent extends Cursor {
  public int getId();
  public void setId(int value);
  public long getStrength();
  public void setStrength(long value);
}

Allocator<GameEvent> eventAllocator =
                         Allocator.of(GameEvent.class);

GameEvent event = eventAllocator.allocate(100);

event.move(50);

event.setId(6);
assertEquals(6, event.getId());
Data Alignment

● An address a is n-byte aligned when:
  ○ n is a power of two
  ○ a is divisible by n

● Cache Line Aligned:
  ○ byte aligned to the size of a cache line
Rules and Hacks
● Java (Hotspot) ♥ 8-byte alignment:
  ○ new Java Objects
  ○ Unsafe.allocateMemory
  ○ Bytebuffer.allocateDirect


● Get the object address:
  ○ Unsafe gives it to you directly
  ○ Direct ByteBuffer has a field called 'address'
Cacheline Aligned Access

● Performance Benefits:
    ○    3-6x on Core2Duo
    ○    ~1.5x on Nehalem Class
    ○    Measured using direct ByteBuffers
    ○    Mid Aligned vs Straddling a Cacheline


●   http://psy-lob-saw.blogspot.co.uk/2013/01/direct-memory-alignment-in-java.html
Today I learned ...

Access and Cache Aligned, Contiguous Data
Translation Lookaside
       Buffers
      TLB, not TLAB!
Virtual Memory


● RAM is finite

● RAM is fragmented

● Multitasking is nice

● Programming easier with a contiguous
  address space
Page Tables
● Virtual Address Space split into pages

● Page has two Addresses:
  ○ Virtual Address: Process' view
  ○ Physical Address: Location in RAM/Disk


● Page Table
  ○   Mapping Between these addresses
  ○   OS Managed
  ○   Page Fault when entry is missing
  ○   OS Pages in from disk
Translation Lookaside Buffers
● TLB = Page Table Cache

● Avoids memory bottlenecks in page lookups

● CPU Cache is multi-level => TLB is multi-
  level

● Miss/Hit rates measurable via CPU Counters
  ○ Hit/Miss Rates
  ○ Walk Duration
TLB Flushing
● Process context switch => change address
  space

● Historically caused "flush" =


● Avoided with an Address Space Identifier
  (ASID)
Page Size Tradeoff:
  Bigger Pages
     = less pages
     = quicker page lookups

                     vs

  Bigger Pages
     = wasted memory space
     = more disk paging
Page Sizes
●   Page Size is adjustable
●   Common Sizes: 4KB, 2MB, 1GB
●   Easy Speedup: 10%-30%
●   -XX:+UseLargePages
●   Oracle DBs love Large Pages
Too Long, Didn't Listen

1. Virtual Memory + Paging have an overhead

2. Translation Lookaside Buffers used to
reduce cost

3. Page size changes important
Principles & Examples (2)
       Concurrent Code
Context Switching

● Multitasking systems 'context switch'
  between processes

● Variants:
  ○ Thread
  ○ Process
  ○ Virtualized OS
Context Switching Costs
● Direct Cost
  ○ Save old process state
  ○ Schedule new process
  ○ Load new process state

● Indirect Cost
  ○   TLB Reload (Process only)
  ○   CPU Pipeline Flush
  ○   Cache Interference
  ○   Temporal Locality
Indirect Costs (cont.)
● "Quantifying The Cost of Context Switch" by
  Li et al.
  ○ 5 years old - principle still applies
  ○ Direct = 3.8 microseconds
  ○ Indirect = ~0 - 1500 microseconds
  ○ indirect dominates direct when working set > L2
     Cache size
● Cooperative hits also exist
● Minimise context switching if possible
Locking

● Locks require kernel arbitration
   ○ Not a context switch, but a mode switch
   ○ Lots of lock contention causes context switching


● Has a cache-pollution cost

● Can use try lock-free algorithms
Java Memory Model

● JSR 133 Specs the memory model

● Turtles Example:
  ○ Java Level: volatile
  ○ CPU Level: lock
  ○ Cache Level: MESI


● Has a cost!
False Sharing
● Data can share a cache line

● Not always good
  ○ Some data 'accidentally' next to other data.
  ○ Causes Cache lines to be evicted prematurely


● Serious Concurrency Issue
Concurrent Cache Line Access




                          © Intel
Current Solution: Field Padding
public volatile long value;
public long pad1, pad2, pad3, pad4,
pad5, pad6;

8 bytes(long) * 7 fields + header = 64 bytes =
Cacheline

NB: fields aligned to 8 bytes even if smaller
Real Solution: JEP 142
class UncontendedFoo {

    int someValue;

    @Uncontended volatile long foo;

    Bar bar;
}


http://openjdk.java.net/jeps/142
False Sharing in Your GC
● Card Tables
   ○ Split RAM into 'cards'
   ○ Table records which parts of RAM are written to
   ○ Avoid rescanning large blocks of RAM


● BUT ... optimise by writing to the byte on
  every write

● -XX:+UseCondCardMark
   ○ Use With Care: 15-20% sequential slowdown
Buyer Beware:

● Context Switching

● False Sharing

● Visibility/Atomicity/Locking Costs
Too Long, Didn't Listen

1. Caching Behaviour has a performance effect

2. The effects are measurable

3. There are common problems and solutions
Questions?
           @RichardWarburto


www.akkadia.org/drepper/cpumemory.pdf
www.jclarity.com/friends/
g.oswego.edu/dl/concurrency-interest/
A Brave New World
   (hopefully, sort of)
Arrays 2.0
● Proposed JVM Feature
● Library Definition of Array Types
● Incorporate a notion of 'flatness'
● Semi-reified Generics
● Loads of other goodies (final, volatile,
  resizing, etc.)
● Requires Value Types
Value Types
● Assign or pass the object you copy the value
  and not the reference

● Allow control of the memory layout

● Control of layout = Control of Caching
  Behaviour

● Idea, initial prototypes in mlvm
The future will solve all problems!

Más contenido relacionado

La actualidad más candente

MapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelJavaDayUA
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineNarann29
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Parallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр ПавлишакParallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр ПавлишакIgor Bronovskyy
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformMartin Zapletal
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingMatthew Dennis
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase HBaseCon
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0HBaseCon
 
Query optimization in Apache Tajo
Query optimization in Apache TajoQuery optimization in Apache Tajo
Query optimization in Apache TajoJihoon Son
 
Layered caching in OpenResty (OpenResty Con 2018)
Layered caching in OpenResty (OpenResty Con 2018)Layered caching in OpenResty (OpenResty Con 2018)
Layered caching in OpenResty (OpenResty Con 2018)Thibault Charbonnier
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3Rob Skillington
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESSubhajit Sahu
 

La actualidad más candente (18)

MapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next level
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Parallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр ПавлишакParallel programming patterns - Олександр Павлишак
Parallel programming patterns - Олександр Павлишак
 
Large volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive PlatformLarge volume data analysis on the Typesafe Reactive Platform
Large volume data analysis on the Typesafe Reactive Platform
 
Map db
Map dbMap db
Map db
 
Cassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data ModelingCassandra NYC 2011 Data Modeling
Cassandra NYC 2011 Data Modeling
 
Vaex pygrunn
Vaex pygrunnVaex pygrunn
Vaex pygrunn
 
Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase Update on OpenTSDB and AsyncHBase
Update on OpenTSDB and AsyncHBase
 
OpenTSDB 2.0
OpenTSDB 2.0OpenTSDB 2.0
OpenTSDB 2.0
 
Query optimization in Apache Tajo
Query optimization in Apache TajoQuery optimization in Apache Tajo
Query optimization in Apache Tajo
 
Layered caching in OpenResty (OpenResty Con 2018)
Layered caching in OpenResty (OpenResty Con 2018)Layered caching in OpenResty (OpenResty Con 2018)
Layered caching in OpenResty (OpenResty Con 2018)
 
Go and Uber’s time series database m3
Go and Uber’s time series database m3Go and Uber’s time series database m3
Go and Uber’s time series database m3
 
Layered Caching in OpenResty
Layered Caching in OpenRestyLayered Caching in OpenResty
Layered Caching in OpenResty
 
Collections forceawakens
Collections forceawakensCollections forceawakens
Collections forceawakens
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Optimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTESOptimizing Parallel Reduction in CUDA : NOTES
Optimizing Parallel Reduction in CUDA : NOTES
 
Lecture 25
Lecture 25Lecture 25
Lecture 25
 

Similar a Caching in (DevoxxUK 2013)

Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedJ On The Beach
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimizationguest3eed30
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory OptimizationWei Lin
 
Elasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuningElasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuningPetar Djekic
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
If the data cannot come to the algorithm...
If the data cannot come to the algorithm...If the data cannot come to the algorithm...
If the data cannot come to the algorithm...Robert Burrell Donkin
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Introduction to Memoria
Introduction to MemoriaIntroduction to Memoria
Introduction to MemoriaVictor Smirnov
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterRed_Hat_Storage
 
Prerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrencyPrerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrencyViller Hsiao
 
WiredTiger In-Memory vs WiredTiger B-Tree
WiredTiger In-Memory vs WiredTiger B-TreeWiredTiger In-Memory vs WiredTiger B-Tree
WiredTiger In-Memory vs WiredTiger B-TreeSveta Smirnova
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingScyllaDB
 
Top 10 Perl Performance Tips
Top 10 Perl Performance TipsTop 10 Perl Performance Tips
Top 10 Perl Performance TipsPerrin Harkins
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017Corey Huinker
 

Similar a Caching in (DevoxxUK 2013) (20)

Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous Speed
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Memory Optimization
Memory OptimizationMemory Optimization
Memory Optimization
 
Elasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuningElasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuning
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
If the data cannot come to the algorithm...
If the data cannot come to the algorithm...If the data cannot come to the algorithm...
If the data cannot come to the algorithm...
 
Mongodb meetup
Mongodb meetupMongodb meetup
Mongodb meetup
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Introduction to Memoria
Introduction to MemoriaIntroduction to Memoria
Introduction to Memoria
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on gluster
 
Hibernate caching
Hibernate cachingHibernate caching
Hibernate caching
 
Prerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrencyPrerequisite knowledge for shared memory concurrency
Prerequisite knowledge for shared memory concurrency
 
WiredTiger In-Memory vs WiredTiger B-Tree
WiredTiger In-Memory vs WiredTiger B-TreeWiredTiger In-Memory vs WiredTiger B-Tree
WiredTiger In-Memory vs WiredTiger B-Tree
 
NUMA and Java Databases
NUMA and Java DatabasesNUMA and Java Databases
NUMA and Java Databases
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
Top 10 Perl Performance Tips
Top 10 Perl Performance TipsTop 10 Perl Performance Tips
Top 10 Perl Performance Tips
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
 
Introduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimizationIntroduction to Parallelization and performance optimization
Introduction to Parallelization and performance optimization
 

Más de RichardWarburton

Fantastic performance and where to find it
Fantastic performance and where to find itFantastic performance and where to find it
Fantastic performance and where to find itRichardWarburton
 
Production profiling what, why and how technical audience (3)
Production profiling  what, why and how   technical audience (3)Production profiling  what, why and how   technical audience (3)
Production profiling what, why and how technical audience (3)RichardWarburton
 
Production profiling: What, Why and How
Production profiling: What, Why and HowProduction profiling: What, Why and How
Production profiling: What, Why and HowRichardWarburton
 
Production profiling what, why and how (JBCN Edition)
Production profiling  what, why and how (JBCN Edition)Production profiling  what, why and how (JBCN Edition)
Production profiling what, why and how (JBCN Edition)RichardWarburton
 
Production Profiling: What, Why and How
Production Profiling: What, Why and HowProduction Profiling: What, Why and How
Production Profiling: What, Why and HowRichardWarburton
 
Java collections the force awakens
Java collections  the force awakensJava collections  the force awakens
Java collections the force awakensRichardWarburton
 
Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)RichardWarburton
 
Generics past, present and future
Generics  past, present and futureGenerics  past, present and future
Generics past, present and futureRichardWarburton
 
Generics Past, Present and Future
Generics Past, Present and FutureGenerics Past, Present and Future
Generics Past, Present and FutureRichardWarburton
 
Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)RichardWarburton
 
Twins: Object Oriented Programming and Functional Programming
Twins: Object Oriented Programming and Functional ProgrammingTwins: Object Oriented Programming and Functional Programming
Twins: Object Oriented Programming and Functional ProgrammingRichardWarburton
 
Pragmatic functional refactoring with java 8
Pragmatic functional refactoring with java 8Pragmatic functional refactoring with java 8
Pragmatic functional refactoring with java 8RichardWarburton
 
Introduction to lambda behave
Introduction to lambda behaveIntroduction to lambda behave
Introduction to lambda behaveRichardWarburton
 
Introduction to lambda behave
Introduction to lambda behaveIntroduction to lambda behave
Introduction to lambda behaveRichardWarburton
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Simplifying java with lambdas (short)
Simplifying java with lambdas (short)Simplifying java with lambdas (short)
Simplifying java with lambdas (short)RichardWarburton
 

Más de RichardWarburton (20)

Fantastic performance and where to find it
Fantastic performance and where to find itFantastic performance and where to find it
Fantastic performance and where to find it
 
Production profiling what, why and how technical audience (3)
Production profiling  what, why and how   technical audience (3)Production profiling  what, why and how   technical audience (3)
Production profiling what, why and how technical audience (3)
 
Production profiling: What, Why and How
Production profiling: What, Why and HowProduction profiling: What, Why and How
Production profiling: What, Why and How
 
Production profiling what, why and how (JBCN Edition)
Production profiling  what, why and how (JBCN Edition)Production profiling  what, why and how (JBCN Edition)
Production profiling what, why and how (JBCN Edition)
 
Production Profiling: What, Why and How
Production Profiling: What, Why and HowProduction Profiling: What, Why and How
Production Profiling: What, Why and How
 
Java collections the force awakens
Java collections  the force awakensJava collections  the force awakens
Java collections the force awakens
 
Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)
 
Generics past, present and future
Generics  past, present and futureGenerics  past, present and future
Generics past, present and future
 
How to run a hackday
How to run a hackdayHow to run a hackday
How to run a hackday
 
Generics Past, Present and Future
Generics Past, Present and FutureGenerics Past, Present and Future
Generics Past, Present and Future
 
Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)
 
Twins: Object Oriented Programming and Functional Programming
Twins: Object Oriented Programming and Functional ProgrammingTwins: Object Oriented Programming and Functional Programming
Twins: Object Oriented Programming and Functional Programming
 
Pragmatic functional refactoring with java 8
Pragmatic functional refactoring with java 8Pragmatic functional refactoring with java 8
Pragmatic functional refactoring with java 8
 
Introduction to lambda behave
Introduction to lambda behaveIntroduction to lambda behave
Introduction to lambda behave
 
Introduction to lambda behave
Introduction to lambda behaveIntroduction to lambda behave
Introduction to lambda behave
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Simplifying java with lambdas (short)
Simplifying java with lambdas (short)Simplifying java with lambdas (short)
Simplifying java with lambdas (short)
 
Twins: OOP and FP
Twins: OOP and FPTwins: OOP and FP
Twins: OOP and FP
 
Twins: OOP and FP
Twins: OOP and FPTwins: OOP and FP
Twins: OOP and FP
 
The Bleeding Edge
The Bleeding EdgeThe Bleeding Edge
The Bleeding Edge
 

Último

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 

Último (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Caching in (DevoxxUK 2013)

  • 1. Caching In Understand, measure and use your CPU Cache more effectively Richard Warburton
  • 2. What are you talking about? ● Why do you care? ● Hardware Fundamentals ● Measurement ● Principles and Examples
  • 3. Why do you care? Perhaps why /should/ you care.
  • 4. The Problem Very Fast Relatively Slow
  • 5. The Solution: CPU Cache ● Core Demands Data, looks at its cache ○ If present (a "hit") then data returned to register ○ If absent (a "miss") then data looked up from memory and stored in the cache ● Fast memory is expensive, a small amount is affordable
  • 6. Multilevel Cache: Intel Sandybridge Physical Core 0 Physical Core N HT: 2 Logical Cores HT: 2 Logical Cores Level 1 Level 1 Level 1 Level 1 Data Instruction .... Data Instruction Cache Cache Cache Cache Level 2 Cache Level 2 Cache Shared Level 3 Cache
  • 7. CPU Stalls ● Want your CPU to execute instructions ● Stall = CPU can't execute because its waiting on memory ● Stalls displace potential computation ● Busy-work Strategies ○ Out-of-Order Execution ○ Simultaneous MultiThreading (SMT) / HyperThreading (HT)
  • 8. How bad is a miss? Location Latency in Clockcycles Register 0 L1 Cache 3 L2 Cache 9 L3 Cache 21 Main Memory 150-400 NB: Sandybridge Numbers
  • 9. Cache Lines ● Data transferred in cache lines ● Fixed size block of memory ● Usually 64 bytes in current x86 CPUs ● Purely hardware consideration
  • 10. You don't need to know everything
  • 11. Architectural Takeaways ● Modern CPUs have large multilevel Caches ● Cache misses cause Stalls ● Stalls are expensive
  • 12. Measurement Barometer, Sun Dial ... and CPU Counters
  • 13. Do you care? ● DON'T look at CPU Caching behaviour first! ● Not all Performance Problems CPU Bound ○ Garbage Collection ○ Networking ○ Database or External Service ○ I/O ● Consider caching behaviour when you're execution bound and know your hotspot.
  • 14. CPU Performance Counters ● CPU Register which can count events ○ Eg: the number of Level 3 Cache Misses ● Model Specific Registers ○ not instruction set standardised by x86 ○ differ by CPU (eg Penryn vs Sandybridge) ● Don't worry - leave details to tooling
  • 15. Measurement: Instructions Retired ● The number of instructions which executed until completion ○ ignores branch mispredictions ● When stalled you're not retiring instructions ● Aim to maximise instruction retirement when reducing cache misses
  • 16. Cache Misses ● Cache Level ○ Level 3 ○ Level 2 ○ Level 1 (Data vs Instruction) ● What to measure ○ Hits ○ Misses ○ Reads vs Writes ○ Calculate Ratio
  • 17. Cache Profilers ● Open Source ○ perfstat ○ linux (rdmsr/wrmsr) ○ Linux Kernel (via /proc) ● Proprietary ○ jClarity jMSR ○ Intel VTune ○ AMD Code Analyst ○ Visual Studio 2012
  • 18. Good Benchmarking Practice ● Warmups ● Measure and rule-out other factors ● Specific Caching Ideas ...
  • 19. Take Care with Working Set Size
  • 20. Good Benchmark = Low Variance
  • 21. You can measure Cache behaviour
  • 22. Principles & Examples Sequential Code
  • 23. Prefetching ● Prefetch = Eagerly load data ● Adjacent Cache Line Prefetch ● Data Cache Unit (Streaming) Prefetch ● Problem: CPU Prediction isn't perfect ● Solution: Arrange Data so accesses are predictable
  • 24. Temporal Locality Repeatedly referring to same data in a short time span Spatial Locality Referring to data that is close together in memory Sequential Locality Referring to data that is arranged linearly in memory
  • 25. General Principles ● Use smaller data types (-XX:+UseCompressedOops) ● Avoid 'big holes' in your data ● Make accesses as linear as possible
  • 26. Primitive Arrays // Sequential Access = Predictable for (int i=0; i<someArray.length; i++) someArray[i]++;
  • 27. Primitive Arrays - Skipping Elements // Holes Hurt for (int i=0; i<someArray.length; i += SKIP) someArray[i]++;
  • 28. Primitive Arrays - Skipping Elements
  • 29. Multidimensional Arrays ● Multidimensional Arrays are really Arrays of Arrays in Java. (Unlike C) ● Some people realign their accesses: for (int col=0; col<COLS; col++) { for (int row=0; row<ROWS; row++) { array[ROWS * col + row]++; } }
  • 30. Bad Access Alignment Strides the wrong way, bad locality. array[COLS * row + col]++; Strides the right way, good locality. array[ROWS * col + row]++;
  • 31. Data Locality vs Java Heap Layout class Foo { count Integer count; 0 bar Bar bar; Baz baz; 1 baz } 2 // No alignment guarantees 3 for (Foo foo : foos) { foo.count = 5; ... foo.bar.visit(); }
  • 32. Data Locality vs Java Heap Layout ● Serious Java Weakness ● Location of objects in memory hard to guarantee. ● Garbage Collection Impact ○ Mark-Sweep ○ Copying Collector
  • 33. Data Layout Principles ● Primitive Collections (GNU Trove, etc.) ● Arrays > Linked Lists ● Hashtable > Search Tree ● Avoid Code bloating (Loop Unrolling) ● Custom Data Structures ○ Judy Arrays ■ an associative array/map ○ kD-Trees ■ generalised Binary Space Partitioning ○ Z-Order Curve ■ multidimensional data in one dimension
  • 34. Game Event Server class PlayerAttack { private long playerId; private int weapon; private int energy; ... private int getWeapon() { return weapon; } ...
  • 35. Flyweight Unsafe (1) static final Unsafe unsafe = ... ; long space = OBJ_COUNT * ATTACK_OBJ_SIZE; address = unsafe.allocateMemory(space); static PlayerAttack get(int index) { long loc = address + (ATTACK_OBJ_SIZE * index) return new PlayerAttack(loc); }
  • 36. Flyweight Unsafe (2) class PlayerAttack { static final long WEAPON_OFFSET = 8 private long loc; public int getWeapon() { long address = loc + WEAPON_OFFSET return unsafe.getInt(address); } public void setWeapon(int weapon) { long address = loc + WEAPON_OFFSET unsafe.setInt(address, weapon); } }
  • 37. SLAB ● Manually writing this is: ○ time consuming ○ error prone ○ boilerplate-ridden ● Prototype library: ○ http://github.com/RichardWarburton/slab ○ Maps an interface to a flyweight
  • 38. Slab (2) public interface GameEvent extends Cursor { public int getId(); public void setId(int value); public long getStrength(); public void setStrength(long value); } Allocator<GameEvent> eventAllocator = Allocator.of(GameEvent.class); GameEvent event = eventAllocator.allocate(100); event.move(50); event.setId(6); assertEquals(6, event.getId());
  • 39. Data Alignment ● An address a is n-byte aligned when: ○ n is a power of two ○ a is divisible by n ● Cache Line Aligned: ○ byte aligned to the size of a cache line
  • 40. Rules and Hacks ● Java (Hotspot) ♥ 8-byte alignment: ○ new Java Objects ○ Unsafe.allocateMemory ○ Bytebuffer.allocateDirect ● Get the object address: ○ Unsafe gives it to you directly ○ Direct ByteBuffer has a field called 'address'
  • 41. Cacheline Aligned Access ● Performance Benefits: ○ 3-6x on Core2Duo ○ ~1.5x on Nehalem Class ○ Measured using direct ByteBuffers ○ Mid Aligned vs Straddling a Cacheline ● http://psy-lob-saw.blogspot.co.uk/2013/01/direct-memory-alignment-in-java.html
  • 42. Today I learned ... Access and Cache Aligned, Contiguous Data
  • 43. Translation Lookaside Buffers TLB, not TLAB!
  • 44. Virtual Memory ● RAM is finite ● RAM is fragmented ● Multitasking is nice ● Programming easier with a contiguous address space
  • 45. Page Tables ● Virtual Address Space split into pages ● Page has two Addresses: ○ Virtual Address: Process' view ○ Physical Address: Location in RAM/Disk ● Page Table ○ Mapping Between these addresses ○ OS Managed ○ Page Fault when entry is missing ○ OS Pages in from disk
  • 46. Translation Lookaside Buffers ● TLB = Page Table Cache ● Avoids memory bottlenecks in page lookups ● CPU Cache is multi-level => TLB is multi- level ● Miss/Hit rates measurable via CPU Counters ○ Hit/Miss Rates ○ Walk Duration
  • 47. TLB Flushing ● Process context switch => change address space ● Historically caused "flush" = ● Avoided with an Address Space Identifier (ASID)
  • 48. Page Size Tradeoff: Bigger Pages = less pages = quicker page lookups vs Bigger Pages = wasted memory space = more disk paging
  • 49. Page Sizes ● Page Size is adjustable ● Common Sizes: 4KB, 2MB, 1GB ● Easy Speedup: 10%-30% ● -XX:+UseLargePages ● Oracle DBs love Large Pages
  • 50. Too Long, Didn't Listen 1. Virtual Memory + Paging have an overhead 2. Translation Lookaside Buffers used to reduce cost 3. Page size changes important
  • 51. Principles & Examples (2) Concurrent Code
  • 52. Context Switching ● Multitasking systems 'context switch' between processes ● Variants: ○ Thread ○ Process ○ Virtualized OS
  • 53. Context Switching Costs ● Direct Cost ○ Save old process state ○ Schedule new process ○ Load new process state ● Indirect Cost ○ TLB Reload (Process only) ○ CPU Pipeline Flush ○ Cache Interference ○ Temporal Locality
  • 54. Indirect Costs (cont.) ● "Quantifying The Cost of Context Switch" by Li et al. ○ 5 years old - principle still applies ○ Direct = 3.8 microseconds ○ Indirect = ~0 - 1500 microseconds ○ indirect dominates direct when working set > L2 Cache size ● Cooperative hits also exist ● Minimise context switching if possible
  • 55. Locking ● Locks require kernel arbitration ○ Not a context switch, but a mode switch ○ Lots of lock contention causes context switching ● Has a cache-pollution cost ● Can use try lock-free algorithms
  • 56. Java Memory Model ● JSR 133 Specs the memory model ● Turtles Example: ○ Java Level: volatile ○ CPU Level: lock ○ Cache Level: MESI ● Has a cost!
  • 57. False Sharing ● Data can share a cache line ● Not always good ○ Some data 'accidentally' next to other data. ○ Causes Cache lines to be evicted prematurely ● Serious Concurrency Issue
  • 58. Concurrent Cache Line Access © Intel
  • 59. Current Solution: Field Padding public volatile long value; public long pad1, pad2, pad3, pad4, pad5, pad6; 8 bytes(long) * 7 fields + header = 64 bytes = Cacheline NB: fields aligned to 8 bytes even if smaller
  • 60. Real Solution: JEP 142 class UncontendedFoo { int someValue; @Uncontended volatile long foo; Bar bar; } http://openjdk.java.net/jeps/142
  • 61. False Sharing in Your GC ● Card Tables ○ Split RAM into 'cards' ○ Table records which parts of RAM are written to ○ Avoid rescanning large blocks of RAM ● BUT ... optimise by writing to the byte on every write ● -XX:+UseCondCardMark ○ Use With Care: 15-20% sequential slowdown
  • 62. Buyer Beware: ● Context Switching ● False Sharing ● Visibility/Atomicity/Locking Costs
  • 63. Too Long, Didn't Listen 1. Caching Behaviour has a performance effect 2. The effects are measurable 3. There are common problems and solutions
  • 64. Questions? @RichardWarburto www.akkadia.org/drepper/cpumemory.pdf www.jclarity.com/friends/ g.oswego.edu/dl/concurrency-interest/
  • 65. A Brave New World (hopefully, sort of)
  • 66. Arrays 2.0 ● Proposed JVM Feature ● Library Definition of Array Types ● Incorporate a notion of 'flatness' ● Semi-reified Generics ● Loads of other goodies (final, volatile, resizing, etc.) ● Requires Value Types
  • 67. Value Types ● Assign or pass the object you copy the value and not the reference ● Allow control of the memory layout ● Control of layout = Control of Caching Behaviour ● Idea, initial prototypes in mlvm
  • 68. The future will solve all problems!