SlideShare una empresa de Scribd logo
1 de 50
Peddle the Pedal to the Metal
Howard Chu
CTO, Symas Corp. hyc@symas.com
Chief Architect, OpenLDAP hyc@openldap.org
2019-03-05
InfoQ.com: News & Community Site
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
programming-c-efficient-scalable-
software
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon London
www.qconlondon.com
2
Overview
●
Context, philosophy, impact
●
Profiling tools
●
Obvious problems and effective solutions
●
More problems, more tools
●
When incremental improvement isn’t enough
3
Tips, Tricks, Tools & Techniques
●
Real world experience accelerating an existing codebase over
100x
– From 60ms per op to 0.6ms per op
– All in portable C, no asm or other non-portable tricks
4
Search Performance
5
Mechanical Sympathy
●
“By understanding a machine-oriented language, the
programmer will tend to use a much more efficient method; it is
much closer to reality.”
– Donald Knuth The Art of Computer Programming 1967
6
Optimization
●
“We should forget about small efficiencies, say about 97% of
the time: premature optimization is the root of all evil. Yet we
should not pass up our opportunities in that critical 3%.”
– Donald Knuth “Computer Programming as an Art” 1974
7
Optimization
●
The decisions differ greatly between refactoring an existing
codebase, and starting a new project from scratch
– But even with new code, there’s established knowledge that can’t be
ignored.
●
e.g. it’s not premature to choose to avoid BubbleSort
●
Planning ahead will save a lot of actual coding
8
Optimization
●
Eventually you reach a limit, where a time/space tradeoff is
required
– But most existing code is nowhere near that limit
●
Some cases are clear, no tradeoffs to make
– E.g. there’s no clever way to chop up or reorganize an array of
numbers before summing them up
●
Eventually you must visit and add each number in the array
●
Simplicity is best
9
Summing
int i, sum;
for (i=1, sum=A[0]; i<8; sum+=A[i], i++);
A[0] A[1]A[0] + A[2]+ A[3]+ A[4]+ A[5]+ A[6]+ A[7]+
10
Summing
int i, j, sum=0;
for (i=0; i<5; i+= 4) {
for (j=0; j<3; j+=2) a[i+j] += a[i+j+1];
a[i] += a[i+2];
sum += a[i];
}
A[0] A[1]A[0] + A[2] A[3]+ A[4] A[5]+ A[6] A[7]+
A[01] A[23] A[45] A[67]+ +
A[0123] A[4567]+
11
Optimization
●
Correctness first
– It’s easier to make correct code fast, than vice versa
●
Try to get it right the first time around
– If you don’t have time to do it right, when will you ever have time to
come back and fix it?
●
Computers are supposed to be fast
– Even if you get the right answer, if you get it too late, your code is
broken
12
Tools
●
Profile! Always measure first
– Many possible approaches, each has different strengths
●
Linux perf (formerly called oprofile)
– Easiest to use, time-based samples
– Generated call graphs can miss important details
●
FunctionCheck
– Compiler-based instrumentation, requires explicit compile
– Accurate call graphs, noticeable performance impact
●
Valgrind callgrind
– Greatest detail, instruction-level profiles
– Slowest to execute, hundreds of times slower than normal
13
Profiling
●
Using `perf` in a first pass is fairly painless and will show you
the worst offenders
– We found in UMich LDAP 3.3, 55% of execution time was spent in
malloc/free. Another 40% in strlen, strcat, strcpy
– You’ll never know how (bad) things are until you look
14
Profiling
●
As noted, `perf` can miss details and usually doesn’t give very
useful call graphs
– Knowing the call tree is vital to fixing the hot spots
– This is where other tools like FunctionCheck and valgrind/callgrind
are useful
15
Insights
●
“Don’t Repeat Yourself” as a concept applies universally
– Don’t recompute the same thing multiple times in rapid succession
●
Don’t throw away useful information if you’ll need it again soon. If the
information is used frequently and expensive to compute, remember it
●
Corollary: don’t cache static data that’s easy to re-fetch
16
String Mangling
●
The code was doing a lot of redundant string
parsing/reassembling
– 25% of time in strlen() on data received over the wire
●
Totally unnecessary since all LDAP data is BER-encoded, with explicit
lengths
●
Use struct bervals everywhere, which carries a string pointer and an explicit
length value
●
Eliminated strlen() from runtime profiles
17
String Mangling
●
Reassembling string components with strcat()
– Wasteful, Schlemiel the Painter problem
●
https://en.wikipedia.org/wiki/Joel_Spolsky#Schlemiel_the_Painter
%27s_algorithm
●
strcat() always starts from beginning of string, gets slower the more it’s used
– Fixed by using our own strcopy() function, which returns pointer to
end of string.
●
Modern equivalent is stpcpy().
18
String Mangling
●
Safety note – safe strcpy/strcat:
char *stecpy(char *dst, const char *src, const char *end)
{
while (*src && dst < end)
*dst++ = *src++;
if (dst < end)
*dst = '0';
return dst;
}
main() {
char buf[64];
char *ptr, *end = buf+sizeof(buf);
ptr = stecpy(buf, "hello", end);
ptr = stecpy(ptr, " world", end);
}
19
String Mangling
●
stecpy()
– Immune to buffer overflows
– Convenient to use, no repetitive recalculation of remaining buffer
space required
– Returns pointer to end of copy, allows fast concatenation of strings
– You should adopt this everywhere
20
String Mangling
●
Conclusion
– If you’re doing a lot of string handling, you probably need to use
something like struct bervals in your code
struct berval {
size_t len;
char *val;
}
– You should avoid using the standard C string library
21
Malloc Mischief
●
Most people’s first impulse on seeing “we’re spending a lot of
time in malloc” is to switch to an “optimized” library like jemalloc
or tcmalloc
– Don’t do it. Not as a first resort. You’ll only net a 10-20%
improvement at most.
– Examine the profile callgraph; see how it’s actually being used
22
Malloc Mischief
●
Most of the malloc use was in functions looking like
datum *foo(param1, param2, etc…) {
datum *result = malloc(sizeof(datum));
result->bar = blah blah…
return result;
}
23
Malloc Mischief
●
Easily eliminated by having the caller provide the datum
structure, usually on its own stack
void foo(datum *ret, param1, param2, etc…)
{
ret->bar = blah blah...
}
24
Malloc Mischief
●
Avoid C++ style constructor patterns
– Callers should always pass data containers in
– Callees should just fill in necessary fields
●
This eliminated about half of our malloc use
– That brings us to the end of the easy wins
– Our execution time accelerated from 60ms/op to 15ms/op
25
Malloc Mischief
●
More bad usage patterns:
– Building an item incrementally, using realloc
●
Another Schlemiel the Painter problem
– Instead, count the sizes of all elements first, and allocate the
necessary space once
26
Malloc Mischief
●
Parsing incoming requests
– Messages include length in prefix
– Read entire message into a single buffer before parsing
– Parse individual fields into data structures
●
Code was allocating containers for fields as well as memory for
copies of fields
●
Changed to set values to point into original read buffer
●
Avoid unneeded mallocs and memcpys
27
Malloc Mischief
●
If your processing has self-contained units of work, use a per-
unit arena with your own custom allocator instead of the heap
– Advantages:
●
No need to call free() at all
●
Can avoid any global heap mutex contention
– Basically the Mark/Release memory management model of Pascal
28
Malloc Mischief
●
Consider preallocating a number of commonly used structures
during startup, to avoid cost of malloc at runtime
– But be careful to avoid creating a mutex bottleneck around usage of
the preallocated items
●
Using these techniques, we moved malloc from #1 in profile to
… not even the top 100.
29
Malloc Mischief
●
If you make some mistakes along the way you might encounter
memory leaks
●
FunctionCheck and valgrind can trace these but they’re both
quite slow
●
Use github.com/hyc/mleak – fastest memory leak tracer
30
Uncharted Territory
●
After eliminating the worst profile hotspots, you may be left with
a profile that’s fairly flat, with no hotspots
– If your system performance is good enough now, great, you’re done
– If not, you’re going to need to do some deep thinking about how to
move forward
– A lot of overheads won’t show up in any profile
31
Threading Cost
●
Threads, aka Lightweight Processes
– The promise was that they would be cheap, spawn as many as you
like, whenever
– (But then again, the promise of Unix was that processes would be
cheap, etc…)
– In reality: startup and teardown costs add up
●
Don’t repeat yourself: don’t incur the cost of startup and teardown repeatedly
32
Threading Cost
●
Use a threadpool
– Cost of thread API overhead is generally not visible in profiles
– Measured throughput improvement of switching to threadpool was
around 15%
33
Function Cost
●
A common pattern involves a Debug function:
Debug(level, message) {
if (!( level & debug_level ))
return;
…
}
34
Function Cost
●
For functions like this that are called frequently but seldom do
any work, the call overhead is significant
●
Replace with a DEBUG() macro
– Move the debug_level test into the macro, avoid function call if the
message would be skipped
35
Function Cost
●
We also had functions with huge signatures, passing many
parameters around
●
This is both a correctness and efficiency issue
●
“If you have a procedure with 10 parameters, you probably
missed some.”
– Alan Perlis Epigrams on Programming 1982
36
Function Cost
●
Nested calls of functions with long parameter lists use a lot of
time pushing params onto the stack
●
Instead, put all params into a single structure and pass pointer
to this struct as function parameter
●
Resulted in 7-8% performance gain
– https://www.openldap.org/lists/openldap-devel/200304/
msg00004.html
37
Data Access Cost
●
Shared data structures in a multithreaded program
– Cost of mutexes to protect accesses
– Hidden cost of misaligned data within shared structures: “False
sharing”
●
Only occurs in multiprocessor machines
38
Data Access Cost
●
Within a single structure, order elements from largest to
smallest, to minimize padding overhead
●
Within shared tables of structures, align structures with size of
CPU cache line
– Use mmap() or posix_memalign() if necessary
●
Use instruction-level tracing and cache hit counters with perf to
see results
39
Data Access Cost
●
Use mutrace to measure lock contention overhead
●
Where hotspots appear, try to distribute the load across multiple
locks instead of just one
– E.g. in slapd threadpool, work queue used a single mutex
– Splitting into 4 queues with 4 mutexes decreased contention and wait
time by a factor of 6.
40
Stepwise Refinement
●
Writing optimal code is an iterative process
– When you eliminate one bottleneck, others may appear that were
previously overshadowed
– It may seem like an unending task
– Measure often and keep good notes so you can see progress being
made
41
Burn It All Down
●
Sometimes you’ll get stuck, maybe you went down a dead end
●
No amount of incremental improvements will get the desired
result
●
If you can identify the remaining problems in your way, it may
be worthwhile to start over with those problems in mind
42
Burn It All Down
●
In OpenLDAP, we’ve used BerkeleyDB since 2000
– Have spent countless hours building a cache above it because its
own performance was too slow
– Numerous bugs along the way related to lock
management/deadlocks
●
Realization: if your DB engine is so slow you need to build your
own cache above it, you’ve got the wrong DB engine
43
Burn It All Down
●
We started designing LMDB in 2009 specifically to avoid the
caching and locking issues in BerkeleyDB
●
Changing large components like this requires a good modular
internal API to be feasible
– Rewriting the entire world from scratch is usually a horrible idea,
reuse as much as you can that’s worth saving
– Make sure you actually solve the problems you intend, make sure
those are the actual important problems
44
Burn It All Down
●
LMDB uses copy-on-write MVCC, exposes data via read-only
mmap
– Eliminates locks for read operations, readers don’t block writers,
writers don’t block readers
– Eliminates mallocs and memcpy when returning data from the DB
●
There are no blocking calls at all in the read path, reads scale perfectly
linearly across all available CPUs
– DB integrity is 100% crash proof, incorruptible
●
Restart after shutdown or crash is instantaneous
45
Review
●
Correctness first
– But getting the right answer too late is still wrong
●
Fixing inefficiencies is an iterative process
●
Multiple tools available, each with different strengths and
weaknesses
●
Sometimes you may have to throw a lot out and start over
46
Conclusion
●
Ultimately the idea is to do only what is necessary and sufficient
– Do what you need to do, and nothing more
– Do what you need, once
– DRY talks about not repeating yourself in source code; here we mean
don’t repeat yourself in execution
47
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
programming-c-efficient-scalable-
software

Más contenido relacionado

Similar a Peddle the Pedal to the Metal

What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
 
Lean Model-Driven Development through Model-Interpretation: the CPAL design ...
Lean Model-Driven Development through  Model-Interpretation: the CPAL design ...Lean Model-Driven Development through  Model-Interpretation: the CPAL design ...
Lean Model-Driven Development through Model-Interpretation: the CPAL design ...Nicolas Navet
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSTulipp. Eu
 
C++ 11 Style : A Touch of Class
C++ 11 Style : A Touch of ClassC++ 11 Style : A Touch of Class
C++ 11 Style : A Touch of ClassYogendra Rampuria
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCDamienCarpy
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin敬倫 林
 
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Tim Bunce
 
smalltalk numbercrunching
smalltalk numbercrunchingsmalltalk numbercrunching
smalltalk numbercrunchingDaniel Poon
 
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Adam Dunkels
 
Asynchronous and parallel programming
Asynchronous and parallel programmingAsynchronous and parallel programming
Asynchronous and parallel programmingAnil Kumar
 

Similar a Peddle the Pedal to the Metal (20)

What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
Introduction to multicore .ppt
Introduction to multicore .pptIntroduction to multicore .ppt
Introduction to multicore .ppt
 
Lean Model-Driven Development through Model-Interpretation: the CPAL design ...
Lean Model-Driven Development through  Model-Interpretation: the CPAL design ...Lean Model-Driven Development through  Model-Interpretation: the CPAL design ...
Lean Model-Driven Development through Model-Interpretation: the CPAL design ...
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Software Engineering
Software EngineeringSoftware Engineering
Software Engineering
 
04 performance
04 performance04 performance
04 performance
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
C++ 11 Style : A Touch of Class
C++ 11 Style : A Touch of ClassC++ 11 Style : A Touch of Class
C++ 11 Style : A Touch of Class
 
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesPragmatic Optimization in Modern Programming - Ordering Optimization Approaches
Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches
 
Meetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaCMeetup 2020 - Back to the Basics part 101 : IaC
Meetup 2020 - Back to the Basics part 101 : IaC
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin
 
Introduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimizationIntroduction to Parallelization ans performance optimization
Introduction to Parallelization ans performance optimization
 
Lecture1
Lecture1Lecture1
Lecture1
 
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
Devel::NYTProf v3 - 200908 (OUTDATED, see 201008)
 
Kubernetes 101
Kubernetes 101Kubernetes 101
Kubernetes 101
 
An Introduction to OMNeT++ 6.0
An Introduction to OMNeT++ 6.0An Introduction to OMNeT++ 6.0
An Introduction to OMNeT++ 6.0
 
smalltalk numbercrunching
smalltalk numbercrunchingsmalltalk numbercrunching
smalltalk numbercrunching
 
Onnc intro
Onnc introOnnc intro
Onnc intro
 
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
 
Asynchronous and parallel programming
Asynchronous and parallel programmingAsynchronous and parallel programming
Asynchronous and parallel programming
 

Más de C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 

Más de C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Último

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Peddle the Pedal to the Metal

  • 1. Peddle the Pedal to the Metal Howard Chu CTO, Symas Corp. hyc@symas.com Chief Architect, OpenLDAP hyc@openldap.org 2019-03-05
  • 2. InfoQ.com: News & Community Site Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ programming-c-efficient-scalable- software • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon London www.qconlondon.com
  • 4. 2 Overview ● Context, philosophy, impact ● Profiling tools ● Obvious problems and effective solutions ● More problems, more tools ● When incremental improvement isn’t enough
  • 5. 3 Tips, Tricks, Tools & Techniques ● Real world experience accelerating an existing codebase over 100x – From 60ms per op to 0.6ms per op – All in portable C, no asm or other non-portable tricks
  • 7. 5 Mechanical Sympathy ● “By understanding a machine-oriented language, the programmer will tend to use a much more efficient method; it is much closer to reality.” – Donald Knuth The Art of Computer Programming 1967
  • 8. 6 Optimization ● “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” – Donald Knuth “Computer Programming as an Art” 1974
  • 9. 7 Optimization ● The decisions differ greatly between refactoring an existing codebase, and starting a new project from scratch – But even with new code, there’s established knowledge that can’t be ignored. ● e.g. it’s not premature to choose to avoid BubbleSort ● Planning ahead will save a lot of actual coding
  • 10. 8 Optimization ● Eventually you reach a limit, where a time/space tradeoff is required – But most existing code is nowhere near that limit ● Some cases are clear, no tradeoffs to make – E.g. there’s no clever way to chop up or reorganize an array of numbers before summing them up ● Eventually you must visit and add each number in the array ● Simplicity is best
  • 11. 9 Summing int i, sum; for (i=1, sum=A[0]; i<8; sum+=A[i], i++); A[0] A[1]A[0] + A[2]+ A[3]+ A[4]+ A[5]+ A[6]+ A[7]+
  • 12. 10 Summing int i, j, sum=0; for (i=0; i<5; i+= 4) { for (j=0; j<3; j+=2) a[i+j] += a[i+j+1]; a[i] += a[i+2]; sum += a[i]; } A[0] A[1]A[0] + A[2] A[3]+ A[4] A[5]+ A[6] A[7]+ A[01] A[23] A[45] A[67]+ + A[0123] A[4567]+
  • 13. 11 Optimization ● Correctness first – It’s easier to make correct code fast, than vice versa ● Try to get it right the first time around – If you don’t have time to do it right, when will you ever have time to come back and fix it? ● Computers are supposed to be fast – Even if you get the right answer, if you get it too late, your code is broken
  • 14. 12 Tools ● Profile! Always measure first – Many possible approaches, each has different strengths ● Linux perf (formerly called oprofile) – Easiest to use, time-based samples – Generated call graphs can miss important details ● FunctionCheck – Compiler-based instrumentation, requires explicit compile – Accurate call graphs, noticeable performance impact ● Valgrind callgrind – Greatest detail, instruction-level profiles – Slowest to execute, hundreds of times slower than normal
  • 15. 13 Profiling ● Using `perf` in a first pass is fairly painless and will show you the worst offenders – We found in UMich LDAP 3.3, 55% of execution time was spent in malloc/free. Another 40% in strlen, strcat, strcpy – You’ll never know how (bad) things are until you look
  • 16. 14 Profiling ● As noted, `perf` can miss details and usually doesn’t give very useful call graphs – Knowing the call tree is vital to fixing the hot spots – This is where other tools like FunctionCheck and valgrind/callgrind are useful
  • 17. 15 Insights ● “Don’t Repeat Yourself” as a concept applies universally – Don’t recompute the same thing multiple times in rapid succession ● Don’t throw away useful information if you’ll need it again soon. If the information is used frequently and expensive to compute, remember it ● Corollary: don’t cache static data that’s easy to re-fetch
  • 18. 16 String Mangling ● The code was doing a lot of redundant string parsing/reassembling – 25% of time in strlen() on data received over the wire ● Totally unnecessary since all LDAP data is BER-encoded, with explicit lengths ● Use struct bervals everywhere, which carries a string pointer and an explicit length value ● Eliminated strlen() from runtime profiles
  • 19. 17 String Mangling ● Reassembling string components with strcat() – Wasteful, Schlemiel the Painter problem ● https://en.wikipedia.org/wiki/Joel_Spolsky#Schlemiel_the_Painter %27s_algorithm ● strcat() always starts from beginning of string, gets slower the more it’s used – Fixed by using our own strcopy() function, which returns pointer to end of string. ● Modern equivalent is stpcpy().
  • 20. 18 String Mangling ● Safety note – safe strcpy/strcat: char *stecpy(char *dst, const char *src, const char *end) { while (*src && dst < end) *dst++ = *src++; if (dst < end) *dst = '0'; return dst; } main() { char buf[64]; char *ptr, *end = buf+sizeof(buf); ptr = stecpy(buf, "hello", end); ptr = stecpy(ptr, " world", end); }
  • 21. 19 String Mangling ● stecpy() – Immune to buffer overflows – Convenient to use, no repetitive recalculation of remaining buffer space required – Returns pointer to end of copy, allows fast concatenation of strings – You should adopt this everywhere
  • 22. 20 String Mangling ● Conclusion – If you’re doing a lot of string handling, you probably need to use something like struct bervals in your code struct berval { size_t len; char *val; } – You should avoid using the standard C string library
  • 23. 21 Malloc Mischief ● Most people’s first impulse on seeing “we’re spending a lot of time in malloc” is to switch to an “optimized” library like jemalloc or tcmalloc – Don’t do it. Not as a first resort. You’ll only net a 10-20% improvement at most. – Examine the profile callgraph; see how it’s actually being used
  • 24. 22 Malloc Mischief ● Most of the malloc use was in functions looking like datum *foo(param1, param2, etc…) { datum *result = malloc(sizeof(datum)); result->bar = blah blah… return result; }
  • 25. 23 Malloc Mischief ● Easily eliminated by having the caller provide the datum structure, usually on its own stack void foo(datum *ret, param1, param2, etc…) { ret->bar = blah blah... }
  • 26. 24 Malloc Mischief ● Avoid C++ style constructor patterns – Callers should always pass data containers in – Callees should just fill in necessary fields ● This eliminated about half of our malloc use – That brings us to the end of the easy wins – Our execution time accelerated from 60ms/op to 15ms/op
  • 27. 25 Malloc Mischief ● More bad usage patterns: – Building an item incrementally, using realloc ● Another Schlemiel the Painter problem – Instead, count the sizes of all elements first, and allocate the necessary space once
  • 28. 26 Malloc Mischief ● Parsing incoming requests – Messages include length in prefix – Read entire message into a single buffer before parsing – Parse individual fields into data structures ● Code was allocating containers for fields as well as memory for copies of fields ● Changed to set values to point into original read buffer ● Avoid unneeded mallocs and memcpys
  • 29. 27 Malloc Mischief ● If your processing has self-contained units of work, use a per- unit arena with your own custom allocator instead of the heap – Advantages: ● No need to call free() at all ● Can avoid any global heap mutex contention – Basically the Mark/Release memory management model of Pascal
  • 30. 28 Malloc Mischief ● Consider preallocating a number of commonly used structures during startup, to avoid cost of malloc at runtime – But be careful to avoid creating a mutex bottleneck around usage of the preallocated items ● Using these techniques, we moved malloc from #1 in profile to … not even the top 100.
  • 31. 29 Malloc Mischief ● If you make some mistakes along the way you might encounter memory leaks ● FunctionCheck and valgrind can trace these but they’re both quite slow ● Use github.com/hyc/mleak – fastest memory leak tracer
  • 32. 30 Uncharted Territory ● After eliminating the worst profile hotspots, you may be left with a profile that’s fairly flat, with no hotspots – If your system performance is good enough now, great, you’re done – If not, you’re going to need to do some deep thinking about how to move forward – A lot of overheads won’t show up in any profile
  • 33. 31 Threading Cost ● Threads, aka Lightweight Processes – The promise was that they would be cheap, spawn as many as you like, whenever – (But then again, the promise of Unix was that processes would be cheap, etc…) – In reality: startup and teardown costs add up ● Don’t repeat yourself: don’t incur the cost of startup and teardown repeatedly
  • 34. 32 Threading Cost ● Use a threadpool – Cost of thread API overhead is generally not visible in profiles – Measured throughput improvement of switching to threadpool was around 15%
  • 35. 33 Function Cost ● A common pattern involves a Debug function: Debug(level, message) { if (!( level & debug_level )) return; … }
  • 36. 34 Function Cost ● For functions like this that are called frequently but seldom do any work, the call overhead is significant ● Replace with a DEBUG() macro – Move the debug_level test into the macro, avoid function call if the message would be skipped
  • 37. 35 Function Cost ● We also had functions with huge signatures, passing many parameters around ● This is both a correctness and efficiency issue ● “If you have a procedure with 10 parameters, you probably missed some.” – Alan Perlis Epigrams on Programming 1982
  • 38. 36 Function Cost ● Nested calls of functions with long parameter lists use a lot of time pushing params onto the stack ● Instead, put all params into a single structure and pass pointer to this struct as function parameter ● Resulted in 7-8% performance gain – https://www.openldap.org/lists/openldap-devel/200304/ msg00004.html
  • 39. 37 Data Access Cost ● Shared data structures in a multithreaded program – Cost of mutexes to protect accesses – Hidden cost of misaligned data within shared structures: “False sharing” ● Only occurs in multiprocessor machines
  • 40. 38 Data Access Cost ● Within a single structure, order elements from largest to smallest, to minimize padding overhead ● Within shared tables of structures, align structures with size of CPU cache line – Use mmap() or posix_memalign() if necessary ● Use instruction-level tracing and cache hit counters with perf to see results
  • 41. 39 Data Access Cost ● Use mutrace to measure lock contention overhead ● Where hotspots appear, try to distribute the load across multiple locks instead of just one – E.g. in slapd threadpool, work queue used a single mutex – Splitting into 4 queues with 4 mutexes decreased contention and wait time by a factor of 6.
  • 42. 40 Stepwise Refinement ● Writing optimal code is an iterative process – When you eliminate one bottleneck, others may appear that were previously overshadowed – It may seem like an unending task – Measure often and keep good notes so you can see progress being made
  • 43. 41 Burn It All Down ● Sometimes you’ll get stuck, maybe you went down a dead end ● No amount of incremental improvements will get the desired result ● If you can identify the remaining problems in your way, it may be worthwhile to start over with those problems in mind
  • 44. 42 Burn It All Down ● In OpenLDAP, we’ve used BerkeleyDB since 2000 – Have spent countless hours building a cache above it because its own performance was too slow – Numerous bugs along the way related to lock management/deadlocks ● Realization: if your DB engine is so slow you need to build your own cache above it, you’ve got the wrong DB engine
  • 45. 43 Burn It All Down ● We started designing LMDB in 2009 specifically to avoid the caching and locking issues in BerkeleyDB ● Changing large components like this requires a good modular internal API to be feasible – Rewriting the entire world from scratch is usually a horrible idea, reuse as much as you can that’s worth saving – Make sure you actually solve the problems you intend, make sure those are the actual important problems
  • 46. 44 Burn It All Down ● LMDB uses copy-on-write MVCC, exposes data via read-only mmap – Eliminates locks for read operations, readers don’t block writers, writers don’t block readers – Eliminates mallocs and memcpy when returning data from the DB ● There are no blocking calls at all in the read path, reads scale perfectly linearly across all available CPUs – DB integrity is 100% crash proof, incorruptible ● Restart after shutdown or crash is instantaneous
  • 47. 45 Review ● Correctness first – But getting the right answer too late is still wrong ● Fixing inefficiencies is an iterative process ● Multiple tools available, each with different strengths and weaknesses ● Sometimes you may have to throw a lot out and start over
  • 48. 46 Conclusion ● Ultimately the idea is to do only what is necessary and sufficient – Do what you need to do, and nothing more – Do what you need, once – DRY talks about not repeating yourself in source code; here we mean don’t repeat yourself in execution
  • 49. 47
  • 50. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ programming-c-efficient-scalable- software