It’s been said that the first 90% of a project consumes 90% of the time, whereas the second 10 % accounts for the other 90% of the time. One reason might be because elevating software from “mostly works” to robust and supportable requires an attention to detail in the parts of a system that are usually mocked out during unit testing. It’s all too easy to focus on testing the happy paths and gloss over the more tricky design problems such as how to handle a full disk or Cheshire cat style network.
This session delves into those less glamorous non-functional requirements that crop up the moment you start talking to hard disks, networks, databases, etc. Unsurprisingly it will have a fair bit to say about detecting and recovering from errors; starting with ensuring that you generate them correctly in the first place. This will undoubtedly lead on to the aforementioned subject of testing systemic effects. Finally there will also be diversions into the realms of monitoring and configuration as we look into the operational side of the code once it’s running.
At the end you will hopefully have smiled at the misfortune of others (mostly me) and added a few more items to the ever growing list of “stuff I might have to think about when developing software”.
What do I mean by Robustness?
Not so much about reliability
Chair – sitting, to standing, stacking, etc. – from specified to unknown
Why is it important? Bedrock for sustainable development of new features.
Not over-engineering, just consideration of failures
What do some runtimes do when an unhandled exceptional failure occurs? Nothing!
See QM #6
The exit code convention is 0 for success
Note, that’s “success == !true” just for extra confusion
The parent can’t react and recover if you don’t give them the chance to
Exceptions only exists within languages once you cross module boundaries it’s back to return codes
Assume failure by default
Don’t assume the runtime will do the right thing
It’s int main(), not void main() – always return an exit code
Required at any module boundary, e.g. Win32 callback, COM component, WCF service, etc.
Service recovery – shutdown may be worse – black hole effect
Recap the Abrahams exception safety guarantees
These apply equally to C#, Java, etc. as well
Basic can be implemented with RAII in C++ and Dispose pattern in C# otherwise a manual try/catch block
Example of real-world code, caused process to fail all work rapidly
When recovery is not foremost in the method, be exception agnostic
Still hard - more recent example was slowly losing engines due to subtle out-of-memory exception
Two phase construction is a bad idea anyway, always prefer just the constructor or factory method to do it all
Don’t wait forever, there must be an upper limit on how long a user/system actor will actually wait
Don’t even start work if the users has already got bored
Status message example – received every 60 secs so no point waiting any longer
Infinite waits acceptable when operation can be cancelled through other means
Long running operations should be cancellable to allow graceful termination/shutdown
Fast and slow retries – perhaps retry much later (queued) if there is a specific blockage
Test more than just the happy path (disks fill up, networks hang, access gets denied)
If expecting automatic retry on a cluster failover, mock the service and simulate one to test recovery
Write + rename is equivalent to create + swap earlier
Build facades to allow unit testing of I/O operations and for simulating errors, e.g. out of disk space
In-house production can be simpler as change is tightly controlled, development is where the action happens
Never hard-code anything, all service endpoints and paths must be configurable (on different levels)
Testing often drives the need for flexibility due to shared resources, e.g. developers workstation
DR also a driver, but can be useful outside DR too (e.g. active/passive failover)
But also default sensibly where possible to avoid bloated configuration files
Calm and considered – pages of errors and alarm bells make it harder to diagnose
You’ll never dream up every possible failure, but you can design ways to allow for it
An excellent book probably the best on the subject – good case studies