Presentation about the steps required for Verifying and Validating safety critical systems, as well as the test approach used. It goes beyond the simple processes, and also talks about the required safety culture and people required. The presentation contains examples of real-life IEC 61508 SIL 4 systems used on stormsurge barriers...
7. Some people live on the edge…
How would you feel if you were getting
ready to launch and knew you were
sitting on top of two million parts
-- all built by the lowest bidder on a
government contract.
John Glenn
11. Until it is too late…
• February 1st 1953
• Spring tide and heavy
winds broke dykes
• Killed 1836 humans
and 30.000 animals
12. The battle against flood risk…
• Cost €2.500.000.000
• The largest moving
structure on the
planet
• Defends
– 500 km2 land
– 80.000 people
• Partially controlled
by software
13. Nothing is flawless, by design…
No matter how well the
design has been:
• Some scenarios will be
missed
• Some scenarios are
too expensive to
prevent:
– Accept risk
– Communicate to stakeholders
14. When is software good enough?
• Dutch Law on
storm surge
barriers
• Equalizes risk
of dying due
to unnatural
causes across
the Netherlands
15. Risks have to be balanced…
VS.
Availability of the service
Safety of the service
16. Oosterschelde Storm Surge Barrier
• Chance of
– Failure to close: 10-7
per usage
– Unexpected closure:
10-4 per year
17. To put things in perspective…
•
•
•
•
•
•
•
•
•
Having a drunk pilot: 10-2 per flight
Hurt yourself when using a chainsaw: 10-3 per use
Dating a supermodel: 10-5 in a lifetime
Drowning in a bathtub: 10-7 in a lifetime
Being hit by falling airplane parts: 10-8 in a lifetime
Being killed by lighting: 10-9 per lifetime
Winning the lottery: 10-10 per lifetime
Your house being hit by a meteor: 10-15 per lifetime
Winning the lottery twice: 10-20 per lifetime
23. The industry statistics are against us…
• Capers-Jones: at least 2 high severity
errors per 10KLoc
• Industry concensus is that software
will never be more reliable than
– 10-5 per usage
– 10-9 per operating hour
25. The value of testing
Program testing can be used to show the
presence of bugs, but never to show
their absence!
Edsger W. Dijkstra
26. Is just testing enough?
• 64 bits input isn’t that
uncommon
• 264 is the global rice
production in 1000
years, measured in
individual grains
• Fully testing all binary
inputs on a simple 64-bits
stimilus response system
once takes 2 centuries
30. IEC 61508: A process for safety critical functions
31. SYSTEM DESIGN
What do safety critical systems look like and what are their most important drivers?
32. Design Principles
•
•
•
•
•
Keep it simple...
Risk analysis drives design (decissions)
Safety first (production later)
Fail-to-safe
There shall be no single source of
(catastrophic) failure
33. A simple design of a storm surge barrier
Relais
(€10,00/piece)
Waterdetector
(€17,50)
Design documentation
(Sponsored by Heineken)
38. Risk ≠ system crash
• Understandability of
the GUI
• Wrongful functional
behaviour
• Data accuracy
• Lack of response speed
• Tolerance towards
unlogical inputs
• Resistance to hackers
51. Design Validation and Verification
• Peer reviews by
–
–
–
–
System architect
2nd designer
Programmers
Testmanager system testing
• Fault Tree Analysis / Failure Mode and Effect
Analysis
• Performance modeling
• Static Verification/ Dynamic Simulation by
(Twente University)
52. Programming (in C/C++)
• Coding standard:
– Based on “Safer C”, by Les Hutton
– May only use safe subset of the compiler
– Verified by Lint and 5 other tools
• Code is peer reviewed by 2nd developer
• Certified and calibrated compiler
53. Unit tests
• Focus on conformance to specifications
• Required coverage: 100% with respect to:
– Code paths
– Input equivalence classes
• Boundary Value analysis
• Probabilistic testing
• Execution:
– Fully automated scripts, running 24x7
– Creates 100Mb/hour of logs and measurement data
• Upon bug detection
– 3 strikes is out After 3 implementation errors it is build by another developer
– 2 strikes is out Need for a 2nd rebuild implies a redesign by another designer
55. Integration testing
• Focus on
– Functional behaviour of chain of components
– Failure scenarios based on risk analysis
• Required coverage
– 100% coverage on input classes
• Probabilistic testing
• Execution:
– Fully automated scripts, running 24x7, speed times 10
– Creates 250Mb/hour of logs and measurement data
• Upon detection
– Each bug Rootcause-analysis
56. Redundancy is a nasty beast
• You do get functional
behaviour of your entire
system
• It is nearly impossible to
see if all components
are working correctly
• Is EVERYTHING working
ok, or is it the safetynet?
56
57. System testing
• Focus on
– Functional behaviour
– Failure scenarios based on risk analysis
• Required coverage
– 100% complete environment (simultation)
– 100% coverage on input classes
• Execution:
– Fully automated scripts, running 24x7, speed times 10
– Creates 250Mb/hour of logs and measurement data
• Upon detection
– Each bug Rootcause-analysis
58. Endurance testing
• Look for the “one in a
million times” problem
• Challenge:
– Software is deterministic
– execution is not (timing,
transmission-errors,
system load)
• Have an automated
script run it over and
over again
59. Results of Endurance Tests
Reliability Growth of Function M, Project S
Chance of Failure (Logarithmic Scale)
1.E+00
1.E-01
1.E-02
1.E-03
1.E-04
1.E-05
4.35
4.36
Platform Version
4.37
60. Acceptance testing
• Acceptance testing
1. Functional acceptance
2. Failure behaviour, all top 50 (FMECA) risks tested
3. A year of operational verification
• Execution:
– Tests performed on a working stormsurge barrier
– Creates 250Mb/hour of logs and measurement data
• Upon detection
– Each bug Root cause-analysis
61. A risk limit to testing
• Some things are too
dangerous to test
• Some tests introduce
more risks than they
try to mitigate
• There should always be
a safe way out of a test
procedure
63. GUI Acceptance testing
• Looking for
– quality in use for interactive
systems
– Understandability of the
GUI
• Structural investigation of
the performance of the
man-machine interactions
• Looking for “abuse” by the
users
• Looking at real-life handling
of emergency operations
64. Avalanche testing
• To test the capabilies of
alarming and control
• Usually starts with one
simple trigger
• Generally followed by
millions of alarms
• Generally brings your
network and systems
to the breaking point
65. Crash and recovery procedure testing
• Validation of system
behaviour after massive
crash and restart
• Usually identifies many
issues about emergency
procedures
• Sometimes identifies issues
around power supply
• Usually identifies some
(combination of) systems
incapable of unattended
recovery...
66. Production has its challenges…
• Are equipment and
processes optimally
arranged?
• Are the humans up to
their task?
• Does everything
perform as expected?
74. Requires true commitment to results…
• Romans put the architect
under the arches when
removing the scaffolding
• Boeing and Airbus put all
lead-engineers on the first
test-flight
• Dijkstra put his
“rekenmeisjes” on the
opposite dock when
launching ships
75. It is about keeping your back straight…
• Thomas Andrews, Jr.
• Naval architect in charge of RMS Titanic
• He recognized regulations were
insufficient for ship the size of Titanic
• Decisions “forced upon him” by the client:
– Limit the range of double hulls
– Limit the number of lifeboats
• He was on the maiden voyage to spot
improvements
• He knowingly went down with the ship,
saving as many as he could
76. It requires a specific breed of people
The faiths of developers and
testers are linked to safety
critical systems into
eternity
77. Conclusion
• Stop reading newspapers
• Safety Critical Testing is a
lot of work, making sure
nothing happens
• Technically it isn’t that
much different, we’re just
more rigerous and use a
specific breed of
people....
I spend the last 15 years working on highly mission and safety critical systems
Voordeel van Glen was dat het maar 1 keer hoefde te werken......En dat waren de 60er jaren (toen kon dat nog), en astronauten hadden nog lefBron: http://www.historicwings.com/features98/mercury/seven-left-bottom.html
Surfers sunbathing in front of one of the most dangerousnuclear power plants of the world: San Onofre
It is carved in stone: nobody asks if we can make the storm surge barriers less safe when the crime rate in Amsterdam goes up…
We have toexplicitlybalance the availability of a service anditssafety.Otherwise we wouldjustpermanently close the barrierandbe happy aboutit!
Pleasenotethat:The first bullet is the reasonwhe have auto-pilots (copilots are gooddrinkingbuddies)Thatpeoplestill are anxioustofly, but don’t get up in the morning thinking they are goingto score a date with a supermodel
Wonin 2012 de jackpot van 10 miljoenIn 2013 de jackpot van 3 miljoenKansen:1 maalwinnen: 1 op 14 miljoen2 maalwinnen: 1 op 195 biljoen
Maar we leven (onwetend) nog steeds in die wereld.....
Voordeel van Glen was dat het maar 1 keer hoefde te werken......Bron: http://www.historicwings.com/features98/mercury/seven-left-bottom.html
Je begint met je primary concernProces is simpel: je hakt je probleemzover op todat je die 2 miljoenonderdelenhebt, en je weetwat de bijdrage is van elke componentJe pakt de belangrijkste 10, of 100 en neemtgerichtmaatregelen
Three Mile Island nuclear disaster28th April 1979Lack of understanding situation let to loss of controll of the #2 reactor, led to partial meltdownSome say it killed approx. 330 people
Tickles security: hard van buiten, boterzacht van binnen
There is a bug in this one: this code is NOT fail-safe because it has a potential catastrophic deadlock (when the Diesels don’t report Ready).....
Please be reminded: the presented code has a deadlock!
Do you know the difference between validation and verification?Validation = meets external expectations, does what it is supposed to doVerification = meets internal expectations, conforming to specs
Note: this is NOT related to the StormSurge Barrier!
You look forsafetycriticalsituations, sometimes without safetynet.Chernobyl was a test of a pump, without safetynet
Three Miles Island is a good example
Most beautifull example: UPSes using too much power to charge, killing all fuses....Current example: found out that identity management server was a single point of failure....Eurocontrol example: control unit wasn’t ready for the CWPs, and after that got overloaded
T5 problems: People couldn’t find their way in the installation…
This is functional nonsense: DirMsgResponse is sent to the output, whatever what.
Reality isn’t nice and clean. Reality is messy, chaotic, stressed and nobody completely understands it….Reality is much more fun
The Agile movement is right: People beforeprocesses! A processcan’tcompensateforidiotsinside it.Aquaduct of Segovia, peninsula of Iberia. Buildby the Romans in 50AC to 125AC. Architectsoverengineeredtheir equipment thatheavily, it is still standing 2000 years laterPeople do have torealizethat commitment of peopleto get it right the first time is essential.At Eurocontrol, we mentioned a projecteddeathtoll on every bug
Our successes are unknown, our failures make the headlines….When a system fails in production, it is actual blood on our hands. At eurocontrol, each bug had a bodycount attachted to it.....