2. Software Disasters
“To err is human, but to really foul things up you need
a computer.” – Paul Ehrlich
• U.S. economy loses every year approx $60 billion from reworks, lost
productivity and actual damages arising in a faulty software.
• Failure rate of software projects is between 50% - 80%.
• For every 100 projects that start, there are 84 restarts.
Software bugs can be annoying, but faulty software can be also
expensive, embarrassing, destructive and deadly.
3. Software Disasters
Software Failures 2016
In 2016 there were published:
• 1519 stories about software failures.
• 548 were separate fails
• 363 companies affected.
The most common for failures were 3 categories:
• Software bugs – 432
• Usability glitch – 38
• Security vulnerability – 78
6. Ariane 5
The explosion of the Ariane 5 (1996)
• Self destroyed after 36
seconds
• Reason: Overflow - trying to
save 64 bit data in 16 bit
format
• Estimated loss: $ 8 billion
7. Mars Climate Observer
Mars Climate Observer Fail (1998)
• Crashed due to wrong
trajectory
• Reason: Wrong units –
Imperial (USA) instead of
Metric (NASA)
• Estimated loss: $ 200
million
8. Therac - 25
The deadly radiation therapy (1985)
• 100 times overdose
• Reason: Race condition
9. Doom days for USA
US lack of electricity (2003)
Before After
Reason: Race condition
10. Apple Maps
Apple Maps go nowhere…fast (2012)
• One the most Epic fails:
Missing entire towns,
Disturbed images, Clouds …
• Reason: NOT very well
tested mapping
• Loss: Reputation damage,
Google –
12. Black Monday
Black Monday (1987) – $ 500 billion
• Systems crashed leaving
investors blind
• Reason: Load testing not
done properly
• Estimated loss: $ 500
billion, 22,6 % of Top 500
13. Knight Capital Group
Knight Capital (2012) – lose it all in 45 minutes
• 4 million trades of 397
million shares in 45 min
• Reason: NOT tested
software installed on 1 of 8
servers. Bid and Ask prices -
opposite
• Estimated loss: $ 460
million, fines of $ 20 million
14. Pepsi Bottle Glitch
Pepsi Bottle Glitch (1992)
Pepsi Philippines launched a promotional activity dubbed as “Number
Fever” and offered prizes up to P1 million ($37 000) for holders of bottle
caps with the number 349.
• 500 000 people claimed the prizes,
• Company refused to pay all of them,
• Pepsi officials claimed that a computer glitch picked the number by
mistake
• Pepsi faced more than 1000 criminal and civil suits and lost millions
• Angry claimants burnt 37 trucks
• Pepsi plant in Davao City was attacked
• Grenade killed three persons and caused suspension of the plant’s
operation;
15. Amadeus
Amadeus – smack airports across world in Sep 2017
• 125 airlines blocked
because of check-in failure
• Reason: Bug in Cloud
solution
• Estimated loss: Unknown
16. Even more disasters
• Intel Pentium floating point division bug
• Mariner I Space Probe
• Patriot Missile timing bug – 28 dead
• AT & T Network down due to Out of service msg
• World of Warcraft virus
17. Why get rid of QAs then?
The respondents are 2500 IT professionals, research by Computing
19. New Trends
Is DevOps going to kill QA?
The answer is No.
In a perfect DevOps world:
• everyone is responsible for QA,
But:
• but not everyone has the knowledge and expertise to do QA right.
• only half of the processes are being aligned with what is supposed
to be DevOps.
Critics say DevOps is good for start-ups, but usually turn out to be a
bottleneck for corporations.
20. Future QAs
The shifted role of QAs in Agile and DevOps?
• QA owns continuous improvement and quality tracking
• not only product, also the process.
• Tests are code
• release every day or week, no place for manual testing
• Mentoring the Developers – in TDD, better test cases
Software Dysfunction: Why do software fail? Edward Ogheneovo
In 1996, Europe’s newest and unmanned satellite – launching rocket, the Ariane 5 was intentionally blown up seconds after taking off on its flight from the French Guyana. The European Space Agency estimated that the total development cost of the Ariane 5 was more than $ 8 billion.
What was the reason to be blown? – the self destruction was triggered by software trying to stuff a 64 – bit number into a 16 – bit space.
The shutdown occurred only 36.7 seconds after launch, when the guidance system’s own computer tried to convert one piece of data – the sideways velocity of the rocket – from a 64 – bit format to a 16 – bit format. The number was too big, and an overflow error resulted.
Two spacecraft, the Mars Climate Orbiter and the Mars Polar Lander, were part of a space program that, in 1998, was supposed to study the Martian weather, climate, and water and carbon dioxide content of the atmosphere. But a problem occurred when a navigation error caused the lander to fly low in the atmosphere and it was destroyed.
What caused the error? A sub – contractor on the NASA program has used the imperial units (as used in the USA), rather than the NASA – specified metric units (as used in Europe).
That error was definitely preventable if a proper testing, by specification, was conducted.
The result?
The $ 125 million dollar space craft attempted to stabilize its orbit too low in the Martian atmosphere, and crashed into the red planet.
After this disaster, special attention is being paid to the units used.
The Therac – 25 medical radiation therapy device was involved in several cases where massive overdoses of radiation were administered to patients in the period of 1985 – 87, a side effect of the buggy software powering the device. A number of patients received up to 100 times the intended dose, and at least three died as a direct result of it.
The reason – a subtle bug called race condition. As a result of this bug, a technician could accidentally configure Therac – 25 so the electron beam would fire in high – power mode without the proper patient shielding.
An interesting observation for software testers: The condition that caused the failure to occur happened as the result of nonstandard user input on the keyboard. In other words, an average black box tester, running a set of ad hoc cases, likely would have found the bug and saved lives.
The failure of the Therac – 25 is a standard case study in the importance of testing and handling safety critical systems.
The sad news is that something similar happened in Panama, in 2000.
On august, 14 2003, the biggest power crisis in American history was actually initiated in the bowels of a Unix – based A/ 21 energy – management system.
Deep in the four million lines of C code running the system, there was race condition bug.
Race conditions occur when two separate threads on one operation rely on a single element of code. If the process isn’t properly synchronized, the threads get themselves in a self – perpetuating tangle and crash the entire system. On this occasion, data feeds from several network monitors created a “perfect storm” for the race condition and, in a matter of seconds, the incoming data overwhelmed a system that was supposed to alert controllers to problems in the electricity grid.
With the alarm system down, the doughnut – munching controllers remained unaware of relatively minor network events that soon spiraled out of control because they weren’t quickly resolved. Unprocessed events queued up and the primary server failed within 30 minutes, switching all operations to the backup server, which itself failed minutes later.
Oblivious to the impending nightmare, observers did nothing when a power line tripped out after making contact with and unkempt tree, which force more power onto another overhead power line, causing that one to sag and trip out too. Within an hour, power lines and circuit breakers were tripping left, right and center, as a power surge cascaded across the north – eastern states.
With the 2012 Apple iOS 6 update, the company decided to kick the superior Google Maps Platform to the curb in favor of its own system.
Unfortunately, it did a poor job of mapping out location resulting in one of the most epic fails of the mobile computing movement.
The software was missing entries for entire towns, incorrectly placed locations, incorrect locations given for simple queries, satellite imagery obscured by clouds and more.
The Black Monday is a collapse in 1987 on Wall Street, when in a single day 500 billion US dollars were lost.
The date was 19th of October 1987 when the Dow Jones Industrial Average plummeted 508 points, losing 22.6% of its total value. The S&P 500 dropped by 20.4%.
This was actually the greatest lost Wall Street has ever suffered in a single day.
The cause?
A long bull market was halted by a rash of SEC (securities and Exchange Commission – a government body in the USA) investigations of insider trading and by other market forces.
As investors fled stocks in a mass exodus, computer trading programs generated a flood of sell orders, overwhelming the market and thus, crashing the systems and leaving investors effectively blind.
In August 2012, Knight Capital Group, a market – making firm with a stellar reputation in its industry, blew all that just for 30 minutes.
Between 9.30 a.m. and 10.00 a.m. EST on August 1st, the company’s trading algorithms got a little buggy and decided to buy high and sell low on 150 different stocks.
By the time this was stopped, the Knight Capital had lost 440 million US dollars on trades.
By comparison, its market cap was just 296 million US dollars and the loss was four times its 2011 net income.
Furthermore, the company’s stock price dropped 62% in just one day.
The reason for the fault – faulty test of new software.
The trading algorithm of the company was also a bit eccentric. On every stock exchange, there is a “bid” and an “ask” price. The bid price is what you’d like to pay to the holder of the stock if you want to buy their shares. The ask price is what they’ll pay to buy those same shares from you. There’s always a spread between the two prices, with the “ask” price being a few cents or more above the “Bid”. If the stock is thinly traded, then the spread between the ask and the bid is higher that what you’d see.
Knight Capital’s software went out and bought at the “market”, meaning it paid ask price and then sold at the bid price – instantly. Over and over again.
One of the stocks program was trading, electric utility Exelon, had a bid/ ask spread of 15 cents. Knight Capital was trading blocks of Exelon common stock at a rate as high as 40 trades per second – and taking a 15 cent per share loss on each round – trip transaction.
Thousands of people claimed the prizes, but the company refused to pay all of them, stating that the caps did not contain the proper security codes;
The prize was P1 million or around 37 000 US dollars per winner;
Pepsi officials claimed that a computer glitch picked the number by mistake, but those who got the bottle caps with the winning number combinations argued that it was not their mistake and the soft drinks company should be ordered to pay them;
The Supreme Court ruled out “the issues surrounding the 349 incident have been laid to rest and must no longer be disturbed in this decision”;
Pepsi faced more than 1000 criminal and civil suits, but most of these cases have been dismissed;
Pepsi lawyer Alexander Poblador said that the company spent more than P200 million to pay close to 500 000 non – winning claimants as a good will gesture, but still suffered loses when angry claimants burnt 37 company trucks;
A network failure affected the Amadeus online booking platform in September 2017
Amadeus is an airline check – in system that led to delay in airports around the world and booking problems for passengers. The problem occurred on Thursday, 28th of September 2017.
The airports in Frankfurt, Paris an Zurich reported of problems with their computer systems which, happily, were quickly resolved.
London’s Gatwick Airport blamed Altea, a passenger management system that underpins booking systems and is made by the Spanish software company Amadeus.
A network failure affected the Amadeus online booking platform in September 2017
The most affected airline companies were:
British Airways;
Deutsche Lufthansa;
Cathay Pacific Airways Ltd.
Qantas Airways Ltd.
KLM delaying at least 24 departures;
Some of the airlines said they were affected by the issue only for a minute, despite it took 3 hours and a half for Amadeus to fix the problem;
The result?
More than 125 airlines affected by the issue
Passengers struggling to check – in due to the failure;
Delays in airports including Heathrow, Gatwick, Changi (Singapore), Charles de Gaulle (Paris) and Frankfurt;
Problems also affected some of the online check - ins
In November 1994, the New York times reported a rather embarrassing issue with the Intel Pentium chips that had affected a variety of PCs.
A number of chips were flawed in a way that prevented them from accurately handling long divisions – unnoticeable mistake to the common PC user, but a big deal for scientists and engineers, who required precise calculations in the handling of their work.
The reaction of Intel was a refusal to recall the chips. Stating that it would not affect that many people.
Those who needed the precise division were forced to “prove” why it was important to them.
The problem lay in a faulty math coprocessor, also known as a floating – point unit. That problem did not really mean a big deal for the regular user. That kind of bug has a 1 – in – 360 billion chance that miscalculations could reach as high as the fourth decimal place. More likely, with odds 1 – to – 9 billion against, was that any errors would happen in the 9th or 10th decimal digit.
The one that needed that accuracy was the Virginia – bases professor - Thomas Nicely.
He informed Intel in October 1994, then others when Intel replied with a response a little less tactful than “Oh, that thing? We noticed in June.”
Thus, the new exploded and became a PR mess turned into a costly mop – up bill.
In January 1995, Intel announced a pretax charge of 475 million US dollars against earnings, most of which apparently stemmed from replacing flawed processors.
The shifted role of the QAs in the DevOps organizational structure has several implications:
QA owns continuous improvement and quality tracking across the entire development cycle. The QA is the one primarily responsible for finding problems, but not only in the product, also in the process. They are also to recommend changes and improvement whenever they can;
Tests are code, that is a necessity. If you are to release every day or week, there will be no place for manual testing you must develop automation systems, through code, that can ensure quality standards are maintained.
The shifted role of the QAs in the DevOps organizational structure has several implications:
Automation rules. Anything that can be automated, should be automated.
Testers become the quality advocates and thus influencing both development and operational processes. They don’t just find bugs. They look for any opportunity to improve repeatability and predictability.
There is no way that DevOps will kill the QA, but it wil;l definitely transform it. And some of the new roles are: