The supporting paper to my tutorial at RAMS 2011 and 2013. Looking at the key features that make a great (or poor) reliability program.
The purpose of this tutorial is to highlight key traits for the effective management of a reliability program. The basic premise is no single list of reliability activities will work for every product. Every product development and production team faces a different history, constraints, and a different set of variables and uncertainties. Such that what worked for the last program may or may not be appropriate for the current project. There are a handful of key traits that separate the valuable programs from the merely busy programs. These traits and the underlying structure can provide a framework to create a cost effective and efficient reliability program.
How to Troubleshoot Apps for the Modern Connected Worker
Tutorial on Effective Reliability Program Traits and Management
1. Effective Reliability Program
Traits and Management
Fred Schenkelberg
Fred Schenkelberg
Senior Reliability Consultant
Ops A La Carte, LLC
990 Richard Ave, Suite 101
Santa Clara, CA 95050
fms@opsalacarte.com
Schenkelberg: page i
2. SUMMARY & PURPOSE
The purpose of this tutorial is to highlight key traits for the effective management of a reliability program.
The basic premise is no single list of reliability activities will work for every product. Every product
development and production team faces a different history, constraints, and a different set of variables and
uncertainties. Such that what worked for the last program may or may not be appropriate for the current
project. There are a handful of key traits that separate the valuable programs from the merely busy
programs. These traits and the underlying structure can provide a framework to create a cost effective and
efficient reliability program.
Fred Schenkelberg
Fred Schenkelberg is a reliability engineering and management consultant with Ops A La Carte, with areas
of focus including reliability engineering management training and accelerated life testing. Previously, he
co-founded and built the HP corporate reliability program, including consulting on a broad range of HP
products. He is a lecturer with the University of Maryland teaching a graduate level course on reliability
engineering management. He earned a master of science degree in statistics at Stanford University in 1996.
He earned his bachelors degrees in Physics at the United State Military Academy in 1983. Fred is an active
volunteer as the Executive Producer of the American Society of Quality Reliability Division webinar
program, IEEE reliability standards development teams and previously a voting member of the IEC TAG 56
- Durability. He is a Senior Member of ASQ and IEEE. He is an ASQ Certified Quality and Reliability
Engineer.
Table of Contents
1. Introduction...........................................................................................................................................1
2. Basic Structure......................................................................................................................................1
3. Reliability Goals.....................................................................................................................................1
4. Apportionment.......................................................................................................................................3
5. Feedback Mechanism.............................................................................................................................4
6. Determining Value.................................................................................................................................6
7. Maturity Model......................................................................................................................................7
8. Conclusions............................................................................................................................................7
9. References.............................................................................................................................................7
10. Tutorial Visuals……………………………………………………………………………………... . .8
Schenkelberg: page ii
3. 1. INTRODUCTION reliable product. This brings up the question of what is a
‘reliable product’?
A product’s design, supply chain and assembly process in The objective or goal provides the direction and guidance
large part establish the product’s reliability performance. A
for the reliability program. Clearly stating the reliability goal
product well suited for the use application will meet or exceed is a key trait of very effective programs. Leaving the goal
the customer’s durability expectations. The myriad of
unstated or vaguely understood may lead to one or more of
decisions by the entire design and production team creates the the following:
eventual product reliability performance. The structure for
these decisions is the focus of this tutorial. • High field failure rate
Considering that each activity of a design team takes • Product recall
resources such as time and money to accomplish, focusing the • Over designed and expensive product
use of these resources on activities of high value is a common • Design team priority confusion
strategy. Including product reliability in the value proposition
permits the entire team to weigh the importance of product Another element of a process is feedback. This occurs
reliability and the appropriate use of tools to accomplish both within the process as part of the creation of the output, and it
the business and product reliability objectives. most certainly exists externally based on the output or process
The basic premise of this tutorial is the underlying concept results.
that no one set of reliability activities is appropriate for every The final result for product reliability is the customer
product development situation. Selecting and integrating the acceptance or rejection of the product. If the product
best tools permits the execution of an effective and efficient functions longer than expected, like an HP calculator, the
reliability program. product is considered a ‘good value’. If the product fails
The traits of very good reliability programs and examples of quickly or often, especially compared to other products
very poor practices in this tutorial serve to illustrate how to providing the same solution, it is considered of ‘poor value’.
approach establishing an effective reliability program. In some organization the feedback is non-existent, in others
Highlighting the basic structure along with guidelines on how it is captured within a warranty claims system, in others
to tailor a reliability program will permit the repeatable within service or repair programs. Customers may complain
creation of reliable products. directly with returned products and demands for
replacements, or indirectly by simple not purchasing the
Acronyms and Notation product in the future.
ALT Accelerated Life Test The feedback within the reliability program attempts to
CAD Computer Aided Design anticipate the customer’s feedback prior to the delivery of the
FMEA Failure Modes and Effects Analysis product to the customer. Depending on the product and the
HALT Highly Accelerated Life Test organization, this feedback may be very formally determined,
LED Light Emitting Diode highly structured and very accurate. Or, the feedback may be
MTBF Mean Time Between Failure random, haphazard and inaccurate. Both types of feedback
PoF Physics of Failure may be suitable, again depending on the product and
SPICE Simulation Program with Integrated Circuit organization.
Emphasis Establishing the appropriate set of feedback mechanisms
within a reliability program is done within the context of the
product reliability goals and the value to the organization of
2. BASIC STRUCTURE the feedback. The process benefits from feedback that is
timely and accurate enough to make decisions. It is those
A product reliability program is a process. Like any process
decisions that lead to the product’s reliability in the hands of
it has inputs and outputs, plus generally some form of an
the customers.
objective and feedback. Furthermore, the process may or may
Therefore the basic structure for any reliability program is
not be controlled or even a conscious part of the organization.
to clearly establish and state the reliability goal. Then
Reliability may just happen, good or bad. Results may or may
determine the appropriate set of feedback mechanisms that
not be known or understood.
provide timely information to permit design and production
In some organizations, the reliability program may be
decisions. The ‘how’ to decide the ‘appropriate set’ is the
highly structured with required activities at each stage along
subject of this tutorial.
the product lifecycle. In other organizations, reliability is
considered as a set of tests (e.g. environmental or safety
compliance). And, in some organizations, reliability is 3. RELIABILITY GOALS
effectively a part of everyone’s role. The target, objective, mission or goal is the statement that
In each example above, the resulting product reliability may provides the design team focus and direction. A well stated
meet the customer’s expectations or not. There isn’t a single goal will establish the business connection to the technical
process that will always work. decisions related to the product durability expectations. A
Going back to the basic notion of a simple process, consider well stated goal provides clarity across the organization and
the objective for a moment. For a reliability program one may permits a common language for discussing design, supply
desire a specific outcome of a reliable product. The process chain and manufacturing decisions.
then should promote activities leading to the creation of a Let’s explore the definition of a ‘well stated reliability
goal’. First is it not simple MTBF, “as good as or better
Schenkelberg: page 1
4. than…”, or ‘a 5 year product’. These are common ‘goals’ expected to wash clothes for 10 years. An implanted hearing
found across many industries, yet none permit a clear aid is expected to last the life of the patient; if the patient is a
technical understanding of the durability expectations for the child this expectation may be more than 70 years.
product. The duration expectations may be defined by contract,
The common definition for reliability is market expectations, or by a business decision. The duration
or life expectancy most likely is not the warranty period. For
Reliability is … the ability or capability of the example, many personal computers have a 3 month or 1 year
product to perform the specified function in the warranty period. Yet, the product is expected to last at least
designated environment for a minimum length of two years or more with normal use.
time or minimum number of cycles or events. Many products have multiple durations that are of interest.
(Ireson, Coombs et al. 1995) • Out of box
• Warranty
Note this definition has four elements: • Design Life
• Function
• Environment The initial, out-of-box, or installation period is that
• Duration duration when the customer is first setting up and using the
• Probability product. Brand visibility is at the highest and the expectation
that a new product will function as expected is very high. The
3.1 Function types of failures that may occur include installation or
The function is what the product is to do or perform. For configuration errors, mistaken purchase, shipping or
example, an emergency room ventilator is to provide assisted installation damage, or simply buyer error. All of these
breathing for a person. This requires the ventilator to produce ‘failures’ cost the company producing the product resources.
breathable air within a range of pressures within a prescribed The warranty period is the duration associated with the
cycle of respiration. It may include requirements for filtering, producer’s promise to provide a product free of defects for a
temperature, and adjustments to pressure and timing of the stated period of time. For example a computer may have a 1
cycle, etc. Often, a product development team either develops year warranty period. During this one year, if the product
or is given a detailed set of functional requirements. fails (usually limited to normal use and operating
Often the functional elements of a product are directly environment) the producer will repair or replace the product.
measurable. And, the quality function of most organizations Naturally this will cost the producer resources.
verifies the design and production units meet the functional The design life is the business or market expected product
requirements. When the product does not meet the functional duration of function use. After the warranty period there isn’t
requirements, it is considered a product failure. Within the an expectation for the producer to replace or repair the
function definition, which are the most important functions, product, yet the customer may have a reasonable expectation
which must not fail, which are functions that, if they fail may that the product will function satisfactorily over the design
simply degrade performance, if noticed by the customer at life duration. For example, many cell phones have a 3-month
all? warranty, yet as consumers we have an expectation that the
phone will function for two years or more.
3.2 Environment Marketing or senior management may set the design life.
The environment could be considered the weather around They may want to establish a market position for the product
the product when in use. ‘Weather’ such as temperature, related to reliability. One way is to design a very robust
humidity, UV radiation intensity, etc. It should also include product with a long design life duration. HP calculators often
environmental factors that provide destructive stresses, such have only a 3-month or 1-year warranty, yet many have lasted
as vibration, moisture, corrosive gases, voltage transients, and 10 or more years. These calculators are known for their
many more. robustness and often cost more to purchase – a reliability
Another element of the environment is the use of the premium.
product. What is the use profile? Once a day for a few Each of the three durations often involves different risks
minutes, like a remote control for the stereo system. Or is it a related to the failure mechanisms. It is rare for bearings to
24/7 operation such as for server system processing wear out in the first 30 days, yet more likely for a 10-year
transactions for a major online store. The profile may include design life. Establishing three or more durations within the
details concerning human interactions, operating modes, product reliability goal permits the design team to focus on
shipping, storage, and installation. The environmental and address the full range of product reliability risks.
conditions need to detail how the product responds or 3.4 Probability
degrades to the set of stresses the product encounters. The
environmental conditions focus on drivers for the product’s The probability is the likelihood of the product surviving
most likely failure mechanisms. over a specified period of time. In the formal reliability
definition above, the phrase ‘ability or capability’ refers to the
3.3 Duration probability. This is the statistical part of the reliability goal
The duration is the amount of time or number of cycles the and without it the goal is fairly meaningless. Furthermore,
product is expected to function. A computer printer may be stating a probability without an associated duration and
expected to print for five years. A washing machine is distribution is also meaningless in most cases.
Schenkelberg: page 2
5. What is the chance that a particular product will function as and some will require modification. Sophisticated model
expected over the entire expected design life? How many of include apportioned goals, addressing many functions, and
the installed units will be functional over the warranty several use profiles and several environments, different
period? Since each product and the associated environmental durations, and conditional probabilities. Simple models work
stress vary, the use of statistics is unavoidable in describing to get started, as more details become available concerning
product reliability. Even the definition of a product failure the design and use, sophisticated models are increasingly
may vary by customer. useful.
While there are many common terms to convey the The duration may also require modification. The durations
probability of survival, the use of a percentage surviving is are most often the same as the system level and may require
the easiest understood and most easily applied across an modification if the various components or subsystems are
organization. Stating that 95% of units are expected to only employed during specific phases of the products use, i.e.
survive over the 5 year design life, means 95 out of 100 units an installation and configuration aide.
will function properly over the 5 year period. Or, that a single The probability will require modification unless the product
product has a one in twenty chance (95%) of surviving 5 has no component or subsystem elements. This is rare except
years. A similar statement is that not more than 5% of for raw materials. Even a simple discrete resistor has multiple
products fail over the full five years. Or, may be stated as not components that may have different failure mechanisms. For
more than a 1% failure rate per year. example, the resistive element and the soldering leads have
A common probability statement is the inverse of the failure different functional descriptions are made of different
rate, or MTBF. The 95% reliability over 5 years (t) becomes materials and enjoy different sets of stresses that lead to
approximately 100 years MTBF (θ). This does not mean the failure. The probability of failure is not the same as for the
product will last 100 years, it does mean that 95% of the system.
products are expected to last 5 years. Another way to look at the probability differences breaks
Finally stating a separate failure probability for each down the system probability of success to each element within
duration of interest provides a set of duration/probability the product. A simple system with two primary means to fail
couplets that permit different focus for early or out of box (say the resistor with the resistive and connection elements as
failure risks versus the longer term failure risks. an example for discussion) and the system has a 90%
If the product has a specific mission time, say an aircraft probability of successfully functioning over 20 years. If both
with an expected 12-hour mission over a 20-year serviceable of the elements also have a 90% probability, and either the
life period. The probability of success for the 12-hour mission resistive or connection element causes a system failure, then
time maybe set relatively high. And, it may have a either the system or subsystem goals are misstated. As you
conditional probability considering the number of missions already know, for a simple series system the probability of
since the last major service. Some products have availability success for the subsystems has to be larger such that when
goals and undergo routine maintenance or repair. These they are multiplied together the result meets or exceeds the
products and many complex systems require additional system goal.
complexity in their goal setting. For the purpose of this There are excellent references for basic reliability modeling
discussion, we are considering simple products that are not and many papers and forums to discuss even the most
normally repaired or, products where the main interest is in complex systems. The intention is to apportion the system
the time to the first failure. reliability goal, especially the probability value, to all major
The point is that setting the reliability goal for a product is elements of the product.
not as simple as stating a ‘five year life’ – it requires a clear 4.1 Establishing the probability apportionment
statement with sufficient detail of each of the four elements:
function, environment, duration, and probability. And, it may The time to establish the reliability apportionment is early
and often should include at least three duration/probability in the project. Depending on the project and the known
couplets. The goal establishes the direction or target for the values from field data, vendors, previous projects, etc. the
entire design, supply chain and manufacturing team. apportionment may be well founded on data, or simply a
guess. Both are valuable.
Consider a simple example of a computer system with five
4. APPORTIONMENT major subsystems: motherboard, disk drive, monitor, power
supply and keyboard. Of course there are other elements, yet
The system or product level reliability goal is not sufficient for this example we are limited the list to these five.
by itself. Ideally, every component or assembly step, which If this is our first product and little is known about the
has a possible impact on the final product reliability, should reliability of any of these components (for example, when
have an established reliability goal. Each individual element designing the first personal computers in the 80’s). Further,
should have goals that are tailored to that specific element. let’s assume the system goal is 95% reliable over a 5-year
For example a cooling fan that only operates when the period for the design life. Having no other information, a
internal temperature reaches a defined value, has a different straight-line apportionment is as good a starting place as any.
use profile than the entire system. The function and Therefore, each of the five subsystems receives an
environment are different for the specific fan than for the apportionment goal of 99% reliable over 5 years. Also, the
system. The computer provides a platform for computer functional and environmental elements receive attention to
programs to operate along with a user interface, whereas the adjust to those subsystems particular requirements.
fan provides cooling. Many of the environmental factors for
the computer also impact the fan, yet not everything applies
Schenkelberg: page 3
6. At first, this simple method provides a starting point for the The primary intent of using reliability goals and
team’s discussion concerning reliability. It provides the basis apportionment is to permit meaningful decisions concerning
for product design, part procurement, validation and reliability along with the ability to consider product cost and
verification testing, and the myriad of cost/benefit trade off other important aspects of the design in a meaningful
decisions required during the product lifecycle. manner.
Overtime, years of field data, vendor data and internal
product testing continue to improve the understanding of 5. FEEDBACK MECHANISMS
each subsystem’s reliability. This understanding becomes the
base for the initial apportionment estimates for a new There are two basic questions in reliability engineering.
product. Consider a new project for a personal computer What is going to fail? And, when will the product fail? Both
where only the CPU and associated chipset is new. The are related to failure mechanisms. The first may require the
overall apportionment model may start with the best available discovery of the failure mechanism. The second may require
reliability values for all the subsystems and include an the determination of the expected behavior of the failure
adjustment to the motherboard value considering the mechanism over time. Both questions have a wide range of
uncertainty or estimated value change regarding the new tools available to find the answers. It is the selection of the
CPU chipset. The uncertainty is relatively low and the use right tools to provide a good enough answer in an effective
within subtle design decisions is possible. and efficient manner that is the subject of this section.
Each engineer tends to design away from failure. (Petroski
4.2 Adjusting the probability apportionment 1994) And, each engineer generally knows about the most
likely failure mechanisms related to their section of the
Going back to the first personal computer design and
design, within the realm of their experience. They may gain
simple straight-line apportionment. A little common sense
additional experience as their design fails in unexpected (to
and feedback from vendors may provide additional
them) ways. Part of the design process is to uncover failures
information. The keyboard is most likely more reliable than
and improve the design to avoid or lessen the probability of
the power supply, for example. Adjusting the goal for the
the same.
power supply down, say to 98%, then requires an adjustment
Tools such as FMEA and HALT permit the design team to
in one or more of the other subsystems such that the product
discover failures. Often the FMEA session permits the design
remains at or above the system goal of 95%. The same rule
team to share the known or expected failure mechanisms.
applies for any other series system of apportionment.
Occasionally, a new possible failure mode appears in these
Another consideration for the apportionment adjustment is
sessions. The real value is in improving the ability of the
the cost/benefit tradeoff. For nearly any development project
entire team to identify unknown failures and address the
there is a limit to product cost, therefore simply purchasing
effects of the known expected failures. Each person on the
the most expensive components, which may or may not be the
FMEA team brings a set of known or expected failures to the
most reliable, is not always an option. Back to the power
discussion. The combined set increases the entire team’s
supply example above. Let’s say the vendor of the initially
awareness to the larger set of possible issues.
selected power supply considers the use, environment and
HALT, in the broadest sense is started with the first product
functional requirements and states that the power supply will
models or bench top testing. Exploring the reaction of the
have a 95% probability of success over 5 years. That is the
product to various stimulations is an exploration of where the
same value as the overall system goal, and unless all the other
product works by defining where it doesn’t work. The
subsystems are perfect (100% reliable over 5 years) the design
intention of HALT is to apply stresses relevant to the
team will not achieved the reliability objective.
product’s environment (vibration, voltage, temperature, usage
A search reveals three alternative power supplies that will
rates, etc.) and determine the boundary between functional
meet the functional requirements. One has a 97% reliability
and not functional behavior. With careful root cause analysis,
at a cost of $50, the second has a 98% reliability at a cost of
then uncover and understand the failures, enabling the design
$100 and the third has a 99% reliability at a cost of $250.
to adjust to create a more robust product.
If product cost is not an issue (rarely the case) spend the
Common engineering tools also permit this discovery.
$250 and achieve the apportioned objective. If it is possible to
Many CAD programs include basic finite element analysis
improve the reliability of other subsystems, say the monitor,
capabilities. Adjusting material properties to reflect the
for less cost, to offset the difference between the 99% goal
effects of aging (i.e. oxidation of polymers making them more
and 98% or 97% reliability associated with less expensive
brittle) and performing a simple analysis may find aging
power supplies, than that would provide the highest reliability
weaknesses in the design. The same applies for SPICE
for the least cost. This is a simple illustration of the
models of circuits. Consider the expected drift of capacitor
cost/benefit tradeoff; in practice these may become very
values over time and the continued functionality of the
complicated decisions.
circuit.
An advanced practice is to establish reliability goals and
If the product is new or contains new technology or
associated apportionment for the various stage gates during
assembly processes, the nature of the failures may not be well
the product lifecycle. With each successive round of design,
understood. FMEA and HALT and related discovery tools
prototyping, and analysis not only is the product improving,
apply. If the project is to refine an existing product and there
but the uncertainty is also diminishing. Using the lower limit
is ample internal and field data defining the areas for
for reliability estimates is one way to reflect the range of
improvements, then the discovery tools do not add value.
reliability uncertainty.
The first question looks for what will fail. If the failures are
known or the various tools help determine what will fail, the
Schenkelberg: page 4
7. product reliability can be improved by addressing those expected new environment. Tools such as ALT may apply.
aspects the product that lead to the failures. One approach to Thermal cycling for the solder joint attachments and high
product design is: build, test, fix – repeat. That is, find and temperature exposure while illuminated to evaluate the
fix the first element of a design to fail and the product luminosity degradation are two examples of what could be
improves. Continue to do so till there are no more failures or usefully tested.
the design reaches the design limits of the materials (for The results of the discovery evaluations along with
example, the first failure occurs as the polymer case melts). engineering judgments concerning the uncertainty of failure
The primary drawback to this approach is the inability to mechanism behaviors will prioritize the list of most likely
quantify the product reliability value concerning how many failure mechanisms. This list then can be sorted by
units will last how long. Understanding what will fail is appropriate stress to design accelerated life tests. More than
critical to being able to answer the second question – when one failure mechanism may be accelerated due to the high
will it fail? temperature exposure, for instance.
As the design team addresses the design issues the second The reliability program most likely will have goal setting,
question enables them to know if they have achieved the apportionment, initial reliability predictions based on
product reliability goal. As with discovery tools, there are literature and vendor data, prototype testing of various solder
many tools available to determine how long a product will joint attachment mythologies, and product level accelerated
last. Predictions, accelerated life testing, demonstration tests life testing focused on 3 to 5 different stresses.
all are capable of providing an estimate of how long a product This approach takes advantage of existing knowledge
may last. concerning LED technology and previous explorations of
Deterministic models may also provide results. For failure mechanisms within LED technology and solder
example, the polymer diffusion rate permits air to accumulate attachment methods. The approach also considers if any new
within a tube, which at a critical air volume will block fluid failure modes may appear in the new, harsher environment.
flow. This process can be modeled and the time to failure The approach also considers the relative low cost of the
calculated for different wall thicknesses and air pressures. individual units and the ability to quickly measure the
Field data is often the most accurate way to estimate actual product performance by the use of ALT. The initial risk is
field performance although it is usually not available for new high for the new environment and if the LED’s actually last
products or elements of new products. longer than twice the expected life of the incandescent
To illustrate how to select the appropriate tools to provide systems the product provides a cost savings to the car
feedback to the design team, let’s consider a few cases. Keep manufacturer and owner.
in mind that not all tools are appropriate for all situations.
5.2 The low volume high cost case
5.1 The existing technology in new environment case
In comparison to the first case, consider a product that has a
To illustrate the existing technology in new environment very limited production volume, say 50 total units. Plus, each
case consider the initial design of an LED brake light. This unit is very expensive; say $1million. Running 30 units each
is new technology with respect to the application of the LED in three different ALT to failure is not viable. Even getting
to the car taillight environment. While LED lighting has been one full unit for destructive HALT testing is not likely. Yet
available in a range of applications for some time, the car all the same unknowns as above or more may apply.
taillight environment is harsher and more demanding than Consider an oil exploration sensor array unit that attaches
previous application environments. Simply the ambient to the drill string during drilling and has the function to
temperature extremes from overnight, outdoors in Fargo, ND monitor and report the presence of specific types of
(-30°F) to direct sunlight exposure, within an unventilated hydrocarbons. This is a complex system in a very harsh
enclosure in the Tucson, AZ summer (180°C). Also a new environment.
assembly process to attach dozens of LED elements to a brake The list of what and how the product could fail is quite
light pattern frame in a high-volume mass-production long. Given the constraint of no system level units for product
assembly line will be required. testing, only a few of the tools from the first case apply: goal
There is no history, no previous products on the market setting, apportionment, prediction and FMEA. The FMEA is
using LED’s in anything like the brake light environment. a discovery tool and will not provide the necessary feedback
What could possibly go wrong? The design team doesn’t on the product’s expected durability. Thus the onus is on
know what could go wrong. Therefore, the appropriate set of performing accurate predictions.
tools should first discover the most likely failure mechanisms. In this case, the use of Physics of Failure (PoF) modeling
FMEA and HALT both apply, for example. Both of these may be the most valuable tool available. Understanding the
tools can build on what is already known about LED relationship between the expected stresses and the component
operations and known failure mechanisms. The new level responses over time, permits the PoF models to predict
environment may accelerate some little known failure the system life. The development of the PoF models related to
mechanisms, or it may simply accelerate already well known the critical component failure mechanisms may take
mechanisms. significant work, yet the option to test multiple units is not
Once the failure mechanisms are known the requirement for viable. Therefore, the analytical and theoretical work permits
the new brake lights is to last twice as long as the current the team to receive feedback on the expected product
incandescent systems, or a 95% probability of lasting 10 weaknesses and expected life limiting failure mechanisms.
years. Simply finding the failures, surprising failure modes or Even determining the critical component failure mechanisms
not, permits the evaluation of how long they will last in the may be difficult.
Schenkelberg: page 5
8. This approach takes advantage of the existing literature Suppose after the first round of predictions we find the
detailing failure mechanisms for a wide range of components, keyboard has a lower expected reliability of 99.9% reliability
plus the ability to evaluate individual components at much over 5 years. Furthermore, let’s assume the remaining four
less expense than the full system. The approach has more risk subsystems all meet their goal at 99%. And, it is possible to
in the identification of the unknown failure mechanisms improve the reliability of the keyboard to 99.99% by spending
related to the full system configuration and use, yet, careful $1 more per keyboard. And, let’s assume it will cost the
use of tools like FMEA and reliability modeling permits the company $1000 per field failure for any cause. And, we
team to mitigate this risk to some extent. expect to build and sell 100,000 computers.
For the current keyboard, we expect 100,000-(0.999 *
5.3 The moderate volume product family variation case
100,000) = 100 keyboard failures. These will cost the
A common case is the modification of an existing product. company 100*$1,000=$100,000.
There is field data, the previous product testing information is For the new keyboard the cost will be 100,000 * $1 =
available and the list of known failure mechanisms is well $100,000. The savings will be due to reducing the field
defined. Furthermore the product functions, intended failures, from 100 failures to 1 failure. The new keyboard’s
environment and use profile remain basically the same. one failure costs $1,000. This is down from $100,000 for a
In some regards, this is more difficult than the previous two savings of $99,000.
cases. One approach would be to only test the new product For a savings of approximately $99,000 we spent $100,000,
with respect to the changes, and possibly only evaluate the which may make it difficult to justify the change. The
individual new components with the justification that nothing calculations might be more favorable if for the same cost of
else has changed. change, a difference in reliability from 99% to 99.5% could
The second possible approach is to repeat all of the be made in the power supply. For any proposed change that
evaluations and testing as done for the original product. Here impacts the reliability apportionment model the above
the justification may be that the relatively minor changes may calculation quickly illustrates the value.
adversely impact existing elements of the product. Or, worse, Yet, not all of the reliability tools directly increase or
the justification could be ‘we always do the full set of testing’ decrease the expected reliability. In some cases, the tool
mentality. might only shorten the time to detect the failure mechanisms.
Both approaches have risks and costs that can and should HALT is an example of this and it often finds most of the
be mitigated. Using the existing reliability models and best failure mechanisms in a design within a week, which would
available data, the design team can isolate the changes and normally take months of standards based environmental
assign a range of predicted values to the new component testing to uncover. The savings in time to market risk, more
reliability. In conjunction with that they can perform a very than justifies the necessity of making multiple trips to the
focused Design FMEA on the changes with an emphasis on HALT chamber.
how the changes impact any other element within the design. Another cost saving is the reduction in uncertainty. By
At this point, the design team can decide if the uncertainty simply improving the accuracy of reliability predictions the
concerning either the interaction effects or the life uncertainty range of the estimated reliability diminishes. Once the range
warrant further testing. If the true value for any range of no longer crosses a decision threshold to either conduct
reliability uncertainty will not preclude the product launch, further analysis or testing, the project resources can focus
then clearly no further testing is needed and the current improvement efforts on other high uncertainty or low
prediction if sufficient. If the low end of the range, on the reliability elements.
other hand, would require further reliability improvements, or
if the changes impact on other aspects of the product is
unclear, then further analysis and testing will be needed. 7. MATURITY MODEL
The appropriate approach considers what is known and
The state of the organization is also important. A design
unknowns, and the associated risks and decision points. The
team that has no experience or expertise in statistical methods
intent is to provide both guidance and feedback to the design
will probably flounder when trying to use an event-
team that permits well informed decisions. Using too few or
conditional based reliability block diagram that requires
too many reliability tools may incur undue risks or costs. The
advance statistical modeling. Getting this team to simply use
well crafted reliability program carefully considers how each
a Weibull cumulative distribution plot may be a stretch, and
reliability activity provides feedback toward answering the
provide more value initially.
two primary questions: what will fail and when will the
Each organization has a set of skills, expectations,
product fail. The intent is to add value to the product.
structures, etc. that defines the culture concerning product
reliability. Designing and applying reliability tools that will
6. DETERMINING VALUE
make an impact within the organization should fit within or
One way to select reliability tools for improving the be close to the organization’s current capabilities. The tools
product’s reliability is to consider the return on investment. If will only have impact and be useful if understood and make
the activity will not reduce risk, increase durability, reduce the current situation better. For example, a team that is
engineering time, and eliminate failure mechanisms, etc. then consistently surprised by field failure modes may immediately
the activity should not occur. benefit by conducting HALT testing to discover failure
Consider a simple example. Recall the computer with five mechanism before their customers do so for them.
subsystems from the apportionment discussion above. The Phil Crosby in his book Quality is Free (Crosby 1979)
initial goal for each subsystem was 99% over 5 years. created a maturity matrix focused on quality. With slight
Schenkelberg: page 6
9. modification, by substituting reliability for quality the same different tools would be needed. The downstairs, stage II,
basic table is meaningful for the assessment of an organization would require coaching, training, and resources
organization’s reliability program. to break the cycle of letting surprising field failures dominate
The primary difference that separates an effective reliability the engineering day. The upstairs, stage IV, organization
program from a non-effective one is the proactive nature of might be ready for advanced tools related to product
the program. On one occasion I conducted assessments of two modeling or field data analysis. They would have the time to
organizations located in the same building. Both designed learn advanced accelerated degradation testing methods, for
and manufactured telecommunication equipment with similar example.
complexity and volume. The interview schedule had me
going up and down stairs almost every hour for two days and 8. CONCLUSIONS
by midday of the first day I enjoyed going upstairs and
dreaded heading down. Despite all the product and business In summary, the traits of effective reliability programs
similarities the two reliability programs were dramatically include the ability to:
different; as different as their reliability results. • State clear reliability goals;
Downstairs the interviews started late, got interrupted by • Enable tradeoff decision-making;
urgent phone calls or in-person requests; firefighting at its • Selectively use only value-added reliability
best. The team employed a wide range of tools, all that were activities;
listed on a checklist, for each project. The reliability goals • Promote a proactive reliability culture.
were not known to the design team and the few that did know The basic message is that no one list or standard of tasks
them also understood that they would not be measured nor makes an effective reliability program. The selection of
would a failure to meet those goals impede getting the valuable tools and the establishment of a basic structure for
product to market. The people I talked to stated reliability decision-making permit an organization to achieve the
was very important and were very busy fixing field failures or desired reliability objectives.
testing (just before product launch) identified issues.
Reliability was done by “the guy that left last year”. 9. REFERENCES
Upstairs the interview started on time, and proceeded
without interruption. No one remembered the last time there Crosby, P. B. (1979). Quality is Free: The Art of Making
had been an urgent need to resolve a field issue. The team Quality Certain. New York, Signet.
employed reliability tools that would benefit the project as
needed. The specific testing that was done was tailored to the Ireson, W. G., C. F. Coombs, et al. (1995). Handbook of
risks identified during the design phase. The goals were reliability engineering and management. New York, McGraw
widely known and their current status was also known, both Hill.
during development and after product launch. The people I
talked to stated reliability was very important and they knew Petroski, H. (1994). Design Paradigms: Case historyes of
what to do to meet their reliability objectives. The team’s error and judgement in engineering. Cambridge, Cambridge
manager taught reliability thinking and skills, and everyone University Press.
did reliability.
For both organizations the basic structure and thought
process to determine which reliability tools to use would
apply but because of their different stages of development
Schenkelberg: page 7