Blog Archives

Communicating Reliability, Risk and Resiliency to Decision Makers by JD Solomon now available

JD Solomon, PE, CRE, CMRP and a valuable member and volunteer of ASQ Reliability Division has released a book including some of his expert experiences and knowledge.

2017-08-06 17_42_26-Amazon.com_ Communicating Reliability, Risk and Resiliency to Decision Makers (9

Communicating Reliability, Risk and Resiliency to Decision Makers:

Effective communication of concepts and solutions related to reliability, risk, and resiliency is frequently cited by technical professionals as the most challenging and overlooked aspect of their work. Texts and guidance documents often reference the importance of better communication; however, there are few practical examples and limited practical guidance for this type of communication. This book fills many of these gaps for practitioners. Communicating Reliability, Risk and Resiliency to Decision Makers provides the hands-on approaches and techniques that will make you more effective in getting decision makers to move from discussion to action.

More info on

A more extended review of the book will apear later in the blog section

Posted in General

Censoring and MTBF and MCBF Calculations


At times, several items are tested simultaneously, and the results are combined to estimate the mean time before failure/mean cycles before failure (MTBF/MCBF). When performing tests of this type, two common test results occur:
type I censoring(also called time-truncated censoring) and type II censoring (also called failure-truncated censoring).

Note that the MTBF/MCBF (θ) is the reciprocal of the failure rate (λ).


Type I Censoring

This is when a total of n items are placed on test. Each test stand operates each test item a given number of hours or cycles. As test items fail, they are replaced. The test time is defined in advance, so the test is said to be truncated after the specified number of hours or cycles.

A time- or cycle-truncated test is called type I censoring.

The formula for type I censoring is:

type I


Type II Censoring

Type II censoring is where n items are placed on test one at a time, and the test is truncated after a total of r failures occurs. As items fail, they are not replaced or repaired.

type II

Example: A total of 20 items are placed on test. The test is to be truncated when the fourth failure occurs. The failures occur at the following number of hours into the test.


The remaining 16 items were in good operating order when the test was truncated at the fourth failure. What is the estimate of the MTBF?


What is the failure rate?




Practical engineering, process, and reliability statistics / Mark Allen Durivage

ISBN: 978-0-87389-889-8

Picture © B. Poncelet

Posted in General

Tribute to Harry Ascher (1935-2014)


Harold (Harry) E. Ascher was known widely in the reliability engineering community for his many contributions, especially in the area of statistical modeling of repairable systems reliability.

The 1984 book co-authored by Harry Ascher and Harry Feingold, Repairable Systems Reliability: Modeling, Inference, Misconceptions and Their Causes [1] was the first book to thoroughly address statistical modeling of systems that are repaired rather than discarded after the first failure.
Harry Ascher earned degrees in Operations Research from City College of New York and New York University and spent most of his career with the Naval Research Laboratory (NRL) in Washington DC. Following his retirement from NRL, he spent more than 20 years as a private consultant.
He received many distinctions and awards including the Allan Chop Award from ASQC (now ASQ) in 1995.
One of Harry Ascher’s greatest visions, as a professional, was to “clean up” notation and terminology from what he considered to be results of inherent and avoidable ambiguities between properties of repairable systems and, in contrast, properties of (non-repairable) parts.
For example, the term “Failure Rate” in the literature is used in multiple conflicting meanings, sometimes even within the same article.
One meaning is the derivative of the expected cumulative number of failures for a repairable system (also known as a “Rate of Occurrence of Failures” of “Failure Intensity Function”).
The other, quite different, meaning is the ratio of the Probability Density Function and the Reliability Function for a part (also known as the “Force of Mortality” or “Hazard Function”).
He introduced us the fact that the most important analysis to perform on statistical data for a repairable system is to test for trend.
This may possibly classify the system as a “Happy System” (times between failures tend to get larger and larger, a “Sad System” (times between failures tend to get shorter and shorter) or a “Non-Committal System” (times between failures show no significant sign of trend).
He would repeatedly state that, when it comes to modeling a repairable system, “a set of numbers is not a data set.” Often, the particular pattern in which failures occur provide more useful information about
a system, than does a set of numbers describing the observed times between failures.
He taught us that a renewal process is generally an inappropriate model to use for a complex repairable system consisting of many parts that have either “infant mortality” or “wear-out” characteristics.
To demonstrate how absurd the renewal process model is, when used to describe a complex system such as a car he often told his “Honest John” story.
This is about a used-car salesman trying to sell a wornout car with multiple mechanical issues by telling the customer: “Two days ago the battery was dead, so we charged it. So the car is two days old!”

(excerpted from article by Christian K. Hansen, President, IEEE Reliability Society)

Previously published in the June 2014 Volume 5, Issue 2 ASQ Reliability Division Newsletter

Picture © B. Poncelet

Posted in General

Example using “Bayesian” assumptions


Example using “Bayesian” assumptions in proving a new design is X% better.
Suppose my current part is producing failures in the field with a Weibull β=2, η=95 hours.
I have demonstrated that I can reproduce the failure mode in the lab by producing the same β.
I have redesigned the part and placed 8 of the new designed parts on test in the same lab setup.
After a few weeks they have test times of 210, 260, 570, 120, 225, 280 500 and 400 hours with no failures.
How much better is this new part?
One way to demonstrate this is to calculate the MLE η of the 8 times demonstrated on the new designed parts, “assuming” a β=2 and “assuming” 1 failure is imminent:
Therefore, showing you have (conservatively) over 10X better reliability

Example using “Bayesian” assumptions
Previously published in the June 2013 Volume 4, Issue 2 ASQ Reliability Division Newsletter

Picture © B. Poncelet

Posted in General

CRE Experts Needed

Certified Reliability Engineer _ ASQ

ASQ is seeking volunteers to help write example questions to be used for Certified Reliability Engineer (CRE) exam practice questions.
You must be a member of ASQ and a CRE.
You can not have participated in efforts related to the actual CRE Exam content and development within the past two years in order to keep actual and practice CRE efforts separate.
You will be provided RUs for the hours contributed to the project as well as recognition of your contributions.

If you are interested in supporting this effort, please contact Donna Grunewald at for more information

Info on ASQ CRE

Posted in General

Bayesian Analysis by Markov Chain Monte Carlo (MCMC)


There are some classic methods for determining the unknown parameters in reliability analysis including probability plotting, least square, and maximum likelihood estimation (MLE).
These methods provide a simple value for parameters based on experimental data.
Bayesian approach is the method of choice employed widely in research for estimating and updating parameters values.

The advantages of this method:
1. The Bayesian approach is an updating network which does not ignore the prior information in contrast with the classic methods, but updates the earlier estimations with new obtained knowledge to improve the estimated parameter.
2. The output of Bayesian network is a distribution instead of a simple value for the parameter.
3. In classic methods, a large sample size is required for convergence of the estimate. For analysis of cases restricted by limited data the Bayesian approach is a good approach for parameter estimation.

By considering X as the unknown parameter and E as the new knowledge of crack length, Bayesian theorem modifies a prior probability ASQ-RD-Dec2015-Newsletter - Google Chrome_2 yielding a posterior probability ASQ-RD-Dec2015-Newsletter - Google Chrome_3, via the expression:
ASQ-RD-Dec2015-Newsletter - Google Chrome

where ASQ-RD-Dec2015-Newsletter - Google Chrome_4 is the likelihood function and is constructed based on new available knowledge and evidence.
The factor f(E|X)/∫f(E|X)π0(X)d(X) is the impact of the evidence on the belief in the PDF of the parameters.
Multiplying the prior PDF of the parameters by this factor provides a theoretical mechanism to update the prior knowledge of the parameters with the new evidence.

A Bayesian network is a complicated method in practice. An analytical solution rarely occurs.
However, it is possible to moderate the Bayesian difficulty by numerical method through Markov chain Monte Carlo (MCMC) solution.
This numerical method is applicable for similar approaches which need to integrate over the posterior distribution to make inference about model parameters or to make predictions.
MCMC is Monte Carlo integration that draws samples from the required distribution by running a properly constructed Markov chain for a long time.
Gibbs sampling is usually used for taking samples.
BUGs is an acronym stand for Bayesian inference using Gibbs sampling with WinBUGS as an open source software package for performing MCMC simulation.

By: Mohammad Pourgol-Mohammad, Ph.D, P.E, CRE,

Previously published in the December 2015 Volume 6, Issue 4 ASQ Reliability Division Newsletter

Picture © B. Poncelet

Posted in General

Juran & Deming —The Kings of Quality


Joseph Juran (1904-2008) and W. Edwards Deming (1900-93), the two most influential thinkers behind the totalquality movement, both launched their careers a few years apart at Western Electric, which used Statistical qualitycontrol techniques pioneered at Bell Labs to build reliable telephones.
And both gained acclaim while on loan to the government during World War II.
The irony is, Japanese execs heeded the lessons of total quality ahead of American managers.
In 1969, JUSE asked Juran to lend his name to Japan’s top quality award, a sort of super-Deming Prize for companies that maintain the highest quality for five years running.
JUSE deemed Juran’s vision of top-to-bottom quality management even more important than Deming’s manufacturing insights.
Juran demurred-a decision he later regretted.
So what could have been the Juran Medal is instead called the Japan Quality Control Medal.
There is a Joseph Juran Medal, though. It’s awarded by the American Society for Quality.
Juran personally presented the first one in 2001 to Robert W. Galvin, then head of Motorola Inc.’s executive committee.

Juran - Deming

Previously published in the June 2016 Volume 7, Issue 2 ASQ Reliability Division Newsletter

Picture © B. Poncelet

Posted in General

“Cumulative Sums of the Poisson” – affectionately called the “Thorndike Chart”

An old Nomograph that can save Risk Analysts and Reliability Engineers a lot of time: “Cumulative Sums of the Poisson” – affectionately called the “Thorndike Chart”

If you ever had to explain to a customer that the expected number of “catastrophic” events over the next (say) 10 years is a small fraction < 1. ..(say 0.26 events).. Using the Thorndike chart you can see that using µ=0.26, the Probability of seeing 1 (or more) events over the same time period is 0.24. Much easier to explain to a customer than 0.26 “expected” events!! Thorndike Chart

Previously published in the June 2013 Volume 4, Issue 2 ASQ Reliability Division Newsletter

Posted in General

The First Reliability Model for Mechanical Situations

A simple reliability model is the stress-load model.
In its simplest form it assumes that the stresses present are normally distributed and that the strength of material is also normally distributed across a number of samples.
With these distributions known, we can calculate the overlap of the two distributions, called interference, and can estimate the reliability of the situation.
The area of overlap is proportional to unreliability. See references [2], [3], [4] and [6] for additional information on this technique.
Mechanical-Design-Reliability-Monograph.pdf - Google Chrome
Mean of the Load is L = 5000 PSI and θL = 700 PSI L
Mean of the Strength is S = 8000 PSI and θS = 800 PSI
We often desire the Safety Margin to be > 3 to ensure a high reliability.
The Safety Margin unfortunately is a poorly defined term.
You can find at least 2 definitions, depending upon the book. Note – Most books agree that The Margin of Safety is different from the Safety Margin.
The Margin of Safety being the ratio of the average values of the strength and load when the standard deviations are unknown.
The following is the most common convention for the Safety Margin or S.M.
A one time application of a range of stress upon a material will lead to the population reliability of 0.997614 for the range of loads and strengths present.
Figure 1.12 shows the considerable overlap of the two distributions.
Despite this, the one time reliability is still high
This simple model can be employed with a variety of situations.
These are mainly mechanical, but can also be electronic.
Where ever one can describe a probability distribution that is related to a state of a system that has a well defined failure distribution, we can use this approach.
Time may even be added to the whole approach.

Extensions of the simple static model may be based upon the fact that one can model some quasi-static situations by the interference (overlap) of the strength and load distributions.
The distributions represent a probability of strength and a probability of load (stress) in a population of possibilities.
They do not suggest that a single system is changing values of strength, rather the whole population may be drifting.
Load is assumed to be static or drifting just as is strength.
Both distributions are still assumed to be normally distributed.
The math in this case is easy.
The overlap area of the two distributions is proportional to the probability of failure.
This simple model may be extended by adding time or repetitive activities.
The following examples show ways to extend this simple model.
Degradation may be the description of a slowly declining strength distribution.
The stress changing with time may be associated with a number of common failure mechanisms such as loss of lubrication, wear out or damage.

2. Ireson, Grant, Coombs, Clyde and Moss, Richard. Handbook of Reliability Engineering and Management, 2nd. edition, McGraw Hill, New York, 1996
3. Rao, S.S., Reliability Based Design, McGraw Hill, New York, 1992
4. O’Connor, P. D. T., Practical Reliability, 4 th Student edition, Wiley, New York, 2002
6. Carter, A.D.S., Mechanical Reliability and Design, Wiley, New York, 1997

By: by James McLinn CRE, Fellow ASQ

Published in Mechanical Design Reliability Handbook: Simplified Approaches and Techniques ISBN 0277-9633 February 2010 (available as free download for ASQ Reliability Division Members)

Picture © B. Poncelet

Posted in General

Wilks Tolerance Limit for Affordable Monte Carlo Based Uncertainty Propagation


As systems and their models become more complex and costly to run, the use of tolerance limit uncertainly characterization is gaining popularity.
For example in very complex models containing several uncertain parameters (each represented by a probability distribution function), classical Bayes’ and bootstrap Monte Carlo simulation may become impractical.
Often in complex computer-based models of (5.1) in which calculation of values require significant amount of time and effort, the traditional Monte Carlo simulation is not possible.
Wilks Tolerance limit is used in these cases

A tolerance interval is a random interval (L, U) that contains with probability (or confidence) b at least a fraction g of the population under study.
The probability and fraction b and g are analyst’s selected criteria depending on the confidence desired.
The pioneering work in this area is attributed to Wilks [1-2] and later to Wald [3-4].
Wilks Tolerance limit is an efficient and simple sampling method to reduce sample size from few thousands to around 100 or so.
The number of sample size does not depend on the number of uncertain parameters in the model.

There are two kinds of tolerance limits:

Non-parametric tolerance limits: Nothing is known about distribution of the random variable except that it is continuous
Parametric tolerance limits: The distribution function representing the random variable of interest is known and only some distribution parameters involved are unknown.

The problem in both cases is to calculate a tolerance range (L, U) for a random variable X represented by the observed sample, x1, ¼, xm, and the corresponding size of the sample.
ASQ-RD-June2016-Newsletter - Google Chrome
where, f(x) is the probability density function of the random variable X.

Let us consider a complex system represented by a model (e.g., a risk model).
Such a model may describe relationship between the output variables (e.g., probability of failure or performance value of a system) as a function of some input (random) variables (e.g., geometry, material properties, etc.).
Assume several parametric variables involve in the model.
Further assume that the observed randomness of the output variables is the result of the randomness of input variables.
If we take N samples of each input variable, then we obtain a sample of N output values {y1, ¼, yN} for y = f(x).
In using (1) for this problem, note that probability B bears the name confidence level.
To be on the conservative side, one should also specify probability content Y in addition to the confidence level B as large as possible.
It should be emphasized that Y is not a probability, although it is a non-negative real number of less than one [5].
Having fixed B and Y; it becomes possible to determine the number of runs (samples of output) N required to remain consistent with the selected B and Y values.

Let y1,¼,yN be N independent output values of y. Suppose that nothing is known about the pdf g(y) except that it is continuous.
Arrange the values of y1,¼, yN in an increasing order and denote them by y(k), hence
ASQ-RD-June2016-Newsletter - Google Chrome_2
and by definition y(0) = – ∞; while y(N+1) = +∞, it can be shown that for confidence level B [5] is obtained from
ASQ-RD-June2016-Newsletter - Google Chrome_3
From equation (3) sample sizes N can be estimated. For application of this approach consider two cases of the tolerance limits:
one-sided and two-sided follow:

One-sided Tolerance Limits: This is the more common case, for example when measuring a model output value such a temperature or sheer stress at a point on the surface of a structure.
We are interested in assuring that a small sample, of for example estimated temperatures, obtained from the model, and the corresponding upper sample tolerance limit TU according to (3), contains
with probability β (say 95%) at least the fraction γ of the temperatures in a fictitious sample containing infinite estimates
of such temperatures.
Table I shows values for sample size N based on values of β and γ. For example, if β = 0:95; γ = 0:90; then N = 45 samples taken from the model (e.g., by standard Monte Carlo sampling) assures that the highest temperature TH in this sample represent the 95% upper confidence limit below which 90% of the all possible temperatures lie.
ASQ-RD-June2016-Newsletter - Google Chrome_4

Two-Sided Tolerance Limits: We now consider the two-sided case, which is less common [6].
Table II shows the Wilks’ sample size. With B and γ both equal to 95%, we will get N = 93 samples.
For example, in the 93 samples taken from the model (e.g., by standard Monte Carlo sampling) we can say that limits (TL TH) from this sample represent the 95% confidence interval within which 95% of the all possible temperatures lie.
ASQ-RD-June2016-Newsletter - Google Chrome_5

Example 1:
A manufacturer of steel bars wants to order boxes for shipping their bars.
They want to order appropriate length for the boxes, with 90% confident that at least 95% of the bars do not exceed the box’s length.
How many samples, N, the manufacturer should select and which one should be used as the measure of the box length?

From Table I, with γ = 95% and β = 90%, the value for N is 29.
The manufacturer should orders box’s length as the x29 sampled bar (when samples are ordered).
To compare Wilks tolerance limit with Bayes’ Monte Carlo consider a complex Mathematical-based routine [7] (called MDFracture) used to calculate the probability of a nuclear reactor pressure vessel fracture due to pressurized thermal shock.
Certain transient scenarios can cause a rapid cooling inside the reactor vessel while it is pressurized.

Example 2:
A 2.828-inch surge line break in a certain design of nuclear reactors may lead to such a condition.
Many input variables contribute to the amount of thermal stress and fracture toughness of the vessel.
Some of them may involve uncertainties.
The temperature, pressure and heat transfer coefficient are examples of such variables, represented by normal distributions.
Also, flaw size, the distance from the flaw inner tip to the interface between base and clad of reactor vessel (C_Dist)) and aspect ratio are unknown and can be represented by random variables with the distributions shown in the Table III.
To compare the results of vessel fracture due to this scenario using Wilks approach with γ = 95% and B = 95% with the results of the standard 1000 and 2000 trials standard Monte Carlo simulation, three Wilks’ runs with 100 samples (assuming γ = 95% and β = 95% with two-sided case as shown in Table II) and two Monte Carlo runs with 1000 and 2000 are performed using the MD-Fracture Mathematical-based tool.
Results show good agreement between Wilks tolerance limits and simple Monte Carlo sampling, as shown in Figure I
ASQ-RD-June2016-Newsletter - Google Chrome_6

1) Wilks, S.S., Determination of Sample Sizes for Setting Tolerance Limits. The Annals of Mathematical Statistics, 12(1), 91, 1941.
2) Wilks, S.S., Statistical Prediction with Special Reference to the Problem of Tolerance Limits. The Annals of Mathematical Statistics, 13(4), 400, 1942.
3) Wald, A., An Extension of Wilks’ Method for Setting Tolerance Limits. The Annals of Mathematical Statistics, 14(1), 45, 1943.
4) Wald, A., Tolerance Limits for a Normal Distribution. The Annals of Mathematical Statistics, 17(2), 208, 1946.
5) Guba, A., Makai, M., and Pal, L., Statistical aspects of best estimate method I., Reliability Engineering & System Safety, 80 (3), 217, 2003.
6) Nutt, W.T., and Wallis, G.B., Evaluation of nuclear safety from the outputs of computer codes in the presence of uncertainties. Reliability Engineering & System Safety, 83(1), 57, 2004.
7) Li, F. and Modarres, M., Characterization of Uncertainty in the Measurement of Nuclear Reactor Vessel Fracture Toughness and Probability of Vessel Failure, Transactions of the American Nuclear Society Annual Meeting, Milwaukee, 2001.

By: Mohammad Pourgol-Mohammad, Ph.D, P.E, CRE,

Previously published in the December June 2016 Volume 7, Issue 2 ASQ Reliability Division Newsletter

Picture © B. Poncelet

Posted in General
Webinar Categories
Recent Webinars
  • Un Sitio Web de Confiabilidad con Material en Español
    March 5, 2018
  • Demystifying the common misconceptions about Reliability Centered Maintenance (RCM)
    March 8, 2018
  • Reliability Analysis using Reliability Block Diagram( RBD)
    April 12, 2018
  • Human-in-the-Loop: Predictive Modeling of the Likelihood of a Vehicular Mission or an Extraordinary Situation Outcome
    May 10, 2018

Provide a global forum for networking among practitioners of reliability engineering, management and related topics.


Facilitate growth and development of division members,


Provide Resources

Promote reliability engineering principles and serve as a technical resource on reliability engineering for ASQ, standards agencies, industry, government, academia and related disciplines


Sponsor, present and promote reliability, maintainability, and related training materials for courses, symposia, and conferences.