Blog Archives

Bayesian Analysis by Markov Chain Monte Carlo (MCMC)


There are some classic methods for determining the unknown parameters in reliability analysis including probability plotting, least square, and maximum likelihood estimation (MLE).
These methods provide a simple value for parameters based on experimental data.
Bayesian approach is the method of choice employed widely in research for estimating and updating parameters values.

The advantages of this method:
1. The Bayesian approach is an updating network which does not ignore the prior information in contrast with the classic methods, but updates the earlier estimations with new obtained knowledge to improve the estimated parameter.
2. The output of Bayesian network is a distribution instead of a simple value for the parameter.
3. In classic methods, a large sample size is required for convergence of the estimate. For analysis of cases restricted by limited data the Bayesian approach is a good approach for parameter estimation.

By considering X as the unknown parameter and E as the new knowledge of crack length, Bayesian theorem modifies a prior probability ASQ-RD-Dec2015-Newsletter - Google Chrome_2 yielding a posterior probability ASQ-RD-Dec2015-Newsletter - Google Chrome_3, via the expression:
ASQ-RD-Dec2015-Newsletter - Google Chrome

where ASQ-RD-Dec2015-Newsletter - Google Chrome_4 is the likelihood function and is constructed based on new available knowledge and evidence.
The factor f(E|X)/∫f(E|X)π0(X)d(X) is the impact of the evidence on the belief in the PDF of the parameters.
Multiplying the prior PDF of the parameters by this factor provides a theoretical mechanism to update the prior knowledge of the parameters with the new evidence.

A Bayesian network is a complicated method in practice. An analytical solution rarely occurs.
However, it is possible to moderate the Bayesian difficulty by numerical method through Markov chain Monte Carlo (MCMC) solution.
This numerical method is applicable for similar approaches which need to integrate over the posterior distribution to make inference about model parameters or to make predictions.
MCMC is Monte Carlo integration that draws samples from the required distribution by running a properly constructed Markov chain for a long time.
Gibbs sampling is usually used for taking samples.
BUGs is an acronym stand for Bayesian inference using Gibbs sampling with WinBUGS as an open source software package for performing MCMC simulation.

By: Mohammad Pourgol-Mohammad, Ph.D, P.E, CRE,

Previously published in the December 2015 Volume 6, Issue 4 ASQ Reliability Division Newsletter

Picture © B. Poncelet

Posted in General

Juran & Deming —The Kings of Quality


Joseph Juran (1904-2008) and W. Edwards Deming (1900-93), the two most influential thinkers behind the totalquality movement, both launched their careers a few years apart at Western Electric, which used Statistical qualitycontrol techniques pioneered at Bell Labs to build reliable telephones.
And both gained acclaim while on loan to the government during World War II.
The irony is, Japanese execs heeded the lessons of total quality ahead of American managers.
In 1969, JUSE asked Juran to lend his name to Japan’s top quality award, a sort of super-Deming Prize for companies that maintain the highest quality for five years running.
JUSE deemed Juran’s vision of top-to-bottom quality management even more important than Deming’s manufacturing insights.
Juran demurred-a decision he later regretted.
So what could have been the Juran Medal is instead called the Japan Quality Control Medal.
There is a Joseph Juran Medal, though. It’s awarded by the American Society for Quality.
Juran personally presented the first one in 2001 to Robert W. Galvin, then head of Motorola Inc.’s executive committee.

Juran - Deming

Previously published in the June 2016 Volume 7, Issue 2 ASQ Reliability Division Newsletter

Picture © B. Poncelet

Posted in General

“Cumulative Sums of the Poisson” – affectionately called the “Thorndike Chart”

An old Nomograph that can save Risk Analysts and Reliability Engineers a lot of time: “Cumulative Sums of the Poisson” – affectionately called the “Thorndike Chart”

If you ever had to explain to a customer that the expected number of “catastrophic” events over the next (say) 10 years is a small fraction < 1. ..(say 0.26 events).. Using the Thorndike chart you can see that using µ=0.26, the Probability of seeing 1 (or more) events over the same time period is 0.24. Much easier to explain to a customer than 0.26 “expected” events!! Thorndike Chart

Previously published in the June 2013 Volume 4, Issue 2 ASQ Reliability Division Newsletter

Posted in General

The First Reliability Model for Mechanical Situations

A simple reliability model is the stress-load model.
In its simplest form it assumes that the stresses present are normally distributed and that the strength of material is also normally distributed across a number of samples.
With these distributions known, we can calculate the overlap of the two distributions, called interference, and can estimate the reliability of the situation.
The area of overlap is proportional to unreliability. See references [2], [3], [4] and [6] for additional information on this technique.
Mechanical-Design-Reliability-Monograph.pdf - Google Chrome
Mean of the Load is L = 5000 PSI and θL = 700 PSI L
Mean of the Strength is S = 8000 PSI and θS = 800 PSI
We often desire the Safety Margin to be > 3 to ensure a high reliability.
The Safety Margin unfortunately is a poorly defined term.
You can find at least 2 definitions, depending upon the book. Note – Most books agree that The Margin of Safety is different from the Safety Margin.
The Margin of Safety being the ratio of the average values of the strength and load when the standard deviations are unknown.
The following is the most common convention for the Safety Margin or S.M.
A one time application of a range of stress upon a material will lead to the population reliability of 0.997614 for the range of loads and strengths present.
Figure 1.12 shows the considerable overlap of the two distributions.
Despite this, the one time reliability is still high
This simple model can be employed with a variety of situations.
These are mainly mechanical, but can also be electronic.
Where ever one can describe a probability distribution that is related to a state of a system that has a well defined failure distribution, we can use this approach.
Time may even be added to the whole approach.

Extensions of the simple static model may be based upon the fact that one can model some quasi-static situations by the interference (overlap) of the strength and load distributions.
The distributions represent a probability of strength and a probability of load (stress) in a population of possibilities.
They do not suggest that a single system is changing values of strength, rather the whole population may be drifting.
Load is assumed to be static or drifting just as is strength.
Both distributions are still assumed to be normally distributed.
The math in this case is easy.
The overlap area of the two distributions is proportional to the probability of failure.
This simple model may be extended by adding time or repetitive activities.
The following examples show ways to extend this simple model.
Degradation may be the description of a slowly declining strength distribution.
The stress changing with time may be associated with a number of common failure mechanisms such as loss of lubrication, wear out or damage.

2. Ireson, Grant, Coombs, Clyde and Moss, Richard. Handbook of Reliability Engineering and Management, 2nd. edition, McGraw Hill, New York, 1996
3. Rao, S.S., Reliability Based Design, McGraw Hill, New York, 1992
4. O’Connor, P. D. T., Practical Reliability, 4 th Student edition, Wiley, New York, 2002
6. Carter, A.D.S., Mechanical Reliability and Design, Wiley, New York, 1997

By: by James McLinn CRE, Fellow ASQ

Published in Mechanical Design Reliability Handbook: Simplified Approaches and Techniques ISBN 0277-9633 February 2010 (available as free download for ASQ Reliability Division Members)

Picture © B. Poncelet

Posted in General

Wilks Tolerance Limit for Affordable Monte Carlo Based Uncertainty Propagation


As systems and their models become more complex and costly to run, the use of tolerance limit uncertainly characterization is gaining popularity.
For example in very complex models containing several uncertain parameters (each represented by a probability distribution function), classical Bayes’ and bootstrap Monte Carlo simulation may become impractical.
Often in complex computer-based models of (5.1) in which calculation of values require significant amount of time and effort, the traditional Monte Carlo simulation is not possible.
Wilks Tolerance limit is used in these cases

A tolerance interval is a random interval (L, U) that contains with probability (or confidence) b at least a fraction g of the population under study.
The probability and fraction b and g are analyst’s selected criteria depending on the confidence desired.
The pioneering work in this area is attributed to Wilks [1-2] and later to Wald [3-4].
Wilks Tolerance limit is an efficient and simple sampling method to reduce sample size from few thousands to around 100 or so.
The number of sample size does not depend on the number of uncertain parameters in the model.

There are two kinds of tolerance limits:

Non-parametric tolerance limits: Nothing is known about distribution of the random variable except that it is continuous
Parametric tolerance limits: The distribution function representing the random variable of interest is known and only some distribution parameters involved are unknown.

The problem in both cases is to calculate a tolerance range (L, U) for a random variable X represented by the observed sample, x1, ¼, xm, and the corresponding size of the sample.
ASQ-RD-June2016-Newsletter - Google Chrome
where, f(x) is the probability density function of the random variable X.

Let us consider a complex system represented by a model (e.g., a risk model).
Such a model may describe relationship between the output variables (e.g., probability of failure or performance value of a system) as a function of some input (random) variables (e.g., geometry, material properties, etc.).
Assume several parametric variables involve in the model.
Further assume that the observed randomness of the output variables is the result of the randomness of input variables.
If we take N samples of each input variable, then we obtain a sample of N output values {y1, ¼, yN} for y = f(x).
In using (1) for this problem, note that probability B bears the name confidence level.
To be on the conservative side, one should also specify probability content Y in addition to the confidence level B as large as possible.
It should be emphasized that Y is not a probability, although it is a non-negative real number of less than one [5].
Having fixed B and Y; it becomes possible to determine the number of runs (samples of output) N required to remain consistent with the selected B and Y values.

Let y1,¼,yN be N independent output values of y. Suppose that nothing is known about the pdf g(y) except that it is continuous.
Arrange the values of y1,¼, yN in an increasing order and denote them by y(k), hence
ASQ-RD-June2016-Newsletter - Google Chrome_2
and by definition y(0) = – ∞; while y(N+1) = +∞, it can be shown that for confidence level B [5] is obtained from
ASQ-RD-June2016-Newsletter - Google Chrome_3
From equation (3) sample sizes N can be estimated. For application of this approach consider two cases of the tolerance limits:
one-sided and two-sided follow:

One-sided Tolerance Limits: This is the more common case, for example when measuring a model output value such a temperature or sheer stress at a point on the surface of a structure.
We are interested in assuring that a small sample, of for example estimated temperatures, obtained from the model, and the corresponding upper sample tolerance limit TU according to (3), contains
with probability β (say 95%) at least the fraction γ of the temperatures in a fictitious sample containing infinite estimates
of such temperatures.
Table I shows values for sample size N based on values of β and γ. For example, if β = 0:95; γ = 0:90; then N = 45 samples taken from the model (e.g., by standard Monte Carlo sampling) assures that the highest temperature TH in this sample represent the 95% upper confidence limit below which 90% of the all possible temperatures lie.
ASQ-RD-June2016-Newsletter - Google Chrome_4

Two-Sided Tolerance Limits: We now consider the two-sided case, which is less common [6].
Table II shows the Wilks’ sample size. With B and γ both equal to 95%, we will get N = 93 samples.
For example, in the 93 samples taken from the model (e.g., by standard Monte Carlo sampling) we can say that limits (TL TH) from this sample represent the 95% confidence interval within which 95% of the all possible temperatures lie.
ASQ-RD-June2016-Newsletter - Google Chrome_5

Example 1:
A manufacturer of steel bars wants to order boxes for shipping their bars.
They want to order appropriate length for the boxes, with 90% confident that at least 95% of the bars do not exceed the box’s length.
How many samples, N, the manufacturer should select and which one should be used as the measure of the box length?

From Table I, with γ = 95% and β = 90%, the value for N is 29.
The manufacturer should orders box’s length as the x29 sampled bar (when samples are ordered).
To compare Wilks tolerance limit with Bayes’ Monte Carlo consider a complex Mathematical-based routine [7] (called MDFracture) used to calculate the probability of a nuclear reactor pressure vessel fracture due to pressurized thermal shock.
Certain transient scenarios can cause a rapid cooling inside the reactor vessel while it is pressurized.

Example 2:
A 2.828-inch surge line break in a certain design of nuclear reactors may lead to such a condition.
Many input variables contribute to the amount of thermal stress and fracture toughness of the vessel.
Some of them may involve uncertainties.
The temperature, pressure and heat transfer coefficient are examples of such variables, represented by normal distributions.
Also, flaw size, the distance from the flaw inner tip to the interface between base and clad of reactor vessel (C_Dist)) and aspect ratio are unknown and can be represented by random variables with the distributions shown in the Table III.
To compare the results of vessel fracture due to this scenario using Wilks approach with γ = 95% and B = 95% with the results of the standard 1000 and 2000 trials standard Monte Carlo simulation, three Wilks’ runs with 100 samples (assuming γ = 95% and β = 95% with two-sided case as shown in Table II) and two Monte Carlo runs with 1000 and 2000 are performed using the MD-Fracture Mathematical-based tool.
Results show good agreement between Wilks tolerance limits and simple Monte Carlo sampling, as shown in Figure I
ASQ-RD-June2016-Newsletter - Google Chrome_6

1) Wilks, S.S., Determination of Sample Sizes for Setting Tolerance Limits. The Annals of Mathematical Statistics, 12(1), 91, 1941.
2) Wilks, S.S., Statistical Prediction with Special Reference to the Problem of Tolerance Limits. The Annals of Mathematical Statistics, 13(4), 400, 1942.
3) Wald, A., An Extension of Wilks’ Method for Setting Tolerance Limits. The Annals of Mathematical Statistics, 14(1), 45, 1943.
4) Wald, A., Tolerance Limits for a Normal Distribution. The Annals of Mathematical Statistics, 17(2), 208, 1946.
5) Guba, A., Makai, M., and Pal, L., Statistical aspects of best estimate method I., Reliability Engineering & System Safety, 80 (3), 217, 2003.
6) Nutt, W.T., and Wallis, G.B., Evaluation of nuclear safety from the outputs of computer codes in the presence of uncertainties. Reliability Engineering & System Safety, 83(1), 57, 2004.
7) Li, F. and Modarres, M., Characterization of Uncertainty in the Measurement of Nuclear Reactor Vessel Fracture Toughness and Probability of Vessel Failure, Transactions of the American Nuclear Society Annual Meeting, Milwaukee, 2001.

By: Mohammad Pourgol-Mohammad, Ph.D, P.E, CRE,

Previously published in the December June 2016 Volume 7, Issue 2 ASQ Reliability Division Newsletter

Picture © B. Poncelet

Posted in General

Reliability Training Material


Slides from Quanterion Solutions Inc Lunchtime Learning series. Topics include Reliability distributions, Weibull analysis, FMEA, DOE.
Slides available at:

Previously published in the June 2013 Volume 4, Issue 2 ASQ Reliability Division Newsletter

Picture © B. Poncelet

Posted in General

Quote: answer


“An approximate answer to the right question is worth a good deal more than the exact answer to an approximate problem.”

—John W. Tukey (1915–2000)

Picture © B. Poncelet

Posted in General

Top Twelve list of Mechanical Reliability Problems


1. Always assume the worst will eventually happen. This applies especially to critical parts and assemblies. Know what is critical especially to the use and customers. What are the critical parts? How will the customer abuse the assembly?

2. Always check for tolerance stack up problems. Parts in tolerance today may not be in the future. Don’t assume stability from the suppliers or that wear can not occur. Most of the time we do not know the relationship between being in specification and the ultimate reliability. A DOE (Design of Experiments) would help here.

3. Metal inserts in plastic parts are hard to mold well. This may lead to problems in use because of residual stress and will eventually cause problems through tool wear and/or part cracking.

4. Always maximize the radii that are present. Small radii lead to high stress concentrations and failure prone places. Harden these areas or use harder metals where possible when the radii can not be increased.

5. Use as few connections as possible. This includes connectors, wire connections such as welds or solder joints, crimps and material connections and seals. Remember all connections are potentially weak points that will fail given time and stress.

6. All seals fail given time and stress. You need at least two levels of sealing to ensure the product will last as long as the customer expects. Remember that some materials diffuse through others. Perhaps three levels of seal are required.

7. Threads on bolts and screws shouldn’t carry shear loads. Remember they need preloads and/or stretches to ensure proper loading initially. Metal stretches, fractures and corrodes as well as developing high stress concentrations in use under tension. Be sure to allow for this.

8. Use as few nuts, bolts and screws as possible. While these are convenient temporary connection methods, it is 100 year old technology. Lock washers and locktite have been developed to slow down the rate of loosening. All will eventually come loose anyway when there is stress, temperature or vibration present.

9. Belts and chains will stretch and/or slip when use to deliver power. Remember these types of parts need constant tension devices to aid their reliability. Again this is old technology that can be made reliable by careful application. (Note, this is one of the biggest field problem with snow-throwers.)

10. Avoid set screws as these easily come loose because of their small sizes. Even when used on a flat, set screws are only “temporary connection” mechanisms. Locktite only makes “the temporary” a little longer in the presence of stress.

11. Watch the use of metal arms to carry loads. They often deflect in an imperceptible manner. This is especially true when loads are dynamic.

12. Integrate as many mechanical functions as possible. Use as few separate and distinct mechanical parts that are joined as possible. Joints are usually unreliable.

Each of these common mechanical design problems is used in common “every day life” situations where 10% failures per year might be acceptable or near the limit of technology (washing machines, other appliances, many instruments and even some cars). The same standard designs will not work well in high reliability applications where only 1 or 2% failures per year are desired or acceptable (aerospace, military, medical devices etc.). Remember the difference between the two applications when designing.

By: by James McLinn CRE, Fellow ASQ

Published in Mechanical Design Reliability Handbook: Simplified Approaches and Techniques ISBN 0277-9633 February 2010 (available as free download for ASQ Reliability Division Members)

Picture © B. Poncelet

Posted in General

William Sealy Gosset


William Sealy Gosset, alias “Student,” was an immensely talented scientist of diverse interests, but he will be remembered primarily
for his contributions to the development of modern statistics.
Born in Canterbury in 1876, he was educated at Winchester and New College, Oxford, where he studied chemistry and mathematics.
At the turn of the 19th century, Arthur Guinness, Son & Co. became interested in hiring scientists to analyze data concerned with various aspects of its brewing process.
Gosset was to be one of the first of these scientists, and so it was that in 1899 he moved to Dublin to take up a job as a brewer at St. James’Gate.
In 1935 he left Dublin to become head brewer at the new Guinness Park Royal brewery in London, but he died soon thereafter at the young age of 61 in 1937.
After initially finding his feet at the brewery in Dublin, Gosset wrote a report for Guinness in 1904 called “The Application of the Law of Error to Work of the Brewery.”
The report emphasized the importance of probability theory in setting an exact value on the results of brewery experiments, many of which were probable but not certain.
Most of the report was the classic theory of errors (Airy and Merriman) being applied to brewery analysis, but it also showed signs of a curious mind at work exploring new statistical horizons.
The report concluded that a mathematician should be consulted about special problems with small samples in the brewery.

Taken from: Philip J. Boland (1984): “A Biographical Glimpse of William Sealy Gosset”, The American Statistician, 38:3, 179-183.

Previously published in the June 2013 Volume 4, Issue 2 ASQ Reliability Division Newsletter

Picture © B. Poncelet

Posted in General

Dynamic Fault Tree Method (Part 3 of 3)


Simulation based methods, especially Monte Carlo simulation based techniques, can solve these problems.
According to the researches, complex systems which may be difficult to solve with analytical methods are simply solved with Monte Carlo simulation approach [3,4,7,12].
The reliability methods, which are based on Monte Carlo simulation approach, because of their ability in modeling the real conditions and stochastic behavior of the system, can eliminate uncertainty in reliability modeling [7].
The utilization of this approach is increasing for the calculation and estimation of reliability of dynamic systems.

DFT Versus SFT
Although, there are many rational reasons to utilize the dynamic methods in industrial field, the usage of these methods is not very common yet.
Perhaps, the main reason for this problem is directly related to the owners.
They do not bother to modernize existing static methods, such as RBD and SFT, which are extensively used in industrial field.
It comes from two main causes [5]; First of all, static approaches are more simplified and aggressively tested.
In addition, dynamic approaches are still too vague to apply to industrial applications.
Also, from a technical point of view, a SFT can be translated in a RBD, but this conversation to a DFT has to be figured out.

A Simple Example of DFT and SFT
Fig. 3 presents a simple example of SFT (the left one) and DFT for a similar system.
Let us consider the failure rate value of 0.01 (1/hrs) for all BEs.
In this example, Top Event (TE) of SFT will occur if all of the BEs occur; that is occurring of A, B and C at a same time, all together, no matter the sequences of them.
ASQ-RD-Dec2015-Newsletter - Google Chrome_5
Now, Let us consider the DFT of this case.
Also, for this DFT, Top Event (TE) will occur if all of the BEs occur, at a same time, all together, but the way how this configuration is reached, matters.
In this case, due to the presence of the PAND gate, the sequences of events are important.
In this case, to occurrence DFT’s TE, A and B must occur before C.
At the mission time of 1000 hours, unreliability value for SFT and DFT, are 9.99E-16 and 3.33E-16, respectively (numerical analysis was done by “PTC Windchill” software).

[1] Bechta Dugan, J., Bavuso, Salvatore J., Boyd, M.A., 1992, “Dynamic Fault-Tree Models For Fault-Tolerant Computer Systems,” IEEE Transactions on Reliability, Vol. 41, pp. 363 – 377.
[2] Xing, L., Amari, S. V., 2008, “Handbook of Performability Engineering,” Fault Tree Analysis, London, Springer London, pp. 595-620.
[3] DurgaRao, K., et al., 2009, “Dynamic Fault Tree Analysis Using Monte Carlo Simulation In Probabilistic Safety Assessment,” Reliability Engineering & System Safety, Vol. 94, pp. 872–883.
[4] Berg, G.V., “Monte Carlo Sampling of Dynamic Fault Trees for Reliability Prediction,”
[5] Manno, G., et al., 2014, “Conception of Repairable Dynamic Fault Trees and resolution by the use of RAATSS, a Matlab toolbox based on the ATS formalism.” Reliability Engineering & System Safety, Vol. 121, pp.
[6] Chiacchio, F., et al., 2011, “Dynamic Fault Trees Resolution: A Conscious Trade-Off between Analytical and Simulative Approaches,” Reliability Engineering & System Safety, Vol. 96, pp. 1515–1526.
[7] Faulin Fajardo, J., et al., 2010, “Simulation Methods for Reliability and Availability of Complex Systems,” British Library Cataloguing in Publication Data, pp. 41-64.
[8] Amari, S., Dill, G., Howald, E., 2003, “A New Approach To Solve Dynamic Fault Trees,” Annual Reliability and Maintainability Symposium, IEEE Publisher., pp. 374 – 379.
[9] Rausand, M., Hoyland, A., 2003, “System Reliability Theory: Models, Statistical Methods, and Applications,” 2nd Edition, New York, USA, Wiley-Interscience

By: Mohammad Pourgol-Mohammad, Ph.D, P.E, CRE,

Previously published in the December 2015 Volume 6, Issue 4 ASQ Reliability Division Newsletter

Picture © B. Poncelet

Posted in General
Webinar Categories
Recent Webinars
  • Availability
    June 8, 2017
  • The Investigation of Physical Explanation for Proportional Hazard Model (PHM) for Typical Failure Mechanisms (从故障物理角度解释比例风险模型)
    June 11, 2017
  • Disponibilidad de Service (Availability)
    June 22, 2017
  • GD&T
    July 13, 2017

Provide a global forum for networking among practitioners of reliability engineering, management and related topics.


Facilitate growth and development of division members,


Provide Resources

Promote reliability engineering principles and serve as a technical resource on reliability engineering for ASQ, standards agencies, industry, government, academia and related disciplines


Sponsor, present and promote reliability, maintainability, and related training materials for courses, symposia, and conferences.