History of Reliability Engineering - ASQ Risk and Reliability

History of Reliability

by James McLinn

success

Reliability is a popular concept that has been celebrated for years as a commendable attribute of a person or a product. Its modest beginning was in 1816, far sooner than most would guess. The word “reliability” was first coined by poet Samuel Taylor Coleridge [17]. In statistics, reliability is the consistency of a set of measurements or measuring instrument, often used to describe a test. Reliability is inversely related to random error [18]. In Psychology, reliability refers to the consistency of a measure. A test is considered reliable if we get the same result repeatedly. For example, if a test is designed to measure a trait (such as introversion), then each time the test is administered to a subject, the results should be approximately the same [19]. Thus, before World War II, reliability as a word came to mean dependability or repeatability. The modern use was redefined by the U.S. military in the 1940s and evolved to the present. It initially came to mean that a product that would operate when expected. The current meaning connotes a number of additional attributes that span products, service applications, software packages or human activity. These attributes now pervade every aspect of our present day technologically- intensive world. Let’s follow the recent journey of the word “reliability” from the early days to present.

An early application of reliability might relate to the telegraph. It was a battery powered system with simple transmitters and receivers connected by wire. The main failure mode might have been a broken wire or insufficient voltage. Until the light bulb, the telephone, and AC power generation and distribution, there was not much new in electronic applications for reliability. By 1915, radios with a few vacuum tubes began to appear in the public. Automobiles came into more common use by 1920 and may represent mechanical applications of reliability. In the 1920s, product improvement through the use of statistical quality control was promoted by Dr. Walter A. Shewhart at Bell Labs [20]. On a parallel path with product reliability was the development of statistics in the twentieth century. Statistics as a tool for making measurements would become inseparable with the development of reliability concepts.

At this point, designers were still responsible for reliability and repair people took care of the failures. There wasn’t much planned proactive prevention or economic justification for doing so. Throughout the 1920s and 30s, Taylor worked on ways to make products more consistent and the manufacturing process more efficient. He was the first to separate the engineering from management and control [21]. Charles Lindberg required that the 9 cylinder air cooled engine for his 1927 transatlantic flight be capable of 40 continuous hours of operation without maintenance [22]. Much individual progress was made into the 1930s in a few specific industries. Quality and process measures were in their infancy, but growing. Walodie Weibull was working in Sweden during this period and investigated the fatigue of materials. He created a distribution, which we now call Weibull, during this time [1]. In the 1930s, Rosen and Rammler were also investigating a similar distribution to describe the fineness of powdered coal [16].

By the 1940s, reliability and reliability engineering still did not exist. The demands of WWII introduced many new electronics products into the military. These ranged from electronic switches, vacuum tube portable radios, radar and electronic detonators. Electronic tube computers were started near the end of the war, but did not come into completion until after the war. At the onset of the war, it was discovered that over 50% of the airborne electronics equipment in storage was unable to meet the requirements of the Air Core and Navy [1, page 3]. More importantly, much of reliability work of this period also had to do with testing new materials and fatigue of materials. M.A. Miner published the seminal paper titled “Cumulative Damage in Fatigue” in 1945 in an ASME Journal. B. Epstein published “Statistical Aspects of Fracture Problems” in the Journal of Applied Physics in February 1948 [2]. The main military application for reliability was still the vacuum tube, whether it was in radar systems or other electronics. These systems had proved problematic and costly during the war. For shipboard equipment after the war, it was estimated that half of the electronic equipment was down at any given time [3]. Vacuum tubes in sockets were a natural cause of system intermittent problems. Banging the system or removing the tubes and reinstalling were the two main ways to fix a failed electronic system. This process was gradually giving way to cost considerations for the military. They couldn’t afford to have half of their essential equipment non-functional all of the time. The operational and logistics costs would become astronomical if this situation wasn’t soon rectified. IEEE formed the Reliability Society in 1948 with Richard Rollman as the first president. Also in 1948, Z.W. Birnbaum had founded the Laboratory of Statistical Research at the University of Washington which, through its long association with the Office of Naval Research, served to strengthen and expand the use of statistics [23].

The start of the 1950s found the bigger reliability problem was being defined and solutions proposed both in the military and commercial applications. The early large Sperry vacuum tube computers were reported to fill a large room, consume kilowatts of power, have a 1024 bit memory and fail on the average of about every hour [8]. The Sperry solution was to permit the failed section of the computer to shut off and tubes replaced on the fly. In 1951, Rome Air Development Center (RADC) was established in Rome, New York to study reliability issues with the Air Force [24]. That same year, Wallodi Weibull published his first paper for the ASME Journal of Applied Mechanics in English. It was titled “A Statistical Distribution Function of Wide Applicability” [28]. By 1959, he had produced “Statistical Evaluation of Data from Fatigue and Creep Rupture Tests: Fundamental Concepts and General Methods” as a Wright Air Development Center Report 59-400 for the US military.

On the military side, a 1950 study group was initiated. This group was called the Advisory Group on the Reliability of Electronic Equipment, AGREE for short [4, 5]. By 1952, an initial report by this group recommended the following three items for the creation of reliable systems:

There was a need to develop better components and more consistency from suppliers.
The military should establish quality and reliability requirements for component suppliers.
Actual field data should be collected on components in order to establish the root causes of problems.

In 1955, a conference on electrical contacts and connectors was started, emphasizing reliability physics and understanding failure mechanisms. Other conferences began in the 1950s to focus on some of these important reliability topics. That same year, RADC issued “Reliability Factors for Ground Electronic Equipment.” This was authored by Joseph Naresky. By 1956, ASQC was offering papers on reliability as part of their American Quality Congress. The radio engineers, ASME, ASTM and the Journal of Applied Statistics were contributing research papers. The IRE was already holding a conference and publishing proceedings titled “Transaction on Reliability and Quality Control in Electronics”. This began in 1954 and continued until this conference merged with an IEEE Reliability conference and became the Reliability and Maintainability Symposium.

In 1957, a final report was generated by the AGREE committee and it suggested the following [6]. Most vacuum tube radio systems followed a bathtub-type curve. It was easy to develop replaceable electronic modules, later called Standard Electronic Modules (or SEMs), to quickly restore a failed system and they emphasized modularity of design. Additional recommendations included running formal demonstration tests with statistical confidence for products. Also recommended was running longer and harsher environmental tests that included temperature extremes and vibration. This came to be known as AGREE testing and eventually turned into Military Standard 781. The last item provided by the AGREE report was the classic definition of reliability. The report stated that the definition is “the probability of a product performing without failure a specified function under given conditions for a specified period of time”. Another major report on “Predicting Reliability” in 1957 was that by Robert Lusser of Redstone Arsenal, where he pointed out that 60% of the failures of one Army missile system were due to components [7]. He showed that current methods for obtaining quality and reliability for electronic components were inadequate and that something more was needed. ARINC set up an improvement process with vacuum tube suppliers and reduced infant mortality removals by a factor of four [25]. This decade ended with RCA publishing information in TR1100 on the failure rates of some military components. RADC picked this up and it became the basis for Military Handbook 217. This decade ended with a lot of promise and activity. Papers were being published at conferences showing the growth of this field. Consider the following examples: “Reliability Handbook for Design Engineers” published in Electronic Engineers, number 77, pp. 508-512 in June 1958 by F.E. Dreste and “A Systems Approach to Electronic Reliability” by W.F. Leubbert in the Proceedings of the I.R.E., vol. 44, p. 523 in April, 1956 [7]. Over the next several decades, Birnbaum made significant contributions to probabilistic inequalities (i.e. Chebychev), nonparametric statistics, reliability of complex systems, cumulative damage models, competing risk, survival distributions and mortality rates [23]. The decade ended with C.M. Ryerson producing a history of reliability to 1959 [26] published in the proceedings of the IRE.

The 1960s dawned with several significant events. RADC began the Physics of Failure in Electronics Conference sponsored by Illinois Institute of Technology (IIT). A strong commitment to space exploration would turn into NASA, a driving force for improved reliability of components and systems. Richard Nelson of RADC produced the document “Quality and Reliability Assurance Procedures for Monolithic Microcircuits,” which eventually became Mil-Std 883 and Mil-M 38510. Semiconductors came into more common use as small portable transistor radios appeared. Next, the alternator became possible with low cost germanium and later silicon diodes able to meet the under-the-hood requirements. Dr Frank M Gryna published a Reliability Training Text through the Institute of Radio Engineers. The nuclear power industry was growing by leaps and bounds at that point in history. The demands of the military ranging from missiles to airplanes, helicopters and submarine applications drove a variety of technologies. The study of the effects of EMC on systems was initiated at RADC and this produced many developments in the 1960s.

During this decade, a number of people began to use, and contribute to the growth and development of, the Weibull function, the common use of the Weibull graph, and the propagation of Weibull analysis methods and applications. A few of these people who helped develop Weibull are mentioned here. First, Dorian Shainin wrote an early booklet on Weibull in the late 1950s, while Leonard Johnson at General Motors helped improve the plotting methods by suggesting median ranks and beta Binomial confidence bounds in 1964. Professor Gumbel demonstrated that the Weibull distribution is a Type III Smallest Extreme Value distribution [9]. This is the distribution that describes a weakest link situation. Dr. Robert Abernethy was an early adaptor at Pratt and Whitney, and he developed a number of applications, analysis methods and corrections for the Weibull function.

In 1963, Weibull was a visiting professor at Columbia and there worked with professors Gumbel and Freudenthal in the Institute for the Study of Fatigue and Reliability. While he was a consultant for the US Air Force Materials Laboratory, he published a book on materials and fatigue testing in 1961. He later went on to work for the US military, producing reports through 1970 [9].

A few additional key dates and events in this decade should be mentioned. In 1962, G.A. Dodson and B.T. Howard of Bell Labs published “High Stress Aging to Failure of Semiconductor Devices” in the Proceedings of the 7th National Symposium of Reliability and Quality Control [5]. This paper justified the Arrhenius model for semiconductors. Lots of other papers at this conference looked at other components for improvement. By 1967, this conference was re-titled as the Reliability Physics Symposium, RPS, with “International” being added a few years later. Really, 1962 was a key year with the first issue of Military Handbook 217 by the Navy. Already, the two main branches of reliability existed. One branch existed for investigation of failures and the other for predictions. Later in the decade was the first paper on step stress testing by Shurtleff and Workman that set limits to this technique when applied to Integrated Circuits. J.R. Black published his work on the physics of electro-migration in 1967. The decade ended studies of wafer yields as silicon began to dominate reliability activities and a variety of industries. By this time, the ER and TX families of specifications had been defined. The U.S. Army Material Command issued a Reliability Handbook (AMCP 702-3) in October of 1968, while Shooman’s Probabilistic Reliability was published by McGraw-Hill the same year to cover statistical approaches. The Automotive industry was not to be outdone and issued a simple FMEA handbook for improvement of suppliers. This was based upon work done on failure mode investigation and root cause by the military, but not yet published as a Military standard. Communications were enhanced by the launch of a series of commercial satellites, INTELSAT. These provided voice communications between the U.S. and Europe. Around the world, in other countries, professionals were beginning to investigate reliability and participate with papers at conferences. The decade ended with a landing on the moon showing how far reliability had progressed in only 10 years. Human reliability had now been recognized and studied, which resulted in a paper by Swain on the techniques for human error rate prediction (THERP) [11]. In 1969, Birnbaum and Saunders described a life distribution model that could be derived from a physical fatigue process where crack growth causes failure [23]. This was important for new models that described degradation processes.

During the decade of the 1970s, work progressed across a variety of fronts. In this decade, the use and variety of ICs increased. Bipolar, NMOS and CMOS all developed at an amazing rate. In the middle of the decade, ESD and EOS were covered by several papers and eventually evolved into a conference by the decade end. Likewise, passive components which were once covered by IRPS, moved to a Capacitor and Resistor Technology Symposium (CARTS) for continued advancement on all discrete components. A few highlights of the decade were the first papers on gold-aluminum inter-metallic products, accelerated testing, the use of Scanning Electron Microscopes for analysis and loose particle detection testing (PIND). In mid-decade, Hakim and Reich published a detailed paper on the evaluation of plastic encapsulated transistors and ICs based upon field data. Other areas being studied included gold embrittlement, PROM nichrome link grow back, moisture, out gassing of glass sealed packages and problems with circuit boards. Perhaps the two most memorable reliability papers from this decade were one on soft error rates caused by alpha particles (Woods and May) and on accelerated testing of ICs with activation energies calculated for a variety of failure mechanisms by D.S. Peck. By the end of the decade, commercial field data were being collected by Bellcore as they strived to achieve no more than 2 hours of downtime over 40 years. This data became the basis of the Bellcore reliability prediction methodology [28].

The Navy Material Command brought in Willis Willoughby from NASA to help improve military reliability across a variety of platforms. During the Apollo space program, Willoughby had been responsible for making sure that the spacecraft worked reliably all the way to the moon and back. In coming to the Navy, he was determined to prevent unreliability. He insisted that all contracts contain specifications for reliability and maintainability instead of just performance requirements. Willoughby’s efforts were successful because he attacked the basics and worked upon a broad front. Wayne Tustin credits Willoughby with emphasizing temperature cycling and random vibration, which became ESS testing. This was eventually issued as a Navy document P9492 in 1979. Next, he published a book on Random Vibration with Tustin in 1984. After that, he replaced older quality procedures with the Navy Best Manufacturing Practice program. The microcomputer had been invented and was making changes to electronics while RAM memory size was growing at a rapid rate. Electronic calculators had shrunk in size and cost and now rivaled early vacuum tube computers in capability by 1980. Military Standard 1629 on FMEA was issued in 1974, and human factors engineering and human performance reliability had been recognized by the Navy as important to the operating reliability of complex systems and work continued in this area. They led groundbreaking work with a Human Reliability Prediction System User’s Manual in 1977 [12]. The Air Force contributed with the Askren-Regulinski exponential models for human reliability. NASA made great strides at designing and developing spacecraft such as the space shuttle. Their emphasis was on risk management through the use of statistics, reliability, maintainability, system safety, quality assurance, human factors and software assurance [10]. Reliability had expanded into a number of new areas as technology rapidly advanced.

The 1980s was a decade of great changes. Televisions had become all semiconductor. Automobiles rapidly increased their use of semiconductors with a variety of microcomputers under the hood and in the dash. Large air conditioning systems developed electronic controllers, as had microwave ovens and a variety of other appliances. Communications systems began to adopt electronics to replace older mechanical switching systems. Bellcore issued the first consumer prediction methodology for telecommunications and SAE developed a similar document SAE870050 for automotive applications [29]. The nature of predictions evolved during the decade and it became apparent that die complexity wasn’t the only factor that determined failure rates. Kam Wong published a paper at RAMS questioning the bathtub curve [25]. During this decade, the failure rate of many components dropped by a factor of 10. Software became important to the reliability of systems; this discipline rapidly advanced with work at RADC and the 1984 article “History of Software Reliability” by Martin Shooman [13] and the book Software Reliability – Measurement, Prediction, Application by Musa et.al. Complex software-controlled repairable systems began to use availability as a measure of success. Repairs on the fly or quick repairs to keep a system operating would be acceptable. Software reliability developed models such as Musa Basic to predict the number of missed software faults that might remain in code. The Naval Surface Warfare Center issued Statistical Modeling and Estimation of Reliability Functions for Software (S.M.E.R.F.S) in 1983 for evaluating software reliability. Developments in statistics made an impact on reliability. Contributions by William Meeker, Gerald Hahn, Richard Barlow and Frank Proschan developed models for wear, degradation and system reliability. Events of note in the decade were the growing dominance of the CMOS process across most digital IC functions. Bipolar technologies, PMOS and NMOS gave way in most applications by the end of the decade. CMOS had a number of advantages such as low power and reliability. High speed applications and high power were still dominated by Bipolar. At the University of Arizona, under Dr. Dimitri Kececioglu, the Reliability Program turned out a number of people who later became key players in a variety of industries. The PC came into dominance as a tool for measurement and control. This enhanced the possibility of canned programs for evaluating reliability. Thus, by decade end, programs could be purchased for performing FMEAs, FTAs, reliability predictions, block diagrams and Weibull Analysis. The Challenger disaster caused people to stop and re-evaluate how they estimate risk. This single event spawned a reassessment of probabilistic methods. Pacemakers and implantable infusion devices became common and biomedical companies were quick to adopt the high reliability processes that had been developed by the military. The Air Force issued the R&M 2000 which was aimed at making R&M tasks normal business practice [26]. David Taylor Research Center in Carderock Maryland commissioned a handbook of reliability prediction procedures for mechanical equipment to Eagle Technology in 1988 [30]. This was typically called the Carderock Handbook and was issued by the Navy in 1992 as NSWC 92/L01[31]. Altogether, the1980s demonstrated progress in reliability across a number of fronts from military to automotive and telecommunications to biomedical. RADC published their first Reliability Tool Kit and later updated this in the 1990s for COTS applications. The great quality improvement driven by competition from the Far East had resulted in much better components by decade end.

By the 1990s, the pace of IC development was picking up. New companies built more specialized circuits and Gallium Arsenide emerged as a rival to silicon in some applications. Wider use of stand alone microcomputers was common and the PC market helped keep IC densities following Moore’s Law and doubling about every 18 months. It quickly became clear that high volume commercial components often exceeded the quality and reliability of the small batch specially screened military versions. Early in the decade, the move toward Commercial Off the Shelf (COTS) components gained momentum. With the end of the cold war, the military reliability changed quickly. Military Handbook 217 ended in 1991 at revision F2. New research developed failure rate models based upon intrinsic defects that replaced some of the complexity-driven failure rates that dominated from the 1960s through the 1980s. This effort was led by RAC (the new name for RADC) and resulted in PRISM, a new approach to predictions. Reliability growth was recognized for components in this document. Many of the military specifications became obsolete and best commercial practices were often adopted. The rise of the internet created a variety of new challenges for reliability. Network availability goals became “five 9s or 5 minutes annually” to describe the expected performance in telecommunications. The decade demanded new approaches and two were initiated by the military. Sematech issued Guidelines for Equipment Reliability in 1992 [34]. The SAE issued a handbook on the reliability of manufacturing equipment in 1993 [32]. This was followed in 1994 by the SAE G-11 committee issuing a reliability journal [35]. Richard Sadlon, at RAC, produced a mechanical application handbook the same year [33]. The Army started the Electronic Equipment Physics of Failure Project and engaged the University of Maryland CALCE center, under Dr. Michael Pecht, as part of the process. The Air Force initiated a tri-service program that was cancelled later in the decade. On the software side, the Capability Maturity Model (CMM) was generated. Companies at the highest levels of the model were thought to have the fewest residual faults [14]. RAC issued a six set Blueprint for Establishing Effective Reliability Programs in 1996 [35]. The internet showed that one single software model would not work across the wide range of worldwide applications, which now includes wireless. New approaches were required such as software mirroring, rolling upgrades, hot swapping, self-healing and architecture changes [15]. New reliability training opportunities and books became available to the practitioners. ASQ made a major update to its professional certification exam to keep pace with the changes evident. ISO 9000 added reliability measures as part of the design and development portion of the certification.

The turn of the century brought Y2K issues for software. The expansion of the world-wideweb created new challenges of security and trust. Web-based information systems became common, but were found not to be secure from hacking. Thus, the expanded use of the web to store and move information could be problematic. The older problem of too little reliability information available had now been replaced by too much information of questionable value. Consumer reliability problems could now have data and be discussed online in real time. Discussion boards became the way to get questions answered or find resources. Training began the move toward webinars, rather than face-to-face classes. Insurance, banking, job hunting, newspapers, music, baseball games and magazines all went online and could be monitored in real time or downloaded. New technologies such as micro-electro mechanical systems (MEMS), hand-held GPS, and handheld devices that combined cell phones and computers all represent challenges to maintain reliability. Product development time continued to shorten through this decade and what had been done in three years was now done in 18 months. This meant reliability tools and tasks must be more closely tied to the development process itself. Consumers have become more aware of reliability failures and the cost to them. One cell phone developed a bad reputation (and production soon ceased) when it logged a 14% first year failure rate. In many ways, reliability became part of everyday life and consumer expectations.

Footnotes

Thomas, Marlin U., Reliability and Warranties: Methods for Product Development and Quality Improvement, CRC, New York, 2006
Kapur, K.C. and Lamberson, L.R., Reliability in Engineering Design, Wiley, 1977, New York
Ralph Evans, “Electronics Reliability: A personal View”, IEEE Transactions on Reliability, vol 47, no. 3 September 1998, pp. 329-332, 50 th Anniversary special edition.
O’Connor, P.D.T., Practical Reliability Engineering, Wiley, 4th edition, 2002, New York, pp. 11-13
George Ebel, “Reliability Physics in Electronics: A Historical View”, IEEE Transactions on Reliability, vol 47, no. 3 September 1998, pp. 379-389, 50 th Anniversary special edition
Reliability of Military Electronic Equipment, Report by the Advisory Group on Reliability of Electronic Equipment, Office of the Assistant Secretary of Defense (R&D), June 1957
Lloyd, David, and Lipow, Myron, Reliability: Management, Methods and Mathematics, Prentice Hall, 1962, Englewood Cliffs
Personal report of Gus Huneke, Failure Analysis Manager who I worked with at Control Data Corporation in the late 1970s. Gus had worked on these early computer systems as a young engineer in the early 1950s at Univac.
Abernethy, Robert, The New Weibull Handbook, 4 th edition, self published, 2002, ISBN 0-9653062-1-6
Vincent Lalli, “Space-System Reliability: A Historical Perspective”, IEEE Transactions on Reliability, vol 47, no. 3 September 1998, pp. 355-360, 50 th Anniversary special edition
D. Swain T.H.E.R.P. (Techniques for Human Error Rate Prediction), SC-R-64-1338, 1964 by Sandia National Labs.
I. Siegel, K.P. LaSala and C. Sontz, Human Reliability Prediction System User’s Manual, Naval Sea Systems Command, 1977
Shooman, “Software Reliability: A Historical Perspective”, IEEE Transactions on Reliability, vol R-33, 1984, pp.48-55
Keene, “Modeling Software R&M Characteristics” Reliability Review, Vol 17 No 2 & 3, 1997.
Henry Malec, “Communications Reliability: A Historical Perspective”, IEEE Transactions on Reliability, vol 47, no. 3 September 1998, pp. 333-344, 50 th Anniversary special edition
Dodson, Bryan, The Weibull Analysis Handbook, second edition, ASQ, Milwaukee, 2006
Saleh, J.H. and Marais, Ken, “Highlights from the Early (and pre-) History of Reliability Engineering”, Reliability Engineering and System Safety, Volume 91, Issue 2, February 2006, Pages 249-256
http://en.wikipedia.org/wiki/Reliability, (statistics definition)
http://psychology.about.com/od/researchmethods/f/reliabilitydef.htm - Reliability, (Psychology definition)
Juran, Joseph and Gryna, Frank, Quality Control Handbook, Fourth Edition, McGrawHill, New York, 1988, p.24.3
Juran, Joseph editor, A History of Managing for Quality, ASQC Press, Milwaukee 1995, pp. 555-556
http://en.wikipedia.org/wiki/Spirit_of_St._Louis
Birnbaum, W.Z., Obituary, Department of Mathematics, Dec 15, 2000 at University of Washington, http://www.math.washington.edu/~sheetz/Obituaries/zwbirnbaum.html
Reliability Analysis Center Journal, 1Q 1998 from RAC
Wong, Kam, “Unified Field (Failure) Theory-Demise of the Bathtub Curve”, Proceedings of Annual RAMS, 1981, pp402-408
Knight, Raymond, “Four Decades of Reliability Progress”, Proceedings of Annual RAMS, 1991, pp156-160
Denson, William, “The History of Reliability Predictions”, IEEE Transactions on Reliability, vol. 47, no. 3 September 1998, pp. 321-328, 50 th Anniversary special edition.
Wallodi Weibull, “A Statistical Distribution Function of Wide Applicability”, ASME Journal of Applied Mechanics, Vol. 18(3), pp.293-297
Bosch Automotive Handbook, 3 rd edition, 1993, p159
David Taylor Research Center, Carderock Division, January 1988
This Mechanical Reliability Document is NSWC 92/L01
Reliability and Maintainability Guideline for Manufacturing Machinery and Equipment, M-110.2, Issued by the SAE, Warrendale, 1993
Mechanical Applications in Reliability Engineering, RBPR -1 through -6, RAC, Rome, New York, 1993
Vallabh Dhudsia, Guidelines for Equipment Reliability, Sematech document 92031014A, published 1992
This short lived journal was called “Communications in Reliability, Maintainability, and Supportability”, ISSN 1072-3757, SAE, Warrendale, 1994