Reliability Engineering: Designing for Dependable Performance

Industrial Engineering Industrial Engineering 8 min read 1633 words Beginner ExcellentWiki Editorial Team

Every product fails eventually. The question is when. Reliability engineering is the discipline of understanding, predicting, and managing failure. It determines whether a product fails after one year or twenty, whether it fails gradually or suddenly, and whether failure is inconvenient or catastrophic.

The cost of poor reliability is staggering. Product recalls, warranty claims, lost customer loyalty, and liability lawsuits all stem from unreliable products. The automotive industry alone spends over 10 billion dollars annually on warranty repairs. In aerospace, a single in-flight failure can cost hundreds of lives and billions of dollars. Reliability engineering is not an optional add-on — it is fundamental to product design and customer satisfaction.

Reliability Fundamentals

Reliability is defined as the probability that a product performs its intended function under specified conditions for a specified period of time.

The Bathtub Curve

Product failure rates follow a characteristic pattern over time called the bathtub curve. The first region is the infant mortality period — early failures caused by manufacturing defects, material flaws, or design errors. The failure rate starts high and decreases as defective units fail and are removed from the population.

The second region is the useful life period — the failure rate is approximately constant. Failures occur randomly due to stress spikes, accidental damage, or environmental extremes. This is the period where the product is most reliable.

The third region is the wearout period — the failure rate increases as components age, wear, corrode, or fatigue. Bearings wear, seals leak, and electronics degrade. The wearout period determines the product’s economic life.

Mean Time Between Failures

MTBF is the average time between failures for repairable systems. For non-repairable products, mean time to failure is used instead. MTBF is calculated as total operating time divided by the number of failures.

A common misconception is that MTBF represents the expected life of an individual product. It does not. For an exponential failure distribution, 63 percent of products will fail by the MTBF. The median time to failure is 0.693 times the MTBF.

Availability

Availability is the probability that a system is operational when needed. It combines reliability and maintainability. Inherent availability equals MTBF divided by MTBF plus mean time to repair. A system with MTBF of 500 hours and MTTR of 5 hours has inherent availability of 0.99 — 99 percent uptime.

Achieving high availability requires both high reliability (long MTBF) and good maintainability (short MTTR). Redundancy — having backup components — increases availability by enabling the system to continue operating when a component fails.

Reliability Analysis Methods

Engineers use several analytical tools to predict and improve reliability.

Reliability Block Diagrams

An RBD represents the system as a network of functional blocks. Blocks in series — all must work for the system to work. Blocks in parallel — at least one must work. Complex systems are decomposed into series and parallel combinations.

For a series system with independent components, the system reliability equals the product of component reliabilities. Five components each with 0.99 reliability gives system reliability of 0.951. Ten components gives 0.904. This drives the design principle — minimize the number of components in series.

Parallel redundancy dramatically improves reliability. Two parallel components each with 0.90 reliability give system reliability of 0.99. Three give 0.999. The systems engineering article discusses how reliability analysis integrates with overall system design.

Failure Mode and Effects Analysis

FMEA is a systematic method for identifying potential failure modes and their effects. Each component or function is analyzed for how it could fail, what effect the failure would have, and how likely the failure is to occur and be detected.

Risk priority numbers prioritize failure modes for corrective action. Severity, occurrence, and detection are each rated on a 1 to 10 scale. The RPN is the product of the three ratings. Failure modes with RPN above a threshold receive preventive action.

Fault Tree Analysis

Fault tree analysis starts with a top-level undesired event — system failure — and works backward to identify combinations of component failures that cause it. AND gates indicate that all input events must occur. OR gates indicate that any input event causes the output.

FTA is valuable for complex systems where multiple failure combinations are possible. It identifies single points of failure — components whose failure alone causes system failure. It also identifies common cause failures — a single event that causes multiple components to fail simultaneously.

Weibull Analysis

The Weibull distribution is the most widely used distribution in reliability engineering. It models the time to failure for many types of components.

Weibull Parameters

The Weibull distribution has three parameters. The shape parameter determines the failure rate behavior. A shape parameter below 1.0 indicates decreasing failure rate — infant mortality. A shape parameter equal to 1.0 indicates constant failure rate — useful life. A shape parameter above 1.0 indicates increasing failure rate — wearout.

The scale parameter determines the characteristic life — the time at which 63.2 percent of the population has failed. The location parameter defines the minimum life — the time before which no failures occur.

Weibull Probability Plot

Weibull analysis involves plotting failure data on a special graph where the Weibull distribution appears as a straight line. The slope of the line estimates the shape parameter. The intercept estimates the scale parameter.

Weibull analysis requires failure data — either from laboratory testing or field returns. With sufficient data, the analysis predicts future failure rates, estimates warranty costs, and identifies the onset of wearout.

Design for Reliability

Reliability must be designed into products, not tested into them.

Design Margin

Design margin is the ratio of the strength of a component to the stress it experiences. A margin of 1.5 means the component can withstand 50 percent more stress than it will typically experience. Conservative design margins are the simplest and most effective reliability improvement strategy.

Derating

Derating operates components below their rated maximums. A resistor rated for 0.25 watts is used at 0.125 watts. A hydraulic component rated for 3,000 psi is used at 2,000 psi. Derating reduces stress and increases reliability. Military and aerospace derating guidelines specify derating factors for each component type.

Redundancy

Active redundancy keeps all components operating simultaneously. Standby redundancy activates backup components only when primary components fail. Redundancy adds cost, weight, and complexity but dramatically increases reliability for critical functions.

Environmental Stress Screening

ESS subjects products to accelerated environmental stress — temperature cycling, vibration, thermal shock — to precipitate latent defects before shipment. The stress levels expose weak components that would otherwise fail in the field. ESS is standard practice for electronics manufacturing.

Reliability Testing

Reliability testing verifies that products meet reliability requirements and identifies failure modes that must be addressed.

Life Testing

Life testing operates products under normal conditions until they fail. The test data is analyzed using Weibull or other failure distributions to estimate MTBF and failure rate. Life testing is time-consuming — testing 100 units for 1,000 hours provides 100,000 unit-hours of data but may reveal few failures if the product is highly reliable.

Censored data — tests where some units have not failed when the test ends — requires special statistical methods. Maximum likelihood estimation handles censored data correctly. The confidence interval for MTBF depends on the number of failures observed.

Accelerated Life Testing

ALT applies higher stress levels — temperature, voltage, pressure, vibration — to induce failures faster. The acceleration factor relates the failure rate at accelerated stress to the failure rate at normal stress. The Arrhenius model relates temperature acceleration. The inverse power law relates voltage and mechanical stress acceleration.

ALT reduces test time from years to weeks. A product with a 10-year design life at 25 degrees Celsius may be tested for 6 weeks at 125 degrees Celsius. The acceleration factor depends on the activation energy of the failure mechanism.

Reliability Demonstration Testing

RDT demonstrates that a product meets a specified MTBF with a given confidence level. The test plan specifies the test duration, number of units, and acceptance criteria. A common plan tests until a certain number of failures occur. If the MTBF calculated from the test data exceeds the requirement, the product passes.

The test duration depends on the required demonstration. Demonstrating an MTBF of 10,000 hours with 90 percent confidence requires approximately 23,000 unit-hours of testing without failure. Each failure requires additional test time to demonstrate the same MTBF.

Frequently Asked Questions

What is the difference between reliability and quality? Quality is conformance to specifications at the time of delivery. Reliability is the ability to continue meeting specifications over time. A product can have high quality — passing all inspections — but low reliability if it fails quickly in service. Quality is a snapshot at shipment. Reliability is the full picture over the product life.

How much reliability testing is enough? The amount of testing depends on the reliability target, the criticality of the application, and the maturity of the design. For a critical aerospace application, the test plan may require demonstrating the target MTBF with 90 percent confidence through thousands of hours of testing. For a consumer product, accelerated life testing may be sufficient.

How do you estimate reliability when no failure data exists? Without failure data, reliability is estimated using parts count methods — summing the predicted failure rates of individual components from standard databases. MIL-HDBK-217 provides failure rate data for electronic components. NPRD provides data for mechanical and electromechanical components.

What is the relationship between reliability and warranty costs? Warranty cost is directly proportional to the failure rate during the warranty period. A product with annual failure rate of 5 percent and warranty cost of 200 dollars per failure generates warranty expense of 10 dollars per unit. Reducing the failure rate to 2 percent halves the warranty expense.

Quality Control and Six Sigma — Systems Engineering — Statistical Process Control

Share this article

X LinkedIn Facebook Email