Mean Time Between Failure


Every engineer should be familiar with the concept of Mean Time Between Failure (MTBF), which is one of the most commonly used terms in reliability engineering. Having said this, different people have varying interpretations of what MTBF actually means, and these interpretations are often incorrect.

For example, suppose I were to tell you that a particular system had a MTBF of 8,766 hours, which equates to one year (MTBF is always quoted in hours). Does this mean that if I use a single instantiation of this system in my application, that system is guaranteed to be operational for a year before its first failure?

In fact, you might be surprised to discover that with a MTBF of a year, a single system has a probability of only 0.37 of still being operational after the 8,766 hours. If you’re a manufacturer of systems, then only 37 percent of the units you produce will still be operational after this period, which might cause problems for your warranty and post-design services departments. Using the equation P(s) = E^(-t/MTBF), and charting this equation in Excel will produce the following plot, which shows probability of success at 0.5 MTBF and 1.0 MTBF:

As engineers, we want (or, in some cases, need) to ensure the probability of success for a particular number of years is very high — as high as 0.95 in many cases (some applications require 0.99 or higher, but this will require you to employ redundancy). This means that, for the simple one year product/mission life presented in our example, the MTBF would have to be 20.74 years to obtain the required probability of success. This is a big difference from what you may originally have thought.

The reliability of a module or system follows the well know “bath tub curve” as shown below:

The discussions in the remainder of this blog will relate to determining the MTBF and failure rate during the constant failure rate duration. It is the responsibility of the manufacturer to ensure that infant mortality is screened out, which is one reason for performing post-production burn-in.

One method of determining MTBF is based around the failure rate of the component interconnections (solder joints), the technology (hybrid, IC, PCB etc.), and the operating environment (ground, aircraft, space vehicle, etc.). In fact, two methods for determining the failure rate are commonly used:

  • Parts count: As this technique is based around reference stresses on components, the resulting analysis tends to give a more conservative (pessimistic) failure rate.
  • Stressed reliability: This approach utilizes actual electrical and thermal stress applied to each component to determine a more accurate failure rate for the system.

In many cases, circuit/systems designers may use both of these techniques to determine the MTBF of their system. For example, they may initially perform a parts count analysis on the bill of materials (BOM) to provide an initial (“ball park”) estimate of the reliability.

Later, once the circuit design has been completed, they may follow up with a stressed reliability analysis that takes into account the electrical stresses on the components and the equipment temperatures. This second, stressed analysis tends to lower the failure rate and increase the MTBF, which is generally what engineers and companies want while still being accurate.

One common standard for performing both of these analyses is Mil Handbook 217F Notice 2. This is freely available over the Internet and provides very detailed information on reliability rate calculations for different devices and environments. The only downside with this standard is that it was last updated in 1995, so its predictions can be a little pessimistic for modern components. The other commonly used standards are the Bellcore/Telcordia and SAE Reliability Prediction Methods.

Another method for determining the failure rate of a device is via life testing; i.e., the total number of hours the device operates without a failure. This is often achieved using accelerated life testing, which stresses the devices beyond their normal operating conditions to “age” the devices and determine failure rates. This approach is generally performed by device manufacturers to obtain each component’s FIT (failure in time) rate. Typical examples of this are the Mil Std 883 and the JEDEC Solid State Technology Association’s “JEDS22 reliability test methods for packaged devices.”

The FIT rate is the number of failures in a billion hours, for example, a device with a FIT rate of 20 is said to have 20e-9 FITs. The relationship between MTBF and FIT rate is very simple — the reciprocal of one results in the other. Hence, in our earlier example, in order to have a probability of success of 0.95 for one year, the total design needs a FIT rate no greater than 5,500 FITs, which is still pretty high.

Typical FIT rates for Xilinx FPGAs (this site’s sponsor) are 24 FITs for the current 7 series devices. Typically, power supplies tend to dominate failure rates, especially in Mil Handbook 217F Notice 2, which can be used to calculate the reliability of a hybrid device.