Tag Archives: reliability

Design Reliability: MTBF Is Just the Beginning Issue 88



When most engineers think about design reliability, their minds turn to a single, central metric: mean time between failures. MTBF is, in fact, an important parameter in assessing how dependable your design will be. But another factor, probability of success, is just as crucial, and you would do well to take note of other considerations as well to ensure an accurate reliability analysis and, ultimately, a reliable solution.

Link here



Using Xilinx’s Power Estimator and Power Analyzer Tools Issue 83



FPGAs are unlike many classes
of components in that the
power they will require on their
core, auxiliary and I/O voltages
depends upon the implementation of
the design. Determining the power dissipation
of the FPGA in your application
is thus a little more complicated
than just reading the datasheet. It can
therefore be challenging to ensure you
have the correct power architecture—
one that takes into account not only the
required quiescent currents, ramp rates
and sequencing, but also has the ability
to suitably power the end application
while remaining within the acceptable
junction temperature of the device.

Link here


Nuts and Bolts of Designing an FPGA into Your Hardware Issue 82



To many engineers and project
managers, implementing
the functionality within an
FPGA and achieving timing
closure are the main areas of focus.
However, actually designing the FPGA
onto the printed-circuit board at the
hardware level can provide a number
of interesting challenges that you must
surmount for a successful design.

Link here


Using FPGA’s in Mission Critical Systems Issue 73



Dramatic surges in FPGA technology,
device size and capabilities have over the
last few years increased the number of
potential applications that FPGAs can
implement. Increasingly, these applications
are in areas that demand high reliability,
such as aerospace, automotive or
medical. Such applications must function
within a harsh operating environment,
which can also affect the system
performance. This demand for high reliability
coupled with use in rugged environments
often means you as the
engineer must take additional care in the
design and implementation of the state
machines (as well as all accompanying
logic) inside your FPGA to ensure they
can function within the requirements.
One of the major causes of errors within
state machines is single-event upsets
caused by either a high-energy neutron or
an alpha particle striking sensitive sections
of the device silicon. SEUs can cause a bit
to flip its state (0 -> 1 or 1 -> 0), resulting
in an error in device functionality that
could potentially lead to the loss of the system
or even endanger life if incorrectly
handled. Because these SEUs do not result
in any permanent damage to the device
itself, they are called soft errors.

link here



Design West 2013



Space: The Final Frontier – FPGAs for Space and Harsh Environments

The last 20 years have seen the explosion of FPGA technology used in many different
end applications, including those within harsh environments. It therefore follows that
system developers wish these devices to operate correctly and safely regardless of
environment. When engineers design for a spaceflight mission, there are three main
environmental factors that will impact performance: radiation; temperature; and
vibration and shock

Paper available here :- ESC-322Paper_Taylor

Slides Available here :- ESC-322Slides_Taylor


White Paper – Flying High-Performance FPGAs on Satellites: Two Case Studies

When considering flying an FPGA within a satellite mission, ensuring the device and design will work
within the radiation environment is the first of a number of parameters to take into account. In this
paper I am going to consider the parameters which must be considered when flying a highperformance
FPGA in two very different missions.

  • Ukube1, a CubeSat mission scheduled for launch in late 2013
  • A generic FPGA processing card for use in a number of GEO missions

Of these two missions, one UKube has been delivered for launch, while the generic FPGA processing
card is currently in development. Both of these missions have their own challenges and unique
requirements which need to be addressed. At the same time, however, both missions also have
common driving requirements.

Paper available here :- STS-401Paper_Taylor

Slides available here :- STS-401Slides_Taylor


Mean Time Between Failure


Every engineer should be familiar with the concept of Mean Time Between Failure (MTBF), which is one of the most commonly used terms in reliability engineering. Having said this, different people have varying interpretations of what MTBF actually means, and these interpretations are often incorrect.

For example, suppose I were to tell you that a particular system had a MTBF of 8,766 hours, which equates to one year (MTBF is always quoted in hours). Does this mean that if I use a single instantiation of this system in my application, that system is guaranteed to be operational for a year before its first failure?

In fact, you might be surprised to discover that with a MTBF of a year, a single system has a probability of only 0.37 of still being operational after the 8,766 hours. If you’re a manufacturer of systems, then only 37 percent of the units you produce will still be operational after this period, which might cause problems for your warranty and post-design services departments. Using the equation P(s) = E^(-t/MTBF), and charting this equation in Excel will produce the following plot, which shows probability of success at 0.5 MTBF and 1.0 MTBF:

As engineers, we want (or, in some cases, need) to ensure the probability of success for a particular number of years is very high — as high as 0.95 in many cases (some applications require 0.99 or higher, but this will require you to employ redundancy). This means that, for the simple one year product/mission life presented in our example, the MTBF would have to be 20.74 years to obtain the required probability of success. This is a big difference from what you may originally have thought.

The reliability of a module or system follows the well know “bath tub curve” as shown below:

The discussions in the remainder of this blog will relate to determining the MTBF and failure rate during the constant failure rate duration. It is the responsibility of the manufacturer to ensure that infant mortality is screened out, which is one reason for performing post-production burn-in.

One method of determining MTBF is based around the failure rate of the component interconnections (solder joints), the technology (hybrid, IC, PCB etc.), and the operating environment (ground, aircraft, space vehicle, etc.). In fact, two methods for determining the failure rate are commonly used:

  • Parts count: As this technique is based around reference stresses on components, the resulting analysis tends to give a more conservative (pessimistic) failure rate.
  • Stressed reliability: This approach utilizes actual electrical and thermal stress applied to each component to determine a more accurate failure rate for the system.

In many cases, circuit/systems designers may use both of these techniques to determine the MTBF of their system. For example, they may initially perform a parts count analysis on the bill of materials (BOM) to provide an initial (“ball park”) estimate of the reliability.

Later, once the circuit design has been completed, they may follow up with a stressed reliability analysis that takes into account the electrical stresses on the components and the equipment temperatures. This second, stressed analysis tends to lower the failure rate and increase the MTBF, which is generally what engineers and companies want while still being accurate.

One common standard for performing both of these analyses is Mil Handbook 217F Notice 2. This is freely available over the Internet and provides very detailed information on reliability rate calculations for different devices and environments. The only downside with this standard is that it was last updated in 1995, so its predictions can be a little pessimistic for modern components. The other commonly used standards are the Bellcore/Telcordia and SAE Reliability Prediction Methods.

Another method for determining the failure rate of a device is via life testing; i.e., the total number of hours the device operates without a failure. This is often achieved using accelerated life testing, which stresses the devices beyond their normal operating conditions to “age” the devices and determine failure rates. This approach is generally performed by device manufacturers to obtain each component’s FIT (failure in time) rate. Typical examples of this are the Mil Std 883 and the JEDEC Solid State Technology Association’s “JEDS22 reliability test methods for packaged devices.”

The FIT rate is the number of failures in a billion hours, for example, a device with a FIT rate of 20 is said to have 20e-9 FITs. The relationship between MTBF and FIT rate is very simple — the reciprocal of one results in the other. Hence, in our earlier example, in order to have a probability of success of 0.95 for one year, the total design needs a FIT rate no greater than 5,500 FITs, which is still pretty high.

Typical FIT rates for Xilinx FPGAs (this site’s sponsor) are 24 FITs for the current 7 series devices. Typically, power supplies tend to dominate failure rates, especially in Mil Handbook 217F Notice 2, which can be used to calculate the reliability of a hybrid device.


Increasing FPGA System Reliability


In this column, I will look at what we can do within the FPGA and at the hardware/system level to increase reliability.

Focusing on the FPGA implementation first, there are numerous ways the design can be corrupted, depending on the end environment. This corruption could be the result of a single-event upset (SEU), a single-event functional interrupt (SEFI), or even data corruption from a number of sources.

An SEU occurs when a data bit (register or memory) is hit by radiation and flips from a 0 to a 1, or vice versa. A SEFI is where a control register or other critical register suffers a bit flip that locks up the system. In the world of SRAM-based FPGAs, we tend to consider an SEFI when one of the SRAM cells holding the device’s configuration flips and changes the design’s implementation. Data corruption can occur for a number of reasons, including EMI (electromagnetic interference) affecting the design in an industrial application.

How can we protect these systems and increase a unit’s MTBF? Depending on the end application, it may be acceptable simply to duplicate the logic — create two instantiations of the design within the same device — and to indicate an error if the results do not match. The higher-level system would be in charge of deciding what to do in the event of such an error.

The next thing we can do is to implement triple modular redundancy (TMR) within the device. At the simplest level, this instantiates the same design three times within the FPGA. A majority vote — two out of three — decides the result. (Even though this might sound simple, implementing it can become very complex very quickly.) If one instantiation of the design becomes corrupted, the error will be masked. Depending on the kind of error, the device may clear itself on the next calculation, or it may require reconfiguration.

Implementing TMR can be performed by hand, which can be time-consuming, or using tools such as the TMRTool from Xilinx (this site’s sponsor) or the BL-TMR from Brigham Young University. If TMR is implemented correctly (and you have to be careful about synthesis optimizations), the design should mask all SEUs, as long as only one is present at any particular time.

Memory blocks inside the FPGA may also use error-correcting code technology to detect and correct SEUs. However, to ensure you really have good data, you need to perform memory scrubbing. This involves accessing the memory when it is not being used for other purposes, reading out the data, checking the error detection and correction code, and (if necessary) writing back the corrected data. Common tools here include Hamming codes that allow double-error detection and single-error correction.

This nicely leads us to the concept of scrubbing the entire FPGA. Depending on your end application, you might be able to simply reconfigure the FPGA each time before it is used. For example, a radar imaging system taking an image could reconfigure the FPGA between images to prevent corruption. If the FPGA’s performance is more mission-critical or uptime-critical, you can monitor the FPGA’s configuration by reading back the configuration data over the configuration interface. If any errors are detected, the entire device may be reconfigured, or partial reconfiguration may be used to target a specific portion of the design. Of course, all this requires a supervising device or system.

Of course, it will take significant analysis to determine how any of the methods that have been mentioned thus far affects MTBF. The complexity of this analysis will depend on the environment in which the system is intended to operate.

Working at the module level, we as engineers can take a number of steps to increase reliability. The first is to introduce redundancy, either within the module itself (e.g., extra processing chains) or by duplicating the module in its entirity.

If you are implementing redundancy, you have two options: hot and cold. Each has advantages and disadvantages, and implementing either option will be a system-level decision.

In the case of hot redundancy, both the prime and redundant devices (to keep things simple, I am assuming one-for-two redundancy) are powered up, with the redundant module configured ready to replace the prime should it fail. This has the advantage of a more or less seamless transition. However, since the redundant unit is operating alongside the prime, it is also aging and might fail.

In the case of cold redundancy, the prime unit is powered and operating while the redundant unit is powered down. This means the redundant module is not subject to as many aging stresses and, to a large extent, is essentially new when it is turned on. However, this comes at the expense of having some amount of down time if the prime module fails and the redundant module must be switched in.

With careful analysis of your system, you can identify the key drivers; i.e., which components have a high failure rate and are hurting system reliability. Power supply is often a key driver. Therefore, it is often advisable to implement a redundant power supply architecture that can power the same electronics, often in a one-out-of-two setup.

If you are implementing redundancy at the data path, module, or system level, the number of data paths, modules, or systems you employ will impact the new failure rate. For example, 12-for-8 systems will give you a lower failure rate than 10-for-8 systems. Of course, redundancy comes the expense of cost, size, weight, power consumption, and so forth. A very good interactive Website for this analysis can be found by clicking here.

When implementing redundancy at either the system or module level, it is crucial that both prime and redundant modules cannot have faults that can keep each other from working. Fault propagation has to be considered, and prime and redundant modules must be isolated from each other.