Slides presented at ESC Minn 2015 on aspects to consider when creating a embedded system.
Slides presented at ESC Minn 2015 on aspects to consider when creating a embedded system.
When designing any electronic system the modules connectors will have a significant impact upon the system reliability.
The system could be designed for a traditional high reliability application like railway, aerospace, medical, military or maybe an emerging one like high frequency trading.
Perhaps the first and simplest approach is to group the connectors in the functional types Power, Control, Data and Clocks etc. as each of these will be addressed in a different manner.
For instance it is possible to have prime and redundant power connectors, but if your system has a large number of data interfaces then it is not possible to have prime and redundant connectors for each input. This may lead to the need for system level redundancy in the worse case.
Regardless of the connector function we need to consider the following aspects
• Pin Derating, maximum reliability of a component is achieved by reducing the electrical stress placed upon it. There are many different standards for this (ESA, NASA, US Military etc) however; the basic idea is to reduce the voltage and current applied to the pins.
• Connector pin out, can a power pin short to a ground pin which will effect the overall power distribution system. It is therefore a good idea to ensure separation of power and ground separating them correctly. If necessary you can use unused pins to add isolation.
• Use of different connector types, styles and keying to prevent incorrect mating of connectors when the system is assembled. An incorrect assembly and power application could result in many hours of design analysis to prove no parts have been subject to electrical overstress.
• Number of mating cycles, the number of times the modules are mated / de-mated from the system has to be recorded. For this reason many designers use connector savers which can be connected to the system and reduce the number of mating cycles.
• Suitability for the job at hand for example if your system uses high speed serial links to communicate then your connector requirements will be very different from a power interface or low speed interface.
• Environmental and Dynamic considerations, many high reliability systems see extremes of temperature, vibration and shock. Can the connector system survive the demands and still stay connected.
Once you have determined your connector philosophy the next stage is ensuring you have a reliable system is in ensuring you cannot propagate a fault outside of your unit should a failure within occur.
Designing a PCB for current devices is a very complex and often over looked area instead focus falls upon the more interesting FPGA or Processors. However, the fact remains that without getting the board correct in the first place you may find you have issues either sooner or later so what are the main aspects of a modern PCB should we be concerned about.
Of course the list above is by no means complete however, it provides a good starting point
Safety-critical systems nowadays include more and
more embedded computer systems, based on different hardware
platforms. These hardware platforms, reaching from microcontrollers
to programmable logic devices, lead to fundamental
differences in design. Major differences result from different
hardware architectures and their robustness and reliability as
well as from differences in the corresponding software design.
This paper gives an overview on how these hardware platforms
differ with respect to fault handling possibilities as fault avoidance
and fault tolerance and the resulting influence on the safety
of the overall system.
When most engineers think about design reliability, their minds turn to a single, central metric: mean time between failures. MTBF is, in fact, an important parameter in assessing how dependable your design will be. But another factor, probability of success, is just as crucial, and you would do well to take note of other considerations as well to ensure an accurate reliability analysis and, ultimately, a reliable solution.
Design Reliability 101 -Tips, Techniques and How to
Slides available here :- Design Reliability 101 – Tips, Techniques and How to Taylor
Every engineer should be familiar with the concept of Mean Time Between Failure (MTBF), which is one of the most commonly used terms in reliability engineering. Having said this, different people have varying interpretations of what MTBF actually means, and these interpretations are often incorrect.
For example, suppose I were to tell you that a particular system had a MTBF of 8,766 hours, which equates to one year (MTBF is always quoted in hours). Does this mean that if I use a single instantiation of this system in my application, that system is guaranteed to be operational for a year before its first failure?
In fact, you might be surprised to discover that with a MTBF of a year, a single system has a probability of only 0.37 of still being operational after the 8,766 hours. If you’re a manufacturer of systems, then only 37 percent of the units you produce will still be operational after this period, which might cause problems for your warranty and post-design services departments. Using the equation P(s) = E^(-t/MTBF), and charting this equation in Excel will produce the following plot, which shows probability of success at 0.5 MTBF and 1.0 MTBF:
As engineers, we want (or, in some cases, need) to ensure the probability of success for a particular number of years is very high — as high as 0.95 in many cases (some applications require 0.99 or higher, but this will require you to employ redundancy). This means that, for the simple one year product/mission life presented in our example, the MTBF would have to be 20.74 years to obtain the required probability of success. This is a big difference from what you may originally have thought.
The reliability of a module or system follows the well know “bath tub curve” as shown below:
The discussions in the remainder of this blog will relate to determining the MTBF and failure rate during the constant failure rate duration. It is the responsibility of the manufacturer to ensure that infant mortality is screened out, which is one reason for performing post-production burn-in.
One method of determining MTBF is based around the failure rate of the component interconnections (solder joints), the technology (hybrid, IC, PCB etc.), and the operating environment (ground, aircraft, space vehicle, etc.). In fact, two methods for determining the failure rate are commonly used:
In many cases, circuit/systems designers may use both of these techniques to determine the MTBF of their system. For example, they may initially perform a parts count analysis on the bill of materials (BOM) to provide an initial (“ball park”) estimate of the reliability.
Later, once the circuit design has been completed, they may follow up with a stressed reliability analysis that takes into account the electrical stresses on the components and the equipment temperatures. This second, stressed analysis tends to lower the failure rate and increase the MTBF, which is generally what engineers and companies want while still being accurate.
One common standard for performing both of these analyses is Mil Handbook 217F Notice 2. This is freely available over the Internet and provides very detailed information on reliability rate calculations for different devices and environments. The only downside with this standard is that it was last updated in 1995, so its predictions can be a little pessimistic for modern components. The other commonly used standards are the Bellcore/Telcordia and SAE Reliability Prediction Methods.
Another method for determining the failure rate of a device is via life testing; i.e., the total number of hours the device operates without a failure. This is often achieved using accelerated life testing, which stresses the devices beyond their normal operating conditions to “age” the devices and determine failure rates. This approach is generally performed by device manufacturers to obtain each component’s FIT (failure in time) rate. Typical examples of this are the Mil Std 883 and the JEDEC Solid State Technology Association’s “JEDS22 reliability test methods for packaged devices.”
The FIT rate is the number of failures in a billion hours, for example, a device with a FIT rate of 20 is said to have 20e-9 FITs. The relationship between MTBF and FIT rate is very simple — the reciprocal of one results in the other. Hence, in our earlier example, in order to have a probability of success of 0.95 for one year, the total design needs a FIT rate no greater than 5,500 FITs, which is still pretty high.
Typical FIT rates for Xilinx FPGAs (this site’s sponsor) are 24 FITs for the current 7 series devices. Typically, power supplies tend to dominate failure rates, especially in Mil Handbook 217F Notice 2, which can be used to calculate the reliability of a hybrid device.
In this column, I will look at what we can do within the FPGA and at the hardware/system level to increase reliability.
Focusing on the FPGA implementation first, there are numerous ways the design can be corrupted, depending on the end environment. This corruption could be the result of a single-event upset (SEU), a single-event functional interrupt (SEFI), or even data corruption from a number of sources.
An SEU occurs when a data bit (register or memory) is hit by radiation and flips from a 0 to a 1, or vice versa. A SEFI is where a control register or other critical register suffers a bit flip that locks up the system. In the world of SRAM-based FPGAs, we tend to consider an SEFI when one of the SRAM cells holding the device’s configuration flips and changes the design’s implementation. Data corruption can occur for a number of reasons, including EMI (electromagnetic interference) affecting the design in an industrial application.
How can we protect these systems and increase a unit’s MTBF? Depending on the end application, it may be acceptable simply to duplicate the logic — create two instantiations of the design within the same device — and to indicate an error if the results do not match. The higher-level system would be in charge of deciding what to do in the event of such an error.
The next thing we can do is to implement triple modular redundancy (TMR) within the device. At the simplest level, this instantiates the same design three times within the FPGA. A majority vote — two out of three — decides the result. (Even though this might sound simple, implementing it can become very complex very quickly.) If one instantiation of the design becomes corrupted, the error will be masked. Depending on the kind of error, the device may clear itself on the next calculation, or it may require reconfiguration.
Implementing TMR can be performed by hand, which can be time-consuming, or using tools such as the TMRTool from Xilinx (this site’s sponsor) or the BL-TMR from Brigham Young University. If TMR is implemented correctly (and you have to be careful about synthesis optimizations), the design should mask all SEUs, as long as only one is present at any particular time.
Memory blocks inside the FPGA may also use error-correcting code technology to detect and correct SEUs. However, to ensure you really have good data, you need to perform memory scrubbing. This involves accessing the memory when it is not being used for other purposes, reading out the data, checking the error detection and correction code, and (if necessary) writing back the corrected data. Common tools here include Hamming codes that allow double-error detection and single-error correction.
This nicely leads us to the concept of scrubbing the entire FPGA. Depending on your end application, you might be able to simply reconfigure the FPGA each time before it is used. For example, a radar imaging system taking an image could reconfigure the FPGA between images to prevent corruption. If the FPGA’s performance is more mission-critical or uptime-critical, you can monitor the FPGA’s configuration by reading back the configuration data over the configuration interface. If any errors are detected, the entire device may be reconfigured, or partial reconfiguration may be used to target a specific portion of the design. Of course, all this requires a supervising device or system.
Of course, it will take significant analysis to determine how any of the methods that have been mentioned thus far affects MTBF. The complexity of this analysis will depend on the environment in which the system is intended to operate.
Working at the module level, we as engineers can take a number of steps to increase reliability. The first is to introduce redundancy, either within the module itself (e.g., extra processing chains) or by duplicating the module in its entirity.
If you are implementing redundancy, you have two options: hot and cold. Each has advantages and disadvantages, and implementing either option will be a system-level decision.
In the case of hot redundancy, both the prime and redundant devices (to keep things simple, I am assuming one-for-two redundancy) are powered up, with the redundant module configured ready to replace the prime should it fail. This has the advantage of a more or less seamless transition. However, since the redundant unit is operating alongside the prime, it is also aging and might fail.
In the case of cold redundancy, the prime unit is powered and operating while the redundant unit is powered down. This means the redundant module is not subject to as many aging stresses and, to a large extent, is essentially new when it is turned on. However, this comes at the expense of having some amount of down time if the prime module fails and the redundant module must be switched in.
With careful analysis of your system, you can identify the key drivers; i.e., which components have a high failure rate and are hurting system reliability. Power supply is often a key driver. Therefore, it is often advisable to implement a redundant power supply architecture that can power the same electronics, often in a one-out-of-two setup.
If you are implementing redundancy at the data path, module, or system level, the number of data paths, modules, or systems you employ will impact the new failure rate. For example, 12-for-8 systems will give you a lower failure rate than 10-for-8 systems. Of course, redundancy comes the expense of cost, size, weight, power consumption, and so forth. A very good interactive Website for this analysis can be found by clicking here.
When implementing redundancy at either the system or module level, it is crucial that both prime and redundant modules cannot have faults that can keep each other from working. Fault propagation has to be considered, and prime and redundant modules must be isolated from each other.