# The Art of Decoupling

Like the perfect temperature at which beer should be served, the design and location of a decoupling capacitor network will return different answers depending upon who is being asked. The funny thing is that although the answers may be very different, each respondent will be sure that he or she is the only one who is correct.

Before I discuss my preferred beer-serving temperature and I explain how I design and locate my decoupling capacitors, I think it is important we all understand why we have decoupling capacitor networks in the first place. These networks are intended to perform two functions as follows:

1. To provide a low impedance path to ground for AC signals and noise signals that are superimposed on the DC supply voltage.
2. To act as a local energy store close to the device being decoupled such that high frequency demands for current due to logic gates switching, for example, can be supplied without the voltage rail being affected. (Remember that a power supply has a much slower response time to transient demands than the operational speed of the devices it powers. Indeed, at higher frequencies, on-chip decoupling is required, but that’s a story for another day.)

Both of these requirements will have bearing on the design of the decoupling capacitor network. We must also understand the parasitic elements and construction of a real-world capacitor, which — along with its capacitive element — will also have resistive and inductive elements as illustrated below:

Real structure of a capacitor (for decoupling purposes, RP is normally discounted).

Equivalent Series Resistance (ESR) is defined by the resistance of the leads or pads and losses in the dielectric; this is typically in the range of 0.01 to 0.1Ω for a ceramic capacitor.

Equivalent Series Inductance (ESL) is defined by internal connections or leads and pads. This is very important in the case of decoupling because it will dominate over the capacitance above certain frequencies.

From the model above, it is clear that the capacitor C and the ESL will form a series resonance creating a near short (it is not a dead short due to the ESR). You can calculate the Self Resonant Frequency (SRF) of a capacitor using the following equation:

What this means is that if you have a specific AC frequency you wish to remove, then you should ideally select a capacitor that has a SRF at the relevant frequency. Another consideration is to ensure a low impedance profile over a wide frequency band, which will require a range of capacitor values connected in parallel. For example, the network illustrated below employs two different value capacitors; observe the fact that there are more lower-value capacitors than higher-value capacitors

An example decoupling capacitor network.

When you are calculating this, do not forget the contribution of PCB inter-plane capacitance, which will dominate at high frequency. Inter-plane capacitance is achieved by careful design of the PCB stack to ensure that the power and ground planes are closely coupled within the stack, thereby creating capacitance.

It’s important to remember that the combined decoupling impedance is a function of all the different types and quantities of decoupling capacitors. The example below shows a combined decoupling capacitance (dark blue) formed by using 100nF capacitors (pink), 10nF capacitors (yellow), and 11µF capacitors (cyan/turquoise). In this case the combined decoupling impedance is required to be below 0.1Ω across a wide frequency range

Decoupling impedance, which is required to be below 0.1Ω across a wide frequency range.

Your target impedance will be defined by the parameters of the voltage supply being decoupled, the maximum transient current, and the allowable ripple on the rail as described by the following equation:

Having defined the target impedance, you can then use the capacitors available to you and their supplied information — capacitance, ESL, ESR, tolerance, and drift — to design a network that meets your impedance profile.

Your selection of decoupling capacitor will generally involve a ceramic device — commonly a multi-layer component — although polymer capacitors may be used for some applications. When it comes to selecting the most appropriate device, obviously you will start by looking for a low ESR and an acceptable SRF. You will also need to understand how the capacitor will operate across the desired temperature range and — more importantly — how the capacitance will change with temperature. For example, an X5R capacitor will work between -55 and +85°C with a change in capacitance of +/- 15% across the temperature range, while a Y7V capacitor will operate between -+30 and +125°C while exhibiting a variation of +22 to -82% of capacitance value — selecting the correct type is crucial.

Please remember to follow any recommendations made by the chip manufacturers also, because some devices have on-chip decoupling, which reduces the board-level decoupling requirements. The reasons for this will become clear in my next column in this miniseries.

Based on the discussions above, your decoupling network should now acknowledge the parasitic elements and component tolerances of the various capacitors you’ve selected. Sad to relate, however, this does not guarantee the final performance of the network. This is because we have not yet taken into account any parasitic parameters associated with the component mounting; nor have we considered the effects of component placement.

Whenever I talk to people about decoupling, they all say that the capacitor should be placed as close to the device as possible. However, very few people can actually tell me why this is and at what point “close enough” becomes “too far away.” As engineers, we need to understand what drives the placement of these components. Using this knowledge, we can define a series of rules regarding placement and layout such that the layout engineer is not simply just told to “put these as close as possible.” The lack of clear guidance can negatively impact the complexity of the design, the complexity of the manufacturing, and the cost of the circuit board.

A key aspect of decoupling is controlling the inductance associated with both the tracking and the mounting of the capacitor. Although the capacitor stores the charge, it is the inductance that determines the speed at which this charge can be delivered from the capacitor. Therefore, reducing the inductance loop is the most important aspect to consider when placing a capacitor.

This starts with the very design of the SMT (surface mount technology) capacitor mounting pads within your PCB library. Ideally the mounting via should be located as close as possible to the pad (though not within the pad unless you are using micro-via technology). If space permits, it is better to use multiple vias per pad as this reduces the overall inductance. You definitely do not want long thin tracks from the solder land to the via. Also, do not be tempted to share vias between capacitors.

The inductance loop is defined as being the loop created between the mounting via and the connections to the voltage planes. For this reason, when you define the stack of your board and assign layers to power and ground, you need to assign higher priority power planes (those with higher current demands from the device being decoupled) to be higher in the stack because this reduces the vertical distance the current needs to travel before reaching the plane.

When implemented correctly, the mounting inductance will be similar in value to the equivalent series inductance. This will have an impact on the resonant frequency (RF) of the capacitor, and hence should be included in the resonant frequency calculation. As the inductance increases, the resonant frequency — it is not the self-resonant frequency (SRF) as the mounting inductance is included — will be reduced:

Once calculated, the RF for the mounted component tells us the frequency at which the capacitor is most effective. Thus, we can use this to determine how close the capacitor needs to be located to the device it is decoupling so as to be most effective.

As the device being decoupled demands more current, it will cause a disturbance in the local power plane, and the decoupling capacitor will attempt to counteract this. There is a finite time between the device demanding current and the capacitor sensing and acting upon this demand. The time delay is calculated as follows:

You can determine the signal propagation speed in your circuit board by means of the following equation (where εr is the dialectic constant of the PCB material):

It obviously takes the same time delay for the current supplied from the capacitor to reach the device; hence, there is a “round trip delay.” We can therefore use the propagation speed “Vp” to determine the effective wavelength of the capacitor at its mounted resonant frequency. This wavelength can then be used to determine how close to the device being decoupled the capacitor needs to be placed using the following rules:

• When the capacitor is located more than a quarter of a wavelength away, the capacitor has no effect on the device being decoupled.
• The energy transfer will increase the closer the capacitor is located to the device being decoupled.
• An ideal target is to place capacitors within 1/40th of a wavelength. This means that smaller value capacitors have to be placed closer than do larger ones.

You can calculate the wavelength of the capacitor using the following equation:

As was noted above, it’s good practice to locate the decoupling capacitors within 1/40th of the wavelength, which means you will have zones of decoupling as shown below:

Priority should be given to termination resistors and discrete filtering capacitors for things like high-speed serial link power supplies over decoupling capacitors close to the device.

So now we understand the reasons why we decouple and how we go about doing this on the final design, including the rules outlining the placement of our decoupling capacitors. It is possible to verify the final layout using tools like HyperLynx Power Integrity from Mentor Graphics, which will not only look at DC drops across planes (this is just as important as decoupling) but also the AC performance

# Securing your FPGA Design

Let’s start by considering the high-level issues we face as engineers attempting to secure our designs. These include the following:

1. Competitors reverse engineering our design
2. Unauthorized production runs
3. Unauthorized modification of the design
4. Unauthorized access to the data within the design
5. Unauthorized control of the end system

The severity and impact of each of these will vary depending upon the end function of the design. In the case of an industrial control system, for example, someone being able to take unauthorized control could be critical and cause untold damage and loss of life. A secure data processing system will place emphasis on integrity of the data being critical. By comparison, in the case of a commercial product, preventing reverse engineering, unauthorized production runs, or even modification might be the driving factors.

Luckily, as engineers, we can use a number of approaches to prevent this sort of thing from happening.

The first, and most critical, is taking control of your design data — source code, schematics, mechanical assemblies, etc. — and ensuring it’s secure. This information is the lifeblood of your company and must be protected all the way through the project life cycle, and beyond, to keep your competitive edge. Sadly, in this age of cyberattacks by anything from individuals to organized groups to nation states, this means having very good firewalls — maybe even an “air gap” — between your design network and one connected to the external world.

There are also efforts that can be undertaken to secure your design within the design process itself. These efforts can be split into the following approaches, which are in no way mutually exclusive:

No. 1: Restrict physical access to the FPGA
One of the first methods that can be undertaken is to limit physical access to the unit — especially the circuit card and the FPGA(s). This involves using methods to detect someone tampering with the unit and taking action suitable for the system upon detection of any threat.

Examples of suitable action would be to safely power down the unit or to erase functional parameters preventing further use of the unit. This is often the case in many industrial control systems or military systems to prevent unauthorized access attempts. Depending upon the end application, other physical methods can be undertaken, such as conformal coating or potting to prevent identification of key components. The use of soldered — as opposed to socketed — components also goes without saying.

No. 2: Encryption of configuration streams
Many applications use SRAM-based FPGAs due to the ability to update the design in the field. Typically, these designs require a configuration device that loads the FPGA configuration at power-up and other times. This configuration data stream may be accessed by a third party (depending upon what physical precautions you have taken).

Many devices these days allow for encryption (normally AES) of the data stream, or even the need to know an encryption key before the device can be programmed further or data read back. Physically, the designer of the PCB can also limit people’s abillity to probe these points by using a multi-layer PCB and by not routing tracks on the top of the board, but instead using internal layers. This is especially efficacious if external termination resistors are not required or can be embedded in the PCB itself (this does add cost)

No. 3: Disable read back or even reconfiguration
Many devices provide the option to prevent the reading back of data over the JTAG interface. Some devices even provide the option to prevent upgrading the device if a certain flag is set, thereby turning a re-programmable device into a one-time programmable (OTP) component. Of course, if you take this course of action, you need to be certain that you will not need to change the design and that you are programming the correct file. (I am sure we have all, at one point, programmed the wrong file into a device. Or is that just me?)

No. 4: Protect that JTAG port
Most access attempts to reverse engineer, modify, or change the functionality of your design are going to be made initially via your JTAG chain. There is a very interesting paper on this topic that you can access by clicking here. It is therefore imperative that you protect your JTAG interface, which should never appear on an external connector, but instead require that the unit be disassembled in order to access the connector.

Ensuring your physical security measures in the field should provide protection over this interface. It’s also a good idea to provide several small chains that can be joined together via numerous tap controllers or external cabling, instead of creating large JTAG chains. Obviously, your design should not indicate on the silk screen where or what the JTAG connectors are. Some more secure designs do not include physical JTAG connectors, but rather just pads on the PCB to which a “bed of nails” type approach can be used to programme the devices.

If the device TAP controller contains the optional TRST pin, then it is possible to fit a zero ohm link to ground programming to hold the TAP in reset, thereby preventing the TAP controller from working. You can do the same with the TCLK pin if the TRST pin is not available. This means your attacker has to find and remove this resistor before the port will work.

No. 5: Differential power analysis
This is a technique that hackers can use to determine when the unit is processing data or when it is idling. As the power profile changes, it is possible to determine a significant amount about the design and the data passing through the system. One solution to this is to ensure the module / system draws the same power regardless of whether it is processing data full-out or while sitting idling, thereby preventing this information from being collected. This requires a more complicated power management and thermal management systems, but can be achieved by means of a shunt regulator, which becomes a constant current load on the main power supply.

No. 6: Design in the ability to detect counterfeits
There is always the possibility that — no matter how many precautions you have taken — your design, or portions thereof, can be copied and reused. However, there are systems you can implement within your code that will enable you to detect if your design has been copied. One potential method is the DesignTag approach from Algotronix, which uses a very unique and innovative method of identifying your design.

The discussions above present just some of the possible threats that are out there, along with a selection of techniques that can be undertaken to secure your design

# Using ChipScope ILA

If you are new to FPGAs, one aspect of the development flow you may not have considered is how you will go about debugging your design once it has been loaded into the FPGA.

In order to set the scene, let’s first cast our minds back to the days before FPGAs and consider how we would debug a digital circuit board or system in the lab. One of the tools we would have employed would be a logic analyzer. (See: Turn Your iPad Into a Logic Analyzer!) First we would connect the analyzer’s probe leads to the signals of interest on the board. We might also specify certain trigger conditions upon which we desired the tool to commence storing data for subsequent display and analysis. Then we would run the system and try to work out what the heck was happening.

Logic analyzers are, of course, still employed today. When it comes to using one to debug an FPGA design, we typically start by creating a dedicated test header that will connect to the FPGA’s input-and-outputs (I/Os). One problem with this scheme is that there can be hundreds of thousands of signals inside the FPGA — a much greater number than there are I/Os on the device and signals you can break out to the test header. This means that you may have to keep on rebuilding your design to access the signals of interest and route them out to the test header.

In some cases, the physical construction of the unit in question means that test headers are of use only at the board level and not during system integration. Indeed, I am working on one such project at the time of this writing. Another problem is that many FPGA designs are I/O limited from the start, so dedicating a bunch of pins to observe what’s happening on internal signals may simply not be a feasible option.

And one further problem is that, inevitability, the logic analyzer you are using will also be required by one or more other project teams, which means you all have to agree on how you will allocate the analyzer resources. I cannot tell you how frustrating it is to be homing in on a problem when… suddenly… it’s time to disconnect one’s intricate probe setup and allow the analyzer to be wheeled away to someone else’s project.

One solution to this problem — a solution that has seen great advances over the last few years — has been the development of in-chip logic analyzers for use with FPGAs. The idea is to employ any unused programmable resources and on-chip memory blocks to implement one or more “virtual” logic analyzers.

As with their physical counterparts, these virtual logic analyzers — like ChipScope from Xilinx, Identify RTL Debugger from Synopsys, Reveal from Lattice Semiconductor, and SignalTap from Altera — can be set up so that they will only start collecting data after certain trigger conditions have been met. Engineers can use these analyzers to “peer” into the design as it operates, storing the resulting data in on-chip RAM, extracting the results over the JTAG port, and then displaying the results — more or less in real-time — on their screens.

Using virtual logic analyzers may remove the need for test headers. Sadly, however, in many cases they do not remove the need to rebuild the code. One big advantage of these in-chip logic analyzers is that they offer the ability to capture the values on wide internal busses and store these values in internal RAM. The big downside with this approach comes in designs that are already utilizing most of the devices programmable resources, because this will limit any logic analyzer implementations.

Implementing ChipScope can be very quickly achieved within the ISE design flow. The simplest method is to first implement your design, but not to generate the *.bit file. Instead, open up Core Inserter under your Xilinx installation (in Windows, use Start > Xilinx > ChipScope [pro] > Core Inserter). Select the target technology and identify the output file of the synthesis (either *.ngc or *.edf depending upon the tool you used) and add an ICON controller and then the ILA block.

This is where you will connect the signals you wish to analyze. It is possible to have several ILA blocks per ICON if you wish to use different triggers or monitor different signals, etc. Once you’re happy with the connections you can insert the core, although — depending on the speed of your machine — this may take a little time. After the core has been inserted, you need to rerun the implementation stages and generate a *.bit file (ISE should show the stages needing to be re-run). Having configured the target device, you can then connect to the target over JTAG using the ChipScope Analyzer tool and trigger on the waveform of interest as illustrated in the screenshot below.

If you are interested in playing with this yourself, an example of the project referenced in this column — along with all the files needed to run it on the Avnet LX9 development board — can be found here

# Generating a VGA Test Pattern

In my original article, we discussed how we could use two counters — the pixel counter and the line counter — to generate the “H_Sync” (horizontal sync) and “V_Sync” (vertical sync) signals that are used to synchronize the VGA display. Now, in this article, we will consider how to also generate some RGB (red, green, and blue) signals to create an image on the display.

My Spartan 3A development board.

The first step was for me to retrieve my trusty Spartan 3A development board, which I had loaned to a friend at work. Once I had this board back in my hands, I started to ponder my implementation. Sadly my development board does not contain proper digital-to-analog converters (DACs) that can be driven by 8-bit wide red, green, and blue signals generated by the FPGA. Instead, it uses only four bits to represent each color, and it employs a simple resistor network to convert these digital outputs into corresponding analog voltages.

This means the color palette of my Spartan board is limited to four bits for the red channel, four bits for the blue channel, and four bits for the green channel, which equates to 2^4 x 2^4 x 2^4 = 4,069 colors. Although this 12-bit color scheme is admittedly somewhat limited, as we shall see it can still provide excellent results.

The next problem is the amount of memory required to hold the image. Once again, I had originally planned on storing an 800 x 600 pixel image in a frame buffer on the FPGA as described in Max’s article. Even with my limited color palette, however, just one frame would require 800 x 600 x 12-bits, which equals 5.76 megabits of RAM. This is more memory than is available in the FPGA on my development board.

As a “cheap-and-cheerful” alternative, I decided to generate a series of simple test patterns algorithmically. A high-level block diagram of my VGA test pattern generator is illustrated below:

High-level block diagram of my VGA test pattern generator.

First we have a “System Clock,” which is used to synchronize all of the activities inside the FPGA. The “VGA Timing” module comprises the pixel and line counters we discussed in my original article. In addition to generating the “H_Sync” and “V_Sync” signals that are used to synchronize the VGA display itself, this module also generates a number of other signals that are used to control the “VGA Video” module.

The “Algorithmic Test Pattern Generator” module is used to generate a series of simple test patterns. The “VGA Video” module takes these test patterns and presents them to the outside world in the form of the three 4-bit RGB signals that are presented to the DACs (or resistor networks, in the case of my development board).

Actually, I should note that in my real-world implementation, the “Algorithmic Test Pattern Generator” and “VGA Video” modules are one and the same thing, but it’s easier to think of them as being separate entities for the purposes of these discussions.

My implementation of this test pattern generator consumes only a small portion of the resources available on my Spartan FPGA. In fact, it requires just 96 slices out of the 5,888 slices that are available, which means it utilizes less than 2 percent of the chip’s total resources.

To be honest, I’m glad that the limitations of my development board forced me to take this intermediate step — that is, to create a test pattern generator. This is because a test pattern provides the simplest way to output images to prove that the backend display drivers are working correctly. Generating a test pattern (or a series of test patterns, in this case) is a good idea for a variety of reasons:

• It allows the RGB color outputs to be verified to prove that they are functioning correctly. This can be achieved by displaying incremental bars where the color is gradually increased from 0 to its maximum value.
• It allows the timing to be checked. Is the frame updating correctly? Are the borders correct? And so forth.
• More advanced test patterns can be used to align the image with a camera viewfinder on systems that are used to capture real-world images.

As an aside, a famous television test pattern many people will recognize is the Indian Head Test Card. This was common in America until the early 1970s, at which time it was replaced by the SMTPE Color Bars.

If you wish to probe deeper into my design, click here to download a ZIP (compressed) version of my project file. As you will see, this design consists of one structural unit tying together two modules: the “VGA Timing” module and the “VGA Video” module (which includes the algorithmic test pattern generation code as noted above).

The “VGA Video” module outputs the RGB video signals during the active periods of the video display period, as can be seen in the results of the simulation shown in the following screenshot:

The results from my initial simulations.

Again, the values in the line and pixel counters in the “VGA Timing” module are used by the “VGA Video” module to determine positions on the screen and to decide when the RGB outputs need to be manipulated to achieve the desired result.

# Generating VGA from an FPGA

Thanks to their nature, FPGAs are well suited to the intense levels of signal processing required by many imaging systems. Of course, one of the most rewarding aspects of image processing is seeing the resultant image on a display, and a very common form of display uses the VGA (video graphics array) standard.

The first VGA display was introduced with the IBM PS/2 line of computers in 1987. One thing most people associate with this form of display is the 15-pin D-subminiature VGA connector you tend to find on the back of a tower computer or the side of your notebook computer.

The original VGA standard supported a resolution of only 640×480 (which means 640 pixels in the horizontal plane and 480 lines in the vertical plane). Over the years, however, the standard has evolved to support a wide variety of resolutions, all the way up to widescreen resolutions as high as 1920×1080.

The act of driving a VGA is surprisingly simple, being based on the use of two counters as follows:

• Pixel counter: Counts at the required clock frequency (40MHz in this example) the number of pixels in a line, this is used to generate the horizontal timing.
• Line counter: Also known as the Frame Counter, this repeats at the refresh rate of the desired VESA specification for 60Hz, 75Hz, 85Hz, and so on. This also identifies when the counter is within a valid region for outputting display data. The line counter is incremented each time the pixel counter reaches its terminal count.
These counters are used to generate two synchronization (sync) markers — the “V_Sync” (vertical sync) and “H_Sync” (horizontal sync) signals. In conjunction with the RGB (red, green, and blue) analog signals , “V_Sync” and “H_Sync” form the basic signals required to display video on a monitor.

Actually, this may be a good time to take a step back to remind ourselves as to the origin of terms like “V_Sync” and “H_Sync.” The main thing to remember is that, at the time the original VGA standard was introduced, the predominant form of computer display was based on the cathode ray tube (CRT), in which an electron beam is used to “write” on a phosphorescent screen.

There are several ways in which an electron beam can be manipulated to create images on a CRT screen, but by far the most common technique is the raster scan. Using this approach, the electron beam commences in the upper-left corner of the screen and is guided across the screen to the right. The path the beam follows as it crosses the screen is referred to as a line. When the beam reaches the right-hand side of the screen it undergoes a process known as horizontal flyback, in which its intensity is reduced and it is caused to “fly back” across the screen. While the beam is flying back it is also pulled a little way down the screen as shown in the following illustration:

The beam is now used to form a second line, then a third, and so on until it reaches the bottom of the screen. The number of lines affects the resolution of the resulting picture (that is, the amount of detail that can be displayed). When the beam reaches the bottom right-hand corner of the screen it undergoes vertical flyback, in which its intensity is reduced, it “flies back” up the screen to return to its original position in the upper left-hand corner, and the whole process starts again.

The “V_Sync” and “H_Sync” signals are used to synchronize all of these activities. Thus, returning to our pixel and line counters, the values on these counters can be decoded so as to generate the required waveforms on the “V_Sync” and “H_Sync” outputs from an FPGA (that is, on the FPGA’s pins that are being used to drive the display’s “V_Sync” and “H_Sync” signals). Meanwhile, generating the RGB signals will require the FPGA to drive three digital-to-analog convertors (DACs), one for each signal. As the design engineer, you must ensure that the latency through the DACs is accounted for to ensure that their outputs are correctly aligned with respect to the “V_Sync” and “H_Sync” signals.

The line and pixel counters both have portions of their count sequences when no data is being output to the display. In the case of an 800×600 resolution display refreshing at 60Hz, for example, the vertical (line) counter will actually count 628 lines while the horizontal (pixel) counter will count 1,056 pixels.

Why should this be so? Well, returning to our raster scan, it takes a certain amount of time for the electron beam to undergo its horizontal and vertical flyback activities. One way to think about these times is that we have an actual display area that we see, and that this actual display area “lives” in a larger (virtual) display space that contains a border zone that we don’t see:

Of course, in the case of today’s flat-screen, liquid crystal displays (LCDs) and similar technologies, we don’t actually need to worry about things like horizontal and vertical flyback times. At least, we wouldn’t have to worry if it were not for the fact that we don’t actually know what type of screen our FPGA is driving. Thus, anything driving a VGA output generates the timing signals required to drive CRT display, and other forms of display simply make allowances for any of the historical peculiarities associated with these VGA signals.

But we digress… Each of our counters has a collection of associated timing parameters. Vertical timings are referenced in terms of lines, while horizontal timings are referenced in terms of pixels. The following values are those associated with a display resolution of 800×600:

Using this approach, it is very easy to generate a simple VGA interface and see the results of our image processing algovgarithms on a monitor. If you are interested, you can download a ZIP file containing the VHDL code for these counters along with a VHDL testbench by clicking here vga

# Mean Time Between Failure

Every engineer should be familiar with the concept of Mean Time Between Failure (MTBF), which is one of the most commonly used terms in reliability engineering. Having said this, different people have varying interpretations of what MTBF actually means, and these interpretations are often incorrect.

For example, suppose I were to tell you that a particular system had a MTBF of 8,766 hours, which equates to one year (MTBF is always quoted in hours). Does this mean that if I use a single instantiation of this system in my application, that system is guaranteed to be operational for a year before its first failure?

In fact, you might be surprised to discover that with a MTBF of a year, a single system has a probability of only 0.37 of still being operational after the 8,766 hours. If you’re a manufacturer of systems, then only 37 percent of the units you produce will still be operational after this period, which might cause problems for your warranty and post-design services departments. Using the equation P(s) = E^(-t/MTBF), and charting this equation in Excel will produce the following plot, which shows probability of success at 0.5 MTBF and 1.0 MTBF:

As engineers, we want (or, in some cases, need) to ensure the probability of success for a particular number of years is very high — as high as 0.95 in many cases (some applications require 0.99 or higher, but this will require you to employ redundancy). This means that, for the simple one year product/mission life presented in our example, the MTBF would have to be 20.74 years to obtain the required probability of success. This is a big difference from what you may originally have thought.

The reliability of a module or system follows the well know “bath tub curve” as shown below:

The discussions in the remainder of this blog will relate to determining the MTBF and failure rate during the constant failure rate duration. It is the responsibility of the manufacturer to ensure that infant mortality is screened out, which is one reason for performing post-production burn-in.

One method of determining MTBF is based around the failure rate of the component interconnections (solder joints), the technology (hybrid, IC, PCB etc.), and the operating environment (ground, aircraft, space vehicle, etc.). In fact, two methods for determining the failure rate are commonly used:

• Parts count: As this technique is based around reference stresses on components, the resulting analysis tends to give a more conservative (pessimistic) failure rate.
• Stressed reliability: This approach utilizes actual electrical and thermal stress applied to each component to determine a more accurate failure rate for the system.

In many cases, circuit/systems designers may use both of these techniques to determine the MTBF of their system. For example, they may initially perform a parts count analysis on the bill of materials (BOM) to provide an initial (“ball park”) estimate of the reliability.

Later, once the circuit design has been completed, they may follow up with a stressed reliability analysis that takes into account the electrical stresses on the components and the equipment temperatures. This second, stressed analysis tends to lower the failure rate and increase the MTBF, which is generally what engineers and companies want while still being accurate.

One common standard for performing both of these analyses is Mil Handbook 217F Notice 2. This is freely available over the Internet and provides very detailed information on reliability rate calculations for different devices and environments. The only downside with this standard is that it was last updated in 1995, so its predictions can be a little pessimistic for modern components. The other commonly used standards are the Bellcore/Telcordia and SAE Reliability Prediction Methods.

Another method for determining the failure rate of a device is via life testing; i.e., the total number of hours the device operates without a failure. This is often achieved using accelerated life testing, which stresses the devices beyond their normal operating conditions to “age” the devices and determine failure rates. This approach is generally performed by device manufacturers to obtain each component’s FIT (failure in time) rate. Typical examples of this are the Mil Std 883 and the JEDEC Solid State Technology Association’s “JEDS22 reliability test methods for packaged devices.”

The FIT rate is the number of failures in a billion hours, for example, a device with a FIT rate of 20 is said to have 20e-9 FITs. The relationship between MTBF and FIT rate is very simple — the reciprocal of one results in the other. Hence, in our earlier example, in order to have a probability of success of 0.95 for one year, the total design needs a FIT rate no greater than 5,500 FITs, which is still pretty high.

Typical FIT rates for Xilinx FPGAs (this site’s sponsor) are 24 FITs for the current 7 series devices. Typically, power supplies tend to dominate failure rates, especially in Mil Handbook 217F Notice 2, which can be used to calculate the reliability of a hybrid device.

# Increasing FPGA System Reliability

In this column, I will look at what we can do within the FPGA and at the hardware/system level to increase reliability.

Focusing on the FPGA implementation first, there are numerous ways the design can be corrupted, depending on the end environment. This corruption could be the result of a single-event upset (SEU), a single-event functional interrupt (SEFI), or even data corruption from a number of sources.

An SEU occurs when a data bit (register or memory) is hit by radiation and flips from a 0 to a 1, or vice versa. A SEFI is where a control register or other critical register suffers a bit flip that locks up the system. In the world of SRAM-based FPGAs, we tend to consider an SEFI when one of the SRAM cells holding the device’s configuration flips and changes the design’s implementation. Data corruption can occur for a number of reasons, including EMI (electromagnetic interference) affecting the design in an industrial application.

How can we protect these systems and increase a unit’s MTBF? Depending on the end application, it may be acceptable simply to duplicate the logic — create two instantiations of the design within the same device — and to indicate an error if the results do not match. The higher-level system would be in charge of deciding what to do in the event of such an error.

The next thing we can do is to implement triple modular redundancy (TMR) within the device. At the simplest level, this instantiates the same design three times within the FPGA. A majority vote — two out of three — decides the result. (Even though this might sound simple, implementing it can become very complex very quickly.) If one instantiation of the design becomes corrupted, the error will be masked. Depending on the kind of error, the device may clear itself on the next calculation, or it may require reconfiguration.

Implementing TMR can be performed by hand, which can be time-consuming, or using tools such as the TMRTool from Xilinx (this site’s sponsor) or the BL-TMR from Brigham Young University. If TMR is implemented correctly (and you have to be careful about synthesis optimizations), the design should mask all SEUs, as long as only one is present at any particular time.

Memory blocks inside the FPGA may also use error-correcting code technology to detect and correct SEUs. However, to ensure you really have good data, you need to perform memory scrubbing. This involves accessing the memory when it is not being used for other purposes, reading out the data, checking the error detection and correction code, and (if necessary) writing back the corrected data. Common tools here include Hamming codes that allow double-error detection and single-error correction.

This nicely leads us to the concept of scrubbing the entire FPGA. Depending on your end application, you might be able to simply reconfigure the FPGA each time before it is used. For example, a radar imaging system taking an image could reconfigure the FPGA between images to prevent corruption. If the FPGA’s performance is more mission-critical or uptime-critical, you can monitor the FPGA’s configuration by reading back the configuration data over the configuration interface. If any errors are detected, the entire device may be reconfigured, or partial reconfiguration may be used to target a specific portion of the design. Of course, all this requires a supervising device or system.

Of course, it will take significant analysis to determine how any of the methods that have been mentioned thus far affects MTBF. The complexity of this analysis will depend on the environment in which the system is intended to operate.

Working at the module level, we as engineers can take a number of steps to increase reliability. The first is to introduce redundancy, either within the module itself (e.g., extra processing chains) or by duplicating the module in its entirity.

If you are implementing redundancy, you have two options: hot and cold. Each has advantages and disadvantages, and implementing either option will be a system-level decision.

In the case of hot redundancy, both the prime and redundant devices (to keep things simple, I am assuming one-for-two redundancy) are powered up, with the redundant module configured ready to replace the prime should it fail. This has the advantage of a more or less seamless transition. However, since the redundant unit is operating alongside the prime, it is also aging and might fail.

In the case of cold redundancy, the prime unit is powered and operating while the redundant unit is powered down. This means the redundant module is not subject to as many aging stresses and, to a large extent, is essentially new when it is turned on. However, this comes at the expense of having some amount of down time if the prime module fails and the redundant module must be switched in.

With careful analysis of your system, you can identify the key drivers; i.e., which components have a high failure rate and are hurting system reliability. Power supply is often a key driver. Therefore, it is often advisable to implement a redundant power supply architecture that can power the same electronics, often in a one-out-of-two setup.

If you are implementing redundancy at the data path, module, or system level, the number of data paths, modules, or systems you employ will impact the new failure rate. For example, 12-for-8 systems will give you a lower failure rate than 10-for-8 systems. Of course, redundancy comes the expense of cost, size, weight, power consumption, and so forth. A very good interactive Website for this analysis can be found by clicking here.

When implementing redundancy at either the system or module level, it is crucial that both prime and redundant modules cannot have faults that can keep each other from working. Fault propagation has to be considered, and prime and redundant modules must be isolated from each other.