In a previous column, we discussed the difference between registers and latches, so I decided to dedicate this column to explain what metastability is, what causes it, and how we can learn to live with it since its occurrence cannot be totally prevented.
As illustrated below, metastability can happen to registers when their setup or hold times are violated; that is, if the data input changes within the capture window. As a result, the output of the register may enter a metastable state, which involves oscillating between logic 0 and 1 values. If not treated, this metastable condition may propagate through the system, causing issues and errors. The register will eventually recover from its metastable state and “settle” on a logic 0 or 1 value; the time it takes for this to occur is called the recovery time.
Metastability within an FPGA design will typically occur in one of two ways:
- When an incoming signal is asynchronous with regard to the clock domain. This may be an external input signal or a signal crossing between clock domains. In this case, the design engineer is expected to resynchronise the signal to address metastability, which is certain to occur eventually. This is where a multi-stage synchroniser is typically employed as discussed below.
- When multiple register elements in a synchronous design are using the same clock, but phase alignments or clock skew issues mean that the output from one register violates another register’s setup and hold time. This may be addressed by modifying the place-and-route constraints or by changing the logic design itself.
Let’s consider the case of an incoming signal that is asynchronous with respect to the system clock. It is the engineer’s responsibility to create the design in such a way as to mitigate against any resultant metastability issues. Many engineers will be familiar with the concept of a two-stage synchronizer, but I wonder how many really understand just how it performs its magic?
In fact, the two-stage synchronizer works by permitting the first register to go metastable. The idea is that the system clock is running — and therefore “sampling” the external signal — significantly faster than the external signal is changing from one state to another. If it should happen that a transition on the asynchronous signal causes the first register to become metastable, then ideally this register will have recovered by the time the next clock edge arrives and loads this value into the second register.
Now, this is where some people become confused. Let’s assume that the original value on the asynchronous signal was a logic 0, and that this value has already been loaded into both of the synchronizer registers. At some stage the asynchronous signal transitions to a logic 1. Let’s explore the possibilities as follows:
The first possibility is that the transition on the asynchronous signal doesn’t violate the first register’s setup or hold times. In this case, the first active clock edge (shown as “Edge #1” is the illustration below) following the transition on the asynchronous signal transition loads its new value into the first register, and the second active clock edge will copy this new value from the first register into the second as shown below:
The second possibility is that the transition on the asynchronous signal does violate the first register’s setup or hold times, which means the first active clock edge causes the first register to enter a metastable state. At some stage — hopefully before the next clock edge — the first register will recover, by which we mean it will settle into either a logic 0 or a 1 value. Let’s assume that the first register ends up settling into a logic 1 as shown below:
This is, of course, what we wanted in the first place. In this case, the second active clock edge will load this 1 into the second register (which originally contained a logic 0). Thus, the end result — as seen at the output from the second register — is exactly the same as if the first register had not gone metastable at all.
The final possibility (at least, the last one we will consider in this column) is that, following a period of metastability, the first register settles into a logic 0 as shown below:
In this case, the second active clock edge will load this 0 into the second register (which already contained a 0). At the same time, this second active clock edge will load the logic 1 on the asynchronous signal into the first register. Thus, it is the third active clock edge that eventually causes the second register to be loaded with a logic 1.
The end result of using our two-stage synchronizer is that — in a worst-case scenario — the desired output from the synchronizer is delayed by a single clock cycle. Having said this, there is a slight chance that the first register will not recover in time, which might cause the second stage of the synchronizer to enter its own metastable condition.
The alternative would be to have three or even more stages, so how do we determine if two stages are acceptable… or not?
Well, as engineers we can actually calculate this. Sadly, this does involve some math, but I will try and keep the painful parts to a minimum. The mean time between failure (MTBF) for a flip-flop (register) depends upon the manufacturing process. Let’s start with the equation for a single flip-flop as follows:
Based on this, we can calculate the MTBF for a multi-stage synchroniser using the following equation:
For both equations:
I really am sorry about this, and I will do my best to keep math out of future columns. Having said this, by means of this equation, it is possible to determine the mean time between a metastability event occurring for your chosen synchronizer structure (two or more flip-flops). If the resulting MTBF for a two-stage synchronizer shows that the time between metastable events is not acceptable (that is, they will occur too often), then you can introduce a third flip-flop.
The use of a three-stage synchronizer is often required in the case of high-speed or high-reliability systems. If you are designing for a high-reliability application, then you will need to demonstrate that the metastability event cannot occur during the operating life of the equipment (at a minimum). This MTBF (or, more correctly, its reciprocal, which is the failure rate) can also be fed into the system-level reliability calculations to determine the overall reliability of the entire system.
When it comes time to simulate these synchronizers, it quickly becomes obvious that the tools are limited in regard to the way in which they can model metastable events. For example, consider the following results generated by simulating an RTL version of a two-stage synchronizer
Even though there is, in fact, a problem with this design, no errors are detected or displayed, due to the fact that the RTL does not — in this case — contain any timing information.
For a simulation to exhibit metastability, you have to simulate at the gate level using a standard delay file (SDF) that contains the appropriated timing information. The synthesis tool extracts this timing information from the library associated with the target FPGA component. For example, consider the following gate-level simulation results for the same two-stage synchronizer
Also, the following warning messages were generated as part of this gate-level simulation:
If you wish, you can replicate these results for yourself by downloading this ZIP file, which contains the following files:
- meta_testbench.vhd — The VHDL testbench
- meta_rtl.vhd — The RTL version of the design
- meta_gate.vhd — The synthesized gate-level version of the design
- meta_gate.sdf — The delays associated with the gate-level version of the design
You can replicate the RTL simulation using the “meta_testbench.vhd” and “meta_rtl.vhd” files. Similarly, you can replicate the gate-level simulation using the “meta_testbench.vhd” and “meta_gate.vhd” files with the delays in the “meta_gate.sdf” file being applied to the “/uut/” region.
the RTL simulation of our two-stage synchronizer indicated that there weren’t any problems. However, the gate-level simulation did reveal a timing error (where the traces go red). It’s also interesting to note that even the gate-level simulation will not fully behave like the real-world synchroniser, because the flip-flop models do not contain the required information on recovery time. This means that the unknown “X” state will be propagated through the design and will affect any downstream elements in the design. It’s up to us at the engineers to handle this correctly.
Of course, saying things like “It’s up to us to handle this correctly” sounds good if you say it quickly and wave your arms around a lot, but how do we actually do this? Well, since the flip-flops in an FPGA are already fabricated, this means that — as FPGA designers — we have only two options (ASIC/SoC designers have a third option, because they can implement a special synchroniser flip-flop which has an acceptably high MTBF with regard to these metastable events).
The first approach is to disable all timing checks within the simulator, but this will hide other timing issues that really need to be investigated. The alternate — and often preferred — technique is to find the first register (“synch_reg(0)” in this case) in the SDF file and set the setup and hold time information to 0. This is shown in the example below where the red highlighted text is changed from the original settings to the updated values required for simulation.
Doing this will prevent the register from being able to experience the metastable event. This is only acceptable, however, if you have already analysed your design and you are confident that your synchroniser has the required MTBF