MicroZed Chronicles: Writing RTL for Timing Closure

May 29, 20245 min read

Last week, we looked at how we could create a baseline timing closure, which hopefully helps with achieving timing closure. Of course, one of the key elements that can help us achieve timing closure is making sure we write good code that exploits both the device's architectural features and implementation tools' capabilities.

Let's take a look at several aspects that can help us provide a better quality of code.

Architecture - Plan the architecture from day one, and try to leverage IP cores wherever possible to reduce the amount of development needed. When it comes to hierarchy, there are several considerations:

Keep Input and Output Structures at the top level. This allows us to make changes easily and port the design to new architectures more easily, e.g., from 7 Series to UltraScale.
Keep Clocking structures at the top level. I like to implement them within a standalone module.
Register the inputs and outputs of all modules. It is easier to address timing issues intra-module rather than paths spanning modules. Another benefit of registering I/O is that it prevents logic from moving between modules during optimization, which can significantly aid debugging.
Ensure interfaces leverage, wherever possible, standard interfaces such as AXI, AXI Lite, and AXI Stream.

Resets – The question of using a reset or not is important in the AMD ecosystem. Whenever possible, we should leverage the Global Set Reset, which is applied at the end of configuration. This means that every register is in a known state at the end of startup, unless a default statement is used in the HDL, this will be mostly logic zero. If we use resets in our design, we might limit the ability of the synthesis engine to map to certain structures, and hence impact our design's timing performance. Whether a reset is required depends on each design; typically, control path logic may require a reset while data path logic does not.

If a reset is required, do not use asynchronous resets; this can have a significant impact on the timing performance and utilization of the design. We have examined this subject in detail before, using the DSP48E2 as an example. The difference between the two implementations is significant, as the linked blog shows.

Clock Enables – Clock enables can be very useful in our designs; we can use them to reduce power across the design or as part of our functionality, for example, generating an I2C or SPI output which runs at a lower clock rate.

However, a large number of clock enables can contribute significantly to the number of control sets in a design. If you are not familiar with control sets, the definition of a control set is the set/reset, clock enable, and clock which drives a SRL, LUTRAM, or Register.

One method that helps with this is control set optimization, when the synthesis tools identify a synchronous reset/set or enable, it examines the load on the cone of identified logic.

If the logic is below the threshold identified in the -control_set_opt_threshold, the synthesis engine implements the reset/set using logic gates on the data path instead of using the register inputs.

The default setting for this threshold is 4 for 7 series devices and 2 for UltraScale devices.

If we want control over direct / extract of signals we can use the attributes below.

Direct Enable / Direct Reset force the synthesis tool to use the register pins.
Extract Enable / Extract Reset forces the synthesis tool to use the data path for control and reset.

Pipelining – Pipelining allows us to restructure data paths which have several layers of logic. The idea is that we can introduce registers between logic levels, which enables a faster clock frequency at the cost of increased latency. If we consider pipelining from day one of the design, we can leverage the capabilities of the synthesis tool to help us achieve pipelining, this is called retiming.

One approach to this is to, within a module, add several additional registers either before or after the logic where we think retiming will be needed. The synthesis engine is then able to implement retiming as it sees fit. If we want more direct control, we can use forward retiming attributes if the registers are placed at the front or backward retiming attributes if the registers are placed at the end. This saves us from having to break up the logic level ourselves and enables the synthesis tool maximum flexibility in where it places the registers.

If desired, pipelining can be automated and left to the discretion of the synthesis tool, by the use of the auto-pipeline attribute. This is an interesting approach, and I will write an in-depth blog on this alone in the coming weeks.

Block RAMs – BRAMs are crucial within our designs; whenever possible, I prefer to infer the BRAMs vs. instantiating them. When it comes to registers, the BRAM output should be registered with at least two output registers. This enables the synthesis tool to implement one register in the BRAM and the other in the fabric, while increasing the overall latency to three clock cycles. This has the advantage of providing faster clock-to-out timing than the BRAM registers. When it comes to input registers, it can be beneficial to drive the BRAM inputs with a register stage, especially if several BRAM resources are used to create a larger BRAM structure across the die; the extra register can be very beneficial for timing closure.

We will look in-depth at some of these points in the coming weeks in individual blogs, but for now, I hope they help you with your RTL and timing closure.

Workshops and Webinars

If you enjoyed the blog why not take a look at the free webinars, workshops and training courses we have created over the years. Highlights include

Professional PYNQ Learn how to use PYNQ in your developments
Introduction to Vivado learn how to use AMD Vivado
Ultra96, MiniZed & ZU1 three day course looking at HW, SW and PetaLinux
Arty Z7-20 Class looking at HW, SW and PetaLinux
Mastering MicroBlaze learn how to create MicroBlaze solutions
HLS Hero Workshop learn how to create High Level Synthesis based solutions
Perfecting Petalinux learn how to create and work with PetaLinux OS

Embedded System Book

Do you want to know more about designing embedded systems from scratch? Check out our book on creating embedded systems. This book will walk you through all the stages of requirements, architecture, component selection, schematics, layout, and FPGA / software design. We designed and manufactured the board at the heart of the book! The schematics and layout are available in Altium here Learn more about the board (see previous blogs on Bring up, DDR validation, USB, Sensors) and view the schematics here.

Order here

MicroZed Chronicles: Writing RTL for Timing Closure

Recent Posts

1 Comment