Slides from my recent tutorial at FPGA Kongress
SDSoC Session 4
Using Vivado HLS we can of course, accelerate the development of our data path. There are times however, when using HLS that we want to interact with external memories such as DDR. Either to store data or to retrieve data already written by another function.
Using HLS to interact with an external DDR at first sounds a like it might be complicated. Nothing could be further from the truth as I am about to demonstrate.
To do this I am going to use my Arty board which contains both a Artix 35T FPGA and 256MB of DDR3L ideal for the demonstration. What this demonstration will create is a HLS IP block which can be included within our Vivado design and interface with DDR. The functionality of this block will be simple writing a pattern of numbers into the DDR memory.
To successfully use this block we will therefore, need a Vivado design which contains the following
The idea behind this demonstration is the HLS block will write the data to the DDR and I will then be able to read the values written into the DDR using the JTAG to AXIS.
To crate the HLS block we use Vivado HLS and create a new project targeting the same Artix devices as on the Arty. All we need to do then is create our source files and test bench both of which will be very simple.
First the source file, to read or write from DDR memory we use the C function memcpy when used in software it allows the copying of data from a source to a destination. When we use it in HLS it also does the same however, the movement of data is based around the AXI4 memory mapped interface.
Of course, this interface is ideal for use with the MIG and other memory interfaces. As such the code we use to create the example HLS block is very simple
Looking in detail at the code, the function prototype includes a int pointer DDR, this will be the interface we use to write out the data to the DDR memory. If necessary, we could also use this interface to read from the DDR as well.
We to define this interface as a AXI4 master interface we use the pragma HLS INTERFACE with the type m_axi port =ddr associates the port ddr with the AXI interface. While the depth is used for co-simulation when pointers are used in place of arrays and should be set to the number values written /read.
The offset is how we control the address space and the addresses the AXI interface begins to access. This can be provided by one of three methods.
For this example, we will be using the direct approach as this allows us to show how a more logic-based design as opposed to a solution containing an embedded processing system using AXI Lite.
When it comes to verifying the design, we need to create a simple test bench which allows us to perform C based simulation and Co-Simulation. The test bench for this example simply has a 256-bit array passed to the HLS function which it fills.
The C Simulation allows us to test out the functionality of the code before we perform the HLS synthesis and co-simulation. Just like in any C debugging here we can use break points and examine the contents of variables.
With this completed the next step is to perform Synthesis and Co-Simulation, when you run Co-Simulation ensure you set the dump trace to port. This allows us to see the inputs and output waveforms of the HLS core.
The Co-Simulation applies the same C test bench inputs to the generated RTL and reports back on the pass / fail of the simulation.
Being happy with the co-simulation results the final step in Vivado HLS is to generate the IP core, and add it into our Vivado IP repository allowing its use.
Once our HLS core has been added into the Vivado design it is here that we can set the address the HLS core will use for its transactions. To do this I used a constant block.
With this completed we can then generate the bit stream, and once completed open the hardware manager and program the device.
Thanks to the VIO we can control the ap_start signal on the HLS core, this means before we start we can read the DDR memory using the JTAG to AXIS link and check that it is not set to our test values.
Satisfied the DDR memory is randomly initialised we can then start the HLS block using the VIO and check that it has run correctly writing its data to the DDR memory.
To verify the succesful writing of data, we can do this in two ways the first is to use internal ILA’s configured to trigger on AXI writes and completion of the HLS function.
The second is to use the JTAG to AXI bridge to read back the written addresses and confirm the values.
This demonstrates the write to DDR memory is successful, which enables us more flexibility within our HLS designs.
Watch out for!
MicroZed Chronicles on GitHub
Want a Book
Slide deck presented at ESC Minn 2016
Slide deck presented at ESC Boston 2017
Slide Deck from my recent presentation at embedded world EW_Slides
The Finite State Machine (FSM) are one of the basic building blocks every FPGA designer should know and deploy often. However, over the years when implementing state machines in many different solution spaces (defense, aerospace, automotive etc) I have learnt a few tips and recommendations which I thought I would share. Following these have provided me with a better quality of results and helped me deliver working systems to my customers faster.
Perhaps the most controversial point is always using a single process state machine. Of course, this is different to the state machine architecture we are taught when we first learn about them. Many, if not most universities teach state machine implementation using two processes, one combinatorial and another sequential.
With a two-process state machine implementation the main functionality will be contained in a combinatorial process. If we fail to fully define all the combinatorial conditions within this process we will implement latches during synthesis. The combinatorial process may also generate glitches on its outputs as the state and inputs change.
Debugging can also be harder as the sensitivity list needs to be complete, including all of the signals used in the combinatoral process. Failure to include a signal in the sensitivity list will result in different behaviour between RTL simulation and the implementation which can take a little time to find. This is alleviated somewhat if we are using VHDL 2008, where we can use the “all” statement in the process declaration. Of course to take advantage of this your tool chain needs to support VHDL 2008.
By contrast a one process state machine enables much easier use of conditional definition and removes the ability for latches to be created. While preventing glitches as all outputs signals are registered.
Personally I also find them a little easier to debug as all the functionality is within one process.
We are often tempted to take into our state machine large buses and decode these within the main body. This is especially true when counters are used for the timing of events within the state machine. For example, as shown in the code snippet below
when wait_cnt =>
if cnt >= std_logic_vector(to_unsigned(128,16)) then
current_state <= read_fifo;
cnt <= (others =>’0’);
cnt <= cnt + 1;
Implementing the state machine in this manner includes additional logic within the state machine for both the comparator and the counter. This will impact the performance of the state machine within the implemented FPGA as it requires more resources and consequently more routing.
A better method is to use an external process for the counter or other external functions which pass in a single control signal. This leaves the bulk of the logic decoupled from the state machine. Taking the counter example, we can use an external counter process to generate a single pulse once the terminal count is achieved as shown below.
when wait_cnt =>
if cnt_terminal = ‘1’ then
current_state <= read_fifo;
cnt_reset <= ‘1’;
In the above code the single bit input provided from the counter process to indicate the terminal count has been reached. While the state machine asserts a single bit output to reset the counter when this occurs. This uses less logic within the state machine, enabling better performance.
As shown in this example only a single bit should also only leave the state machine as well.
Many times, when we develop a state machine, the implementation does not use a power of two states leaving several unmapped states. If the state machine enters an unmapped state the state machine will stall and become unresponsive as there is no recovery mechanism. Depending upon the application this can be merely inconvenient or lead to catastrophic consequences.
Transitions to unmapped states can occur for a variety of reasons, from a single event effect to an electrically noisy environment or even vibration effecting device bond wires (before you ask yes, I have seen this). It is therefore good practice to ensure there is a path back from an unmapped state back into the main flow of the design. One of the simplest mechanisms to do this is to cycle through the unused states following the reset or power up before the state machine enters its idle state. This prevents the unmapped states from being optimised out during synthesis, while providing a simple recovery mechanism.
When considering the control flow of a state machine it is important to consider what happens if an expected signal does not arrive when expected. While in an ideal world we would have defined interface definitions for each module in our design sadly this is not always the case. As such we need to make sure the state machine does not hang in a state waiting for a signal which has already occurred and as such has been missed.
Considering this ensures the state machine can handle the worse case conditions in operation.
This is incredibly important if your state machine contains structures like below, where a late signal can easily lead to the system becoming unresponsive.
when <state> =>
if input_one = ‘1’ then
if input_two = ‘1’ then
current_state <= wait_fifo;
elsif input_three = ‘1’ then
current_state <= idle;
Should input one not be asserted when inputs two or three occur then the state machine will hang and become un-responsive. If such a structure is unavoidable and I am sure you can avoid it with a little thought, a timer or other recovery function should be added to prevent the state machine locking up, allowing a graceful recovery.
For more on the basics of state machines read How to implement a state machine in your FPGA
MicroZed Chronicles on GitHub
Want a Book
It has been a while since I last wrote about the Pynq, covering it in chronicles 155 to 161 just after its release.
Starting with the computer vision overlay.
One of the things I developed for my Pynq quickly after I received it was a simple object tracking application using OpenCV. This used a USB web camera and a simple OpenCV algorithm that detected the difference between a reference frame and frames received from the web camera.
Differences between the captured frame and reference frame above a certain threshold would then be identified on the captured frame as it was output over HDMI.
The original webcam-based algorithm itself is simple doing the following steps
While the algorithm has less than 10 steps, each step requires several operations on an image array. As a result, the frame rate was when running just on SW was very low. Knowing that programmable logic is ideal for implementing these functions and would provide significant acceleration I intended to go back and accelerate the algorithm.
Sadly, I never got the time to do this. However, looking at the new computer vision overlay which uses elements of the reVision stack I realised that I could probably very quickly accelerate this algorithm.
The new computer vision overlay provides the following image processing functions accelerated within the programmable logic.
Within the Pynq PL these are implemented as shown in the diagram below.
To install the computer vision overlay we use a PuTTY terminal connected to the Pynq to download and install the packages from Github.
In the PuTTY terminal use the following commands
$ sudo -H pip3.6 install –upgrade git+https://github.com/Xilinx/PYNQ-ComputerVision.git
$ sudo reboot now
Once installed we can proceed to updating the algorithm.
The computer vision overlay ideally uses the HDMI input and output for the best performance. To provide an accurate comparison against the previous OpenCV based example, my first step was to update that design to capture images using the HDMI input in place of the web camera.
I also modified the code to run for 200 frames such that I could time the execution and calculate the frames per second of both solutions.
This updated OpenCV design resulted in a frame rate of 4 frames per second when I ran it on my Pynq.
The next step was to update the algorithm to use the computer vision overlay to do this I used the 2D filter to perform the Gaussian Blurring and the Dilate operations.
Switching in these functions resulted in significant increase in the frame rate making the application usable.
Hints & Tips
MicroZed Chronicles on GitHub
Want a Book
When it comes to creating our FPGA or SoC designs, it is inefficient and poor practice to not leverage IP cores and reuse of other design elements if they are available.
Before we start developing our own IP core, we should of course first check if such functions are available in the Vivado library or via Open Source repositories before we consider the need to create our own or purchase one from a third party supplier.
Of course, creating our own IP core establishes a library we can use to reduce the design time and hence cost of future projects.
However, often in our design we find ourselves implementing functions which have several IP cores connected in the same manner for example an image processing pipeline.
Each time we wish to implement this function within our design we need to add in and connect all the necessary IP cores. Again, this is inefficient, what is needed is a method of reusing these functions each time we want to instantiate it in this project and the next.
We can do this using hierarchical blocks.
Working with a block diagram design in Vivado we can create a reusable hierarchical block using the write_bd_tcl command. This command is one you may have used previously to output a TCL description of the block diagram so that it can be stored in a version control tool like Git.
Rather helpfully we can also use the same command to write out a description of a single hierarchical block within a block diagram. We can then use this TCL description to create multiple instances of the block at will across several projects.
So how do we do it?
Creating a hierarchical block within the block diagram is straight forward, simply right click on the block diagram canvas and select create hierarchy. If the IP cores we wish to include in this new block, already exist at the higher level all we then have to do is drag and drop them into the new hierarchical block. If not we can double click on the new hierarchical block which will open the block in a new window allowing us to add IP cores and connect them as desired.
Creation of the Hierarchical Block with the image processing chain
To create a TCL file description of this block we use the following command in the Vivado TCL console.
write_bd_tcl -force -hier_blks [get_bd_cells <hier block name>] <filename.tcl>
This command will write out a file which describes the hierarchical block contents and their connections. If we wish to add the block to an existing or new design, we do this by loading the file into Vivado, again we use the TCL Console.
Once the TCL file has been loaded, in the TCL Console window you will see notification of a new procedure which can be called to create a new instance of the block in the project.
Calling this procedure results in a new hierarchical block being added to your design. For this example using the image processing chain above, I used the following command to add a second image processing block.
create_hier_cell_IPC / NEW_IPC
Examining the new block against the initial block demonstrates the contents are identical as we would expect.
We can now use this created TCL file across several designs where we want to create a image processing chain saving time. If we want to ensure it is has the maximum reuse potential, we can use the -no_ip_version option in the write_bd_tcl command to prevent the IP version from being included within the file. This makes the script more versatile with different versions of Vivado.
One final point having created the TCL file it is a good idea to enter it into a version control tool as we would with any other design element.
MicroZed Chronicles on GitHub
Want a Book
One thing that is always important for engineers, is the need for us to deliver our projects on quality, schedule and budget. When it comes to developing embedded systems there are a number of lessons, learnt by embedded system developers over the years which can be used to ensure your embedded system achieves these. Let us explore some of the most important lessons learned in developing these.