Slide deck presented at ESC Minn 2016
Slide deck presented at ESC Boston 2017
Slide Deck from my recent presentation at embedded world EW_Slides
The Finite State Machine (FSM) are one of the basic building blocks every FPGA designer should know and deploy often. However, over the years when implementing state machines in many different solution spaces (defense, aerospace, automotive etc) I have learnt a few tips and recommendations which I thought I would share. Following these have provided me with a better quality of results and helped me deliver working systems to my customers faster.
Develop the state machine as a single clocked process.
Perhaps the most controversial point is always using a single process state machine. Of course, this is different to the state machine architecture we are taught when we first learn about them. Many, if not most universities teach state machine implementation using two processes, one combinatorial and another sequential.
With a two-process state machine implementation the main functionality will be contained in a combinatorial process. If we fail to fully define all the combinatorial conditions within this process we will implement latches during synthesis. The combinatorial process may also generate glitches on its outputs as the state and inputs change.
Debugging can also be harder as the sensitivity list needs to be complete, including all of the signals used in the combinatoral process. Failure to include a signal in the sensitivity list will result in different behaviour between RTL simulation and the implementation which can take a little time to find. This is alleviated somewhat if we are using VHDL 2008, where we can use the “all” statement in the process declaration. Of course to take advantage of this your tool chain needs to support VHDL 2008.
By contrast a one process state machine enables much easier use of conditional definition and removes the ability for latches to be created. While preventing glitches as all outputs signals are registered.
Personally I also find them a little easier to debug as all the functionality is within one process.
Decouple functionality, only allow single bit input and outputs to your state machine.
We are often tempted to take into our state machine large buses and decode these within the main body. This is especially true when counters are used for the timing of events within the state machine. For example, as shown in the code snippet below
when wait_cnt =>
if cnt >= std_logic_vector(to_unsigned(128,16)) then
current_state <= read_fifo;
cnt <= (others =>’0’);
cnt <= cnt + 1;
Implementing the state machine in this manner includes additional logic within the state machine for both the comparator and the counter. This will impact the performance of the state machine within the implemented FPGA as it requires more resources and consequently more routing.
A better method is to use an external process for the counter or other external functions which pass in a single control signal. This leaves the bulk of the logic decoupled from the state machine. Taking the counter example, we can use an external counter process to generate a single pulse once the terminal count is achieved as shown below.
when wait_cnt =>
if cnt_terminal = ‘1’ then
current_state <= read_fifo;
cnt_reset <= ‘1’;
In the above code the single bit input provided from the counter process to indicate the terminal count has been reached. While the state machine asserts a single bit output to reset the counter when this occurs. This uses less logic within the state machine, enabling better performance.
As shown in this example only a single bit should also only leave the state machine as well.
Address any unmapped states
Many times, when we develop a state machine, the implementation does not use a power of two states leaving several unmapped states. If the state machine enters an unmapped state the state machine will stall and become unresponsive as there is no recovery mechanism. Depending upon the application this can be merely inconvenient or lead to catastrophic consequences.
Transitions to unmapped states can occur for a variety of reasons, from a single event effect to an electrically noisy environment or even vibration effecting device bond wires (before you ask yes, I have seen this). It is therefore good practice to ensure there is a path back from an unmapped state back into the main flow of the design. One of the simplest mechanisms to do this is to cycle through the unused states following the reset or power up before the state machine enters its idle state. This prevents the unmapped states from being optimised out during synthesis, while providing a simple recovery mechanism.
Consider the unexpected, what happens if control signals arrive late or early
When considering the control flow of a state machine it is important to consider what happens if an expected signal does not arrive when expected. While in an ideal world we would have defined interface definitions for each module in our design sadly this is not always the case. As such we need to make sure the state machine does not hang in a state waiting for a signal which has already occurred and as such has been missed.
Considering this ensures the state machine can handle the worse case conditions in operation.
This is incredibly important if your state machine contains structures like below, where a late signal can easily lead to the system becoming unresponsive.
when <state> =>
if input_one = ‘1’ then
if input_two = ‘1’ then
current_state <= wait_fifo;
elsif input_three = ‘1’ then
current_state <= idle;
Should input one not be asserted when inputs two or three occur then the state machine will hang and become un-responsive. If such a structure is unavoidable and I am sure you can avoid it with a little thought, a timer or other recovery function should be added to prevent the state machine locking up, allowing a graceful recovery.
- For more on high reliability state machines read USING FPGA’S IN MISSION CRITICAL SYSTEMS
For more on the basics of state machines read How to implement a state machine in your FPGA
It has been a while since I last wrote about the Pynq, covering it in chronicles 155 to 161 just after its release.
Starting with the computer vision overlay.
One of the things I developed for my Pynq quickly after I received it was a simple object tracking application using OpenCV. This used a USB web camera and a simple OpenCV algorithm that detected the difference between a reference frame and frames received from the web camera.
Differences between the captured frame and reference frame above a certain threshold would then be identified on the captured frame as it was output over HDMI.
The original webcam-based algorithm itself is simple doing the following steps
- Capture a frame from the web cam and convert it to grey scale (cv2.cvtColor)
- Perform a Gaussian Blur on the frame (cv2.GaussianBlur)
- Calculate the difference between the frame blurred frame and the reference frame (cv2.absdiff)
- Create a binary image using a threshold operation (threshold)
- Dilate the differences on the binary image to make them more noticeable (cv2.dilate)
- Find the contours within the binary image (cv2.findContours)
- Ignore any contours with a small area (cv2.contourArea)
- Draw a box around each contour that is large enough (cv2.boundingRect & cv2.rectangle)
- Show captured frame with the boxes drawn (cv2.imshow)
While the algorithm has less than 10 steps, each step requires several operations on an image array. As a result, the frame rate was when running just on SW was very low. Knowing that programmable logic is ideal for implementing these functions and would provide significant acceleration I intended to go back and accelerate the algorithm.
Sadly, I never got the time to do this. However, looking at the new computer vision overlay which uses elements of the reVision stack I realised that I could probably very quickly accelerate this algorithm.
The new computer vision overlay provides the following image processing functions accelerated within the programmable logic.
- 2D filter 3×3 with a configurable kernel allowing Gaussian Blurs, Sobel (V+H) etc
- Re mapping
Within the Pynq PL these are implemented as shown in the diagram below.
To install the computer vision overlay we use a PuTTY terminal connected to the Pynq to download and install the packages from Github.
In the PuTTY terminal use the following commands
$ sudo -H pip3.6 install –upgrade git+https://github.com/Xilinx/PYNQ-ComputerVision.git
$ sudo reboot now
Once installed we can proceed to updating the algorithm.
The computer vision overlay ideally uses the HDMI input and output for the best performance. To provide an accurate comparison against the previous OpenCV based example, my first step was to update that design to capture images using the HDMI input in place of the web camera.
I also modified the code to run for 200 frames such that I could time the execution and calculate the frames per second of both solutions.
This updated OpenCV design resulted in a frame rate of 4 frames per second when I ran it on my Pynq.
The next step was to update the algorithm to use the computer vision overlay to do this I used the 2D filter to perform the Gaussian Blurring and the Dilate operations.
Switching in these functions resulted in significant increase in the frame rate making the application usable.
Hints & Tips
- You can increase the performance by increasing the size of the contour further processed.
- Previous blogs on the Pynq are available here P155, P156, P157, P158, P159,P160 & P161
- Jupyter Note book is available on my GitHub
When it comes to creating our FPGA or SoC designs, it is inefficient and poor practice to not leverage IP cores and reuse of other design elements if they are available.
Before we start developing our own IP core, we should of course first check if such functions are available in the Vivado library or via Open Source repositories before we consider the need to create our own or purchase one from a third party supplier.
Of course, creating our own IP core establishes a library we can use to reduce the design time and hence cost of future projects.
However, often in our design we find ourselves implementing functions which have several IP cores connected in the same manner for example an image processing pipeline.
Each time we wish to implement this function within our design we need to add in and connect all the necessary IP cores. Again, this is inefficient, what is needed is a method of reusing these functions each time we want to instantiate it in this project and the next.
We can do this using hierarchical blocks.
Working with a block diagram design in Vivado we can create a reusable hierarchical block using the write_bd_tcl command. This command is one you may have used previously to output a TCL description of the block diagram so that it can be stored in a version control tool like Git.
Rather helpfully we can also use the same command to write out a description of a single hierarchical block within a block diagram. We can then use this TCL description to create multiple instances of the block at will across several projects.
So how do we do it?
Creating a hierarchical block within the block diagram is straight forward, simply right click on the block diagram canvas and select create hierarchy. If the IP cores we wish to include in this new block, already exist at the higher level all we then have to do is drag and drop them into the new hierarchical block. If not we can double click on the new hierarchical block which will open the block in a new window allowing us to add IP cores and connect them as desired.
Creation of the Hierarchical Block with the image processing chain
To create a TCL file description of this block we use the following command in the Vivado TCL console.
write_bd_tcl -force -hier_blks [get_bd_cells <hier block name>] <filename.tcl>
This command will write out a file which describes the hierarchical block contents and their connections. If we wish to add the block to an existing or new design, we do this by loading the file into Vivado, again we use the TCL Console.
Once the TCL file has been loaded, in the TCL Console window you will see notification of a new procedure which can be called to create a new instance of the block in the project.
Calling this procedure results in a new hierarchical block being added to your design. For this example using the image processing chain above, I used the following command to add a second image processing block.
create_hier_cell_IPC / NEW_IPC
Examining the new block against the initial block demonstrates the contents are identical as we would expect.
We can now use this created TCL file across several designs where we want to create a image processing chain saving time. If we want to ensure it is has the maximum reuse potential, we can use the -no_ip_version option in the write_bd_tcl command to prevent the IP version from being included within the file. This makes the script more versatile with different versions of Vivado.
One final point having created the TCL file it is a good idea to enter it into a version control tool as we would with any other design element.
- Ensure the project your loading the script into can see all of the IP cores used in the script. Make sure you have all the repositories added in the Vivado project.
- If you change the module and over write the generated TCL description. For the changes to take effect in your project you must reload it in to Vivado and then re instantiate it.
- When you create the TCL description make sure you know where the file will be created by running a pwd command first and if necessary setting the working path to a more friendly location.
- Simplify the interfacing of the block by using custom interface definitions within Vivado and the IP Packager.
One thing that is always important for engineers, is the need for us to deliver our projects on quality, schedule and budget. When it comes to developing embedded systems there are a number of lessons, learnt by embedded system developers over the years which can be used to ensure your embedded system achieves these. Let us explore some of the most important lessons learned in developing these.