Tag Archives: fpga

MicroZed Chronicles – Vivado HLS & DDR Access


Using Vivado HLS we can of course, accelerate the development of our data path. There are times however, when using HLS that we want to interact with external memories such as DDR.  Either to store data or to retrieve data already written by another function.

Using HLS to interact with an external DDR at first sounds a like it might be complicated. Nothing could be further from the truth as I am about to demonstrate.

To do this I am going to use my Arty board which contains both a Artix 35T FPGA and 256MB of DDR3L ideal for the demonstration. What this demonstration will create is a HLS IP block which can be included within our Vivado design and interface with DDR. The functionality of this block will be simple writing a pattern of numbers into the DDR memory.

To successfully use this block we will therefore, need a Vivado design which contains the following

  • Memory Interface Generator configured for the DDR3L
  • JTAG to AXI to allow debug verification access to the DDR3L
  • AXI interconnect to allow the HLS block and the JTAG to AXI to access the MIG
  • VIO to start the HLS block
  • ILA to verify / debug internally the design
  • The HLS test block itself

The idea behind this demonstration is the HLS block will write the data to the DDR and I will then be able to read the values written into the DDR using the JTAG to AXIS.

To crate the HLS block we use Vivado HLS and create a new project targeting the same Artix devices as on the Arty. All we need to do then is create our source files and test bench both of which will be very simple.

First the source file, to read or write from DDR memory we use the C function memcpy when used in software it allows the copying of data from a source to a destination. When we use it in HLS it also does the same however, the movement of data is based around the AXI4 memory mapped interface.

Of course, this interface is ideal for use with the MIG and other memory interfaces. As such the code we use to create the example HLS block is very simple

Looking in detail at the code, the function prototype includes a int pointer DDR, this will be the interface we use to write out the data to the DDR memory. If necessary, we could also use this interface to read from the DDR as well.

We to define this interface as a AXI4 master interface we use the pragma HLS INTERFACE with the type m_axi port =ddr associates the port ddr with the AXI interface. While the depth is used for co-simulation when pointers are used in place of arrays and should be set to the number values written /read.

The offset is how we control the address space and the addresses the AXI interface begins to access. This can be provided by one of three methods.

  1. Off – the default, it will start accesses from 0x00000000.
  2. Direct – this will create a port on the HLS module which allows the offset definition to be applied from within the Vivado block design.
  3. Slave – this will create a AXI 4 Lite interface with a register which defines the offset.

For this example, we will be using the direct approach as this allows us to show how a more logic-based design as opposed to a solution containing an embedded processing system using AXI Lite.

When it comes to verifying the design, we need to create a simple test bench which allows us to perform C based simulation and Co-Simulation. The test bench for this example simply has a 256-bit array passed to the HLS function which it fills.

The C Simulation allows us to test out the functionality of the code before we perform the HLS synthesis and co-simulation. Just like in any C debugging here we can use break points and examine the contents of variables.

With this completed the next step is to perform Synthesis and Co-Simulation, when you run Co-Simulation ensure you set the dump trace to port. This allows us to see the inputs and output waveforms of the HLS core.

The Co-Simulation applies the same C test bench inputs to the generated RTL and reports back on the pass / fail of the simulation.

Being happy with the co-simulation results the final step in Vivado HLS is to generate the IP core, and add it into our Vivado IP repository allowing its use.

Once our HLS core has been added into the Vivado design it is here that we can set the address the HLS core will use for its transactions. To do this I used a constant block.

With this completed we can then generate the bit stream, and once completed open the hardware manager and program the device.

Thanks to the VIO we can control the ap_start signal on the HLS core, this means before we start we can read the DDR memory using the JTAG to AXIS link and check that it is not set to our test values.

Satisfied the DDR memory is randomly initialised we can then start the HLS block using the VIO and check that it has run correctly writing its data to the DDR memory.

To verify the succesful writing of data, we can do this in two ways the first is to use internal ILA’s configured to trigger on AXI writes and completion of the HLS function.

The second is to use the JTAG to AXI bridge to read back the written addresses and confirm the values.

This demonstrates the write to DDR memory is successful, which enables us more flexibility within our HLS designs.

Watch out for!

  1. Ensure the HLS block is reset correctly.
  2. Ensure the ap_start signal is held high until ap_ready is asserted.
  3. Holding ap_start high will result in the core running again.
  4. Ensure you have the address range set correctly for your memory access in the hardware.

Example code

MicroZed Chronicles on GitHub 

Want a Book 

Year One  

Year Two

Image Processing with Xilinx Devices 


MicroZed Chronicles – Finite State Machines Tips


The Finite State Machine (FSM) are one of the basic building blocks every FPGA designer should know and deploy often. However, over the years when implementing state machines in many different solution spaces (defense, aerospace, automotive etc) I have learnt a few tips and recommendations which I thought I would share. Following these have provided me with a better quality of results and helped me deliver working systems to my customers faster.


Develop the state machine as a single clocked process.

Perhaps the most controversial point is always using a single process state machine. Of course, this is different to the state machine architecture we are taught when we first learn about them. Many, if not most universities teach state machine implementation using two processes, one combinatorial and another sequential.

With a two-process state machine implementation the main functionality will be contained in a combinatorial process. If we fail to fully define all the combinatorial conditions within this process we will implement latches during synthesis. The combinatorial process may also generate glitches on its outputs as the state and inputs change.

Debugging can also be harder as the sensitivity list needs to be complete, including all of the signals used in the combinatoral process. Failure to include a signal in the sensitivity list will result in different behaviour between RTL simulation and the implementation which can take a little time to find. This is alleviated somewhat if we are using VHDL 2008, where we can use the “all” statement in the process declaration. Of course to take advantage of this your tool chain needs to support VHDL 2008.

By contrast a one process state machine enables much easier use of conditional definition and removes the ability for latches to be created. While preventing glitches as all outputs signals are registered.

Personally I also find them a little easier to debug as all the functionality is within one process.

Decouple functionality, only allow single bit input and outputs to your state machine.

We are often tempted to take into our state machine large buses and decode these within the main body.  This is especially true when counters are used for the timing of events within the state machine. For example, as shown in the code snippet below

when wait_cnt =>   

    if cnt >= std_logic_vector(to_unsigned(128,16)) then 

       current_state <= read_fifo;

       cnt <= (others =>’0’);


       cnt <= cnt + 1;

    end if;

Implementing the state machine in this manner includes additional logic within the state machine for both the comparator and the counter. This will impact the performance of the state machine within the implemented FPGA as it requires more resources and consequently more routing.

A better method is to use an external process for the counter or other external functions which pass in a single control signal. This leaves the bulk of the logic decoupled from the state machine. Taking the counter example, we can use an external counter process to generate a single pulse once the terminal count is achieved as shown below.

when wait_cnt =>   

    if cnt_terminal = ‘1’ then 

       current_state <= read_fifo;

       cnt_reset <= ‘1’;

    end if;

In the above code the single bit input provided from the counter process to indicate the terminal count has been reached. While the state machine asserts a single bit output to reset the counter when this occurs. This uses less logic within the state machine, enabling better performance.

As shown in this example only a single bit should also only leave the state machine as well.

Address any unmapped states

Many times, when we develop a state machine, the implementation does not use a power of two states leaving several unmapped states. If the state machine enters an unmapped state the state machine will stall and become unresponsive as there is no recovery mechanism. Depending upon the application this can be merely inconvenient or lead to catastrophic consequences.

Transitions to unmapped states can occur for a variety of reasons, from a single event effect to an electrically noisy environment or even vibration effecting device bond wires (before you ask yes, I have seen this). It is therefore good practice to ensure there is a path back from an unmapped state back into the main flow of the design. One of the simplest mechanisms to do this is to cycle through the unused states following the reset or power up before the state machine enters its idle state. This prevents the unmapped states from being optimised out during synthesis, while providing a simple recovery mechanism.

Consider the unexpected, what happens if control signals arrive late or early

When considering the control flow of a state machine it is important to consider what happens if an expected signal does not arrive when expected.  While in an ideal world we would have defined interface definitions for each module in our design sadly this is not always the case. As such we need to make sure the state machine does not hang in a state waiting for a signal which has already occurred and as such has been missed.

Considering this ensures the state machine can handle the worse case conditions in operation.

This is incredibly important if your state machine contains structures like below, where a late signal can easily lead to the system becoming unresponsive.

when <state> =>

 if input_one  = ‘1’ then

   if input_two = ‘1’ then

     current_state <= wait_fifo;

  elsif input_three = ‘1’ then

   current_state <= idle;

  end if;

end if;

Should input one not be asserted when inputs two or three occur then the state machine will hang and become un-responsive. If such a structure is unavoidable and I am sure you can avoid it with a little thought, a timer or other recovery function should be added to prevent the state machine locking up, allowing a graceful recovery.

Further Reading!

  1. For more on high reliability state machines read USING FPGA’S IN MISSION CRITICAL SYSTEMS

For more on the basics of state machines read How to implement a state machine in your FPGA

Example code

MicroZed Chronicles on GitHub 

Want a Book 

Year One  

Year Two

Image Processing with Xilinx Devices 


MicroZed Chronicles – Pynq Computer Vision Overlay


It has been a while since I last wrote about the Pynq, covering it in chronicles 155 to 161  just after its release.

With the release of the Pynq Version 2.1 image and  the associated new overlays for computer vision and quantised neural networks, I thought I should take a look at these new capabilities.

Starting with the computer vision overlay.

One of the things I developed for my Pynq quickly after I received it was a simple object tracking application using OpenCV. This used a USB web camera and a simple OpenCV algorithm that detected the difference between a reference frame and frames received from the web camera.

Differences between the captured frame and reference frame above a certain threshold would then be identified on the captured frame as it was output over HDMI.

The original webcam-based algorithm itself is simple doing the following steps

  1. Capture a frame from the web cam and convert it to grey scale (cv2.cvtColor)
  2. Perform a Gaussian Blur on the frame (cv2.GaussianBlur)
  3. Calculate the difference between the frame blurred frame and the reference frame (cv2.absdiff)
  4. Create a binary image using a threshold operation (threshold)
  5. Dilate the differences on the binary image to make them more noticeable (cv2.dilate)
  6. Find the contours within the binary image (cv2.findContours)
  7. Ignore any contours with a small area (cv2.contourArea)
  8. Draw a box around each contour that is large enough (cv2.boundingRect & cv2.rectangle)
  9. Show captured frame with the boxes drawn (cv2.imshow)

While the algorithm has less than 10 steps, each step requires several operations on an image array. As a result, the frame rate was when running just on SW was very low. Knowing that programmable logic is ideal for implementing these functions and would provide significant acceleration I intended to go back and accelerate the algorithm.

Sadly, I never  got the time to  do this.  However, looking at the new computer vision overlay which uses elements of the reVision stack I realised that I could probably very quickly accelerate this algorithm.

The new computer vision overlay provides the following image processing functions accelerated within the programmable logic.

  • 2D filter 3×3 with a configurable kernel allowing Gaussian Blurs, Sobel (V+H) etc
  • Dilation
  • Re mapping

Within the Pynq PL these are implemented as shown in the diagram below.

To install the computer vision overlay we use a PuTTY terminal connected to the Pynq to download and install the packages from Github.

In the PuTTY terminal use the following commands

$ sudo -H pip3.6 install –upgrade git+https://github.com/Xilinx/PYNQ-ComputerVision.git

$ sudo reboot now

Once installed we can proceed to updating the algorithm.

The computer vision overlay ideally uses the  HDMI input and output  for the best performance. To provide an accurate comparison against the previous OpenCV based example, my first step was to update that design to capture images using the HDMI input in place of the web camera.

I also modified the code to run for 200 frames such that I could time the execution and calculate the frames per second of both solutions.

This updated OpenCV design resulted in a frame rate of 4 frames per second when I ran it on my Pynq.

The next step was to update the algorithm to use the computer vision overlay to do this I used the 2D filter to perform the Gaussian Blurring and the Dilate operations.

Switching in these functions resulted in significant increase in the frame rate making the application usable.

Hints & Tips

  1. You can increase the performance by increasing the size of the contour further processed.
  2. Previous blogs on the Pynq are available here  P155, P156, P157, P158, P159,P160 & P161
  3. Jupyter Note book is available on my GitHub 

Example code

MicroZed Chronicles on GitHub 

Want a Book 

Year One  

Year Two

Image Processing with Xilinx Devices 


MicroZed Chronicles – Maximising Reuse in your Vivado Design


When it comes to creating our FPGA or SoC designs, it is inefficient and poor practice to not leverage IP cores and reuse of other design elements if they are available.

Before we start developing our own IP core, we should of course first check if such functions are available in the Vivado library or via Open Source repositories before we consider the need to create our own or purchase one from a third party supplier.

Of course, creating our own IP core establishes a library we can use to reduce the design time and hence cost of future projects.

However, often in our design we find ourselves implementing functions which have several IP cores connected in the same manner for example an image processing pipeline.

Each time we wish to implement this function within our design we need to add in and connect all the necessary IP cores. Again, this is inefficient, what is needed is a method of reusing these functions each time we want to instantiate it in this project and the next.

We can do this using hierarchical blocks.

Example Input Image Processing Chain

Working with a block diagram design in Vivado we can create a reusable hierarchical block using the write_bd_tcl command. This command is one you may have used previously to output a TCL description of the block diagram so that it can be stored in a version control tool like Git.

Rather helpfully we can also use the same command to write out a description of a single hierarchical block within a block diagram. We can then use this TCL description to create multiple instances of the block at will across several projects.

So how do we do it?

Creating a hierarchical block within the block diagram is straight forward, simply right click on the block diagram canvas and select create hierarchy. If the IP cores we wish to include in this new block, already exist at the higher level all we then have to do is drag and drop them into the new hierarchical block. If not we can double click on the new hierarchical block which will open the block in a new window allowing us to add IP cores and connect them as desired.

Creation of the Hierarchical Block with the image processing chain

Image processing core within the hierarchical block.

To create a TCL file description of this block we use the following command in the Vivado TCL console.

write_bd_tcl -force -hier_blks [get_bd_cells <hier block name>] <filename.tcl>

Command within the TCL window

This command will write out a file which describes the hierarchical block contents and their connections. If we wish to add the block to an existing or new design, we do this by loading the file into Vivado, again we use the TCL Console.

source <filename>.tcl 

Once the TCL file has been loaded, in the TCL Console window you will see notification of a new procedure which can be called to create a new instance of the block in the project.

New TCL function following loading of the file into Vivado

Calling this procedure results in a new hierarchical block being added to your design. For this example using the image processing chain above, I used the following command to add a second image processing block.

create_hier_cell_IPC / NEW_IPC

Examining the new block against the initial block demonstrates the contents are identical as we would expect.

Comparison between the new original and new IPC block created from the TCL file.

We can now use this created TCL file across several designs where we want to create a image processing chain saving time. If we want to ensure it is has the maximum reuse potential, we can use the -no_ip_version option in the write_bd_tcl command to prevent the IP version from being included within the file. This makes the script more versatile with different versions of Vivado.

One final point having created the TCL file it is a good idea to enter it into a version control tool as we would with any other design element.


  1. Ensure the project your loading the script into can see all of the IP cores used in the script. Make sure you have all the repositories added in the Vivado project.
  2. If you change the module and over write the generated TCL description. For the changes to take effect in your project you must reload it in to Vivado and then re instantiate it.
  3. When you create the TCL description make sure you know where the file will be created by running a pwd command first and if necessary setting the working path to a more friendly location.
  4. Simplify the interfacing of the block by using custom interface definitions within Vivado and the IP Packager.

Example code

MicroZed Chronicles on GitHub 

Want a Book 

Year One  

Year Two

Image Processing with Xilinx Devices 


A Recipe for Embedded Systems



One thing that is always important for engineers, is the need for us to deliver our projects on quality, schedule and budget. When it comes to developing embedded systems there are a number of lessons, learnt by embedded system developers over the years which can be used to ensure your embedded system achieves these. Let us explore some of the most important lessons learned in developing these.

Link – Page34


Making XDC Timing Constraints Work for You



Completing the RTL design is one part of getting
your FPGA design production-ready.
The next challenge is to ensure the design
meets its timing and performance requirements
in the silicon. To do this, you will often need to
define both timing and placement constraints.
Let’s take a look at how to create and use both
of these types of constraints when designing systems
around Xilinx® FPGAs and SoCs



SDSoC Accelerate your AES Encryption



The Advanced Encryption Standard (AES) has become an increasingly popular cryptographic specification in many applications, including those within embedded systems. Since the National Institute of Standards and Technology (NIST) selected the speci- cation as a standard in 2002, developers of processor, microcontroller, FPGA and SoC applications have turned to AES to secure data entering, leaving and residing within their systems. The algorithm is described very efficiently at a higher abstraction level, as is used in traditional software development; but because of the operations involved, it is most efficiently implemented in an FPGA. Indeed, developers can even get some operations “for free” in the routing. For those reasons, AES is an excellent example of how developers can benefit from the Xilinx® SDSoC™ development environment by describing the algorithm in C and then accelerating the implementation in hardware. In this article we will do just that, first gaining familiarity with the AES algorithm and then implementing AES256 (256-bit key length) on the processing system (PS) side of a Xilinx Zynq®-7000 All Programmable SoC to establish a baseline of software performance before accelerating it in the onchip programmable logic (PL). To gain a thorough understanding of the benefits to be gained, we will perform the steps in all three operating systems the
SDSoC environment supports: Linux, FreeRTOS and BareMetal