Tag Archives: Zynq

MicroZed Chronicles – Vivado HLS & DDR Access


Using Vivado HLS we can of course, accelerate the development of our data path. There are times however, when using HLS that we want to interact with external memories such as DDR.  Either to store data or to retrieve data already written by another function.

Using HLS to interact with an external DDR at first sounds a like it might be complicated. Nothing could be further from the truth as I am about to demonstrate.

To do this I am going to use my Arty board which contains both a Artix 35T FPGA and 256MB of DDR3L ideal for the demonstration. What this demonstration will create is a HLS IP block which can be included within our Vivado design and interface with DDR. The functionality of this block will be simple writing a pattern of numbers into the DDR memory.

To successfully use this block we will therefore, need a Vivado design which contains the following

  • Memory Interface Generator configured for the DDR3L
  • JTAG to AXI to allow debug verification access to the DDR3L
  • AXI interconnect to allow the HLS block and the JTAG to AXI to access the MIG
  • VIO to start the HLS block
  • ILA to verify / debug internally the design
  • The HLS test block itself

The idea behind this demonstration is the HLS block will write the data to the DDR and I will then be able to read the values written into the DDR using the JTAG to AXIS.

To crate the HLS block we use Vivado HLS and create a new project targeting the same Artix devices as on the Arty. All we need to do then is create our source files and test bench both of which will be very simple.

First the source file, to read or write from DDR memory we use the C function memcpy when used in software it allows the copying of data from a source to a destination. When we use it in HLS it also does the same however, the movement of data is based around the AXI4 memory mapped interface.

Of course, this interface is ideal for use with the MIG and other memory interfaces. As such the code we use to create the example HLS block is very simple

Looking in detail at the code, the function prototype includes a int pointer DDR, this will be the interface we use to write out the data to the DDR memory. If necessary, we could also use this interface to read from the DDR as well.

We to define this interface as a AXI4 master interface we use the pragma HLS INTERFACE with the type m_axi port =ddr associates the port ddr with the AXI interface. While the depth is used for co-simulation when pointers are used in place of arrays and should be set to the number values written /read.

The offset is how we control the address space and the addresses the AXI interface begins to access. This can be provided by one of three methods.

  1. Off – the default, it will start accesses from 0x00000000.
  2. Direct – this will create a port on the HLS module which allows the offset definition to be applied from within the Vivado block design.
  3. Slave – this will create a AXI 4 Lite interface with a register which defines the offset.

For this example, we will be using the direct approach as this allows us to show how a more logic-based design as opposed to a solution containing an embedded processing system using AXI Lite.

When it comes to verifying the design, we need to create a simple test bench which allows us to perform C based simulation and Co-Simulation. The test bench for this example simply has a 256-bit array passed to the HLS function which it fills.

The C Simulation allows us to test out the functionality of the code before we perform the HLS synthesis and co-simulation. Just like in any C debugging here we can use break points and examine the contents of variables.

With this completed the next step is to perform Synthesis and Co-Simulation, when you run Co-Simulation ensure you set the dump trace to port. This allows us to see the inputs and output waveforms of the HLS core.

The Co-Simulation applies the same C test bench inputs to the generated RTL and reports back on the pass / fail of the simulation.

Being happy with the co-simulation results the final step in Vivado HLS is to generate the IP core, and add it into our Vivado IP repository allowing its use.

Once our HLS core has been added into the Vivado design it is here that we can set the address the HLS core will use for its transactions. To do this I used a constant block.

With this completed we can then generate the bit stream, and once completed open the hardware manager and program the device.

Thanks to the VIO we can control the ap_start signal on the HLS core, this means before we start we can read the DDR memory using the JTAG to AXIS link and check that it is not set to our test values.

Satisfied the DDR memory is randomly initialised we can then start the HLS block using the VIO and check that it has run correctly writing its data to the DDR memory.

To verify the succesful writing of data, we can do this in two ways the first is to use internal ILA’s configured to trigger on AXI writes and completion of the HLS function.

The second is to use the JTAG to AXI bridge to read back the written addresses and confirm the values.

This demonstrates the write to DDR memory is successful, which enables us more flexibility within our HLS designs.

Watch out for!

  1. Ensure the HLS block is reset correctly.
  2. Ensure the ap_start signal is held high until ap_ready is asserted.
  3. Holding ap_start high will result in the core running again.
  4. Ensure you have the address range set correctly for your memory access in the hardware.

Example code

MicroZed Chronicles on GitHub 

Want a Book 

Year One  

Year Two

Image Processing with Xilinx Devices 


MicroZed Chronicles – Finite State Machines Tips


The Finite State Machine (FSM) are one of the basic building blocks every FPGA designer should know and deploy often. However, over the years when implementing state machines in many different solution spaces (defense, aerospace, automotive etc) I have learnt a few tips and recommendations which I thought I would share. Following these have provided me with a better quality of results and helped me deliver working systems to my customers faster.


Develop the state machine as a single clocked process.

Perhaps the most controversial point is always using a single process state machine. Of course, this is different to the state machine architecture we are taught when we first learn about them. Many, if not most universities teach state machine implementation using two processes, one combinatorial and another sequential.

With a two-process state machine implementation the main functionality will be contained in a combinatorial process. If we fail to fully define all the combinatorial conditions within this process we will implement latches during synthesis. The combinatorial process may also generate glitches on its outputs as the state and inputs change.

Debugging can also be harder as the sensitivity list needs to be complete, including all of the signals used in the combinatoral process. Failure to include a signal in the sensitivity list will result in different behaviour between RTL simulation and the implementation which can take a little time to find. This is alleviated somewhat if we are using VHDL 2008, where we can use the “all” statement in the process declaration. Of course to take advantage of this your tool chain needs to support VHDL 2008.

By contrast a one process state machine enables much easier use of conditional definition and removes the ability for latches to be created. While preventing glitches as all outputs signals are registered.

Personally I also find them a little easier to debug as all the functionality is within one process.

Decouple functionality, only allow single bit input and outputs to your state machine.

We are often tempted to take into our state machine large buses and decode these within the main body.  This is especially true when counters are used for the timing of events within the state machine. For example, as shown in the code snippet below

when wait_cnt =>   

    if cnt >= std_logic_vector(to_unsigned(128,16)) then 

       current_state <= read_fifo;

       cnt <= (others =>’0’);


       cnt <= cnt + 1;

    end if;

Implementing the state machine in this manner includes additional logic within the state machine for both the comparator and the counter. This will impact the performance of the state machine within the implemented FPGA as it requires more resources and consequently more routing.

A better method is to use an external process for the counter or other external functions which pass in a single control signal. This leaves the bulk of the logic decoupled from the state machine. Taking the counter example, we can use an external counter process to generate a single pulse once the terminal count is achieved as shown below.

when wait_cnt =>   

    if cnt_terminal = ‘1’ then 

       current_state <= read_fifo;

       cnt_reset <= ‘1’;

    end if;

In the above code the single bit input provided from the counter process to indicate the terminal count has been reached. While the state machine asserts a single bit output to reset the counter when this occurs. This uses less logic within the state machine, enabling better performance.

As shown in this example only a single bit should also only leave the state machine as well.

Address any unmapped states

Many times, when we develop a state machine, the implementation does not use a power of two states leaving several unmapped states. If the state machine enters an unmapped state the state machine will stall and become unresponsive as there is no recovery mechanism. Depending upon the application this can be merely inconvenient or lead to catastrophic consequences.

Transitions to unmapped states can occur for a variety of reasons, from a single event effect to an electrically noisy environment or even vibration effecting device bond wires (before you ask yes, I have seen this). It is therefore good practice to ensure there is a path back from an unmapped state back into the main flow of the design. One of the simplest mechanisms to do this is to cycle through the unused states following the reset or power up before the state machine enters its idle state. This prevents the unmapped states from being optimised out during synthesis, while providing a simple recovery mechanism.

Consider the unexpected, what happens if control signals arrive late or early

When considering the control flow of a state machine it is important to consider what happens if an expected signal does not arrive when expected.  While in an ideal world we would have defined interface definitions for each module in our design sadly this is not always the case. As such we need to make sure the state machine does not hang in a state waiting for a signal which has already occurred and as such has been missed.

Considering this ensures the state machine can handle the worse case conditions in operation.

This is incredibly important if your state machine contains structures like below, where a late signal can easily lead to the system becoming unresponsive.

when <state> =>

 if input_one  = ‘1’ then

   if input_two = ‘1’ then

     current_state <= wait_fifo;

  elsif input_three = ‘1’ then

   current_state <= idle;

  end if;

end if;

Should input one not be asserted when inputs two or three occur then the state machine will hang and become un-responsive. If such a structure is unavoidable and I am sure you can avoid it with a little thought, a timer or other recovery function should be added to prevent the state machine locking up, allowing a graceful recovery.

Further Reading!

  1. For more on high reliability state machines read USING FPGA’S IN MISSION CRITICAL SYSTEMS

For more on the basics of state machines read How to implement a state machine in your FPGA

Example code

MicroZed Chronicles on GitHub 

Want a Book 

Year One  

Year Two

Image Processing with Xilinx Devices 


MicroZed Chronicles – Pynq Computer Vision Overlay


It has been a while since I last wrote about the Pynq, covering it in chronicles 155 to 161  just after its release.

With the release of the Pynq Version 2.1 image and  the associated new overlays for computer vision and quantised neural networks, I thought I should take a look at these new capabilities.

Starting with the computer vision overlay.

One of the things I developed for my Pynq quickly after I received it was a simple object tracking application using OpenCV. This used a USB web camera and a simple OpenCV algorithm that detected the difference between a reference frame and frames received from the web camera.

Differences between the captured frame and reference frame above a certain threshold would then be identified on the captured frame as it was output over HDMI.

The original webcam-based algorithm itself is simple doing the following steps

  1. Capture a frame from the web cam and convert it to grey scale (cv2.cvtColor)
  2. Perform a Gaussian Blur on the frame (cv2.GaussianBlur)
  3. Calculate the difference between the frame blurred frame and the reference frame (cv2.absdiff)
  4. Create a binary image using a threshold operation (threshold)
  5. Dilate the differences on the binary image to make them more noticeable (cv2.dilate)
  6. Find the contours within the binary image (cv2.findContours)
  7. Ignore any contours with a small area (cv2.contourArea)
  8. Draw a box around each contour that is large enough (cv2.boundingRect & cv2.rectangle)
  9. Show captured frame with the boxes drawn (cv2.imshow)

While the algorithm has less than 10 steps, each step requires several operations on an image array. As a result, the frame rate was when running just on SW was very low. Knowing that programmable logic is ideal for implementing these functions and would provide significant acceleration I intended to go back and accelerate the algorithm.

Sadly, I never  got the time to  do this.  However, looking at the new computer vision overlay which uses elements of the reVision stack I realised that I could probably very quickly accelerate this algorithm.

The new computer vision overlay provides the following image processing functions accelerated within the programmable logic.

  • 2D filter 3×3 with a configurable kernel allowing Gaussian Blurs, Sobel (V+H) etc
  • Dilation
  • Re mapping

Within the Pynq PL these are implemented as shown in the diagram below.

To install the computer vision overlay we use a PuTTY terminal connected to the Pynq to download and install the packages from Github.

In the PuTTY terminal use the following commands

$ sudo -H pip3.6 install –upgrade git+https://github.com/Xilinx/PYNQ-ComputerVision.git

$ sudo reboot now

Once installed we can proceed to updating the algorithm.

The computer vision overlay ideally uses the  HDMI input and output  for the best performance. To provide an accurate comparison against the previous OpenCV based example, my first step was to update that design to capture images using the HDMI input in place of the web camera.

I also modified the code to run for 200 frames such that I could time the execution and calculate the frames per second of both solutions.

This updated OpenCV design resulted in a frame rate of 4 frames per second when I ran it on my Pynq.

The next step was to update the algorithm to use the computer vision overlay to do this I used the 2D filter to perform the Gaussian Blurring and the Dilate operations.

Switching in these functions resulted in significant increase in the frame rate making the application usable.

Hints & Tips

  1. You can increase the performance by increasing the size of the contour further processed.
  2. Previous blogs on the Pynq are available here  P155, P156, P157, P158, P159,P160 & P161
  3. Jupyter Note book is available on my GitHub 

Example code

MicroZed Chronicles on GitHub 

Want a Book 

Year One  

Year Two

Image Processing with Xilinx Devices 


MicroZed Chronicles – Maximising Reuse in your Vivado Design


When it comes to creating our FPGA or SoC designs, it is inefficient and poor practice to not leverage IP cores and reuse of other design elements if they are available.

Before we start developing our own IP core, we should of course first check if such functions are available in the Vivado library or via Open Source repositories before we consider the need to create our own or purchase one from a third party supplier.

Of course, creating our own IP core establishes a library we can use to reduce the design time and hence cost of future projects.

However, often in our design we find ourselves implementing functions which have several IP cores connected in the same manner for example an image processing pipeline.

Each time we wish to implement this function within our design we need to add in and connect all the necessary IP cores. Again, this is inefficient, what is needed is a method of reusing these functions each time we want to instantiate it in this project and the next.

We can do this using hierarchical blocks.

Example Input Image Processing Chain

Working with a block diagram design in Vivado we can create a reusable hierarchical block using the write_bd_tcl command. This command is one you may have used previously to output a TCL description of the block diagram so that it can be stored in a version control tool like Git.

Rather helpfully we can also use the same command to write out a description of a single hierarchical block within a block diagram. We can then use this TCL description to create multiple instances of the block at will across several projects.

So how do we do it?

Creating a hierarchical block within the block diagram is straight forward, simply right click on the block diagram canvas and select create hierarchy. If the IP cores we wish to include in this new block, already exist at the higher level all we then have to do is drag and drop them into the new hierarchical block. If not we can double click on the new hierarchical block which will open the block in a new window allowing us to add IP cores and connect them as desired.

Creation of the Hierarchical Block with the image processing chain

Image processing core within the hierarchical block.

To create a TCL file description of this block we use the following command in the Vivado TCL console.

write_bd_tcl -force -hier_blks [get_bd_cells <hier block name>] <filename.tcl>

Command within the TCL window

This command will write out a file which describes the hierarchical block contents and their connections. If we wish to add the block to an existing or new design, we do this by loading the file into Vivado, again we use the TCL Console.

source <filename>.tcl 

Once the TCL file has been loaded, in the TCL Console window you will see notification of a new procedure which can be called to create a new instance of the block in the project.

New TCL function following loading of the file into Vivado

Calling this procedure results in a new hierarchical block being added to your design. For this example using the image processing chain above, I used the following command to add a second image processing block.

create_hier_cell_IPC / NEW_IPC

Examining the new block against the initial block demonstrates the contents are identical as we would expect.

Comparison between the new original and new IPC block created from the TCL file.

We can now use this created TCL file across several designs where we want to create a image processing chain saving time. If we want to ensure it is has the maximum reuse potential, we can use the -no_ip_version option in the write_bd_tcl command to prevent the IP version from being included within the file. This makes the script more versatile with different versions of Vivado.

One final point having created the TCL file it is a good idea to enter it into a version control tool as we would with any other design element.


  1. Ensure the project your loading the script into can see all of the IP cores used in the script. Make sure you have all the repositories added in the Vivado project.
  2. If you change the module and over write the generated TCL description. For the changes to take effect in your project you must reload it in to Vivado and then re instantiate it.
  3. When you create the TCL description make sure you know where the file will be created by running a pwd command first and if necessary setting the working path to a more friendly location.
  4. Simplify the interfacing of the block by using custom interface definitions within Vivado and the IP Packager.

Example code

MicroZed Chronicles on GitHub 

Want a Book 

Year One  

Year Two

Image Processing with Xilinx Devices 


A Recipe for Embedded Systems



One thing that is always important for engineers, is the need for us to deliver our projects on quality, schedule and budget. When it comes to developing embedded systems there are a number of lessons, learnt by embedded system developers over the years which can be used to ensure your embedded system achieves these. Let us explore some of the most important lessons learned in developing these.

Link – Page34


Making XDC Timing Constraints Work for You



Completing the RTL design is one part of getting
your FPGA design production-ready.
The next challenge is to ensure the design
meets its timing and performance requirements
in the silicon. To do this, you will often need to
define both timing and placement constraints.
Let’s take a look at how to create and use both
of these types of constraints when designing systems
around Xilinx® FPGAs and SoCs



Arty – FreeRTOS & XADC


Happy New Year! For the first blog of the year I thought we would combine the FreeRTOS and the XADC examples we had bbeen looking at previously.

This will enable me to show how we can do the following

  1. Configure the XADC
  2. Create a task to read from the XADC
  3. Create a second task to take action on the results from the first task

The intended functionality is the one task, once a second reads the XADC internal parameters and stores them within a array. This array is then communicated via a queue to the recieivng task which processes the results, for this example it just outputs them over the UART. If we wanted to this task could perform more detailed analysis or calcualtion on the results being provided to it.

We can easily modify the hello world example to perform this. If we wish to make it more complex and introduce more tasks which share resources, we must ensure they are properly managed and do not become deadlocked.

The first thing we need to do is include the proper header files such that we can use the API for the XADC we do this by including the “XSYSMON.H”.

Within the main function we are going to configure and initialise the XADC before we start the two tasks.

As we are going to be using the printf function and we are going to transfering data we need to ensure the stack size is correctly allocated. To prevent problems I decided to increase this to ensure there was sufficent for that requred for both tasks, when we create the tasks in the main() we are required to define the stack allocated to each task. For both tasks I decided to allocate 1000 bytes, I left the task pirorites as the receiving task being a higher priortiy than the transmitting task ensuring the data is transmitted as soon as it is received.


The next step was to create the queue, as I mentioned above due to the priorities of the RX and TX tasks there will only be one element in the queue, which will be the size of the size element XADC_Buf.

The final element is to start the scheduler and let the tasks run for ever well once we have written them.

We write each task as a we would any function in C, being careful to ensure we include the 1 second dealy within the transmitt task.

When I ran the code (which is available here) I got the following results.



ZedBoard Part 5 Configuration 2


As I explained in my previous column, the configuration of the Zynq is a little different from that of traditional FPGAs. At the end of the previous column, we were poised to begin using SDK to generate the boot image. So far in this series of blogs, I have created the PS (programmable system) and PL (programmable logic) sides of the device, created a simple C program, and demonstrated everything working using the JTAG interface.

The next step in creating a “boot image” is to create a first-stage boot loader (FSBL). Thankfully, the folks at Xilinx are kind enough to provide us with an FSBL that will load our application and configure our FPGA (you can, of course, customize the code provided to alter the boot sequence). Within the current SDK workspace (the one containing your C project), use “New > Application Project” to create a new project as illustrated below



Select whether you want to use C or C++, the board support package defined for your system, and the name of the project, which is “zynq_fsbl_0” in this case. On the next tab select the “Zynq FSBL” option, as illustrated below and your FSBL project will be created, at which point we are nearly ready to create the boot image



f you have “Compile Automatically” selected then the FSBL will be compiled; if not, it will be compiled on-demand later on.

However, we actually need to make a small change to the linker script provided with the FSBL, as it has no idea where the DDR memory is located in the processor address space. Therefore, we need to open the “lscrip.ld” file and add the DRR memory location into this file. This can be found in the linker script we created for the C application in Part 3 of this mini-series (it can also be found in the “system.xml” file under the hardware platform definition).

The following screenshot shows the FSBL “lscript.ld” file with the DDR address for the system added in. If you forget to do this, you will find that the boot loader will run and the FPGA will, but the application will not run because in Part 3 we configured the application to run from the DDR



Looking under the “Project Explorer,” you should hopefully now have the following modules (each should have a slightly different symbol identifying the type of module):

  1. proc_subsystem_hw_platform — named after the processing subsystem you created in PlanAhead, this is the hardware definition of your file
  2. zed_blog_0 — This is the board support package created in Part 3 of this mini-series
  3. zed_blog_C — The C application created earlier in Part 3 (a simple “Hello World” and LED flashing application)
  4. zynq_fsbl_0 — The first stage boot loader we have just created and modified

As we are creating a “bare metal” application (i.e., no operating system), we need the following files — in this specific order — to create a boot image:

  1. First-stage boot loader (as created in this column)
  2. FPGA programming bit file (created in Part 2)
  3. C application (created in Part 3)

We are now ready to generate the boot image, which can be created using the “Create Zynq Boot Image” option under the “Xilinx Tools” menu as illustrated below



This is where we need to select the ELF files (resulting from the compilation of the software projects) and the bit file for the FPGA as illustrated below



It is important to stress that the FPGA bit file must always follow the FSBL. Clicking on “Create Image” will create *.bin and *.mcs files, which can be programmed into the target device.

On the ZedBoard we have the option of booting from the QSPI (Queued Serial Peripheral Interface) or the SD (Secure Digital) card. I think QPSI is more likely to be used for robust applications than an SD card, so we will store our application there.

The QSPI interface was defined all the way back in Part 2 of this mini-series in the “System Assembly” view of Xilinx Platform Studio as illustrated below (click here to see a larger version of this image); hence, the hardware definition will contain the QSPI IP core and the location in the processor address map at which the QSPI resides.atzb-0016-06-lg


Connect the ZedBoard to the PC via the JTAG interface and ensure the mode pins on jumpers 7 to 11 are set correctly. Select the “Program Flash” option from the Xilinx tools menu as illustrated below



Navigate and select the MCS file you generated and enter an offset of 0x0 in the dialog box that appears in the middle of the screen as illustrated below


It may take a few minutes for the device to be programmed (and verified if you ticked the “Verify” option). Once the programming is completed, power-down the ZedBoard and disconnect the JTAG cable such that the only connection the board has is the RS232 connection and the power cable. Set Jumpers 7 to 11 to configure from QSPI and power the board back on again. Congratulations! You have now created your very first, standalone, bare metal Zynq application.



ZedBoard Part 4 Configuration


If you read this mini-series, you know that I have reached the stage where I have a simple Hello World program running on my Zynq when it is connected to the software development kit. In reality, however, this is not much good. For a real-world use model, we will want to store our software program and configuration bitstream in nonvolatile memory and configure the device following power-on.

But how do we do this? For those used to a traditional FPGA development and configuration flow, this is a little different from what one might initially imagine. In fact, it will take me a couple of blogs to explain it all, but I promise this will be worthwhile, so please bear with me.

Before I jump into the tools and processes you need to follow to create the boot file, I had better explain the basics of the Zynq configuration. In a Zynq system, the processing system is the master and always configures the programmable logic side of things, unless the JTAG interface is used. This means that the processing system can be powered and operating while the programmable logic remains unpowered (unless secure configuration is used).

The processing system therefore follows a typical processor boot sequence, initially running from an internal nonmodifiable boot PROM. This boot PROM contains drivers for the supported nonvolatile memories, along with drivers for the programmable logic configuration. The boot PROM also loads in the next stage of the boot loader, the first stage boot loader (FSBL), which is user-provided. The FSBL can then configure the DDR memory and other peripherals on the processor before loading the software application and configuring the programmable logic (if desired) using the processor configuration access port. The PCAP allows both partial and full configuration of the programmable logic.  This means the programmable logic can be programed at any time once the processing system is up and running. Also, the configuration can be read back and checked for errors.

The Zynq supports both secure and nonsecure methods of configuration. In the case of secure configuration, the programmable logic section of the device must be powered up as the hard macro. The advanced encryption standard and secure hash algorithm needed for decryption are located within the programmable logic side of the device.

The Zynq processing system can be configured via a number of different nonvolatile memories (Quad SPI Flash, NAND Flash, NOR Flash, or SD). The system can also be configured via JTAG, just like any other device. Like all other FPGAs from Xilinx (this site’s sponsor), the Zynq uses a number of mode pins to determine crucial system settings like the type of memory where the program is stored. These mode pins share the multiuse input/output pins on the processor system side of the device. In all, there are seven mode pins mapped to MIO[8:2]. The first four define the boot mode. The fifth defines whether the PLL is used, and the other two define the voltages on MIO bank 0 and bank 1 during power-up.

The FSBL can change the voltage standard defined on MIO bank 0 and 1 to the correct standard. However, if you are designing a system from scratch, you must ensure that the voltage used during power-up cannot damage the device connected to these pins.

On the ZedBoards, these mode pins can be changed via jumpers to facilitate configuration from the SD, JTAG, or Quad SPI memories that are available.

In my next blog, I will explain how we use the tools to generate the first stage boot loader and then tie everything together to generate the file required to program the Zynq.