Tag Archives: fpga

MicroZed Chronicles – Pynq Computer Vision Overlay


It has been a while since I last wrote about the Pynq, covering it in chronicles 155 to 161  just after its release.

With the release of the Pynq Version 2.1 image and  the associated new overlays for computer vision and quantised neural networks, I thought I should take a look at these new capabilities.

Starting with the computer vision overlay.

One of the things I developed for my Pynq quickly after I received it was a simple object tracking application using OpenCV. This used a USB web camera and a simple OpenCV algorithm that detected the difference between a reference frame and frames received from the web camera.

Differences between the captured frame and reference frame above a certain threshold would then be identified on the captured frame as it was output over HDMI.

The original webcam-based algorithm itself is simple doing the following steps

  1. Capture a frame from the web cam and convert it to grey scale (cv2.cvtColor)
  2. Perform a Gaussian Blur on the frame (cv2.GaussianBlur)
  3. Calculate the difference between the frame blurred frame and the reference frame (cv2.absdiff)
  4. Create a binary image using a threshold operation (threshold)
  5. Dilate the differences on the binary image to make them more noticeable (cv2.dilate)
  6. Find the contours within the binary image (cv2.findContours)
  7. Ignore any contours with a small area (cv2.contourArea)
  8. Draw a box around each contour that is large enough (cv2.boundingRect & cv2.rectangle)
  9. Show captured frame with the boxes drawn (cv2.imshow)

While the algorithm has less than 10 steps, each step requires several operations on an image array. As a result, the frame rate was when running just on SW was very low. Knowing that programmable logic is ideal for implementing these functions and would provide significant acceleration I intended to go back and accelerate the algorithm.

Sadly, I never  got the time to  do this.  However, looking at the new computer vision overlay which uses elements of the reVision stack I realised that I could probably very quickly accelerate this algorithm.

The new computer vision overlay provides the following image processing functions accelerated within the programmable logic.

  • 2D filter 3×3 with a configurable kernel allowing Gaussian Blurs, Sobel (V+H) etc
  • Dilation
  • Re mapping

Within the Pynq PL these are implemented as shown in the diagram below.

To install the computer vision overlay we use a PuTTY terminal connected to the Pynq to download and install the packages from Github.

In the PuTTY terminal use the following commands

$ sudo -H pip3.6 install –upgrade git+https://github.com/Xilinx/PYNQ-ComputerVision.git

$ sudo reboot now

Once installed we can proceed to updating the algorithm.

The computer vision overlay ideally uses the  HDMI input and output  for the best performance. To provide an accurate comparison against the previous OpenCV based example, my first step was to update that design to capture images using the HDMI input in place of the web camera.

I also modified the code to run for 200 frames such that I could time the execution and calculate the frames per second of both solutions.

This updated OpenCV design resulted in a frame rate of 4 frames per second when I ran it on my Pynq.

The next step was to update the algorithm to use the computer vision overlay to do this I used the 2D filter to perform the Gaussian Blurring and the Dilate operations.

Switching in these functions resulted in significant increase in the frame rate making the application usable.

Hints & Tips

  1. You can increase the performance by increasing the size of the contour further processed.
  2. Previous blogs on the Pynq are available here  P155, P156, P157, P158, P159,P160 & P161
  3. Jupyter Note book is available on my GitHub 

MicroZed Chronicles – Maximising Reuse in your Vivado Design


When it comes to creating our FPGA or SoC designs, it is inefficient and poor practice to not leverage IP cores and reuse of other design elements if they are available.

Before we start developing our own IP core, we should of course first check if such functions are available in the Vivado library or via Open Source repositories before we consider the need to create our own or purchase one from a third party supplier.

Of course, creating our own IP core establishes a library we can use to reduce the design time and hence cost of future projects.

However, often in our design we find ourselves implementing functions which have several IP cores connected in the same manner for example an image processing pipeline.

Each time we wish to implement this function within our design we need to add in and connect all the necessary IP cores. Again, this is inefficient, what is needed is a method of reusing these functions each time we want to instantiate it in this project and the next.

We can do this using hierarchical blocks.

Example Input Image Processing Chain

Working with a block diagram design in Vivado we can create a reusable hierarchical block using the write_bd_tcl command. This command is one you may have used previously to output a TCL description of the block diagram so that it can be stored in a version control tool like Git.

Rather helpfully we can also use the same command to write out a description of a single hierarchical block within a block diagram. We can then use this TCL description to create multiple instances of the block at will across several projects.

So how do we do it?

Creating a hierarchical block within the block diagram is straight forward, simply right click on the block diagram canvas and select create hierarchy. If the IP cores we wish to include in this new block, already exist at the higher level all we then have to do is drag and drop them into the new hierarchical block. If not we can double click on the new hierarchical block which will open the block in a new window allowing us to add IP cores and connect them as desired.

Creation of the Hierarchical Block with the image processing chain

Image processing core within the hierarchical block.

To create a TCL file description of this block we use the following command in the Vivado TCL console.

write_bd_tcl -force -hier_blks [get_bd_cells <hier block name>] <filename.tcl>

Command within the TCL window

This command will write out a file which describes the hierarchical block contents and their connections. If we wish to add the block to an existing or new design, we do this by loading the file into Vivado, again we use the TCL Console.

source <filename>.tcl 

Once the TCL file has been loaded, in the TCL Console window you will see notification of a new procedure which can be called to create a new instance of the block in the project.

New TCL function following loading of the file into Vivado

Calling this procedure results in a new hierarchical block being added to your design. For this example using the image processing chain above, I used the following command to add a second image processing block.

create_hier_cell_IPC / NEW_IPC

Examining the new block against the initial block demonstrates the contents are identical as we would expect.

Comparison between the new original and new IPC block created from the TCL file.

We can now use this created TCL file across several designs where we want to create a image processing chain saving time. If we want to ensure it is has the maximum reuse potential, we can use the -no_ip_version option in the write_bd_tcl command to prevent the IP version from being included within the file. This makes the script more versatile with different versions of Vivado.

One final point having created the TCL file it is a good idea to enter it into a version control tool as we would with any other design element.


  1. Ensure the project your loading the script into can see all of the IP cores used in the script. Make sure you have all the repositories added in the Vivado project.
  2. If you change the module and over write the generated TCL description. For the changes to take effect in your project you must reload it in to Vivado and then re instantiate it.
  3. When you create the TCL description make sure you know where the file will be created by running a pwd command first and if necessary setting the working path to a more friendly location.
  4. Simplify the interfacing of the block by using custom interface definitions within Vivado and the IP Packager.

A Recipe for Embedded Systems



One thing that is always important for engineers, is the need for us to deliver our projects on quality, schedule and budget. When it comes to developing embedded systems there are a number of lessons, learnt by embedded system developers over the years which can be used to ensure your embedded system achieves these. Let us explore some of the most important lessons learned in developing these.

Link – Page34


Making XDC Timing Constraints Work for You



Completing the RTL design is one part of getting
your FPGA design production-ready.
The next challenge is to ensure the design
meets its timing and performance requirements
in the silicon. To do this, you will often need to
define both timing and placement constraints.
Let’s take a look at how to create and use both
of these types of constraints when designing systems
around Xilinx® FPGAs and SoCs



SDSoC Accelerate your AES Encryption



The Advanced Encryption Standard (AES) has become an increasingly popular cryptographic specification in many applications, including those within embedded systems. Since the National Institute of Standards and Technology (NIST) selected the speci- cation as a standard in 2002, developers of processor, microcontroller, FPGA and SoC applications have turned to AES to secure data entering, leaving and residing within their systems. The algorithm is described very efficiently at a higher abstraction level, as is used in traditional software development; but because of the operations involved, it is most efficiently implemented in an FPGA. Indeed, developers can even get some operations “for free” in the routing. For those reasons, AES is an excellent example of how developers can benefit from the Xilinx® SDSoC™ development environment by describing the algorithm in C and then accelerating the implementation in hardware. In this article we will do just that, first gaining familiarity with the AES algorithm and then implementing AES256 (256-bit key length) on the processing system (PS) side of a Xilinx Zynq®-7000 All Programmable SoC to establish a baseline of software performance before accelerating it in the onchip programmable logic (PL). To gain a thorough understanding of the benefits to be gained, we will perform the steps in all three operating systems the
SDSoC environment supports: Linux, FreeRTOS and BareMetal





Pretty much every embedded system / FPGA design has to interface to the real world through sensors or external interfaces. Some systems require large volumes of data to be moved around very quickly, in which case high-speed communications interfaces like PCI-X, Giga Bit Ethernet, USB, Fire/ SpaceWire, or those featuring multi-gigabit transceivers may be employed.

However, many embedded systems also need to interface to slower interfaces for sensors, memories and other peripherals these systems can employ one or more of the simpler communications protocols. The four simplest, and therefore most commonly used, protocols are as follows.

  • UART (Universal Asynchronous Receiver Transmitter): This comprises a number of standards defined by the Electronic Industry Association (EIA), the most popular being the RS-232, RS-422, and RS-485 interfaces. These standards are often used for inter-module communication (that is, the transfer of data and supervisory control between different modules forming the system) as opposed to between the FPGA and peripherals on the same board, although I am sure there are plenty of applications that do this also. These standards defined are a mixture of point-to-point and multi-drop buses.
  • SPI (Serial Peripheral Interface): This is a full-duplex, serial, four-wire interface that was originally developed by Motorola, but which has developed into a de facto standard. This standard is commonly used for intra-module communication (that is, transferring data between peripherals and the FPGA within the same system module). Often used for memory devices, analog-to-digital converters (ADCs), CODECs, and MultiMediaCard (MMC) and Secure Digital (SD) memory cards, the system architecture of this interface consists of a single master device and multiple slave devices.
  • I2C (Inter-Integrated Circuit): This is a multi-master, two-wire serial bus that was developed by Phillips in the early 1980s with a similar purpose as SPI. Due to the two-wire nature of this interface, communications are only possible in half-duplex mode.
  • Parallel: Perhaps the simplest method of transferring data between an FPGA and an on-board peripheral, this supports half-duplex communications between the master and the slave. Depending upon the width of data to be transferred, coupled with the addressable range, a parallel interface may be small and simple or large and complex.

The arty board provides four PMOD outputs (JA through JD) along with the Arduino / ChipKit shield connector through which we can interface with peripherals using these interfaces. Over the next few weeks I am going to interface the PModDA4 (SPI) and the PModAD2 (I2C) to the MicroBlaze System that we have created.

The first step is to generate the hardware build within Vivado such that we can interface to the PMODs. We can add in a AXI SPI controller from the IP library and configure it to be a standard SPI driver and not a dual or quad (more on those in future blogs). We can also add in the AXI IIC (remember to search for iic not i2c) controller module and connect it up to the AXI bus, do not forget to add both interrupts to the interrupt controller.



Once both controllers are within the design the next step is to customise for application at hand, with these complete we can assign the i2c and SPI pins to the correct external ports for the PMOD desired.

All that remains then is to build the hardware and write the software, it sounds so easy when we say it quickly.


Arty – RTOS Overview


Over the last few blogs we have written software which runs on the MicroBlaze and reads the XADC, to do this we have not used an operating system instead using a bare metal approach. However as we wish to develop more complex systems, we need to introduce an operating system, so over the next few blogs we will be looking at how we can do just that.


However first I think it is a good idea to talk a little about operating systems and real time operating systems in particular. What differentiates an RTOS from a generic operating system? An RTOS is deterministic, that means the response of the system will meet a defined deadline.

But does the system have to always meet these deadlines to be classed a real-time system?

Actually, no it does not.

There are three RTOS categories that address deadlines differently:

Hard RTOS – Missing a deadline is classified as a system failure.

Firm RTOS – Occasionally missing a deadline is acceptable and is not classified as a failure.

Soft RTOS – Missing a deadline simply reduces the usefulness of the results.

An RTOS operates around the concept of running tasks (sometimes called processes). Each of these tasks performs a required system function. For example, a task might read data from an interface or perform a calculation. A very simple real-time system may use just one task, but it is more likely that multiple tasks will be running on the processor at any one time. Switching between these tasks is referred to as “context switching” and requires that the RTOS saves the processor state for each task before the context switch starts the next task. The RTOS saves this processor state on a task stack.

Determining which task to run next is controlled by the RTOS kernel and this decision can be complicated—especially if we want to avoid deadlock where tasks lock each other out—but the two basic decision methods are:

Time sharing – Each task gets a dedicated time slot on the processor. Tasks with higher priority can have multiple time slots. Time slicing is controlled via a regular interrupt or timer. This method is often called Round Robin scheduling.

Event Driven – Tasks are only switched when a task finishes or when a higher priority task must be run. This method is often called pre-emptive scheduling

When two or more tasks want to share a resource— the XADC for example—it is possible that the tasks might request the resource at the same time. Resource access needs to be controlled to prevent contention and this is one of the operating system’s most important duties. Without the correct resource management, deadlock or starvation might occur.

Here are the definitions we’ll use for deadlock and starvation:

Deadlock – Occurs when a task holds a resource, cannot release it until the task completes, and is currently unable to complete because it requires another resource currently held by another task. If that second task requires a resource held by the first task, the system will never exit this deadlocked state. Deadlock is a bad situation for an RTOS to find itself in.

Starvation – Occurs when a task cannot run because the resources it needs are always allocated to another task. The task starves because of a lack of resources.

As you can imagine, much has been written on the subjects of deadlock and starvation over the years and there are many proposed solutions. For example, there’s Dekker’s algorithm, which was the first known correct solution to mutual exclusion. It is a shared-memory mechanism that does not require a special “test and set” instruction (but is therefore limited to managing two competing tasks) and is attributed to the Dutch mathematician Theodorus Dekker. The most commonly used method to handle deadlock is the use of semaphores, which commonly come into two types: binary semaphores and counting semaphores. A binary semaphore controls access to one resource—for example a hardware resource. Counting semaphores control access to a pool of identical, interchangeable resources such as memory buffers.

Typically each resource has a binary semaphore allocated to it. A requesting task will wait for the resource to become available before executing and once the task completes, it releases the resource.  Binary semaphores commonly use WAIT and SIGNAL operations. A task will WAIT on a semaphore. When the resource is free, which could be immediately (or not), the operating system will give control of the resource to the task. When the task completes, it will SIGNAL completion and free the resource. However, if the resource is occupied when the task WAITs on the semaphore, the operating system suspends that task until the resource is free. The WAITing task might have to wait until the the currently executing task is finished with the resource or the WAIT might take longer if it is pre-empted by a higher priority task.

Introducing the concept of task priority also brings up the problem of priority inversion.  There is a more flexible class of binary semaphores called mutex’s (the word “mutex” is an abbreviation for “mutual exclusion”) and these are often used by modern operating systems to prevent priority inversion.

Counting semaphores work in the same way as binary semaphores however they are used when more than one resource is available—data stores for instance. As each of the resources is allocated to requesting tasks, the count is reduced to show the number of free resources remaining. When the semaphore count reaches zero, there are no more resources available and any processes requesting one or more of these resources after the count reaches zero will be suspended until the requisite number of resources is released.

Tasks often need to communicate with each other and there are a number of methods to accomplish this. The simplest method is to use a data store managed with semaphores as described above. More complex communication methods include message queues.

When using message queues, a task that wishes to send information to another task POSTs a message to the queue. When a task wishes to receive a message from a queue, it PENDs on the queue. Message queues therefore work like FIFOs

Over the next few blogs we will look at using FreeRTOS and Micrium uc/OSiii

This is the last Arty blog of 2015 , so have a Merry Christmas and and a Happy New Year.


Arty – XADC Alarms & References


One of the most common uses of the XADC is for health monitoring of the FPGA and the wider system to that mind the XADC has a number of useful features which can be used.

  1. Trigger and Reset Threshold Alarm registers for the Internal voltages and temperature parameters
  2. Over Temp Monitoring – Maximum junction operating temperature allowable for the device – Auto shut down is possible.
  3. Maximum and Minimum values – Registers which contain the lowest and highest sampled values for each of the voltages and device temperature.

These three elements make for a very useful health monitoring system, of course the alarms and the over temperature must be correctly configured and enabled first. This can be achieved in one of two ways, either via the XADC wizard within Vivado or via our software application.


There are seven possible alarms we can configure on the XADC, however we cannot use all of them on the Artix Silicon as some alarms as dedicated to the Zynq (Alarm 6 Vccddro, Alarm 5 Vccpaux and Alarm 4 Vccpint). There is also an eighth alarm bit which is the logical OR of the seven alarm bits and acts as an overall alarm.

We can see if an alarm has occurred via either the Alarm Status Output Register or configure an interrupt to occur should an alarm condition occur such that it can be immediately dealt with.


These trigger and reset values allow us to define values which align with our worst case analysis of supply voltages and junction temperature, ensuring we can protect the system properly.

Each alarm also has an associated output which can be used in the wider system either to indicate via a LED a issue has occurred or to take further action e.g. graceful system degradation.

The over temperature alarm is slightly different in that it can be configured to trigger an automatic shutdown of the device. To set automatic shut down the four LSB’s of the over temperature register must be set high. Doing this means that 10 ms after the trigger level is reached a shut down occurs. This prevents re -configuration of the device until the reset level is reached.

It is intended that the temperature alarm is used to act as a pre-warning that the temperature has exceed what the design has calculated as its maximum. This way the system can take action to prevent the shut down e.g. turning on fans, reducing processing etc.


While the maximum and minimum registers provide the system with a simple methodology of quickly and easily checking the worst case values as currently observed by the system during its period of operation. These registers can also be of good use in system commissioning to record supply voltages and temperatures  across worst case environmental conditions for example.

There is also one last issue which must be addressed when using the XADC on the Arty board and that is we need to correctly set up the board to use the internal VRefp. Failure to do this results in a inaccurate conversions.  The Arty board is configured such that you can use either the internal or external reference.




We can ensure the internal reference is used by grounding the XADCVREF input, as such the internal reference will be used. This can be confirmed by reading the flag register which also helpfully contains information on any alarms as well


With the XADC configured to correctly monitor the internal signals with suitable alarm levels we can look at how we can use the XADC to receive analogue signals from the real world