Nov 204 min read

MicroZed Chronicles: DSP58 and FP32 Mode

Last week we examined how we could use the IEEE VHDL 2008 fixed and floating packages to implement a simple polynomial approximation.

As discussed previously the DSP58 is capable of operating in a FP32 mode however, it cannot be inferred from the RTL. As such we have several ways to implement the DSP58 in FP32 mode.

Instantiate the DSP58 in FP32 mode directly in our RTL.
Leverage the AMD Floating point IP provided within the Vivado IP library.
Use MathWorks Simulink HDL coder to generate the RTL leveraging the AMD Floating point libraries.
Use Vitis HLS to implement the algorithm and allow it to map to the DSP58 in FP32 mode.

While we will look at the first three implementation options in future blogs, over the coming weeks and months.

I must admit I was curious how using Vitis Unified IDE in the HLS flow would work with the DSP58 in floating point mode. I am a big fan of using higher levels of abstraction for algorithm development and this one seemed perfect to implement using HLS tools.

So with that in mind I created an example design which uses Vitis Unified IDE in the HLS flow.

The main objective of this experimentation was to see the DSP58 implemented in its FP32 mode. When I set out to do this I was not sure what the exact results would be but I did think that if we implemented it correctly we would be using many fewer resources than using the direct float VHDL example which used 11735 LUTs, 328 Registers and 13 DSPs.

To implement the algorithm is surprisingly simple in C++, we can leverage the Vitis hls_math.h library and use floating point types direct for synthesis to VHDL / Verilog.

I created a project which targets the Versal AI Edge 2302 device.

I set the clock speed to 400 MHz, along with the flow for Vivado IP creation.

The C++ code for the IP core is very simple and can be seen below

#include <hls_math.h> // For mathematical functions if needed

// Define the function to implement the equation
void calculate_polynomial(float &x, float &y) {
#pragma HLS PERFORMANCE target_ti=1 target_tl=1 unit=cycle
#pragma HLS PIPELINE II=1
#pragma HLS INTERFACE mode=axis register_mode=both port=y
#pragma HLS INTERFACE mode=axis register_mode=both port=x
    // Polynomial coefficients
    const float c4 = 2e-9;   // Coefficient for x^4
    const float c3 = -3e-8;  // Coefficient for x^3
    const float c2 = 0.001;  // Coefficient for x^2
    const float c1 = 2.4146; // Coefficient for x
    const float c0 = -251.71; // Constant term

    float x2, x3, x4;
    float term1, term2, term3, term4; 

    x2 =  x * x;
    x3 = x2 * x;
    x4 = x3 * x;

    term1 = x * c1;
    term2 = x2 * c2;
    term3 = x3 * c3;
    term4 = x4 * c4; 

    // Calculate the polynomial
    y = c0 + term1 + term2 + term3 + term4; 
}

This code is exercised by the test bench for C and Co Simulation.

#include <iostream>
#include "cvd.h"

using namespace std;

// Function declaration (same as the function defined earlier)


int main() {
    float x = 109.0; // Input value for x
    float y;         // Variable to store the result

    // Call the function
    calculate_polynomial(x, y);

    // Display the result
    cout << "For x = " << x << ", y = " << y << endl;

    return 0;
}

I set the interfaces to be implemented as AXI Streaming.

Running the simulation shows we get the same results for the C Simulation and Co Simulation, which is what we would expect.

The real interesting aspect however is the implementation results. As this is basically multiplication and addition. Vitis HLS is able to map them directly to the DSP58 in its FP32 mode, this makes for a very efficient implementation of the algorithm with only 7 DSP, 253 flip flops and 80 LUTs. With an expected operating frequency in the hardware of greater than 560MHz, initiation interval of one and a latency of 20.

To verify this I ran the implementation from within Vitis Unified IDE and observed the implementation results.

This runs Vivado with the implementation mode set to out of context, the resource utilisation was a little higher in hardware (510 FF, 372 LUT) and shows that timing closure can be achieved for the desired operating frequency.

As expected this is a much lower resource implementation than just using the VHDL floating point library.

It is also much easier to implement the algorithm using a HLS approach than it is to start writing lines of RTL to instantiate and configure the DSP58 in the FP32 mode.

As I mentioned previously we will come back and look at the other approaches in blogs over the coming weeks and months.

Workshops and Webinars

If you enjoyed the blog why not take a look at the free webinars, workshops and training courses we have created over the years. Highlights include

Upcoming Webinars Timing, RTL Creation, FPGA Math and Mixed Signal
Professional PYNQ Learn how to use PYNQ in your developments
Introduction to Vivado learn how to use AMD Vivado
Ultra96, MiniZed & ZU1 three day course looking at HW, SW and PetaLinux
Arty Z7-20 Class looking at HW, SW and PetaLinux
Mastering MicroBlaze learn how to create MicroBlaze solutions
HLS Hero Workshop learn how to create High Level Synthesis based solutions
Perfecting Petalinux learn how to create and work with PetaLinux OS

Boards

Get an Adiuvo development board

Adiuvo Spartan 7 / RPi 2040 Embedded System Development Board
Adiuvo Spartan 7 Tile - Low Risk way to add a FPGA to your design.

Embedded System Book

Do you want to know more about designing embedded systems from scratch? Check out our book on creating embedded systems. This book will walk you through all the stages of requirements, architecture, component selection, schematics, layout, and FPGA / software design. We designed and manufactured the board at the heart of the book! The schematics and layout are available in Altium here Learn more about the board (see previous blogs on Bring up, DDR validation, USB, Sensors) and view the schematics here.

Order here

MicroZed Chronicles: DSP58 and FP32 Mode

Recent Posts