Agilex 7 FPGAs and SoCs M-Series – Examining the HBM2E Memory

Over the last few years, Altera has released several new FPGA devices. We have previously examined the Agilex™ 5 E-Series and the Agilex 7 F-Series devices. In this article, we will review the Agilex 7 M-Series devices, which are intended for compute—and memory-intensive applications.

To support these applications, the Agilex 7 M-Series provides developers with support for LPDDR5, DDR5, and DDR4 external memories, along with selected devices offered in the package High Bandwidth Memory 2E (HBM2e).

For those unfamiliar with High Bandwidth Memory (HBM), HBM consists of stacked SDRAM dies on top of a HBM controller die included within the same package as the FPGA. This approach provides a high-bandwidth memory option while using less power than a traditional discrete memory implementation. One of the keys to achieving the increased bandwidth is its interfacing. HBM memory interfacing implements several channels each of which has significantly wider data busses ( DDR) compared to traditional discrete memories.

The first device using HBM was introduced in 2013 with several evolutions of the HBM standard since then. The Agilex 7 M-Series devices provide developers with two HBM2E stacked configurations of either 4H/8GB or 8H/16GB. Depending on the FPGA package selection, this provides either 16GB or 32GB of HBM for the developer to use.

Using HBM2E as an in-package memory inside an FPGA can enhance applications across various domains. In AI and ML, it enables fast data loading, low latency, and high throughput, leading to efficient and accurate training and inference of deep learning models. Large datasets are rapidly ingested, and matrix operations and gradient computations benefit from HBM2E’s multiple channels and low latency. In network processing, it can facilitate rapid packet ingestion, real-time data parsing, and efficient encryption/decryption, enhancing the security and efficiency of network traffic handling. Deep packet inspection and routing are expedited by HBM2E’s high bandwidth and multiple channels. For video processing, especially high-resolution content like 4K and 8K, HBM2E ensures rapid data transfer, real-time encoding/decoding, and smooth rendering/streaming, leveraging its high bandwidth and low latency. Overall, HBM2E significantly boosts the performance and efficiency of FPGA applications in AI, network processing, and video processing.

The physical connection between the Agilex 7 logic fabric and the HBM2E leverages Intel’s Embedded Multi-Die Interconnect Bridge (EMIB), which connects the HBM2E die and the Universal Interface Block Sub System (UIBSS) within the FPGA Die. The UIBSS contains HBM2E Phy, which drives the HBM2E to die; the developer can interface with the HBM using the hard memory Network on Chip (NoC) provided as part of the Agilex 7 M-Series fabric die. User logic can interface with the NoC and, hence, the HBM2E using an AXI4 interface connection to the NoC.

User logic interfaces with the HBM2E using 16 Pseudo channels of 64 bits, but in reality, there are eight 128-bit channels. Splitting each channel into two pseudo channels provides significant advantages for many applications that do not require the full 128-bit bus. Each pseudo channel operates ndependently, enabling unique address spaces to the pseudo channel. Pseudo channels, therefore, provide better efficiency and reduced latency for transactions.

To ensure high performance, the NoC operates at a 1400 MHz clock frequency and bridges between the user application and the HBM2E Phy. Users can only access the HBM2E via the NoC.

Of course, if you are thinking of developing an HMB application, it makes sense to start de-risking your application using a development board. For the Agilex 7 M-Series, the Agilex™ 7 FPGA M-Series development kit is a good starting point for most projects.

It provides users with a range of interfaces, including QSFP, PCIe, FMC, and DDR5 both on board and via DIMM, plus onboard LPDDR5. On the block diagram, we can clearly see both HBM2e identified as U51 and U50.

We are using this development kit board to examine a reference design to demonstrate the performance of the NoC and HBM2E.

This application targets the HBM (U50) and will allow us to determine the real-world throughput against the theoretical throughput.

The theoretical throughput for an HMB2E instance can be calculated as

The target device's bandwidth should be 22.4 GB/s, with a memory clock interface of 1400MHz and two 64-bit pseudo channels. This is for -2V devices. For increased performance , the -1V devices support the fastest fabric with HBM memory interfaces that can run as fast as 1600MHz.

The example design provided with the Agilex 7 Development board to test the HBM bandwidth, provides several configurations for the traffic generator. The traffic generator which is a new advanced stand-alone IP called the ‘Test engine IP’ is a synthesizable traffic generator capable of generating a large variety of traffic patterns and could be reprogrammed without needing to recompile the programming file using its instruction RAM and a SW API through which the traffic patterns are specified. The main difference in the traffic patterns is the burst length, which enables when running the experiment to see the impact of different burst lengths on the achieved channel bandwidth.

To get started the first thing we need to do is program the board with the example HBM2E design.

Once this has been achieved, we can use Signal Tap, an internal logic debugger, to double-check that the design initialization was completed without any errors.

With the device programmed and the initialization checked, we can run the example TCL scripts to measure the bandwidth achieved in reading and writing to the HBM2E memory.

To run each test, several function calls are required to reset the counters, load in the test for the desired burst length, execute the test, and then read back the results. We interact with the design using the System Console within Quartus itself.

What can be seen is larger burst lengths provide higher bandwidths. This is to be expected as the AXI link is more efficient when transferring larger data blocks.

Each run presents the user with not only the actual bandwidth but also the latency, which varies depending on the size of the transaction.

The write efficiency for all of the burst lengths is presented in the following graph.

This demo design has been optimized to present the best possible case. The performance of the example design is very close to its theoretical limit, which shows that developers with careful thought and planning in their FPGA design can achieve similar levels of performance. The high-speed NoC feature helps significantly to achieve these results.

UK FPGA Conference

FPGA Horizons - October 7th 2025 - THE FPGA Conference, find out more here

Embedded System Book

Do you want to know more about designing embedded systems from scratch? Check out our book on creating embedded systems. This book will walk you through all the stages of requirements, architecture, component selection, schematics, layout, and FPGA / software design. We designed and manufactured the board at the heart of the book! The schematics and layout are available in Altium here Learn more about the board (see previous blogs on Bring up, DDR validation, USB, Sensors) and view the schematics here.

Order here