Preface

In the previous blog, we discussed what HLS is and how it works in general. In this blog, I will delve into the core features and fundamentals of HLS C++. Once again, I will use Microchip’s SmartHLS as an example throughout this post.

HLS C++ Differentiating Features

Unlike standard C++, High-Level Synthesis (HLS) C++ refers to a specialized subset of C++ used for the development of hardware designs. HLS tools allow engineers to write algorithms at a high level of abstraction using C++. The written program is automatically converted into hardware description languages (HDL) like Verilog or VHDL, which can be synthesized into actual hardware.

Despite the fact that HLS C++ allows designers to use standard C++ syntax to describe hardware behaviors, there are several key differences between HLS C++ and standard C++ to be discussed below.

1.Directive-Based Optimization

HLS C++ allows users to specify high-level directives, such as #pragma, to guide the synthesis tool on how to optimize the hardware, such as pipelining and loop unrolling. These are different from standard C++ because they influence hardware structure generation, not just software execution.

For example, in SmartHLS, we can optimize a given loop to maximize the throughput by using #pragma HLS loop pipeline II(1) :

const int SIZE = 100;
int sum = 0;

HLS_ADD_LOOP:
#pragma HLS loop pipeline II(1)
for(int i = 0; i < SIZE; ++i) {
     sum += i * 6;
}

From pure software standpoint, this HLS_ADD_LOOP simply adds the variable sum by a number 100 times, there is no much to do to improve the loop performance. However, from hardware standpoint, we can increase the throughput of HLS_ADD_LOOP by making the following operations to complete in one clock cycle:

operation 1: multiply-add (i.e. MAC) operation

operation 2: memory read and memory write operations

It is safe to assume that the modern FPGAs have dual-port embedded RAM blocks and math blocks, operation 1 and operation 2 can be easily done in one clock cycle as there is no memory access contention or chained multiplication. Therefore, by specifying #pragma HLS loop pipeline II(1) the synthesis tool will schedule operation 1 and 2 to ensure the line sum += i * 6 can be initiated every clock cycle, which is exactly what II(1) means.

II, initiation interval, is the cycle interval between starting successive iterations of the loop. The best performance and hardware utilization is achieved when II=1, which means that successive iterations of the loop can begin every clock cycle.

2.Hardware-Specific Constructs

HLS tools introduce specialized constructs to help the C++ code be synthesized into hardware. For example, hls::stream is a FIFO data structure designed specifically for data streaming, such as video processing applications. Such class certainly is not part of standard C++. Additionally, some of built-in constructs from standard C++ can be unavailable in HLS C++ such as std::vector , this is because dynamic allocation is typically unsupported since there is no corresponding resource mapping between such construct to variable-length RAM blocks.

3.Synchronization and Concurrency

In HLS, we can define how data flows among components, as well as how parallelism should be implemented (e.g., pipelining, parallel loops). This differs from standard C++, where concurrency is often managed using multi-threading libraries and system-specific concurrency models.

4.Functional Simulation and Hardware Simulation

HLS tools simulate the C++ code in both functional software simulation and RTL-level simulation such as CoSim, allowing the verification of algorithms before they are mapped to hardware. In SmartHLS C++ particularly, the software simulation evaluates the top-level function, which is to be synthesized into hardware module, by running it as a regular C++ function, whereas the CoSim evaluates the top-level function by first synthesizing it into a hardware module and launching ModelSim to perform the RTL simulation.

Unique HLS C++ Features

In addition to its differences from standard C++, HLS C++ offers a unique set of features specifically designed to streamline the development of HLS applications.

1.SW/HW Co-simulation

SW/HW co-simulation is a extremely useful tool to verify that the HLS-generated hardware produces the same outputs for the same inputs as software, often this is the best tool developers rely on for functional verification prior to running the associated bitstream on the FPGA fabric. With SW/HW co-simulation, users do not have to write their own RTL testbench, as it is automatically generated. If a user already has his/her own custom RTL testbench, one can optionally choose their custom RTL testbench and not use SW/HW co-simulation.

To use SW/HW co-simulation, the input software program will be composed of two parts:

A top-level function (and its descendant functions) to be synthesized to hardware by SmartHLS,
A C/C++ testbench (the parent functions of the top-level function, typically main()) that invokes the top-level function with test inputs and verifies outputs.

For example, the following project shows a top-level function hlsSum which is to be synthesized into hardware and main( ) which acts as the testbench:

// simple_add.cpp
void hlsSum(const int a, const  int b, int &sum) {
#pragma HLS function top
  sum = a + b;
}

int main( ) {

    int hls_sum, sw_sum;
    int err = 0;
    for(int i = 0; i < 5; i++) {
        for(int j = 0; j < 5; j++) {
            hlsSum((const int)i, (const int) j, hls_sum);
            sw_sum = i + j;
            err += sw_sum - hls_sum;
        }
    }

    if(err == 0) {
        printf("PASS\n");
        return 0;
    } else {
        printf("FAIL\n");
        return 1;
    }
}

It is clear to see that designers can write a testbench as if they develop a regular C++ main function. To use CoSim for RTL verification via Microchip’s SmartHLS, we can run the following command on Linux terminal (It also works on Windows powershell but I will only demonstrate the use on Linux platform here)

shls -a cosim

SmartHLS does the following steps when the above command is executed:

Runs the software program and saves all the inputs passed to the top-level function.
Creates an RTL testbench that reads in the inputs from step 1 and passes them into the SmartHLS-generated hardware module.
Launches ModelSim that simulates the testbench and saves the SmartHLS-generated module outputs.
Runs the software program again but uses the simulation outputs as the top-level function output.

The following screenshot shows the CoSim execution result of this simple sum project:

It can be seen above that the CoSim reports not only the simulation result but also a set of metrics:

Metric Name	Description
Number of calls	The number of times that the top-level function is called by the testbench.
Simulation time (cycles)	The total clock cycles spent in RTL simulation.
Call Latency (min/max/avg)	The minimum, maximum and average call latency of `hlsSum` function.
Call II (min/max/avg)	The minimum, maximum and average initiation interval of `hlsSum` function. This metric can be improved by pipelining the entire function via `#pragma HLS function pipeline`. ( More on this in later sections)

From the above report we know that hlsSum can run every 4 clock cycles with the call latency being 4 clock cycles. And the testbench calls hlsSum 25 times in total, which matches the test vector used.

2.Custom C++ Library

SmartHLS includes a number of C/C++ libraries that allow creation of efficient hardware, two key libraries worthy mentioning are:

Arbitrary Precision Integer Library
Arbitrary Fixed Point Library

In this section, I will provide a brief overview of each of these libraries due to space constraints. For those interested in learning more, please refer to the link here.

2.1 Arbitrary Precision Integer Library

The C++ ap_[u]int type allows specifying signed and unsigned data types of any bit width. They can be used for arithmetic, concatenation, and bit level operations. Given the following example:

#include "hls/ap_int.hpp"
#include <iostream>
using namespace hls;

int main() {

    ap_uint<128> data("0123456789ABCDEF0123456789ABCDEF");
    ap_int<4> res(0);

    // print data as a decimal number
    std::cout << "data.to_string(10,true) = " << data.to_string(10, true)
              << std::endl;

    for (ap_uint<8> i = 0; i < data.length(); i += 4) {
        // If this four bit range of data is <= 7
        if (data(i + 3, i) <= 7) {
            res -= 1;
        } else {
            res += 1;
        }
    }
// iostream doesn't synthesize to hardware, so only include this
// line in software compilation. Any block surrounded by this ifdef
// will be ignored when compiling to hardware.
#ifndef __SYNTHESIS__
    std::cout << res << std::endl;
#endif

In the above code we first print variable data as a decimal number and then iterate through a 128 bit unsigned integer in four bit segments, and track the difference between how many segments are above and below 7. All variables have been reduced to their specified minimum widths.

2.2 Arbitrary Fixed Point Library

The Arbitrary Precision Fixed Point library provides fast bit accurate software simulation, and efficient equivalent hardware generation. The C++ ap_[u]fixpt types allow specifying signed and unsigned fixed point numbers of arbitrary width, and arbitrary fixed position relative to the decimal. They can be used for arithmetic, concatenation, and bit level operations.

The ap_[u]fixpt template allows specifying the width of the type, how far the most significant bit is above the decimal, as well as several quantization and overflow modes. The template ap_[u]fixpt<W, I_W, Q_M, O_M> is described in the following table:

Parameter	Description
W	The width of the word in bits.
I_W	How far the most significant bit is above the decimal. I_W can be negative. I_W > 0 implies the MSB is above the decimal. I_W <= 0 implies the MSB is below the decimal.
Q_M	The Quantization(rounding) mode used when a result has precision below the least significant bit. Default to truncating bits below the LSB bringing the result closer to -∞.
O_M	The Overflow mode used when a result exceeds the maximum or minimum representable value. Default to wrapping around between the minimum and maximum representable values in the range.

Generally a fixed point number can be thought of as a signed or unsigned integer word multiplied by 2^(I_W - W). The range of values that an ap_[u]fixpt can take on, as well as the quantum that separates those values is determined by the W, and I_W template parameters.

The Arbitrary Precision Fixed Point library supports all standard arithmetic, logical bitwise, shifts, and comparison operations. During arithmetic intermediate results are kept in a wide enough type to hold all of the possible resulting values. The following example presents a few arithmetic operations performed on fixed point variables.

#include "hls/ap_fixpt.hpp"
#include <iostream>
#include <stdio.h>

using namespace hls;

    //...
    ap_ufixpt<65, 14> a = 32.5714285713620483875274658203125;
    ap_ufixpt<15, 15> b = 7;
    ap_fixpt<8, 4> c = -3.125;

    // the resulting type is wide enough to hold all
    // 51 fractional bits of a, and 15 integer bits of b
    // the width, and integer width are increased by 1 to hold
    // all possible results of the addition
    ap_ufixpt<67, 16> d = a + b; // 39.5714285713620483875274658203125
    std::cout << "d = " << d << std::endl;
    // the resulting type is a signed fixed point
    // with width, and integer width that are the sum
    // of the two operands' widths
    ap_fixpt<23, 19> e = b * c; // -21.875
    std::cout << "e = " << e << std::endl;
    // Assignment triggers the AP_TRN_ZERO quantization mode
    // AP_TRN_ZERO means truncating bits below the LSB bringing 
    // the result closer to zero.
    ap_fixpt<8, 7, AP_TRN_ZERO> f = e; // -21.5
    std::cout << "f = " << f << std::endl;

3.Basic SmartHLS Pragmas

HLS pragmas can be applied to the software code by the user to apply HLS optimization techniques and/or guide the compiler for hardware generation. They are applied directly on the applicable software construct (i.e., function, loop, argument, array) to specify a certain optimization for them. In this section, we cover the most common basic ones:

Loop pipelining
Function pipelining
Loop unrolling

3.1 Loop Pipelining

Loop pipelining is an optimization that can automatically extract loop-level parallelism to create an efficient hardware pipeline. It allows executing multiple loop iterations concurrently on the same pipelined hardware.

To use loop pipelining, the user needs to specify the loop pipeline pragma above the applicable loop. In the following code, we apply #pragma HLS loop pipeline to pipeline the for loop to achieve the minimum initiation interval (II). The best performance and hardware utilization is achieved when II=1, which means that successive iterations of the loop can begin every clock cycle.

#pragma HLS loop pipeline
for (i = 1; i < N; i++) {
    a[i] = a[i-1] + 2
}

You may wonder if SmartHLS is capable of optimizing any loops to achieve an initiation interval (II) of 1. The answer, unfortunately, is no. There are several constraints when coding loops to enable SmartHLS to achieve an II of 1, with a few key restrictions including:

Use constant loop bound
No conditionals in the loop body
Fewer function calls within the loop since the function calls will be in-lined into the loop

3.2 Function Pipelining

When a function is marked to be pipelined by using #pragma HLS function pipeline, SmartHLS will implement the function as a pipelined circuit that can start a new invocation every II cycles. That is, the circuit can execute again while its previous invocation is still executing, allowing it to continuously process incoming data in a pipelined fashion. This can significantly reduce the simulation latency and potentially reduce the call latency as well.

Taking simple_add.cpp as an example again, if we call hlsSum function without the function pipelining pragma we can only call this function every 4 clock cycles, but we the function pipelining allows us to call the same function every clock cycle.

Specifically, when we add pipeline to the function pragma:

The SW/HW Co-Simulation shown the average II is around 1 and the simulation time is 29 cycles.

But without pipeline the average II can only be 4 and the simulation time is more than 3 times bigger than the pipelined version.

3.3 Loop Unrolling

SmartHLS allows the user to specify a loop to be unrolled through the use of a pragma, #pragma HLS loop unroll:

#pragma HLS loop unroll
for (i = 0; i < N; i++) {
    ...
}

This unrolls the loop completely. Unrolling a loop can improve performance as the hardware units for the loop body are replicated, but it also increases area. We can also specify a loop to be partially unrolled, to prevent the area from increasing too much.

#pragma HLS loop unroll factor(2)
for (i = 0; i < N; i++) {
    ...
}

This unrolls the loop two times since the number of loop iteration is halved and the loop body is doubled.

Summary

In this blog, we explored the key features of SmartHLS C++, custom C++ libraries, and essential SmartHLS pragmas. These serve as the foundational elements for programmatically building an FPGA IP core in HLS C++. In the next blog, I will demonstrate how to use these fundamentals to build an HLS core for a real-world project, while also comparing the HLS C++ approach with the traditional RTL approach.

Reference

[1] https://microchiptech.github.io/fpga-hls-docs/2023.2/userguide.html

FPGA Prototyping in HLS C++ (Part 2)

Table of contents