Preface

In the previous blog, we explored the key features of HLS C++ and the fundamentals of SmartHLS. In this post, I will discuss the HLS application development lifecycle, using the RGB2YCbCr project as an example.

HLS Application Development Life Cycle

In my view, developing an HLS C++ application is a bit different from the traditional HDL approach although both the RTL designer or HLS designer must have solid FPGA concepts . The table below highlights a few key differences worth mentioning:

HLS C++ development	RTL development
Designers express I/O ports for the module via HLS pragmas	Designers explicitly define I/O ports in the module definition
Designers implement the module as a C++ function in algorithmic level	Designers implement every detail in HDL language (e.g. Verilog or VHDL)
Designers must have knowledge of both HLS C++ and FPGA concepts to efficiently implement hardware at the algorithmic level.	Designers must have knowledge of both HDL coding and FPGA concepts to efficiently implement hardware at the RTL level.
Designers have an option to perform functional verification in software level via SW/HW Co-Simulation. In other words, designers can write the testbench in C++.	Designers must write RTL testbench in SystemVerilog to verify the module functionality.

It is obvious to see that developing a FPGA core in HLS C++ can take less time to market but loses the design flexibility in certain aspects. Nevertheless, HLS C++ has been becoming more widely used in industries that require high-performance computing and hardware acceleration, such as telecommunications and automotive.

Before diving straight into a real HLS design, it's essential to follow a structured design flow to ensure the feasibility and determinism of HLS IP implementation. A well-defined structured development approach also helps systematically enhance the quality of the HLS IP. The following figure shows the general design flow of an HLS IP design from my understanding:

It can be seen that developing a HLS core is an also iterative process involving verification and optimization, which is no different from developing a core in HDL. The QoR testing step outlined above represents the quality of results (QoR) testing for the HLS core, ensuring that the final design meets the specified requirements, such as Fmax, resource usage, and other design parameters. The Instrument step is optional and is only necessary when there is a discrepancy between runtime behavior and simulation behavior, or when undefined behavior is observed.

RGB2YCbCr FPGA Prototyping Example

Now that we are familiar with the structural HLS development methodology, let's write a simple RGB2YCbCr module using both SmartHLS C++ and Verilog to experience the differences between the two approaches. Please note that RGB2YCbCr is my open-source project and readers are free to check out the full source code at here.

RGB and YCbCr Overview

RGB2YCbCr refers to the process of converting an image or video from the RGB color space to the YCbCr color space. RGB is the most common color model used for displays (like monitors and TVs) and digital cameras. In this model, colors are created by mixing different intensities of red, green, and blue light. YCbCr is a color space commonly used in video compression and broadcasting. It separates the image into three components:

Component	Description
Y (Luma)	Represents the brightness of the color. It can be derived from the RGB components and carries most of the image's detail.
Cb (Chroma Blue)	Represents the difference between the blue color and the brightness.
Cr (Chroma Red)	Represents the difference between the red color and the brightness.

To convert from RGB to YCbCr, the following formula is commonly used:

Y = 0.299 R + 0.587 G + 0.114 * B
Cb = -0.1687 R - 0.3313 G + 0.5 * B + 128
Cr = 0.5 R - 0.4167 G - 0.0833 * B + 128

Note: the +128 in Cb and Cr ensures that these components are shifted to be in the range [0, 255], as they may go negative otherwise.

Now that we know what RGB and YCbCr are, we can start designing the RGB2YCbCr module in Verilog and SmartHLS C++.

HDL Approach

The HDL approach consists of the module design and the associated testbench implementation.

Module Design

We define a module called rgb2ycrcb with the following declaration:

module rgb2ycrcb(clk, rst, r, g, b, y, cr, cb);
    input        clk;
    input        rst;
    input  [9:0] r, g, b;

    output reg [9:0] y, cr, cb;
//...
endmodule

To design the circuit efficiently, we need to apply a few FPGA-specific optimization techniques. The first one is the equation quantization, which is a technique to convert floating point calculations into integer calculations. Since it is super expensive to perform floating point computation, we need to quantize floating point to hexadecimal. The typical way is to first scale up and then scale down.

Taking the Y equation as an example:

Y = 0.299 x R + 0.587 x G + 0.114 x B

We scale the floating-point coefficients into integer format by multiplying each component with 1024:

0.299 x 1024 = 306.176 ≈ 0x132 in hexadecimal
0.587 x 1024 = 601.088 ≈ 0x259 in hexadecimal
0.114 x 1024 = 116.736 ≈ 0x074 in hexadecimal.

These hexadecimal values are then used as multipliers:

Y = 0x132 x R + 0x259 x G + 0x074 x B

Similarly, we can approximate the assignment for Cr and Cb:

Cr = 0x200 x R - 0x1AD x G - 0x053 x B → Cr = (1 << 9) x R - 0x1AD x G - 0x053 x B

Cb = B x 0x200 - 0x0AD x R - 0x153 x G → Cb = (1 << 9) x B - 0x0AD x R - 0x153 x G

The other technique is pipelining, which partitions a large combinational or sequential circuit into multiple clock stages so that the resulting circuit improves Fmax and maximize the throughput. In this case, instead of calculating Y, Cr and Cb in one clock cycles, we can calculate each term and store them in pipeline registers. For Y, we can compute and store each of 3 terms in respective pipeline registers and sum them up to assign Y:

// We can pipeline the calculation for Cr and Cb in the same way.
always@(posedge clk)
        if (rst) begin
            yr <= 0;
            yb <= 0;
            yg <= 0;
            y1 <= 0;
        end else begin
            yr <= 10'h132 * r;
            yg <= 10'h259 * g;        
            yb <= 10'h074 * b;
            y1 <= yr + yg + yb;
        end

And prior to assigning the obtained value y1 to the output port Y we need to check boundaries to ensure only valid values are passed to the output Y:

// We can check boundary for Cr and Cb in the same way.
always@(posedge clk)
        if (rst) begin
            y <= 0;
        end else begin
            // check Y
            y <= (y1[19:10] & {10{!y1[21]}}) | {10{(!y1[21] && y1[20])}};
        end

Since it is reasonable to have tiny value offset for outputs, we use a parameter emargin to define the acceptable error margin. Such parameter is used when evaluating each output value against the golden values:

// Check results
    always@(my or mcr or mcb)
    begin
        iy = y;
        if ( ( iy < my - emargin)  || (iy > my + emargin) )
            $display("Y-value error. Received %d, expected %d. R = %d, G = %d, B = %d", y, my, r[3], g[3], b[3]);

        icr = cr;
        if ( ( icr < mcr - emargin)  || (icr > mcr + emargin) )
            $display("Cr-value error. Received %d, expected %d. R = %d, G = %d, B = %d", cr, mcr, r[3], g[3], b[3]);

        icb = cb;
        if ( ( icb < mcb - emargin)  || (icb > mcb + emargin) )
            $display("Cb-value error. Received %d, expected %d. R = %d, G = %d, B = %d", cb, mcb, r[3], g[3], b[3]);
    end

The full code of rgb2ycrcb module can be found here, readers are free to read the full source code.

Testbench Design

To ensure thorough testing of the module, we design a test vector that combines the following:

64 incremental values for R
64 incremental values for G
64 incremental values for B

The test vector can be iterated via nested for loops:

for (r_idx = 0; r_idx <= r_length; r_idx = r_idx + 1) begin
    for (g_idx = 0; g_idx <= g_length; g_idx = g_idx + 1) begin
        for (b_idx = 0; b_idx <= b_length; b_idx = b_idx + 1) begin
            @(posedge clk);
            r[0] <= r_idx;
            g[0] <= g_idx;
            b[0] <= b_idx;
        end
    end
end

Besides, to evaluate the module output we also need to implement a golden module inside the testbench. We apply the same quantization technique for the golden model but this time we simply use the scaled coefficients in decimal format:

always@(r[3] or g[3] or b[3])
    begin
        my  = (299 * r[3]) + (587 * g[3]) + (114 * b[3]);
        if (my < 0)
            my = 0;

        my = my /1000;
        if (my > 1024)
            my = 1024;

        mcr = (500 * r[3]) - (419 * g[3]) - ( 81 * b[3]);
        if (mcr < 0)
            mcr = 0;

        mcr = mcr /1000;
        if (mcr > 1024)
            mcr = 1024;

        mcb = (500 * b[3]) - (169 * r[3]) - (332 * g[3]);
        if (mcb < 0)
            mcb = 0;

        mcb = mcb /1000;
        if (mcb > 1024)
            mcb = 1024;
    end

The full code of RGB2YCbCr_tb module can be found here, readers are free to read the full source code.

Now we can run RTL simulation to perform the evaluation. Here I provided two ways to launch the simulation:

Use open-source Icarus Verilog
Use ModelSimPro shipped along with LiberoSoC Suite

Icarus Verilog Simulation

We can run the following commands to simulate the design:

# compile RGB2YCbCr_tb.sv to get the verilog virtual processor file using iverilog
iverilog -g2012 -o RGB2YCbCr_tb.vvp RGB2YCbCr_tb.sv
# run the verilog simulation
vvp -n RGB2YCbCr_tb.vvp

I wrote a script compile_and_run.sh to launch iverilog, so you can run RTL simulation by simply running:

./compile_and_run.sh

The exectued simulation result is shown below, and you can see there is no error message shown:

ModelSim Simulation

I wrote tb.tcl that compiles RGB2YCbCr.v and RGB2YCbCr_tb.sv and a Makefile that calls tb.tcl internally so you can simply launch the simulation by running :

make

The tb.tcl script basically compiles RGB2YCbCr_tb.sv and RGB2YCbCr.v using vlog and launches simulation using vsim. The full implementation can be found here. Running the Makefile produced the same output message shown before in ModelSim:

SmartHLS C++ Approach

The SmartHLS C++ approach also consists of the module design and the associated software testbench implementation.

Module Design

The C++ implementation of the module is much more straightforward than the RTL approach, since we no longer need to apply any FPGA-specific optimizations to improve performance and pass the timing check. Instead, we simply use SmartHLS pragmas:

#pragma HLS function top
#pragma HLS function pipeline

to declare function RGB2YCbCr_smarthls is the HLS core to be pipelined and synthesized:

// Fixed point type: Q12.6
// 12 integer bits and 6 fractional bits
typedef ap_fixpt<18, 12> fixpt_t;

void RGB2YCbCr_smarthls(hls::FIFO<RGB>   &input_fifo,
                        hls::FIFO<YCbCr> &output_fifo) {

#pragma HLS function top
#pragma HLS function pipeline

    RGB in = input_fifo.read();

    YCbCr ycbcr;

    // change divide by 256 to right shift by 8, add 0.5 for rounding
    ycbcr.Y = fixpt_t(4) + ((fixpt_t(65.738) * in.R + fixpt_t(129.057) * in.G + fixpt_t(25.064) * in.B ) >> 8) + fixpt_t(0.5);
    ycbcr.Cb = fixpt_t(128) - ((fixpt_t(37.945) * in.R + fixpt_t(74.494) * in.G - fixpt_t(112.439) * in.B) >> 8) + fixpt_t(0.5);
    ycbcr.Cr = fixpt_t(128) + ((fixpt_t(112.439) * in.R - fixpt_t(94.154) * in.G - fixpt_t(18.285) * in.B) >> 8) + fixpt_t(0.5);

    output_fifo.write(ycbcr);
}

As you can see above, the function implementation is no more different from a regular C++ function except we use fixed point data type for fast bit accurate software simulation and efficient equivalent hardware generation. I covered fixed point C++ library in Part 2 of this blog series, you can learn about it here, or explore the official SmartHLS User Guide.

One obvious difference between RGB2YCbCr_smarthls function and rgb2ycrcb module is the I/O port definition. In rgb2ycrcb definition we explicitly defined:

input control signals: clk and rst
input wire ports: r, g, b
output register ports: y, cr, cb

In RGB2YCbCr_smarthls we don’t define I/O ports but instead define the input and output data structures for the function data flow. We use FIFOs as the input and the output to streamline the processing, and this is very efficient for any streaming applications such as real-time image processing. In terms of the I/O ports of the generated RTL module, we need to use the SmartHLS compiler to generate the corresponding RTL definition by running:

shls -a hw

You can find the I/O ports either from the generated HDL module or the output report hls_output/reports/summary.hls.RGB2YCbCr_smarthls.rpt. I present the I/O ports in the output report below. From the RTL interface table, we can observe that the clock and reset signals are generated automatically. The input_fifo and output_fifo are represented as signals for the standard AXI streaming protocol. The control-type signals are used by the on-board processor to manage the HLS core. It is important to note that the automatic generation of control-type signals implies that any FPGA cores developed with SmartHLS are designed with SoC communication support in mind, which is not typically the case in traditional HDL approaches.

Testbench Design

One of the key advantages of using HLS C++ is that we can write a C++ main( ) function as the testbench to evaluate the auto-generated RTL module using SW/HW Co-Simulation, which we covered in Part 2 blog.

To be consistent to the SystemVerilog testbench, we use the same test vector implemented in for loops:

for(int i = 0; i < 64; i++) {
        for(int j = 0; j < 64; j++) {
            for(int k = 0; k < 64; k++) {
                ...

Instead of directly assigning r, g, b to the DUT like the HDL approach, we fill the input FIFO with RGB structs and read output data from the output FIFO as YCbCr structs:

  in.R = i;
  in.G = j;
  in.B = k;
  // HLS call
  input_fifo.write(in);
  RGB2YCbCr_smarthls(input_fifo, output_fifo);
  out = output_fifo.read();

Much like the HDL approach, we also need to create a golden model to evaluate the RGB2YCbCr_smarthls output value by value. To achieve this, we implement the RGB2YCbCr_sw function, which closely mirrors its SmartHLS counterpart.

std::tuple<uint8_t, uint8_t, uint8_t> RGB2YCbCr_sw(uint8_t R, uint8_t G, uint8_t B) {

    float Y  =  0.299f * R + 0.587f * G + 0.114f * B;
    float Cb = -0.169 * R - 0.332 * G + 0.5f * B + 128;
    float Cr =  0.5f * R - 0.419 * G - 0.0813 * B + 128;

    return std::make_tuple(clamp(Y), clamp(Cb), clamp(Cr)); // limit each value inside [0, 255]
}

To run the simulation, we can run the following:

Software Simulation: Run top-level functions as regular C++ functions using gcc on the host PC.
SW/HW Co-Simulation: Run RTL simulation for the generated RTL module via ModelSim and obtain the test output from the main function.

Software simulation: we run shls -a sw to run the program as if it is a regular C++ program.

SW/HW Co-Simulation: we run shls -a cosim and check the output messages for the simulation result.

Obviously, the SmartHLS approach offers more verification tools compared to the HDL approach, and the SW/HW Co-Simulation provides greater flexibility in testbench design, as writing C++ code is much faster than writing Verilog. However, it's important to note that implementing the DUT in Verilog gives us more control such as I/O interfacing, which is crucial for large FPGA designs.

Summary

In this blog we tried both the RTL approach and HLS C++ approach to implement RGB2YCbCr , and we see the HLS C++ approach takes much less time to develop and much more friendly to software engineers working with FPGAs. However, the HDL approach brings us more control to the design such as I/O interfacing. I'm not suggesting that HLS C++ is superior to the HDL approach, but this exercise does provide us an insight that HLS C++ could be better suited for standalone, compute-intensive IP core development.

Reference

[1]. https://microchiptech.github.io/fpga-hls-docs/2023.2/userguide.html

[2].ShuranXu/RGB2YCbCr: A HLS C++ project to convert RGB to YCbCr

FPGA Prototyping in HLS C++ (Part 3)

Table of contents

Preface

HLS Application Development Life Cycle

RGB2YCbCr FPGA Prototyping Example

RGB and YCbCr Overview

HDL Approach

Module Design

Testbench Design

SmartHLS C++ Approach

Module Design

Testbench Design

Summary

Reference

Subscribe to my newsletter

Shuran Xu

Shuran Xu