FPGA Prototyping in HLS C++ (Part 3)


Preface
In the previous blog, we explored the key features of HLS C++ and the fundamentals of SmartHLS. In this post, I will discuss the HLS application development lifecycle, using the RGB2YCbCr project as an example.
HLS Application Development Life Cycle
In my view, developing an HLS C++ application is a bit different from the traditional HDL approach although both the RTL designer or HLS designer must have solid FPGA concepts . The table below highlights a few key differences worth mentioning:
HLS C++ development | RTL development |
Designers express I/O ports for the module via HLS pragmas | Designers explicitly define I/O ports in the module definition |
Designers implement the module as a C++ function in algorithmic level | Designers implement every detail in HDL language (e.g. Verilog or VHDL) |
Designers must have knowledge of both HLS C++ and FPGA concepts to efficiently implement hardware at the algorithmic level. | Designers must have knowledge of both HDL coding and FPGA concepts to efficiently implement hardware at the RTL level. |
Designers have an option to perform functional verification in software level via SW/HW Co-Simulation. In other words, designers can write the testbench in C++. | Designers must write RTL testbench in SystemVerilog to verify the module functionality. |
It is obvious to see that developing a FPGA core in HLS C++ can take less time to market but loses the design flexibility in certain aspects. Nevertheless, HLS C++ has been becoming more widely used in industries that require high-performance computing and hardware acceleration, such as telecommunications and automotive.
Before diving straight into a real HLS design, it's essential to follow a structured design flow to ensure the feasibility and determinism of HLS IP implementation. A well-defined structured development approach also helps systematically enhance the quality of the HLS IP. The following figure shows the general design flow of an HLS IP design from my understanding:
It can be seen that developing a HLS core is an also iterative process involving verification and optimization, which is no different from developing a core in HDL. The QoR testing step outlined above represents the quality of results (QoR) testing for the HLS core, ensuring that the final design meets the specified requirements, such as Fmax, resource usage, and other design parameters. The Instrument step is optional and is only necessary when there is a discrepancy between runtime behavior and simulation behavior, or when undefined behavior is observed.
RGB2YCbCr FPGA Prototyping Example
Now that we are familiar with the structural HLS development methodology, let's write a simple RGB2YCbCr module using both SmartHLS C++ and Verilog to experience the differences between the two approaches. Please note that RGB2YCbCr
is my open-source project and readers are free to check out the full source code at here.
RGB and YCbCr Overview
RGB2YCbCr refers to the process of converting an image or video from the RGB color space to the YCbCr color space. RGB is the most common color model used for displays (like monitors and TVs) and digital cameras. In this model, colors are created by mixing different intensities of red, green, and blue light. YCbCr is a color space commonly used in video compression and broadcasting. It separates the image into three components:
Component | Description |
Y (Luma) | Represents the brightness of the color. It can be derived from the RGB components and carries most of the image's detail. |
Cb (Chroma Blue) | Represents the difference between the blue color and the brightness. |
Cr (Chroma Red) | Represents the difference between the red color and the brightness. |
To convert from RGB to YCbCr, the following formula is commonly used:
Y = 0.299 R + 0.587 G + 0.114 * B
Cb = -0.1687 R - 0.3313 G + 0.5 * B + 128
Cr = 0.5 R - 0.4167 G - 0.0833 * B + 128
Note: the +128
in Cb and Cr ensures that these components are shifted to be in the range [0, 255], as they may go negative otherwise.
Now that we know what RGB and YCbCr are, we can start designing the RGB2YCbCr
module in Verilog and SmartHLS C++.
HDL Approach
The HDL approach consists of the module design and the associated testbench implementation.
Module Design
We define a module called rgb2ycrcb
with the following declaration:
module rgb2ycrcb(clk, rst, r, g, b, y, cr, cb);
input clk;
input rst;
input [9:0] r, g, b;
output reg [9:0] y, cr, cb;
//...
endmodule
To design the circuit efficiently, we need to apply a few FPGA-specific optimization techniques. The first one is the equation quantization, which is a technique to convert floating point calculations into integer calculations. Since it is super expensive to perform floating point computation, we need to quantize floating point to hexadecimal. The typical way is to first scale up and then scale down.
Taking the Y
equation as an example:
Y = 0.299 x R + 0.587 x G + 0.114 x B
We scale the floating-point coefficients into integer format by multiplying each component with 1024:
0.299 x 1024 = 306.176 ≈ 0x132 in hexadecimal
0.587 x 1024 = 601.088 ≈ 0x259 in hexadecimal
0.114 x 1024 = 116.736 ≈ 0x074 in hexadecimal.
These hexadecimal values are then used as multipliers:
Y = 0x132 x R + 0x259 x G + 0x074 x B
Similarly, we can approximate the assignment for Cr
and Cb
:
Cr = 0x200 x R - 0x1AD x G - 0x053 x B
→ Cr = (1 << 9) x R - 0x1AD x G - 0x053 x B
Cb = B x 0x200 - 0x0AD x R - 0x153 x G
→ Cb = (1 << 9) x B - 0x0AD x R - 0x153 x G
The other technique is pipelining, which partitions a large combinational or sequential circuit into multiple clock stages so that the resulting circuit improves Fmax and maximize the throughput. In this case, instead of calculating Y, Cr and Cb in one clock cycles, we can calculate each term and store them in pipeline registers. For Y, we can compute and store each of 3 terms in respective pipeline registers and sum them up to assign Y:
// We can pipeline the calculation for Cr and Cb in the same way.
always@(posedge clk)
if (rst) begin
yr <= 0;
yb <= 0;
yg <= 0;
y1 <= 0;
end else begin
yr <= 10'h132 * r;
yg <= 10'h259 * g;
yb <= 10'h074 * b;
y1 <= yr + yg + yb;
end
And prior to assigning the obtained value y1
to the output port Y
we need to check boundaries to ensure only valid values are passed to the output Y
:
// We can check boundary for Cr and Cb in the same way.
always@(posedge clk)
if (rst) begin
y <= 0;
end else begin
// check Y
y <= (y1[19:10] & {10{!y1[21]}}) | {10{(!y1[21] && y1[20])}};
end
Since it is reasonable to have tiny value offset for outputs, we use a parameter emargin
to define the acceptable error margin. Such parameter is used when evaluating each output value against the golden values:
// Check results
always@(my or mcr or mcb)
begin
iy = y;
if ( ( iy < my - emargin) || (iy > my + emargin) )
$display("Y-value error. Received %d, expected %d. R = %d, G = %d, B = %d", y, my, r[3], g[3], b[3]);
icr = cr;
if ( ( icr < mcr - emargin) || (icr > mcr + emargin) )
$display("Cr-value error. Received %d, expected %d. R = %d, G = %d, B = %d", cr, mcr, r[3], g[3], b[3]);
icb = cb;
if ( ( icb < mcb - emargin) || (icb > mcb + emargin) )
$display("Cb-value error. Received %d, expected %d. R = %d, G = %d, B = %d", cb, mcb, r[3], g[3], b[3]);
end
The full code of rgb2ycrcb
module can be found here, readers are free to read the full source code.
Testbench Design
To ensure thorough testing of the module, we design a test vector that combines the following:
64 incremental values for R
64 incremental values for G
64 incremental values for B
The test vector can be iterated via nested for
loops:
for (r_idx = 0; r_idx <= r_length; r_idx = r_idx + 1) begin
for (g_idx = 0; g_idx <= g_length; g_idx = g_idx + 1) begin
for (b_idx = 0; b_idx <= b_length; b_idx = b_idx + 1) begin
@(posedge clk);
r[0] <= r_idx;
g[0] <= g_idx;
b[0] <= b_idx;
end
end
end
Besides, to evaluate the module output we also need to implement a golden module inside the testbench. We apply the same quantization technique for the golden model but this time we simply use the scaled coefficients in decimal format:
always@(r[3] or g[3] or b[3])
begin
my = (299 * r[3]) + (587 * g[3]) + (114 * b[3]);
if (my < 0)
my = 0;
my = my /1000;
if (my > 1024)
my = 1024;
mcr = (500 * r[3]) - (419 * g[3]) - ( 81 * b[3]);
if (mcr < 0)
mcr = 0;
mcr = mcr /1000;
if (mcr > 1024)
mcr = 1024;
mcb = (500 * b[3]) - (169 * r[3]) - (332 * g[3]);
if (mcb < 0)
mcb = 0;
mcb = mcb /1000;
if (mcb > 1024)
mcb = 1024;
end
The full code of RGB2YCbCr_tb
module can be found here, readers are free to read the full source code.
Now we can run RTL simulation to perform the evaluation. Here I provided two ways to launch the simulation:
Use open-source Icarus Verilog
Use
ModelSimPro
shipped along with LiberoSoC Suite
Icarus Verilog Simulation
We can run the following commands to simulate the design:
# compile RGB2YCbCr_tb.sv to get the verilog virtual processor file using iverilog
iverilog -g2012 -o RGB2YCbCr_tb.vvp RGB2YCbCr_tb.sv
# run the verilog simulation
vvp -n RGB2YCbCr_tb.vvp
I wrote a script compile_and_run.sh
to launch iverilog
, so you can run RTL simulation by simply running:
./compile_and_run.sh
The exectued simulation result is shown below, and you can see there is no error message shown:
ModelSim Simulation
I wrote tb.tcl that compiles RGB2YCbCr.v
and RGB2YCbCr_tb.sv
and a Makefile that calls tb.tcl internally so you can simply launch the simulation by running :
make
The tb.tcl script basically compiles RGB2YCbCr_tb.sv
and RGB2YCbCr.v
using vlog
and launches simulation using vsim
. The full implementation can be found here. Running the Makefile produced the same output message shown before in ModelSim:
SmartHLS C++ Approach
The SmartHLS C++ approach also consists of the module design and the associated software testbench implementation.
Module Design
The C++ implementation of the module is much more straightforward than the RTL approach, since we no longer need to apply any FPGA-specific optimizations to improve performance and pass the timing check. Instead, we simply use SmartHLS pragmas:
#pragma HLS function top
#pragma HLS function pipeline
to declare function RGB2YCbCr_smarthls
is the HLS core to be pipelined and synthesized:
// Fixed point type: Q12.6
// 12 integer bits and 6 fractional bits
typedef ap_fixpt<18, 12> fixpt_t;
void RGB2YCbCr_smarthls(hls::FIFO<RGB> &input_fifo,
hls::FIFO<YCbCr> &output_fifo) {
#pragma HLS function top
#pragma HLS function pipeline
RGB in = input_fifo.read();
YCbCr ycbcr;
// change divide by 256 to right shift by 8, add 0.5 for rounding
ycbcr.Y = fixpt_t(4) + ((fixpt_t(65.738) * in.R + fixpt_t(129.057) * in.G + fixpt_t(25.064) * in.B ) >> 8) + fixpt_t(0.5);
ycbcr.Cb = fixpt_t(128) - ((fixpt_t(37.945) * in.R + fixpt_t(74.494) * in.G - fixpt_t(112.439) * in.B) >> 8) + fixpt_t(0.5);
ycbcr.Cr = fixpt_t(128) + ((fixpt_t(112.439) * in.R - fixpt_t(94.154) * in.G - fixpt_t(18.285) * in.B) >> 8) + fixpt_t(0.5);
output_fifo.write(ycbcr);
}
As you can see above, the function implementation is no more different from a regular C++ function except we use fixed point data type for fast bit accurate software simulation and efficient equivalent hardware generation. I covered fixed point C++ library in Part 2 of this blog series, you can learn about it here, or explore the official SmartHLS User Guide.
One obvious difference between RGB2YCbCr_smarthls
function and rgb2ycrcb
module is the I/O port definition. In rgb2ycrcb
definition we explicitly defined:
input control signals:
clk
andrst
input wire ports: r, g, b
output register ports: y, cr, cb
In RGB2YCbCr_smarthls
we don’t define I/O ports but instead define the input and output data structures for the function data flow. We use FIFOs as the input and the output to streamline the processing, and this is very efficient for any streaming applications such as real-time image processing. In terms of the I/O ports of the generated RTL module, we need to use the SmartHLS compiler to generate the corresponding RTL definition by running:
shls -a hw
You can find the I/O ports either from the generated HDL module or the output report hls_output/reports/summary.hls.RGB2YCbCr_smarthls.rpt.
I present the I/O ports in the output report below. From the RTL interface table, we can observe that the clock and reset signals are generated automatically. The input_fifo
and output_fifo
are represented as signals for the standard AXI streaming protocol. The control-type signals are used by the on-board processor to manage the HLS core. It is important to note that the automatic generation of control-type signals implies that any FPGA cores developed with SmartHLS are designed with SoC communication support in mind, which is not typically the case in traditional HDL approaches.
Testbench Design
One of the key advantages of using HLS C++ is that we can write a C++ main( )
function as the testbench to evaluate the auto-generated RTL module using SW/HW Co-Simulation, which we covered in Part 2 blog.
To be consistent to the SystemVerilog testbench, we use the same test vector implemented in for loops:
for(int i = 0; i < 64; i++) {
for(int j = 0; j < 64; j++) {
for(int k = 0; k < 64; k++) {
...
Instead of directly assigning r
, g
, b
to the DUT like the HDL approach, we fill the input FIFO with RGB
structs and read output data from the output FIFO as YCbCr
structs:
in.R = i;
in.G = j;
in.B = k;
// HLS call
input_fifo.write(in);
RGB2YCbCr_smarthls(input_fifo, output_fifo);
out = output_fifo.read();
Much like the HDL approach, we also need to create a golden model to evaluate the RGB2YCbCr_smarthls output value by value. To achieve this, we implement the RGB2YCbCr_sw
function, which closely mirrors its SmartHLS counterpart.
std::tuple<uint8_t, uint8_t, uint8_t> RGB2YCbCr_sw(uint8_t R, uint8_t G, uint8_t B) {
float Y = 0.299f * R + 0.587f * G + 0.114f * B;
float Cb = -0.169 * R - 0.332 * G + 0.5f * B + 128;
float Cr = 0.5f * R - 0.419 * G - 0.0813 * B + 128;
return std::make_tuple(clamp(Y), clamp(Cb), clamp(Cr)); // limit each value inside [0, 255]
}
To run the simulation, we can run the following:
Software Simulation: Run top-level functions as regular C++ functions using
gcc
on the host PC.SW/HW Co-Simulation: Run RTL simulation for the generated RTL module via
ModelSim
and obtain the test output from themain
function.
Software simulation: we run shls -a sw
to run the program as if it is a regular C++ program.
SW/HW Co-Simulation: we run shls -a cosim
and check the output messages for the simulation result.
Obviously, the SmartHLS approach offers more verification tools compared to the HDL approach, and the SW/HW Co-Simulation provides greater flexibility in testbench design, as writing C++ code is much faster than writing Verilog. However, it's important to note that implementing the DUT in Verilog gives us more control such as I/O interfacing, which is crucial for large FPGA designs.
Summary
In this blog we tried both the RTL approach and HLS C++ approach to implement RGB2YCbCr
, and we see the HLS C++ approach takes much less time to develop and much more friendly to software engineers working with FPGAs. However, the HDL approach brings us more control to the design such as I/O interfacing. I'm not suggesting that HLS C++ is superior to the HDL approach, but this exercise does provide us an insight that HLS C++ could be better suited for standalone, compute-intensive IP core development.
Reference
[1]. https://microchiptech.github.io/fpga-hls-docs/2023.2/userguide.html
[2].ShuranXu/RGB2YCbCr: A HLS C++ project to convert RGB to YCbCr
Subscribe to my newsletter
Read articles from Shuran Xu directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Shuran Xu
Shuran Xu
I am currently a senior software engineer at Microchip working on embedded applications in HLS C++. In my free time I love building side projects in a wide variety, ranging from RTL -level modular design to simple games in C++. Additionally, I also enjoy writing technical blogs to share my knowledge and insights with everyone interested in the tech world.