How to optimize FPGA HLS design?


Here’s a compact, battle-tested playbook for optimizing FPGA HLS (Xilinx Vitis HLS / Intel HLS). The themes are: feed the pipeline, remove memory bottlenecks, and right-size arithmetic.
1) Start from a clean, “hardware-friendly” C/C++
No recursion / dynamic allocation / virtuals.
Make loop bounds static (or give max bounds).
Use
restrict
pointers (orconst&
) to let the tool remove false dependencies.Prefer structs of arrays (SoA) over arrays of structs (AoS) for parallel memory access.
2) Right-size your numerics
Replace
float/double
with fixed-point (e.g.,ap_fixed
,ac_fixed
) where accuracy allows.Narrow integer widths (e.g.,
ap_uint<11>
).
➜ Cuts LUT/DSP/BRAM use and raises Fmax.
3) Design the memory architecture first
Common reason for II>1 is a single BRAM port trying to serve many reads/writes.
Partition or reshape arrays to get parallel banks:
#pragma HLS ARRAY_PARTITION variable=A complete dim=2
(full parallel rows/cols)#pragma HLS ARRAY_RESHAPE variable=A factor=4 dim=1
(wider data per access)
Use line buffers/tiling for 2-D kernels so you hit on-chip RAM instead of DDR every cycle.
For external memory:
Burst transfers (
AXI4
), align and coalesce accesses.Stream tiles into the kernel, compute, stream results out.
4) Pipeline the critical loops
Target II=1 on the innermost loop doing the MAC/compute.
#pragma HLS PIPELINE II=1
(move to the deepest feasible loop).If loop-carried dependencies block pipelining:
Reorder loops, accumulate in local variables, or use reduction trees.
Use
#pragma HLS DEPENDENCE variable=… inter false
only if you’re sure it’s a false dep.
5) Unroll only where the memory can keep up
#pragma HLS UNROLL factor=…
duplicates hardware lanes.Match unroll factor to (ports × banks × bytes/cycle).
➜ If memory can’t feed it, unrolling just increases stalls and area.
6) Dataflow across functions/stages
Split kernel into load → compute → store and connect with streams/FIFOs.
#pragma HLS DATAFLOW
at the top wrapper; usehls::stream<T>
between stages.Size stream depths (e.g.,
#pragma HLS STREAM depth=64
) to decouple producer/consumer and prevent backpressure.
7) Bind resources deliberately
Map MACs to DSPs:
#pragma HLS BIND_OP variable=… op=mul impl=DSP
Force small ROMs into LUTRAM or big buffers into BRAM/URAM with
#pragma HLS RESOURCE
.Control sharing/duplication:
#pragma HLS ALLOCATION instances=mul limit=… operation
(share to save area; duplicate to hit II=1).
8) Interfaces & I/O throughput
Use AXI4-Stream for high-rate pipelines; AXI4 master for memory bursts; AXI-Lite for control.
Align bursts to 64/128/512-bit widths; pack data types to meet the bus width.
For video/DSP, keep everything streaming; avoid writing intermediate arrays to DDR.
9) Typical optimization patterns
(a) FIR / Dot-product inner loop)
for (int i=0;i<N;i++) {
#pragma HLS PIPELINE II=1
acc += x[i]*h[i]; // Put h[] in ROM; partition x[] as needed
}
Put
h[]
in ROM; partition if reading multiple taps/cycle.Accumulate in tree if unrolled.
(b) 2-D stencil (e.g., 3×3 conv)
Use line buffers + window buffer; pipeline inner x-loop, unroll the 3×3 MAC, partition the window.
Read one pixel/cycle → update window → compute → write.
(c) Matrix multiply (tiled)
- Tile A/B into on-chip BRAM; partition tile dimensions to match unroll; pipeline the k-loop; dataflow between load/compute/store.
10) Why your II isn’t 1 (quick fixes)
BRAM port conflict → partition/reshape, or duplicate buffers.
Write-after-read dep on same array → use double-buffering or separate read/write buffers.
Function call boundary blocks scheduling →
#pragma HLS INLINE
.Streams stall (empty/full) → increase FIFO depth, balance stage latencies, or reduce unroll on slower stage.
Division/mod in loop → replace with reciprocal/mul, LUTs, or hoist invariant ops.
11) Close timing (Fmax) early
Check post-synthesis timing as you add pragmas; long add chains from unrolls kill Fmax → add pipeline registers or reduce factor.
Prefer balanced adder trees for reductions (many tools infer them from unrolled loops).
Constrain the target clock realistically (e.g., 200–300 MHz for mid-range devices).
12) Verification & iteration loop
C sim (golden ref) →
HLS C/RTL cosim with real vectors →
Inspect HLS report: II, latency, resource (DSP/BRAM/URAM/LUT) →
Adjust pragmas/types/buffering →
Integrate in IP, run implementation to confirm timing/area.
Pragmas/C++ knobs (cheat sheet)
Throughput:
PIPELINE
,UNROLL
,DATAFLOW
,INLINE
Memory BW:
ARRAY_PARTITION
,ARRAY_RESHAPE
,RESOURCE core=RAM_…
,STREAM depth=…
Numerics:
ap_[u]fixed
,ap_[u]int<N>
,BIND_OP impl=DSP
Sharing/area:
ALLOCATION limit=…
,LATENCY max=…
Deps:
DEPENDENCE inter|intra false
,restrict
,const
Minimal example: load/compute/store dataflow
void top(ap_uint<64>* in, ap_uint<64>* out, int n) {
#pragma HLS INTERFACE m_axi port=in offset=slave bundle=gmem
#pragma HLS INTERFACE m_axi port=out offset=slave bundle=gmem
#pragma HLS INTERFACE s_axilite port=in
#pragma HLS INTERFACE s_axilite port=out
#pragma HLS INTERFACE s_axilite port=n
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS DATAFLOW
hls::stream<ap_uint<64>> s_in, s_out;
#pragma HLS STREAM variable=s_in depth=64
#pragma HLS STREAM variable=s_out depth=64
load(in, s_in, n); // burst reads
compute(s_in, s_out, n);// PIPELINE II=1 inside
store(s_out, out, n); // burst writes
}
TL;DR tuning order
Fixed-point & narrow widths →
Memory banking/tiling →
PIPELINE inner loops (II=1) →
UNROLL to match memory BW →
DATAFLOW across stages →
Bind to DSP/BRAM/URAM →
Iterate with reports until timing/resources meet spec.
Subscribe to my newsletter
Read articles from ampheo directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
