How to optimize FPGA HLS design?

ampheoampheo
4 min read

Here’s a compact, battle-tested playbook for optimizing FPGA HLS (Xilinx Vitis HLS / Intel HLS). The themes are: feed the pipeline, remove memory bottlenecks, and right-size arithmetic.


1) Start from a clean, “hardware-friendly” C/C++

  • No recursion / dynamic allocation / virtuals.

  • Make loop bounds static (or give max bounds).

  • Use restrict pointers (or const&) to let the tool remove false dependencies.

  • Prefer structs of arrays (SoA) over arrays of structs (AoS) for parallel memory access.


2) Right-size your numerics

  • Replace float/double with fixed-point (e.g., ap_fixed, ac_fixed) where accuracy allows.

  • Narrow integer widths (e.g., ap_uint<11>).
    ➜ Cuts LUT/DSP/BRAM use and raises Fmax.


3) Design the memory architecture first

Common reason for II>1 is a single BRAM port trying to serve many reads/writes.

  • Partition or reshape arrays to get parallel banks:

    • #pragma HLS ARRAY_PARTITION variable=A complete dim=2 (full parallel rows/cols)

    • #pragma HLS ARRAY_RESHAPE variable=A factor=4 dim=1 (wider data per access)

  • Use line buffers/tiling for 2-D kernels so you hit on-chip RAM instead of DDR every cycle.

  • For external memory:

    • Burst transfers (AXI4), align and coalesce accesses.

    • Stream tiles into the kernel, compute, stream results out.


4) Pipeline the critical loops

  • Target II=1 on the innermost loop doing the MAC/compute.

  • #pragma HLS PIPELINE II=1 (move to the deepest feasible loop).

  • If loop-carried dependencies block pipelining:

    • Reorder loops, accumulate in local variables, or use reduction trees.

    • Use #pragma HLS DEPENDENCE variable=… inter false only if you’re sure it’s a false dep.


5) Unroll only where the memory can keep up

  • #pragma HLS UNROLL factor=… duplicates hardware lanes.

  • Match unroll factor to (ports × banks × bytes/cycle).
    ➜ If memory can’t feed it, unrolling just increases stalls and area.


6) Dataflow across functions/stages

  • Split kernel into load → compute → store and connect with streams/FIFOs.

  • #pragma HLS DATAFLOW at the top wrapper; use hls::stream<T> between stages.

  • Size stream depths (e.g., #pragma HLS STREAM depth=64) to decouple producer/consumer and prevent backpressure.


7) Bind resources deliberately

  • Map MACs to DSPs: #pragma HLS BIND_OP variable=… op=mul impl=DSP

  • Force small ROMs into LUTRAM or big buffers into BRAM/URAM with #pragma HLS RESOURCE.

  • Control sharing/duplication: #pragma HLS ALLOCATION instances=mul limit=… operation
    (share to save area; duplicate to hit II=1).


8) Interfaces & I/O throughput

  • Use AXI4-Stream for high-rate pipelines; AXI4 master for memory bursts; AXI-Lite for control.

  • Align bursts to 64/128/512-bit widths; pack data types to meet the bus width.

  • For video/DSP, keep everything streaming; avoid writing intermediate arrays to DDR.


9) Typical optimization patterns

(a) FIR / Dot-product inner loop)

for (int i=0;i<N;i++) {
  #pragma HLS PIPELINE II=1
  acc += x[i]*h[i];       // Put h[] in ROM; partition x[] as needed
}
  • Put h[] in ROM; partition if reading multiple taps/cycle.

  • Accumulate in tree if unrolled.

(b) 2-D stencil (e.g., 3×3 conv)

  • Use line buffers + window buffer; pipeline inner x-loop, unroll the 3×3 MAC, partition the window.

  • Read one pixel/cycle → update window → compute → write.

(c) Matrix multiply (tiled)

  • Tile A/B into on-chip BRAM; partition tile dimensions to match unroll; pipeline the k-loop; dataflow between load/compute/store.

10) Why your II isn’t 1 (quick fixes)

  • BRAM port conflict → partition/reshape, or duplicate buffers.

  • Write-after-read dep on same array → use double-buffering or separate read/write buffers.

  • Function call boundary blocks scheduling → #pragma HLS INLINE.

  • Streams stall (empty/full) → increase FIFO depth, balance stage latencies, or reduce unroll on slower stage.

  • Division/mod in loop → replace with reciprocal/mul, LUTs, or hoist invariant ops.


11) Close timing (Fmax) early

  • Check post-synthesis timing as you add pragmas; long add chains from unrolls kill Fmax → add pipeline registers or reduce factor.

  • Prefer balanced adder trees for reductions (many tools infer them from unrolled loops).

  • Constrain the target clock realistically (e.g., 200–300 MHz for mid-range devices).


12) Verification & iteration loop

  1. C sim (golden ref) →

  2. HLS C/RTL cosim with real vectors →

  3. Inspect HLS report: II, latency, resource (DSP/BRAM/URAM/LUT) →

  4. Adjust pragmas/types/buffering →

  5. Integrate in IP, run implementation to confirm timing/area.


Pragmas/C++ knobs (cheat sheet)

  • Throughput: PIPELINE, UNROLL, DATAFLOW, INLINE

  • Memory BW: ARRAY_PARTITION, ARRAY_RESHAPE, RESOURCE core=RAM_…, STREAM depth=…

  • Numerics: ap_[u]fixed, ap_[u]int<N>, BIND_OP impl=DSP

  • Sharing/area: ALLOCATION limit=…, LATENCY max=…

  • Deps: DEPENDENCE inter|intra false, restrict, const


Minimal example: load/compute/store dataflow

void top(ap_uint<64>* in, ap_uint<64>* out, int n) {
  #pragma HLS INTERFACE m_axi     port=in  offset=slave bundle=gmem
  #pragma HLS INTERFACE m_axi     port=out offset=slave bundle=gmem
  #pragma HLS INTERFACE s_axilite port=in
  #pragma HLS INTERFACE s_axilite port=out
  #pragma HLS INTERFACE s_axilite port=n
  #pragma HLS INTERFACE s_axilite port=return
  #pragma HLS DATAFLOW

  hls::stream<ap_uint<64>> s_in, s_out;
  #pragma HLS STREAM variable=s_in  depth=64
  #pragma HLS STREAM variable=s_out depth=64

  load(in, s_in, n);      // burst reads
  compute(s_in, s_out, n);// PIPELINE II=1 inside
  store(s_out, out, n);   // burst writes
}

TL;DR tuning order

  1. Fixed-point & narrow widths

  2. Memory banking/tiling

  3. PIPELINE inner loops (II=1) →

  4. UNROLL to match memory BW

  5. DATAFLOW across stages →

  6. Bind to DSP/BRAM/URAM

  7. Iterate with reports until timing/resources meet spec.

0
Subscribe to my newsletter

Read articles from ampheo directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

ampheo
ampheo