Here’s a compact, battle-tested playbook for optimizing FPGA HLS (Xilinx Vitis HLS / Intel HLS). The themes are: feed the pipeline, remove memory bottlenecks, and right-size arithmetic.

1) Start from a clean, “hardware-friendly” C/C++

No recursion / dynamic allocation / virtuals.
Make loop bounds static (or give max bounds).
Use restrict pointers (or const&) to let the tool remove false dependencies.
Prefer structs of arrays (SoA) over arrays of structs (AoS) for parallel memory access.

2) Right-size your numerics

Replace float/double with fixed-point (e.g., ap_fixed, ac_fixed) where accuracy allows.
Narrow integer widths (e.g., ap_uint<11>).
➜ Cuts LUT/DSP/BRAM use and raises Fmax.

3) Design the memory architecture first

Common reason for II>1 is a single BRAM port trying to serve many reads/writes.

Partition or reshape arrays to get parallel banks:
- #pragma HLS ARRAY_PARTITION variable=A complete dim=2 (full parallel rows/cols)
- #pragma HLS ARRAY_RESHAPE variable=A factor=4 dim=1 (wider data per access)
Use line buffers/tiling for 2-D kernels so you hit on-chip RAM instead of DDR every cycle.
For external memory:
- Burst transfers (AXI4), align and coalesce accesses.
- Stream tiles into the kernel, compute, stream results out.

4) Pipeline the critical loops

Target II=1 on the innermost loop doing the MAC/compute.
#pragma HLS PIPELINE II=1 (move to the deepest feasible loop).
If loop-carried dependencies block pipelining:
- Reorder loops, accumulate in local variables, or use reduction trees.
- Use #pragma HLS DEPENDENCE variable=… inter false only if you’re sure it’s a false dep.

5) Unroll only where the memory can keep up

#pragma HLS UNROLL factor=… duplicates hardware lanes.
Match unroll factor to (ports × banks × bytes/cycle).
➜ If memory can’t feed it, unrolling just increases stalls and area.

6) Dataflow across functions/stages

Split kernel into load → compute → store and connect with streams/FIFOs.
#pragma HLS DATAFLOW at the top wrapper; use hls::stream<T> between stages.
Size stream depths (e.g., #pragma HLS STREAM depth=64) to decouple producer/consumer and prevent backpressure.

7) Bind resources deliberately

Map MACs to DSPs: #pragma HLS BIND_OP variable=… op=mul impl=DSP
Force small ROMs into LUTRAM or big buffers into BRAM/URAM with #pragma HLS RESOURCE.
Control sharing/duplication: #pragma HLS ALLOCATION instances=mul limit=… operation
(share to save area; duplicate to hit II=1).

8) Interfaces & I/O throughput

Use AXI4-Stream for high-rate pipelines; AXI4 master for memory bursts; AXI-Lite for control.
Align bursts to 64/128/512-bit widths; pack data types to meet the bus width.
For video/DSP, keep everything streaming; avoid writing intermediate arrays to DDR.

9) Typical optimization patterns

(a) FIR / Dot-product inner loop)

for (int i=0;i<N;i++) {
  #pragma HLS PIPELINE II=1
  acc += x[i]*h[i];       // Put h[] in ROM; partition x[] as needed
}

Put h[] in ROM; partition if reading multiple taps/cycle.
Accumulate in tree if unrolled.

(b) 2-D stencil (e.g., 3×3 conv)

Use line buffers + window buffer; pipeline inner x-loop, unroll the 3×3 MAC, partition the window.
Read one pixel/cycle → update window → compute → write.

(c) Matrix multiply (tiled)

Tile A/B into on-chip BRAM; partition tile dimensions to match unroll; pipeline the k-loop; dataflow between load/compute/store.

10) Why your II isn’t 1 (quick fixes)

BRAM port conflict → partition/reshape, or duplicate buffers.
Write-after-read dep on same array → use double-buffering or separate read/write buffers.
Function call boundary blocks scheduling → #pragma HLS INLINE.
Streams stall (empty/full) → increase FIFO depth, balance stage latencies, or reduce unroll on slower stage.
Division/mod in loop → replace with reciprocal/mul, LUTs, or hoist invariant ops.

11) Close timing (Fmax) early

Check post-synthesis timing as you add pragmas; long add chains from unrolls kill Fmax → add pipeline registers or reduce factor.
Prefer balanced adder trees for reductions (many tools infer them from unrolled loops).
Constrain the target clock realistically (e.g., 200–300 MHz for mid-range devices).

12) Verification & iteration loop

C sim (golden ref) →
HLS C/RTL cosim with real vectors →
Inspect HLS report: II, latency, resource (DSP/BRAM/URAM/LUT) →
Adjust pragmas/types/buffering →
Integrate in IP, run implementation to confirm timing/area.

Pragmas/C++ knobs (cheat sheet)

Throughput: PIPELINE, UNROLL, DATAFLOW, INLINE
Memory BW: ARRAY_PARTITION, ARRAY_RESHAPE, RESOURCE core=RAM_…, STREAM depth=…
Numerics: ap_[u]fixed, ap_[u]int<N>, BIND_OP impl=DSP
Sharing/area: ALLOCATION limit=…, LATENCY max=…
Deps: DEPENDENCE inter|intra false, restrict, const

Minimal example: load/compute/store dataflow

void top(ap_uint<64>* in, ap_uint<64>* out, int n) {
  #pragma HLS INTERFACE m_axi     port=in  offset=slave bundle=gmem
  #pragma HLS INTERFACE m_axi     port=out offset=slave bundle=gmem
  #pragma HLS INTERFACE s_axilite port=in
  #pragma HLS INTERFACE s_axilite port=out
  #pragma HLS INTERFACE s_axilite port=n
  #pragma HLS INTERFACE s_axilite port=return
  #pragma HLS DATAFLOW

  hls::stream<ap_uint<64>> s_in, s_out;
  #pragma HLS STREAM variable=s_in  depth=64
  #pragma HLS STREAM variable=s_out depth=64

  load(in, s_in, n);      // burst reads
  compute(s_in, s_out, n);// PIPELINE II=1 inside
  store(s_out, out, n);   // burst writes
}

TL;DR tuning order

Fixed-point & narrow widths →
Memory banking/tiling →
PIPELINE inner loops (II=1) →
UNROLL to match memory BW →
DATAFLOW across stages →
Bind to DSP/BRAM/URAM →
Iterate with reports until timing/resources meet spec.

How to optimize FPGA HLS design?