SW/HW CoSim That Sticks: Three Re-useable Practical Patterns


SW/HW Co-Simulation—What It’s For and How We Do It
Software/hardware co-simulation links executable software to a hardware model to verify behavior before silicon or FPGA bring-up. Teams use it at two levels: unit-level (e.g., HLS IP correctness and I/O contracts) and system-level (firmware exercising resets, interrupts, backpressure, and peripheral models). Methodologies span a spectrum of control and productivity:
File-based (C/C++ vectors + vendor TB): functional checks with minimal setup; timing/handshakes live in the RTL testbench.
Verilator (C++): cycle-accurate, per-cycle control (
eval()
), fastest loops for fuzzing, CI, and performance studies.cocotb + Verilator (Python): cycle-accurate with Python ergonomics (async/await, easy randomization and analysis).
In this blog I will deep dive to the above three flows with my open-source project CoSim_Demo as demonstration so you can pick the right co-sim lane for your project.
CoSim Without Tears: Picking the Right Flow for Your Project
If you build chips (or the stuff that talks to them), you eventually bump into software/hardware co-simulation. It shows up everywhere: pre-silicon bring-up, FPGA prototyping, “does this driver actually poke the right bits?” moments. But how you co-sim depends on what you’re trying to achieve:
HLS/FPGA prototyping: treat co-sim like a unit test harness for IP blocks. You’re proving the math and I/O early, fast, and often.
SoC firmware work: think system level. You’re not only checking the firmware; you’re validating that the modeled peripherals behave like the real ones—resets, interrupts, ready/valid, the whole dance.
The two dials: control and abstraction
Most flows fall along two axes: how much cycle-level control you need, and which language you want to write tests in.
1) File-based co-sim (C/C++ + vendor sim TB)
When you just want to know “does it work?” without micromanaging clocks, file-based is the lowest-friction path. Your C/C++ code pushes vectors and checks results; a prewritten RTL testbench handles timing and handshakes. It’s simple, portable, and great for golden-vector testing. The trade-off: backpressure or dynamic handshakes can feel scripted and clunky.
2) Verilator co-sim (pure C++)
Need to drive the design every cycle, tick clocks yourself, and profile performance? Verilator is your power tool. You link a Verilated model into C++, call eval()
per cycle, and control ready/valid like you own the bus—because you do. It’s fast (no file I/O) and CI-friendly for big regressions and fuzzing.
3) cocotb + Verilator (Python)
Prefer writing tests in Python, but still want cycle accuracy? cocotb lets you await
edges and randomize traffic with a couple of lines. You keep Verilator’s speed and tracing while tapping into Python’s ecosystem (hypothesis, numpy, quick data munging). Overhead exists, but for unit/regression scope it’s usually a non-issue.
The following table shows detailed tradeoffs among the three Co-Simulation approaches:
Aspect | File‑based CoSim | Verilator‑based CoSim (C++) | Cocotb + Verilator (Python) |
Integration boundary | Files between C and Verilog TB | In‑process C++ API ⇄ Verilated model | Python coroutines ⇄ Verilator via cocotb FFI |
Authoring language | C/C++ for vector gen/check; Verilog TB for I/O | C++ testbench + Verilog DUT | Python testbench + Verilog DUT |
Timing fidelity | Transaction/time‑stepped; cycle sync is manual | Cycle‑accurate (eval() per cycle) | Cycle‑accurate (per‑cycle drives/awaits) |
Backpressure / handshakes | Awkward (pre‑scripted) | Natural (toggle ready/valid each cycle) | Natural (await on ready , randomize valid ) |
Performance | Slower (disk I/O, process barriers) | Fast (no file I/O; multithreaded Verilator) | Fast; Python overhead acceptable for unit/regression scope |
Tooling | Any vendor sim (Questa/ModelSim, etc.) | Verilator (OSS), C++17 toolchain | Verilator (OSS), Python 3.8+, cocotb |
Waveforms | Vendor formats (WLF/VCD), viewer varies | FST with GTKWave | VCD/FST via Verilator tracing; GTKWave |
Coverage | Vendor coverage tools | Verilator --coverage + verilator_coverage | Verilator coverage (enable via extra args) + same tooling |
Best for | Golden vectors, portability to many simulators | High‑iteration fuzzing, CI perf runs, native C/C++ | High‑level tests, quick protos, Python ecosystems |
To make this concrete, the rest of the post walks through my open-source cosim_demo project . It drives the same tiny DUT across three harnesses:
(1) file-based C/C++ with a vendor-style RTL testbench
(2) Verilator with an in-process C++ testbench
(3) cocotb + Verilator in Python
→ so the only thing that changes is the methodology, not the design. The repo keeps setup friction low (Make targets, sample vectors, wave dumps), letting you compare timing control, backpressure handling, runtime, waveforms, and coverage side-by-side. We’ll use it to build each flow, run a few representative tests, peek at traces, and note where each approach shines.
What this repo focuses on
cosim_demo
distills three patterns that cover 90% of practical needs while staying tiny and copy-pasteable:
File-based CoSim
Verilator-based CoSim (C++)
Verilator + Python cocotb (Make-based)
All three wrap the same simple DUT—a ready/valid adder—and emphasize scoreboard checking, directed + randomized traffic, and artifacts you can keep (waves, coverage).
The Adder DUT
The DUT, which is a simple read/valid adder, owns a two-deep output queue (main buffer + spill) that sustains one result per cycle when out_ready
is high and absorbs one cycle of backpressure without stalling inputs. It computes in_a + in_b
in a single cycle, prioritizes draining the spill buffer, and asserts in_ready
whenever the spill slot is free. Active-low reset (rst_n
) clears both buffers and valid flags.
The following flow diagram shows the internal adder architecture:
The examples in this repo
Example #1: file-based/simple_adder_rv
This example demonstrates the file-based CoSim and shows a clean file contract: software emits inputs, Verilog testbench reads/drives the DUT, outputs are captured and compared. Designed for deterministic goldens and portability across simulators.
The C++ host testbench invokes QuestaSim via std::system()
, using a fixed command string to run the simulation to completion (run -all; quit -f
). Afterward, it validates results by comparing outputs.txt
with the software-computed golden values. The main()
exit code reports the outcome: 0
for pass, 5
for fail.
The example end-to-end CoSim flow is shown below:
One can build and run this example by running make run
, the simulation shall output the following:
Example #2: verilator-based/rv_adder_example
This example shows a simple yet portable Verilator-based Co-simulation example that verifies the ready/valid adder (i.e. adder_rv_simple
) using a C++ testbench. The test drives both directed vectors and random streaming with backpressure, checks results with a scoreboard, dumps an FST waveform, and writes coverage.
The testbench main()
runs a directed smoke test first with no backpressure (forces top->out_ready = 1
), then flushes any leftover outputs to clear the adder’s internal buffers. Next it switches to randomized streaming using std::mt19937_64 rng(1)
, randomizing a
, b
, and top->in_valid
/ top->out_ready
to exercise backpressure cases. Stimulus is driven on the negedge; after top->eval()
on the posedge, it performs FST tracing and checks results. A std::queue
holds golden values, and a mismatch counter accumulates errors—its total determines the main()
return code.
The following diagram demonstrates the end-to-end simulation flow:
One can build and run this example by running make run
, the simulation shall output the following:
As shown above, the screenshot confirms the expected artifacts and shows the coverage summary (e.g., Total coverage ... 82.00%
) and the annotation directory hint.
Example #3: verilator-based/cocotb_rv_adder
This is a Make-based test that drives the same ready/valid adder on the Verilator backend. It shows how to verify a SystemVerilog ready/valid adder (adder_rv_simple.sv
) using a cocotb Python testbench on a Verilated C++ simulator. It also includes FST waveform dumping and Verilator coverage with post-run source annotation.
The cocotb test coroutine test_adder_rv_simple( )
mirrors Example #2—running a directed smoke test followed by randomized traffic—but differs in stimulus and clock handling. Specifically, inputs are applied right after negedge phase via await FallingEdge(dut.clk)
, and outputs are sampled in the simulator’s Observed stage using await ReadOnly()
. This ensures cycle-accurate observation without race conditions. The following code snippet demonstrates such input setting and output sampling:
# drive only on FallingEdge
await FallingEdge(dut.clk)
dut.in_valid.value = 1
dut.in_a.value = a
dut.in_b.value = b
# observe only after RisingEdge
await RisingEdge(dut.clk)
await ReadOnly()
# sample output here, no writes are allowed here
dut_sum = dut.out_sum.value
For cocotb–Verilator communication, VPI is the glue. Cocotb registers its callbacks into the Verilated C++ model via VPI, which is the standard way for external code to hook into the simulator’s event schedule and design hierarchy. In fact, cocotb talks to Verilator entirely through VPI.
And if you’re not familiar with simulation time stages, here’s a quick primer. Basically, at a given simulation time t, the kernel doesn’t execute everything at once. Instead, it runs through a fixed sequence of event regions (queues). Processes schedule work into these queues; the kernel drains them in order until no more work remains (this may take multiple delta cycles at the same time t). Only then can time advance to t+Δ. This ordering lets RTL and testbench code read and write signals without races. The following are the nine regions in order:
Preponed – Snapshot values before updates (e.g., immediate assertion sampling on edges).
Active – Normal execution; blocking assigns update nets/vars; NBAs are scheduled here.
Inactive – Handles
#0
and events deferred within the same time slot.NBA (Non-Blocking Assign) – Commits all scheduled NBAs “simultaneously” (sequential logic).
Re-NBA – A second NBA commit pass (for NBAs scheduled by later regions).
Observed – Read-only; assertions/coverage/sample points see final values for this time.
Reactive – Testbench/program/clocking-block drives; meant to avoid racing with the DUT.
Re-Active – Any re-evaluation triggered by reactive drives.
Postponed – Last read-only look;
$monitor
, VCD/FST dumps, PLI/VPI callbacks.
The following diagram demonstrates the end-to-end simulation flow:
One can build and run this example by running make run
, the simulation shall output the following:
The above screenshot confirms the expected artifacts and shows the test summary where test result and simulation time are shown.
Summary
SW/HW co-simulation meaningfully speeds pre-silicon work—both chip design and FPGA prototyping—by letting you verify at the abstraction and control level that best fits the task. cosim_demo isn’t a heavyweight framework; it’s a set of small, production-ready patterns you can drop into real codebases. Use file-based co-sim for portable, deterministic golden checks; choose Verilator + C++ when you need cycle-accurate control, high speed, and coverage; reach for cocotb + Verilator when Python’s ergonomics make test authoring faster. Same DUT, three viewpoints—pick what fits today, keep the others handy as your verification needs evolve.
Subscribe to my newsletter
Read articles from Shuran Xu directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Shuran Xu
Shuran Xu
I am currently a senior software engineer at Microchip working on embedded applications in HLS C++. In my free time I love building side projects in a wide variety, ranging from RTL -level modular design to simple games in C++. Additionally, I also enjoy writing technical blogs to share my knowledge and insights with everyone interested in the tech world.