This doc keeps track of my experience diagnosing CKB’s unstable tests. While the issue is unlikely to affect most application developers on CKB, it may be of interest to Rust developers working on complex multithreaded systems or those curious about advanced debugging techniques.

A PR has been created in CKB to fix tests per discovery documented in this doc.

Every now and then, when running CKB’s test, you might see errors like:

pthread lock: Invalid argument
terminate called without an active exception

These failures occur randomly (~5% of test runs) and are difficult to reproduce. I suspected the issue might be related to ckb-sync, and this post documents the debugging journey, tools used, and eventual fix.

Reproducing the Issue

Testing with `cargo nextest`

Cargo-nextest is a test runner for Rust projects. Based on the actual command executed by make test, I could piece together the following command:

$ cargo nextest run  --features with_sentry --no-fail-fast \
    --hide-progress-bar --success-output immediate-final \
    --failure-output immediate-final -p ckb-sync

However, this command would fail to build. It seems that running the ckb-sync package alone, ckb-sync is missing a dev-time feature dependency, like the following line:

ckb-tx-pool = { workspace = true, features = ["internal"] }

After adding this line, the above command would succeed in building and proceed to run the tests from ckb-sync package.

On my machine, the test fails after ~20 iterations (5% failure rate) with either pthread lock: Invalid argument or terminate called without an active exception.

Deterministic Debugging with `rr` Multithreaded Failures

rr is a lightweight debugging tool for recording, replaying and debugging execution of applications (trees of processes and threads)

records a program’s execution once and allows deterministic replay of that exact run multiple times. By capturing all sources of nondeterminism (including thread scheduling), rr enables reliable debugging of multithreaded failures, making it especially useful for diagnosing flaky or intermittent test issues in complex systems like CKB.

Installation

Install rr per the instructions.

Note: rr has specific requirements on OS and your CPU, refer to the docs for more details.

To limit the traces generated by rr, I will focus on one particular test tests::sync_shared::test_insert_parent_unknown_block. From my local experiments, this test might throw the above mentioned errors.

Set rr as the cargo runner

$ export CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER="/home/ubuntu/rr-obj/bin/rr record"

The rr binary on my machine is installed to /home/ubuntu/rr-obj/bin/rr, you might want to adjust this accordingly.

Run the test repeatedly

Then we can keep running this command until one of the errors shows up:

$ cargo nextest run  --features with_sentry --no-fail-fast \\
    --hide-progress-bar --success-output immediate-final \\
    --failure-output immediate-final -p ckb-sync \\
    tests::sync_shared::test_insert_parent_unknown_block

As each test run would accumulate rr traces and test data, you might use the following command between test runs to clean up all those data:

$ rm -rf ~/.local/share/rr/ /tmp/ckb-tmp-* /tmp/.tmp*

It’s not hard to write a script that keeps running the test until it fails:

#!/usr/bin/env bash
set -x

# This script simply runs the test repeatedly until faliure happens

unset CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER
while true; do
  cargo nextest run  --features with_sentry --no-fail-fast \\
    --hide-progress-bar --success-output immediate-final \\
    --failure-output immediate-final -p \\
    ckb-sync tests::sync_shared::test_insert_parent_unknown_block
  RETURN_CODE=$?
  rm -rf /tmp/ckb-tmp-* /tmp/.tmp*

  if [[ "$RETURN_CODE" -ne 0 ]]; then
    echo "Failed with $RETURN_CODE"
    exit $RETURN_CODE
  fi
done

Or try this one that takes rr into account:

#!/usr/bin/env bash
set -x

RR="${RR:-/home/ubuntu/rr-obj/bin/rr}"

# This script runs the test wrapped with rr, so when failure happens,
# you can replay the failed test using rr

export CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER="$RR record"
while true; do
  cargo nextest run  --features with_sentry --no-fail-fast \\
    --hide-progress-bar --success-output immediate-final \\
    --failure-output immediate-final -p \\
    ckb-sync tests::sync_shared::test_insert_parent_unknown_block
  RETURN_CODE=$?
  rm -rf /tmp/ckb-tmp-* /tmp/.tmp*

  if [[ "$RETURN_CODE" -ne 0 ]]; then
    echo "Failed with $RETURN_CODE, use $RR replay to rerun the failure!"
    exit $RETURN_CODE
  fi

  rm -rf ~/.local/share/rr
done

For some reason, the test tends to fail more often running under rr on my machine, my guess is that rr alters the behavior of multi-thread scheduling. Or maybe rr picks a different execution path compared to what is normally tested in our daily development.

Failures Observed

Error Case: `pthread lock: Invalid argument`

With enough runs, one of the test outputs might look like following:

$ cargo nextest run  --features with_sentry --no-fail-fast --hide-progress-bar --success-output immediate-final --failure-output immediate-final -p ckb-sync tests::sync_shared::test_insert_parent_unknown_block
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.24s
info: using target runner `/home/ubuntu/rr-obj/bin/rr record` defined by environment variable `CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER`
────────────
 Nextest run ID d8f86415-500a-4b18-9093-cd21a656013a with nextest profile: default
    Starting 1 test across 1 binary (68 tests skipped)
     SIGABRT [   2.098s] ckb-sync tests::sync_shared::test_insert_parent_unknown_block
──── STDOUT:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block

running 1 test
test tests::sync_shared::test_insert_parent_unknown_block ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 0.93s

──── STDERR:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block
pthread lock: Invalid argument

────────────
     Summary [   2.099s] 1 test run: 0 passed, 1 failed, 68 skipped
     SIGABRT [   2.098s] ckb-sync tests::sync_shared::test_insert_parent_unknown_block
──── STDOUT:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block

running 1 test
test tests::sync_shared::test_insert_parent_unknown_block ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 0.93s

──── STDERR:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block
pthread lock: Invalid argument

error: test run failed

At this case, you can find that rr has generated traces:

$ ls -lh ~/.local/share/rr/
total 12K
drwxrwx--- 2 ubuntu ubuntu 4.0K Apr 27 05:55 ckb_sync-a645bdacec5c2bf3-0
drwxrwx--- 2 ubuntu ubuntu 4.0K Apr 27 05:55 ckb_sync-a645bdacec5c2bf3-1
drwxrwx--- 2 ubuntu ubuntu 4.0K Apr 27 05:55 ckb_sync-a645bdacec5c2bf3-2
-rw------- 1 ubuntu ubuntu    8 Apr 27 05:55 cpu_lock
lrwxrwxrwx 1 ubuntu ubuntu   27 Apr 27 05:55 latest-trace -> ckb_sync-a645bdacec5c2bf3-2

I run the clean command above between each run, so you’ll only see three traces here. The last one corresponds to the actual failing test, while the first two might be bookkeeping runs of nextest. If you didn’t run the clean command, there could be more traces.

Now you can repeat the failed case:

$ ~/rr-obj/bin/rr replay
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/ubuntu/.local/share/rr/ckb_sync-a645bdacec5c2bf3-2/mmap_hardlink_4_ckb_sync-a645bdacec5c2bf3...
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /home/ubuntu/.local/share/rr/ckb_sync-a645bdacec5c2bf3-2/mmap_hardlink_4_ckb_sync-a645bdacec5c2bf3.
Use `info auto-load python-scripts [REGEXP]' to list them.
Remote debugging using 127.0.0.1:7932
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/1c/8db5f83bba514f8fd5f1fb6d7be975be1bb855.debug...
BFD: warning: system-supplied DSO at 0x6fffd000 has a section extending past end of file

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) y
Debuginfod has been enabled.
To make this setting permanent, add 'set debuginfod enabled on' to .gdbinit.
Downloading separate debug info for system-supplied DSO at 0x6fffd000
0x00007ce375295540 in _start () from /lib64/ld-linux-x86-64.so.2
(rr)

rr starts a gdb session for the ckb-sync test. We can use c to continue to run the test:

(rr) c
Continuing.
Downloading separate debug info for /lib/x86_64-linux-gnu/libstdc++.so.6

running 1 test
test tests::sync_shared::test_insert_parent_unknown_block ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 0.93s

pthread lock: Invalid argument
[New Thread 72541.73153]
[New Thread 72541.72544]
[New Thread 72541.73028]
[New Thread 72541.73029]
[New Thread 72541.73082]
[New Thread 72541.73109]
[New Thread 72541.73111]
[New Thread 72541.73113]
[New Thread 72541.73114]
[New Thread 72541.73115]
[New Thread 72541.73118]
[New Thread 72541.73119]
[New Thread 72541.73152]
[New Thread 72541.73156]
[New Thread 72541.73157]

Thread 2 received signal SIGABRT, Aborted.
[Switching to Thread 72541.73153]
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
warning: 44     ./nptl/pthread_kill.c: No such file or directory
(rr)

And use bt to print stack trace when the error happens:

(rr) bt

As the stack trace is quite long, I’m posting a screenshot instead:

We can track the failure to this line:

ffi::rocksdb_optimistictransactiondb_close(self.inner);

For some reason, closing an optimistic transaction DB triggers the failure. But the cause is unclear. Just a guess: what should a proper shutdown process in RocksDB look like? Is simply closing the DB instance enough?

As long as you don’t run rr record ... or rerun the test, you can use rr replay to rerun the failed test as many times as needed. Or, you can use rr replay ~/.local/share/rr/ckb_sync-a645bdacec5c2bf3-2 to manually pick the trace of the failure test. This way, you can continue running more rr record ....

Error Case: `terminate called without an active exception`

Enough runs of the test also reveal another error:

$ cargo nextest run  --features with_sentry --no-fail-fast --hide-progress-bar --success-output immediate-final --failure-output immediate-final -p ckb-sync tests::sync_shared::test_insert_parent_unknown_block
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.24s
info: using target runner `/home/ubuntu/rr-obj/bin/rr record` defined by environment variable `CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER`
────────────
 Nextest run ID 2aa60494-7933-4b99-9f5f-cdced260411a with nextest profile: default
    Starting 1 test across 1 binary (68 tests skipped)
     SIGABRT [   2.083s] ckb-sync tests::sync_shared::test_insert_parent_unknown_block
──── STDOUT:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block

running 1 test
test tests::sync_shared::test_insert_parent_unknown_block ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 0.92s

──── STDERR:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block
terminate called without an active exception

────────────
     Summary [   2.084s] 1 test run: 0 passed, 1 failed, 68 skipped
     SIGABRT [   2.083s] ckb-sync tests::sync_shared::test_insert_parent_unknown_block
──── STDOUT:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block

running 1 test
test tests::sync_shared::test_insert_parent_unknown_block ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 0.92s

──── STDERR:             ckb-sync tests::sync_shared::test_insert_parent_unknown_block
terminate called without an active exception

error: test run failed

We can also use rr to jump into a gdb session:

$ ~/rr-obj/bin/rr replay
GNU gdb (Ubuntu 15.0.50.20240403-0ubuntu1) 15.0.50.20240403-git
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/ubuntu/.local/share/rr/ckb_sync-a645bdacec5c2bf3-2/mmap_hardlink_4_ckb_sync-a645bdacec5c2bf3...
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /home/ubuntu/.local/share/rr/ckb_sync-a645bdacec5c2bf3-2/mmap_hardlink_4_ckb_sync-a645bdacec5c2bf3.
Use `info auto-load python-scripts [REGEXP]' to list them.
Remote debugging using 127.0.0.1:9252
Reading symbols from /lib64/ld-linux-x86-64.so.2...
Reading symbols from /usr/lib/debug/.build-id/1c/8db5f83bba514f8fd5f1fb6d7be975be1bb855.debug...
BFD: warning: system-supplied DSO at 0x6fffd000 has a section extending past end of file

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) y
Debuginfod has been enabled.
To make this setting permanent, add 'set debuginfod enabled on' to .gdbinit.
0x000072be440c4540 in _start () from /lib64/ld-linux-x86-64.so.2
(rr)

We can also use c to continue with the test, and use bt to check out stack trace when failure happens:

(rr) c
Continuing.

running 1 test
test tests::sync_shared::test_insert_parent_unknown_block ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 0.92s

terminate called without an active exception
[New Thread 74722.74724]
[New Thread 74722.74725]
[New Thread 74722.74741]
[New Thread 74722.74742]
[New Thread 74722.74743]
[New Thread 74722.74744]
[New Thread 74722.74749]
[New Thread 74722.74750]
[New Thread 74722.74751]
[New Thread 74722.74752]
[New Thread 74722.74753]
[New Thread 74722.74754]
[New Thread 74722.74755]
[New Thread 74722.74771]
[New Thread 74722.74773]
[New Thread 74722.74775]
[New Thread 74722.74776]

Thread 1 received signal SIGABRT, Aborted.
Download failed: Invalid argument.  Continuing without source file ./nptl/./nptl/pthread_kill.c.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
warning: 44     ./nptl/pthread_kill.c: No such file or directory
(rr) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x000072be4384527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x000072be438288ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x000072be43ca5ff5 in __gnu_cxx::__verbose_terminate_handler () at ../../../../src/libstdc++-v3/libsupc++/vterminate.cc:95
#6  0x000072be43cbb0da in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:48
#7  0x000072be43ca5a55 in std::terminate () at ../../../../src/libstdc++-v3/libsupc++/eh_terminate.cc:58
#8  0x00005f7ac4eab073 in std::__terminate () at /usr/include/x86_64-linux-gnu/c++/13/bits/c++config.h:322
#9  std::thread::~thread (this=0x72be3c170ea0, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/std_thread.h:173
#10 0x00005f7ac4eefb68 in std::default_delete<std::thread>::operator() (this=0x5f7ac7268318 <rocksdb::PeriodicTaskScheduler::Default()::timer+152>,
    __ptr=0x72be3c170ea0) at /usr/include/c++/13/bits/unique_ptr.h:99
#11 0x00005f7ac4ede608 in std::unique_ptr<std::thread, std::default_delete<std::thread> >::~unique_ptr (
    this=0x5f7ac7268318 <rocksdb::PeriodicTaskScheduler::Default()::timer+152>, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/unique_ptr.h:404
#12 0x00005f7ac505d876 in rocksdb::Timer::~Timer (this=0x5f7ac7268280 <rocksdb::PeriodicTaskScheduler::Default()::timer>, __in_chrg=<optimized out>)
    at rocksdb/util/timer.h:48
#13 0x000072be43847a76 in __run_exit_handlers (status=0, listp=<optimized out>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at ./stdlib/exit.c:108
#14 0x000072be43847bbe in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:138
#15 0x000072be4382a1d1 in __libc_start_call_main (main=main@entry=0x5f7ac3eeaf40 <main>, argc=argc@entry=4, argv=argv@entry=0x7ffe436ca588)
    at ../sysdeps/nptl/libc_start_call_main.h:74
#16 0x000072be4382a28b in __libc_start_main_impl (main=0x5f7ac3eeaf40 <main>, argc=4, argv=0x7ffe436ca588, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7ffe436ca578) at ../csu/libc-start.c:360
#17 0x00005f7ac3e1a4c5 in _start ()
(rr)

This is actually less obvious than the previous one, the error happens when C++ code destructs a timer instance.

Just to take a guess: maybe there is a timer left unprocessed, when closing a Rocksdb DB instance?

Diving into the Code

Once we had a reliable way to reproduce the failure and replay it in rr, it is time to figure out why it was happening.

When diving into the code, the first thing I noticed is a redundant delete C++ call:

ffi::rocksdb_optimistictransactiondb_close_base_db(self.base_db);
ffi::rocksdb_optimistictransactiondb_close(self.inner);

If we track down the code, both Line 152 and 153 here will eventually invoke ~StackableDB (Line 34-41):

  ~StackableDB() override {
    if (shared_db_ptr_ == nullptr) {
      delete db_;
    } else {
      assert(shared_db_ptr_.get() == db_);
    }
    db_ = nullptr;
  }

After the first invocation, db_ will become nullptr, and C++’s delete first checks if its operand is nullptr. If so, the delete call will just be a NOOP.

So the extra delete C++ call is redundant, not a double-free. In fact, even if we remove one of the double delete calls, we could still run into failures. Here’s one possible stacktrace:

And now, it’s mostly just repeating the following loop:

Read the code.
Add println! in Rust code or printf lines in C++ code.
Use the above bash script to compile and repeatedly run the tests until a failure occurs.
Use rr to replay the failure in a gdb session, setting breakpoints to peek into memory data. If required, restart rr to begin another gdb session.

Throughout this process, rr has been super reliable and helpful. As long as bash script captures a test failure, rr can deterministically rerun it as many times as you like.

By the end, I had likely started over 100 gdb sessions using rr to replay the failures.

Failure Workflow Analysis

After enough trials, I’ve nailed down to the following workflow (I’m using tests::sync_shared::test_insert_parent_unknown_block as an example, I believe other tests could be fixed similarly):

At the very start, two shared structures and two chain_controllers are created (Line 83-94).

The shared1 variable created at Line 83 and its discarded chain controller work as expected.

The issue stems from the shared and chain variables created at Line 84.
As part of the block starting at 84, start_chain_services is invoked. It creates a bunch of threads, some of which might hold a copy of Shared structure, meaning each of those threads also holds Arc<RocksDB> instance.
A unit test might not use CKB’s full shutdown process like a normal CKB node does. When the test tests::sync_shared::test_insert_parent_unknown_block terminates, this thread (Line 59-73) is still running.
- Note: By terminate, I mean you can see test tests::sync_shared::test_insert_parent_unknown_block ... ok generated by Rust’s test infrastructure; in other words, you can think that control flow for the main test thread has already exited the method test_insert_parent_unknown_block.

When Rust initiates thread termination (or pthread determines to kill all still running threads), PreloadUnverifiedBlocksChannel still holds a copy of Shared structure, which maintains a copy of Arc<RocksDB>.

Then the Drop impl of the underlying OptimisticTransactionDB starts executing, which will invoke RocksDB’s destructor (Line 717):
```
  DBImpl::~DBImpl() {
    ThreadStatus::OperationType cur_op_type =
        ThreadStatusUtil::GetThreadOperation();
    ThreadStatusUtil::SetThreadOperation(ThreadStatus::OperationType::OP_UNKNOWN);
```
which then invokes RocksDB’s global Timer’s Cancel method (Line 86):
```
  InstrumentedMutexLock l(&mutex_);
```
This is when Timer::Cancel decides to acquire a mutex lock, and pthread signals an error, aborting the test program.

This leads to a crash like pthread lock: Invalid argument or

terminate called without an active exception.

What Caused the Crash

My best guess at what happened here is:

When test_insert_parent_unknown_block finishes, Rust decides that all the requested tests (in this case, just one) have finished, and starts terminating the process.
At this point, two threads are involved:
1. The thread containing PreloadUnverifiedBlocksChannel runs Drop trait impl, which then calls RocksDB’s destructor, and eventually invokes RocksDB’s Timer::Cancel on the global Timer object.
2. A cleanup thread (most likely the top running thread) runs cleanup hooks for RocksDB’s global Timer structure, which also target the same global timer.

If Thread 1 finishes before 2, the termination shuts down cleanly and the test passes.

However, if Thread 2 finishes first, the thread containing PreloadUnverifiedBlocksChannel will acquire a lock on a pthread mutex, which has already been cleaned up (in other words, this pthread mutex might not be properly initialized), and will trigger the aborting process, causing the process running the test to fail even when the Rust test itself succeeds.

The Fix

Based on the above assumption, I’ve prepared a fix in this commit, to ensure that all chain service’s threads are properly terminated before the testing function exits.

Now when running the following test command, the test never fails:

$ cargo nextest run  --features with_sentry --no-fail-fast \\
    --hide-progress-bar --success-output immediate-final \\
    --failure-output immediate-final -p \\
    ckb-sync tests::sync_shared::test_insert_parent_unknown_block

You can also run this command including bash script indefinitely until a failure happens.

With this fix, I could no longer reproduce the pthread error—even after running the command for over an hour.

Additional Notes

Googling the Error led to this discussion. Unfortunately, the proposed fix does not work for us.
Different RocksDB Versions: Currently, CKB is running on ckb-rocksdb v0.21.1 using RocksDB 8.5.4. CKB also developed a newer version: ckb-rocksdb v0.22.0, with RocksDB 9.10.0. I’ve tried both and found the newer one has lower failure frequency, but the issue still occurs.

✍🏻 Written by Xuejie Xiao

His previous posts include:

Find more in his personal website Less is More.

Deterministic Debugging of CKB’s Flaky Tests

Table of contents

Reproducing the Issue

Testing with `cargo nextest`

Deterministic Debugging with `rr` Multithreaded Failures

Failures Observed

Error Case: `pthread lock: Invalid argument`

Error Case: `terminate called without an active exception`

Diving into the Code

Failure Workflow Analysis

What Caused the Crash

The Fix

Additional Notes

Subscribe to my newsletter

Cryptape

Cryptape

Deterministic Debugging of CKB’s Flaky Tests

Table of contents

Reproducing the Issue

Testing with cargo nextest

Deterministic Debugging with rr Multithreaded Failures

Failures Observed

Error Case: pthread lock: Invalid argument

Error Case: terminate called without an active exception

Diving into the Code

Failure Workflow Analysis

What Caused the Crash

The Fix

Additional Notes

Subscribe to my newsletter

Cryptape

Cryptape

Testing with `cargo nextest`

Deterministic Debugging with `rr` Multithreaded Failures

Error Case: `pthread lock: Invalid argument`

Error Case: `terminate called without an active exception`