Skip to content

Fix py_thread_callback_SUITE flake on slow CI hosts (#63)#64

Merged
benoitc merged 6 commits intomainfrom
feature/fix-thread-callback-flake
May 3, 2026
Merged

Fix py_thread_callback_SUITE flake on slow CI hosts (#63)#64
benoitc merged 6 commits intomainfrom
feature/fix-thread-callback-flake

Conversation

@benoitc
Copy link
Copy Markdown
Owner

@benoitc benoitc commented May 3, 2026

Summary

Resolves issue #63, where py_thread_callback_SUITE cases reported
wrong values (not just timeouts) on FreeBSD/Python 3.12 under load.
The root cause was at the pipe layer: short reads, two-step
length+data writes, and concurrent writers interleaving frames on the
shared async-callback pipe.

The reads now loop with a monotonic deadline, every response goes
out as a single id-prefixed frame written from a dirty-IO NIF on a
non-blocking pipe end, and async writes serialise through one Erlang
process per pipe. Sync workers retire ("poison") on synchronisation
loss; the async path fails loud on unrecoverable read errors. New
high-concurrency, async-concurrent, and 64 KiB-payload tests cover
the regression and a scripts/stress_thread_callback.sh harness
runs the suite in a loop locally.

Closes #63.

benoitc added 6 commits May 3, 2026 12:14
Pipe traffic between Python threads and the Erlang coordinator was
vulnerable to short reads, two-write length+data races, and frame
interleaving on the shared async pipe. Loop reads with a monotonic
deadline, send each response as a single id-prefixed frame written
under a non-blocking dirty-IO NIF, and serialise async writes through
one Erlang process per pipe so chunked kernel writes can no longer
interleave. Sync workers self-poison and the async pipe fails loud on
unrecoverable read errors. Adds high-concurrency, async-concurrent and
large-payload regression tests plus a local stress harness.
The previous backpressure check looked only at the writer's mailbox
length at submission time, so concurrent executors could complete
after the check and still pile {respond,...} into the writer
unbounded. The counter now tracks executors-plus-queued together:
incremented at submission, decremented after the writer hands the
frame to the NIF.
The reader returned -1 immediately when pipe_broken was set, but
left bytes in the fd, so asyncio kept firing the callback while the
Erlang writer was still draining its mailbox into the pipe. Discard
the bytes and return 0 so the wrapper's while loop exits cleanly and
the fd goes quiet once the writer times out. Also document that
recovery is fail-loud only — there is no event-loop reference stored
and no re-registration path to fail.
Monitored writers survived a coordinator crash, leaking the async
pipe fd until the OS reclaimed it. spawn_link from the coordinator
(which traps exits) propagates the EXIT signal in both directions:
a writer death surfaces in the existing trap_exit clause for
cleanup, and a coordinator crash now takes the writer with it so
supervision can restart from a clean slate.
All three callsites already snapshot or modify the dict under the
lock and then call set_result / set_exception with the lock
released, but the rule was only spelled out next to one of them.
Pin it on the struct definition so the next contributor cannot put
a future method back under the mutex and reintroduce the
deadlock / re-entry hazard.
The previous code marked workers poisoned and skipped them in
acquire_thread_worker but kept the struct on g_thread_pool_head until
NIF unload, so a hot loop of synchronisation failures grew runtime
memory linearly. Unlink under g_thread_pool_mutex and free
immediately; the lifetime counter still bumps so the diagnostic
ceiling and stderr warning fire. Caller now clears tl_thread_worker
and the pthread key BEFORE poisoning so no dangling references can
survive the free.
@benoitc benoitc merged commit dad2ede into main May 3, 2026
14 checks passed
@benoitc benoitc deleted the feature/fix-thread-callback-flake branch May 3, 2026 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

py_thread_callback_SUITE: concurrent threadpool tests flaky under load

1 participant