Fix py_thread_callback_SUITE flake on slow CI hosts (#63)#64
Merged
Conversation
Pipe traffic between Python threads and the Erlang coordinator was vulnerable to short reads, two-write length+data races, and frame interleaving on the shared async pipe. Loop reads with a monotonic deadline, send each response as a single id-prefixed frame written under a non-blocking dirty-IO NIF, and serialise async writes through one Erlang process per pipe so chunked kernel writes can no longer interleave. Sync workers self-poison and the async pipe fails loud on unrecoverable read errors. Adds high-concurrency, async-concurrent and large-payload regression tests plus a local stress harness.
The previous backpressure check looked only at the writer's mailbox
length at submission time, so concurrent executors could complete
after the check and still pile {respond,...} into the writer
unbounded. The counter now tracks executors-plus-queued together:
incremented at submission, decremented after the writer hands the
frame to the NIF.
The reader returned -1 immediately when pipe_broken was set, but left bytes in the fd, so asyncio kept firing the callback while the Erlang writer was still draining its mailbox into the pipe. Discard the bytes and return 0 so the wrapper's while loop exits cleanly and the fd goes quiet once the writer times out. Also document that recovery is fail-loud only — there is no event-loop reference stored and no re-registration path to fail.
Monitored writers survived a coordinator crash, leaking the async pipe fd until the OS reclaimed it. spawn_link from the coordinator (which traps exits) propagates the EXIT signal in both directions: a writer death surfaces in the existing trap_exit clause for cleanup, and a coordinator crash now takes the writer with it so supervision can restart from a clean slate.
All three callsites already snapshot or modify the dict under the lock and then call set_result / set_exception with the lock released, but the rule was only spelled out next to one of them. Pin it on the struct definition so the next contributor cannot put a future method back under the mutex and reintroduce the deadlock / re-entry hazard.
The previous code marked workers poisoned and skipped them in acquire_thread_worker but kept the struct on g_thread_pool_head until NIF unload, so a hot loop of synchronisation failures grew runtime memory linearly. Unlink under g_thread_pool_mutex and free immediately; the lifetime counter still bumps so the diagnostic ceiling and stderr warning fire. Caller now clears tl_thread_worker and the pthread key BEFORE poisoning so no dangling references can survive the free.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves issue #63, where
py_thread_callback_SUITEcases reportedwrong values (not just timeouts) on FreeBSD/Python 3.12 under load.
The root cause was at the pipe layer: short reads, two-step
length+data writes, and concurrent writers interleaving frames on the
shared async-callback pipe.
The reads now loop with a monotonic deadline, every response goes
out as a single id-prefixed frame written from a dirty-IO NIF on a
non-blocking pipe end, and async writes serialise through one Erlang
process per pipe. Sync workers retire ("poison") on synchronisation
loss; the async path fails loud on unrecoverable read errors. New
high-concurrency, async-concurrent, and 64 KiB-payload tests cover
the regression and a
scripts/stress_thread_callback.shharnessruns the suite in a loop locally.
Closes #63.