Fix py_thread_callback_SUITE flake on slow CI hosts (#63) by benoitc · Pull Request #64 · benoitc/erlang-python

benoitc · 2026-05-03T10:14:17Z

Summary

Resolves issue #63, where py_thread_callback_SUITE cases reported
wrong values (not just timeouts) on FreeBSD/Python 3.12 under load.
The root cause was at the pipe layer: short reads, two-step
length+data writes, and concurrent writers interleaving frames on the
shared async-callback pipe.

The reads now loop with a monotonic deadline, every response goes
out as a single id-prefixed frame written from a dirty-IO NIF on a
non-blocking pipe end, and async writes serialise through one Erlang
process per pipe. Sync workers retire ("poison") on synchronisation
loss; the async path fails loud on unrecoverable read errors. New
high-concurrency, async-concurrent, and 64 KiB-payload tests cover
the regression and a scripts/stress_thread_callback.sh harness
runs the suite in a loop locally.

Closes #63.

Pipe traffic between Python threads and the Erlang coordinator was vulnerable to short reads, two-write length+data races, and frame interleaving on the shared async pipe. Loop reads with a monotonic deadline, send each response as a single id-prefixed frame written under a non-blocking dirty-IO NIF, and serialise async writes through one Erlang process per pipe so chunked kernel writes can no longer interleave. Sync workers self-poison and the async pipe fails loud on unrecoverable read errors. Adds high-concurrency, async-concurrent and large-payload regression tests plus a local stress harness.

The previous backpressure check looked only at the writer's mailbox length at submission time, so concurrent executors could complete after the check and still pile {respond,...} into the writer unbounded. The counter now tracks executors-plus-queued together: incremented at submission, decremented after the writer hands the frame to the NIF.

The reader returned -1 immediately when pipe_broken was set, but left bytes in the fd, so asyncio kept firing the callback while the Erlang writer was still draining its mailbox into the pipe. Discard the bytes and return 0 so the wrapper's while loop exits cleanly and the fd goes quiet once the writer times out. Also document that recovery is fail-loud only — there is no event-loop reference stored and no re-registration path to fail.

Monitored writers survived a coordinator crash, leaking the async pipe fd until the OS reclaimed it. spawn_link from the coordinator (which traps exits) propagates the EXIT signal in both directions: a writer death surfaces in the existing trap_exit clause for cleanup, and a coordinator crash now takes the writer with it so supervision can restart from a clean slate.

All three callsites already snapshot or modify the dict under the lock and then call set_result / set_exception with the lock released, but the rule was only spelled out next to one of them. Pin it on the struct definition so the next contributor cannot put a future method back under the mutex and reintroduce the deadlock / re-entry hazard.

The previous code marked workers poisoned and skipped them in acquire_thread_worker but kept the struct on g_thread_pool_head until NIF unload, so a hot loop of synchronisation failures grew runtime memory linearly. Unlink under g_thread_pool_mutex and free immediately; the lifetime counter still bumps so the diagnostic ceiling and stderr warning fire. Caller now clears tl_thread_worker and the pthread key BEFORE poisoning so no dangling references can survive the free.

benoitc added 6 commits May 3, 2026 12:14

benoitc merged commit dad2ede into main May 3, 2026
14 checks passed

benoitc deleted the feature/fix-thread-callback-flake branch May 3, 2026 12:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix py_thread_callback_SUITE flake on slow CI hosts (#63)#64

Fix py_thread_callback_SUITE flake on slow CI hosts (#63)#64
benoitc merged 6 commits intomainfrom
feature/fix-thread-callback-flake

benoitc commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

benoitc commented May 3, 2026

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant