Reporting the issue already, I will try to create a simple reproducer and more concise explanation.
Full explanation by Claude
JIT trace-buffer assertion: full explanation
The assertion
In _PyJit_translate_single_bytecode_to_trace (Python/optimizer.c), each bytecode appended to a tier-2 trace is gated by a Py_DEBUG assertion:
assert(uop_buffer_remaining_space(trace) > space_needed);
where
space_needed = expansion->nuops + needs_guard_ip + 2 + (!OPCODE_HAS_NO_SAVE_IP(opcode))
expansion->nuops is the number of uops the bytecode's macro expands to. The + 2 covers the _CHECK_VALIDITY uop emitted at the start of every bytecode plus a reserved _DEOPT slot for it. + needs_guard_ip covers the _GUARD_IP_* slot for opcodes that need it. + (!OPCODE_HAS_NO_SAVE_IP) covers the _SET_IP slot.
Just before the assertion, trace->end has already been decremented to reserve the side-exit/error/deopt/guard-ip stubs at the buffer's tail:
trace->end -= 2 + has_exit + has_error + has_deopt + needs_guard_ip
So if R is the buffer's true remaining space at the start of the bytecode, the assertion is checking R - tail > space_needed, i.e.
R > 4 + has_exit + has_error + has_deopt + 2*needs_guard_ip + nuops + has_save_ip
The fitness mechanism
_PyJit_translate_single_bytecode_to_trace also runs a fitness gate before the assertion:
if (fitness < eq) goto done;
eq is compute_exit_quality(opcode). For specializable opcodes (like TO_BOOL, CALL_*) it returns EXIT_QUALITY_SPECIALIZABLE, defined as FITNESS_INITIAL / 80. With FITNESS_INITIAL == UOP_MAX_TRACE_LENGTH == 1000 in debug builds, that's 12.
Per bytecode, fitness is debited by slots_used = remaining_before − remaining_after, i.e. exactly the number of buffer slots the bytecode just consumed (uops emitted + tail reservations made). So fitness and remaining_space are intended to track in lockstep, with branch/frame penalties only making fitness drop faster (which would only ever make the assertion safer).
The intended invariant is therefore fitness <= remaining_space. If that held, then fitness >= eq = 12 would imply remaining_space >= 12 too, and 12 slots is enough for every existing space_needed.
Where the invariant breaks
Two accounting holes push fitness above remaining_space:
-
Trace prelude not charged. _PyJit_StartTracing emits _START_EXECUTOR and _MAKE_WARM into the buffer, costing 2 slots. Fitness is initialized to FITNESS_INITIAL (1000) right after, with no debit. So before the first bytecode: fitness = 1000, remaining = 998. Drift = +2.
-
needs_guard_ip under-reserved. When needs_guard_ip is true (e.g. after _PUSH_FRAME), the code emits up to three uops (_RECORD_CODE, the guard_ip uop itself, and guard_code_version_uop if the called object is a code object). But space_needed only adds + needs_guard_ip (i.e. 1). Each such occurrence under-reserves by 1–2 slots, but those slots are still consumed by next++, so slots_used does include them and fitness drops correctly. The mismatch is between the predicted space_needed and the actual emission, not between fitness and buffer — but it tightens the assertion further.
I confirmed the drift directly. With a release build with UOP_MAX_TRACE_LENGTH hacked down to 1000 (matching debug) and a printf inserted at the trace-finalize site, a long branchless reproducer produced:
TRACE_LEN: length=666 remaining=14 fitness=18 UOP_MAX=1000
fitness = 18, remaining = 14 — fitness sits 4 slots higher than the buffer.
Why this PR exposes it
The PR for gh-143732 changes the TO_BOOL macro from
macro(TO_BOOL) = _SPECIALIZE_TO_BOOL + unused/2 + _TO_BOOL;
to
macro(TO_BOOL) = _SPECIALIZE_TO_BOOL + _RECORD_TOS_TYPE + unused/2 + _TO_BOOL;
That bumps expansion->nuops for TO_BOOL from 1 to 2.
Concrete space_needed for TO_BOOL (no exit/deopt/guard_ip, has_error, has_save_ip):
|
nuops |
needs_guard_ip |
+2 |
has_save_ip |
space_needed |
| main |
1 |
0 |
2 |
1 |
4 |
| PR |
2 |
0 |
2 |
1 |
5 |
After tail reservations (trace->end -= 2 + has_error = 3), the post-decrement remaining must exceed space_needed. So the true-remaining R we need is:
|
tail |
space_needed |
R required |
| main |
3 |
4 |
> 7 |
| PR |
3 |
5 |
> 8 |
Worst case at the fitness gate: fitness == 12 and the drift is fitness − remaining = +4, so R = 8. On main, 8 > 7 ✓. On the PR, 8 > 8 ✗ — assertion fires.
Empirical confirmation
Debug + JIT build, 30 runs of ./python -m test test_strtod --multiprocess 0 --verbose2 --verbose3:
| Build |
TO_BOOL macro |
Pass |
Fail |
| main |
original |
30 |
0 |
| main + PR's macro change |
with _RECORD_TOS_TYPE |
28 |
2 |
Failures are always in test_short_halfway_cases, which uses random.randrange, so the failing trace shape varies and the failure is flaky (~5–10%). The C backtrace lands in _PyJit_translate_single_bytecode_to_trace reached from the test's if e + q.bit_length() > max_exp: line — so the bytecode that actually trips the assertion is plausibly a wide CALL (e.g. CALL_METHOD_DESCRIPTOR_NOARGS for q.bit_length()), not TO_BOOL itself; widening TO_BOOL just shifts trace timing enough to expose it.
Bottom line
It's a thin-margin coupling, not a PR bug. The intended invariant fitness <= remaining_space doesn't actually hold today; in some traces fitness is ~4 slots above remaining. With EXIT_QUALITY_SPECIALIZABLE = 12, the worst-case assertion threshold is right at the boundary for the PR's wider TO_BOOL, so any future macro widening can resurface this. Fixes (in increasing principled-ness): (a) graceful goto done instead of asserting; (b) raise EXIT_QUALITY_SPECIALIZABLE to ~25 to leave room for the widest opcode plus the +4 drift; (c) fix the two accounting holes so the invariant actually holds.
Bug report
Bug description:
The
assert(uop_buffer_remaining_space(trace) > space_needed);in_PyJit_translate_single_bytecode_to_tracecan be triggered if a uop is widened. Found when the CI for #148271 was failing (thereTO_BOOLgets an extra_RECORD_TOS_TYPE). @markshannonReporting the issue already, I will try to create a simple reproducer and more concise explanation.
Full explanation by Claude
JIT trace-buffer assertion: full explanation
The assertion
In
_PyJit_translate_single_bytecode_to_trace(Python/optimizer.c), each bytecode appended to a tier-2 trace is gated by a Py_DEBUG assertion:where
expansion->nuopsis the number of uops the bytecode's macro expands to. The+ 2covers the_CHECK_VALIDITYuop emitted at the start of every bytecode plus a reserved_DEOPTslot for it.+ needs_guard_ipcovers the_GUARD_IP_*slot for opcodes that need it.+ (!OPCODE_HAS_NO_SAVE_IP)covers the_SET_IPslot.Just before the assertion,
trace->endhas already been decremented to reserve the side-exit/error/deopt/guard-ip stubs at the buffer's tail:So if
Ris the buffer's true remaining space at the start of the bytecode, the assertion is checkingR - tail > space_needed, i.e.The fitness mechanism
_PyJit_translate_single_bytecode_to_tracealso runs a fitness gate before the assertion:eqiscompute_exit_quality(opcode). For specializable opcodes (likeTO_BOOL,CALL_*) it returnsEXIT_QUALITY_SPECIALIZABLE, defined asFITNESS_INITIAL / 80. WithFITNESS_INITIAL == UOP_MAX_TRACE_LENGTH == 1000in debug builds, that's 12.Per bytecode, fitness is debited by
slots_used = remaining_before − remaining_after, i.e. exactly the number of buffer slots the bytecode just consumed (uops emitted + tail reservations made). So fitness and remaining_space are intended to track in lockstep, with branch/frame penalties only making fitness drop faster (which would only ever make the assertion safer).The intended invariant is therefore
fitness <= remaining_space. If that held, thenfitness >= eq = 12would implyremaining_space >= 12too, and 12 slots is enough for every existingspace_needed.Where the invariant breaks
Two accounting holes push
fitnessaboveremaining_space:Trace prelude not charged.
_PyJit_StartTracingemits_START_EXECUTORand_MAKE_WARMinto the buffer, costing 2 slots. Fitness is initialized toFITNESS_INITIAL(1000) right after, with no debit. So before the first bytecode:fitness = 1000,remaining = 998. Drift = +2.needs_guard_ipunder-reserved. Whenneeds_guard_ipis true (e.g. after_PUSH_FRAME), the code emits up to three uops (_RECORD_CODE, theguard_ipuop itself, andguard_code_version_uopif the called object is a code object). Butspace_neededonly adds+ needs_guard_ip(i.e. 1). Each such occurrence under-reserves by 1–2 slots, but those slots are still consumed bynext++, soslots_useddoes include them and fitness drops correctly. The mismatch is between the predictedspace_neededand the actual emission, not between fitness and buffer — but it tightens the assertion further.I confirmed the drift directly. With a release build with
UOP_MAX_TRACE_LENGTHhacked down to 1000 (matching debug) and a printf inserted at the trace-finalize site, a long branchless reproducer produced:fitness = 18, remaining = 14— fitness sits 4 slots higher than the buffer.Why this PR exposes it
The PR for gh-143732 changes the
TO_BOOLmacro fromto
That bumps
expansion->nuopsforTO_BOOLfrom 1 to 2.Concrete
space_neededforTO_BOOL(no exit/deopt/guard_ip, has_error, has_save_ip):After tail reservations (
trace->end -= 2 + has_error = 3), the post-decrement remaining must exceedspace_needed. So the true-remainingRwe need is:Worst case at the fitness gate:
fitness == 12and the drift isfitness − remaining = +4, soR = 8. On main,8 > 7✓. On the PR,8 > 8✗ — assertion fires.Empirical confirmation
Debug + JIT build, 30 runs of
./python -m test test_strtod --multiprocess 0 --verbose2 --verbose3:_RECORD_TOS_TYPEFailures are always in
test_short_halfway_cases, which usesrandom.randrange, so the failing trace shape varies and the failure is flaky (~5–10%). The C backtrace lands in_PyJit_translate_single_bytecode_to_tracereached from the test'sif e + q.bit_length() > max_exp:line — so the bytecode that actually trips the assertion is plausibly a wide CALL (e.g.CALL_METHOD_DESCRIPTOR_NOARGSforq.bit_length()), notTO_BOOLitself; wideningTO_BOOLjust shifts trace timing enough to expose it.Bottom line
It's a thin-margin coupling, not a PR bug. The intended invariant
fitness <= remaining_spacedoesn't actually hold today; in some traces fitness is ~4 slots above remaining. WithEXIT_QUALITY_SPECIALIZABLE = 12, the worst-case assertion threshold is right at the boundary for the PR's widerTO_BOOL, so any future macro widening can resurface this. Fixes (in increasing principled-ness): (a) gracefulgoto doneinstead of asserting; (b) raiseEXIT_QUALITY_SPECIALIZABLEto ~25 to leave room for the widest opcode plus the +4 drift; (c) fix the two accounting holes so the invariant actually holds.CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux