Fix GC reporting of uninitialized local after throw#127680
Fix GC reporting of uninitialized local after throw#127680MichalStrehovsky wants to merge 1 commit intodotnet:mainfrom
Conversation
Keep prolog zero initialization for untracked GC locals when an earlier instruction can throw. A caller exception filter can run managed code and trigger GC before the throwing frame unwinds, so the local stack slot must contain a safe null value. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
|
This is obviously AI generated, so here's the evidence. You can pull down the crashing test with: runfo get-helix-payload -j 8ffbe4f9-be45-445b-8433-40947c70c3bb -w Regressions -o c:\hell\8ffbe4f9-be45-445b-8433-40947c70c3bb\Regressions\ Then run whatever the script tells you to run. I got a crash and dump on first try. The dump is not included, for some reason the infra didn't capture it. I had both Claude 4.7 and GPT-5.5 look at the dump. I had GPT-5.5 come up with a fix. Asked claude what it thinks of the change. Claude said it's a fix for the issue it was looking at. Here's the details from GPT-5.5 because I can't tell if this fix is good or obvious or...: Root cause: the JIT suppressed prolog zero-init for an untracked GC local whose first explicit store is dominated only in normal control flow. An earlier implicit exception can transfer into first-pass EH; a caller filter can run managed code and trigger GC before the throwing frame unwinds. At that point the untracked GC stack slot is reportable but still contains stale stack data. Fix is on branch Repro/test shapeCurrent test: public class Test119403
{
[ActiveIssue("needs triage", typeof(PlatformDetection), nameof(PlatformDetection.IsSimulator))]
[Fact]
public static void TestEntryPoint()
{
TrashStack();
Problem();
}
static void Problem()
{
try
{
SubProblem("s", null);
}
catch (Exception e) when (ForceGC())
{
Console.WriteLine("Caught");
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
static bool ForceGC()
{
GC.Collect();
return true;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static void TrashStack()
{
Span<int> span = stackalloc int[128];
for (int i = 0; i < span.Length; i++)
{
span[i] = 0xBADBAD;
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int SubProblem(string? x, string? y)
{
int z = y.Length;
string s = x;
Foo(ref s);
return z;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static void Foo(ref string s)
{
}
}The important details:
Crash dump evidenceThe dump is a NativeAOT checked-runtime fail-fast/assert, not the original null dereference: The key stack shows GC running from the exception filter, reporting roots, and then validating a bogus object: The suspicious value is visible in the stack arguments around And the throwing frame’s stack area contains the poison pattern, including the slot that JIT later identifies as Native disassembly from the dump
Regressions!test119403_Test119403__Problem:
00007ff6`0ea8e32a lea rcx,[Regressions!_Str_s]
00007ff6`0ea8e331 xor edx,edx
00007ff6`0ea8e333 call Regressions!test119403_Test119403__SubProblem
...
; filter funclet
00007ff6`0ea8e35b call Regressions!test119403_Test119403__ForceGC
00007ff6`0ea8e360 test eax,eax
...
; handler funclet
00007ff6`0ea8e371 lea rcx,[Regressions!_Str_Caught]
00007ff6`0ea8e378 call Regressions!System_Console_System_Console__WriteLine_12The key Regressions!test119403_Test119403__SubProblem:
00007ff6`0ea8e470 push rbx
00007ff6`0ea8e471 sub rsp,30h
00007ff6`0ea8e475 mov ebx,dword ptr [rdx+8] ; y.Length, faults because rdx == null
00007ff6`0ea8e478 mov qword ptr [rsp+28h],rcx ; initializes local s, never executes
00007ff6`0ea8e47d lea rcx,[rsp+28h]
00007ff6`0ea8e482 call Regressions!test119403_Test119403__FooSo at the exception point, JIT dump before the fixGenerated with ILC using Before the fix, But it is not JIT codegen order: Final emitted code before the fix: G_M10899_IG01:
IN0006: 000000 push rbx
IN0007: 000001 sub rsp, 48
G_M10899_IG02:
IN0001: 000005 mov ebx, dword ptr [rdx+0x08]
IN0002: 000008 mov gword ptr [rsp+0x28], rcx
IN0003: 00000D lea rcx, [rsp+0x28]
IN0004: 000012 call Test119403:Foo(byref)GC info confirms the untracked stack slot: The problem is the combination of these facts:
JIT dump after the fixAfter the fix, the focused dump shows: The prolog now initializes the slot before the throwing load: G_M10899_IG01:
IN0006: 000000 push rbx
IN0007: 000001 sub rsp, 48
IN0008: 000005 xor eax, eax
IN0009: 000007 mov qword ptr [rsp+0x28], rax
G_M10899_IG02:
IN0001: 00000C mov ebx, dword ptr [rdx+0x08]
IN0002: 00000F mov gword ptr [rsp+0x28], rcx
IN0003: 000014 lea rcx, [rsp+0x28]
IN0004: 000019 call Test119403:Foo(byref)The GC stack slot still exists: But it now contains null if the method faults before the explicit store. What was wrong in the JITThe relevant optimization is That phase can mark a local as
The fix adds tracking for any prior potentially throwing tree while scanning toward the first store: bool hasImplicitException = false;
...
const bool treeMayThrow = (tree->gtFlags & GTF_EXCEPT) != 0;
hasImplicitException |= treeMayThrow;
hasImplicitControlFlow |= hasEHSuccs && treeMayThrow;And prevents // A caller's exception filter can run arbitrary managed code before this frame is unwound.
// If that triggers a GC, untracked GC locals may be reported even when the throwing method
// has no local EH successors and no explicit GC safe point before this store.
const bool needsInitOnImplicitException =
hasImplicitException && lclDsc->HasGCPtr() && !lclDsc->lvTracked;
if (!removedExplicitZeroInit && isEntire && !needsInitOnImplicitException &&
(!hasImplicitControlFlow || (lclDsc->lvTracked && !lclDsc->lvLiveInOutOfHndlr)))
{
...
lclDsc->lvHasExplicitInit = 1;
}This is intentionally not NativeAOT-specific. The observed crash was in NativeAOT, but the underlying rule is about first-pass EH semantics and GC reporting of stack slots. Test historyThe test source history is short: PR #119403 added this test and fixed #119363. Original issue #119363 was a CI failure in PR #119403 was titled: It changed This new failure is related but distinct:
So the appropriate fix is in the JIT’s zero-init suppression logic, not in the GC decoder. Confirmation recipeUse the existing ILC response file and focus the dump on the test method: $ilc = "D:\git\runtime3\artifacts\bin\coreclr\windows.x64.Checked\x64\ilc\ilc.exe"
$rsp = "D:\git\runtime3\artifacts\tests\coreclr\obj\windows.x64.Checked\Managed\Regressions\Regressions\native\Regressions.ilc.rsp"
& $ilc "@$rsp" --parallelism:1 `
--codegenopt:JitStdOutFile=<path>\jit-test119403.log `
--codegenopt:JitDisasm=Test119403:SubProblem `
--codegenopt:JitDisasmWithGC=1 `
--codegenopt:JitGCDump=Test119403:SubProblem `
--codegenopt:JitDump=Test119403:SubProblemBefore the fix, look for: After the fix, look for: |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the JIT’s redundant zero-init removal logic to preserve prolog zero-initialization for untracked GC-pointer locals when there has been any earlier potentially-throwing instruction, ensuring those locals contain a safe null value if a GC occurs during exception dispatch (e.g., via a caller exception filter) before the frame unwinds.
Changes:
- Track whether any prior node in the scanned region may throw (
hasImplicitException). - Prevent setting
lvHasExplicitInitfor untracked GC-pointer locals when an earlier throw is possible, so prolog zero-init is retained. - Minor refactor to compute “may throw” once per node and reuse for existing implicit-control-flow tracking.
| // A caller's exception filter can run arbitrary managed code before this frame is unwound. | ||
| // If that triggers a GC, untracked GC locals may be reported even when the throwing method | ||
| // has no local EH successors and no explicit GC safe point before this store. | ||
| const bool needsInitOnImplicitException = | ||
| hasImplicitException && lclDsc->HasGCPtr() && !lclDsc->lvTracked; | ||
|
|
||
| if (!removedExplicitZeroInit && isEntire && !needsInitOnImplicitException && |
|
/azp run runtime-nativeaot-outerloop |
|
Azure Pipelines successfully started running 1 pipeline(s). |
How does this work when there is no GC safe point at that location? Is this for fully interruptible cases only? What happens for partially interruptible cases? |
|
GPT-5.5 response follows, so don't take it at face value. There are two separate concepts here:
This case is the second one. Why GC can see this frame without a call-site safe pointThe GC is not interrupting NativeAOT does this in the code manager: bool executionAborted = ((CoffNativeMethodInfo *)pMethodInfo)-\>executionAborted;
if (executionAborted)
flags = ICodeManagerFlags::ExecutionAborted;Then Fully interruptible caseFor fully interruptible GC info, the decoder can compute liveness at essentially any instruction offset. So an aborted frame can still be reported at the throw/fault IP even if that IP is not a call-site safe point. That means all reportable GC locations must contain valid values at any instruction where the method may throw and remain on the stack for first-pass EH. For untracked stack slots, “valid” generally means either a real object reference or null. This is the model that exposes the bug. Partially interruptible caseFor partially interruptible methods, the decoder first tries the exact safepoint table for normal, non-aborted enumeration: if (!executionAborted && m_SafePointIndex != m_NumSafePoints)
{
// use safe point live state
}But for aborted frames, it does not rely on call-site safe points. Instead it checks whether the faulting IP is inside an encoded interruptible range: if (countIntersections == 0 && executionAborted)
{
LOG(("Not reporting this frame because it is aborted and not fully interruptible.\n"));
goto ExitSuccess;
}So for partially interruptible methods:
if (slotDecoder.GetNumUntracked() && !(inputFlags & (ParentOfFuncletStackFrame | NoReportUntracked)))
{
ReportUntrackedSlots(...);
}This is why the bug still happens for the current test even though Why the fix is still neededThe bad slot is untracked: Untracked GC slots are not protected by precise per-instruction liveness in the same way tracked locals are. Once the frame is considered reportable, untracked slots are reported. Therefore, an untracked GC stack slot must be initialized before any earlier instruction can throw and make the frame visible to first-pass EH GC reporting. In this case, before the fix: push rbx
sub rsp, 48
mov ebx, dword ptr [rdx+0x08] ; throws
mov gword ptr [rsp+0x28], rcx ; initializes untracked GC slot too lateAfter the fix: push rbx
sub rsp, 48
xor eax, eax
mov qword ptr [rsp+0x28], rax ; safe null before any throwing instruction
mov ebx, dword ptr [rdx+0x08] ; throws
mov gword ptr [rsp+0x28], rcxSo the answer is: not fully interruptible only. Fully interruptible frames can be reported at arbitrary aborted IPs; partially interruptible aborted frames are reported only if the IP is in an interruptible range. In either case, if the frame is reported, untracked GC stack slots must already contain valid GC values. |
Keep prolog zero initialization for untracked GC locals when an earlier instruction can throw. A caller exception filter can run managed code and trigger GC before the throwing frame unwinds, so the local stack slot must contain a safe null value.