Skip to content

Fix GC reporting of uninitialized local after throw#127680

Open
MichalStrehovsky wants to merge 1 commit intodotnet:mainfrom
MichalStrehovsky:fix-jit-gc-local-zero-init
Open

Fix GC reporting of uninitialized local after throw#127680
MichalStrehovsky wants to merge 1 commit intodotnet:mainfrom
MichalStrehovsky:fix-jit-gc-local-zero-init

Conversation

@MichalStrehovsky
Copy link
Copy Markdown
Member

Keep prolog zero initialization for untracked GC locals when an earlier instruction can throw. A caller exception filter can run managed code and trigger GC before the throwing frame unwinds, so the local stack slot must contain a safe null value.

Keep prolog zero initialization for untracked GC locals when an earlier instruction can throw. A caller exception filter can run managed code and trigger GC before the throwing frame unwinds, so the local stack slot must contain a safe null value.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 2, 2026 06:19
@MichalStrehovsky MichalStrehovsky added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 2, 2026
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@MichalStrehovsky
Copy link
Copy Markdown
Member Author

This is obviously AI generated, so here's the evidence.

You can pull down the crashing test with: runfo get-helix-payload -j 8ffbe4f9-be45-445b-8433-40947c70c3bb -w Regressions -o c:\hell\8ffbe4f9-be45-445b-8433-40947c70c3bb\Regressions\

Then run whatever the script tells you to run. I got a crash and dump on first try. The dump is not included, for some reason the infra didn't capture it.

I had both Claude 4.7 and GPT-5.5 look at the dump. I had GPT-5.5 come up with a fix. Asked claude what it thinks of the change. Claude said it's a fix for the issue it was looking at.

Here's the details from GPT-5.5 because I can't tell if this fix is good or obvious or...:

Root cause: the JIT suppressed prolog zero-init for an untracked GC local whose first explicit store is dominated only in normal control flow. An earlier implicit exception can transfer into first-pass EH; a caller filter can run managed code and trigger GC before the throwing frame unwinds. At that point the untracked GC stack slot is reportable but still contains stale stack data.

Fix is on branch fix-jit-gc-local-zero-init, commit c2e4cd5ea6f.

Repro/test shape

Current test:

public class Test119403
{
    [ActiveIssue("needs triage", typeof(PlatformDetection), nameof(PlatformDetection.IsSimulator))]
    [Fact]
    public static void TestEntryPoint()
    {
        TrashStack();
        Problem();
    }

    static void Problem()
    {
        try
        {
            SubProblem("s", null);
        }
        catch (Exception e) when (ForceGC())
        {
            Console.WriteLine("Caught");
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static bool ForceGC()
    {
        GC.Collect();
        return true;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void TrashStack()
    {
        Span<int> span = stackalloc int[128];
        for (int i = 0; i < span.Length; i++)
        {
            span[i] = 0xBADBAD;
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static int SubProblem(string? x, string? y)
    {
        int z = y.Length;
        string s = x;
        Foo(ref s);
        return z;
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Foo(ref string s)
    {
    }
}

The important details:

  • TrashStack() fills stack memory with 00badbad.
  • SubProblem("s", null) faults on y.Length.
  • The catch when (ForceGC()) filter runs during first-pass EH.
  • ForceGC() triggers GC while the SubProblem frame is still on the stack.
  • s is address-exposed because of Foo(ref s), so JIT uses a stack home for it and reports it as an untracked GC slot.

Crash dump evidence

The dump is a NativeAOT checked-runtime fail-fast/assert, not the original null dereference:

FAILURE_EXCEPTION_CODE:  c0000602
FAILURE_SYMBOL_NAME:  Regressions.exe!Assert
FAULTING_SOURCE_FILE:  src\coreclr\nativeaot\Runtime\rhassert.cpp
FAILURE_BUCKET_ID:  APPLICATION_FAULT_c0000602_Regressions.exe!Assert

The key stack shows GC running from the exception filter, reporting roots, and then validating a bogus object:

KERNELBASE!RaiseFailFastException
Regressions!Assert
Regressions!RhpVectoredExceptionHandler
ntdll!RtlDispatchException
ntdll!KiUserExceptionDispatch

Regressions!WKS::CObjectHeader::Validate+0xe
Regressions!GcEnumObject+0x6c
Regressions!TGcInfoDecoder<AMD64GcInfoEncoding>::ReportUntrackedSlots+0x1fa
Regressions!TGcInfoDecoder<AMD64GcInfoEncoding>::EnumerateLiveSlots+0x18f7
Regressions!CoffNativeCodeManager::EnumGcRefs+0x113
Regressions!EnumGcRefs+0x66
Regressions!Thread::GcScanRootsWorker+0x17f
Regressions!GCToEEInterface::GcScanRoots+0x171
Regressions!WKS::gc_heap::mark_phase+0x346
Regressions!WKS::GCHeap::GarbageCollect+0x28c
Regressions!RhCollect+0xce
Regressions!test119403_Test119403__ForceGC+0x47
Regressions!test119403_Test119403__Problem+0x40
Regressions!RhpCallFilterFunclet2
Regressions!S_P_CoreLib_System_Runtime_EH__FindFirstPassHandler
Regressions!S_P_CoreLib_System_Runtime_EH__DispatchEx
Regressions!S_P_CoreLib_System_Runtime_EH__RhThrowHwEx
Regressions!RhpThrowHwEx2
Regressions!test119403_Test119403__SubProblem+0x5
Regressions!test119403_Test119403__Problem+0x18

The suspicious value is visible in the stack arguments around CObjectHeader::Validate / GcEnumObject:

Regressions!WKS::CObjectHeader::Validate+0xe
Args include: 00badbad`00badbad

And the throwing frame’s stack area contains the poison pattern, including the slot that JIT later identifies as [rsp+0x28]:

00000019`0897f3a0  00badbad`00badbad
00000019`0897f3a8  00007ff6`0ea8e475 Regressions!test119403_Test119403__SubProblem+0x5
...
00000019`0897f3d0  00badbad`00badbad
00000019`0897f3d8  00badbad`00badbad
00000019`0897f3e0  00badbad`00badbad
00000019`0897f3e8  00badbad`00badbad
00000019`0897f3f0  00badbad`00badbad
00000019`0897f3f8  00badbad`00badbad

Native disassembly from the dump

Problem calls SubProblem("s", null), then has a filter funclet that calls ForceGC:

Regressions!test119403_Test119403__Problem:
00007ff6`0ea8e32a  lea  rcx,[Regressions!_Str_s]
00007ff6`0ea8e331  xor  edx,edx
00007ff6`0ea8e333  call Regressions!test119403_Test119403__SubProblem
...
; filter funclet
00007ff6`0ea8e35b  call Regressions!test119403_Test119403__ForceGC
00007ff6`0ea8e360  test eax,eax
...
; handler funclet
00007ff6`0ea8e371  lea  rcx,[Regressions!_Str_Caught]
00007ff6`0ea8e378  call Regressions!System_Console_System_Console__WriteLine_12

The key SubProblem disassembly:

Regressions!test119403_Test119403__SubProblem:
00007ff6`0ea8e470  push rbx
00007ff6`0ea8e471  sub  rsp,30h
00007ff6`0ea8e475  mov  ebx,dword ptr [rdx+8]      ; y.Length, faults because rdx == null
00007ff6`0ea8e478  mov  qword ptr [rsp+28h],rcx    ; initializes local s, never executes
00007ff6`0ea8e47d  lea  rcx,[rsp+28h]
00007ff6`0ea8e482  call Regressions!test119403_Test119403__Foo

So at the exception point, [rsp+0x28] has not been initialized by SubProblem; it contains whatever was on the stack, deliberately 00badbad.

JIT dump before the fix

Generated with ILC using --parallelism:1 and:

--codegenopt:JitStdOutFile=...\jit-test119403.log
--codegenopt:JitDisasm=Test119403:*
--codegenopt:JitDisasmWithGC=1
--codegenopt:JitDisasmWithDebugInfo=1
--codegenopt:JitGCDump=Test119403:*
--codegenopt:JitEHDump=Test119403:*
--codegenopt:JitUnwindDump=Test119403:*
--codegenopt:JitDump=Test119403:*

Before the fix, V02 loc0 is the address-exposed string s stack local:

;  V02 loc0 [V02] ref -> [rsp+0x28] do-not-enreg[X] addr-exposed ld-addr-op class-hnd exact <System.String>

But it is not must-init.

JIT codegen order:

IN0001: mov ebx, dword ptr [rdx+0x08]        ; faulting y.Length
IN0002: mov gword ptr [V02 rsp+0x28], rcx    ; first explicit init of s
IN0003: lea rcx, [V02 rsp+0x28]
IN0004: call Test119403:Foo(byref)

Final emitted code before the fix:

G_M10899_IG01:
IN0006: 000000 push rbx
IN0007: 000001 sub  rsp, 48

G_M10899_IG02:
IN0001: 000005 mov ebx, dword ptr [rdx+0x08]
IN0002: 000008 mov gword ptr [rsp+0x28], rcx
IN0003: 00000D lea rcx, [rsp+0x28]
IN0004: 000012 call Test119403:Foo(byref)

GC info confirms the untracked stack slot:

Stack slot id for offset 40 (0x28) (sp) (untracked) = 0.
Defining 1 call sites:
    Offset 0x12, size 5.

The problem is the combination of these facts:

  1. [rsp+0x28] is a GC stack slot.
  2. It is untracked, so GC reporting does not rely on precise liveness of normal control flow.
  3. It is not zeroed in the prolog.
  4. Its first explicit store comes after a potentially throwing instruction.
  5. First-pass EH can run a caller filter that triggers GC while this frame is still present.

JIT dump after the fix

After the fix, the focused dump shows:

must init V02 because it has a GC ref
;  V02 loc0 [V02] ref -> [rsp+0x28] do-not-enreg[X] must-init addr-exposed ld-addr-op class-hnd exact <System.String>

The prolog now initializes the slot before the throwing load:

G_M10899_IG01:
IN0006: 000000 push rbx
IN0007: 000001 sub  rsp, 48
IN0008: 000005 xor  eax, eax
IN0009: 000007 mov  qword ptr [rsp+0x28], rax

G_M10899_IG02:
IN0001: 00000C mov ebx, dword ptr [rdx+0x08]
IN0002: 00000F mov gword ptr [rsp+0x28], rcx
IN0003: 000014 lea rcx, [rsp+0x28]
IN0004: 000019 call Test119403:Foo(byref)

The GC stack slot still exists:

Stack slot id for offset 40 (0x28) (sp) (untracked) = 0.

But it now contains null if the method faults before the explicit store.

What was wrong in the JIT

The relevant optimization is Compiler::optRemoveRedundantZeroInits.

That phase can mark a local as lvHasExplicitInit, which tells codegen not to insert prolog zero initialization. The old reasoning was essentially:

  • If there is no local EH successor and no GC safe point before the first explicit store, then a GC local does not need prolog zero-init.
  • For normal control flow, that is true.
  • For first-pass EH, it is not true: the exception can escape to a caller filter, and the caller filter can run arbitrary managed code including GC.Collect(), while the throwing frame remains on the stack.

The fix adds tracking for any prior potentially throwing tree while scanning toward the first store:

bool hasImplicitException = false;
...
const bool treeMayThrow = (tree->gtFlags & GTF_EXCEPT) != 0;
hasImplicitException |= treeMayThrow;
hasImplicitControlFlow |= hasEHSuccs && treeMayThrow;

And prevents lvHasExplicitInit suppression for untracked GC locals in that case:

// A caller's exception filter can run arbitrary managed code before this frame is unwound.
// If that triggers a GC, untracked GC locals may be reported even when the throwing method
// has no local EH successors and no explicit GC safe point before this store.
const bool needsInitOnImplicitException =
    hasImplicitException && lclDsc->HasGCPtr() && !lclDsc->lvTracked;

if (!removedExplicitZeroInit && isEntire && !needsInitOnImplicitException &&
    (!hasImplicitControlFlow || (lclDsc->lvTracked && !lclDsc->lvLiveInOutOfHndlr)))
{
    ...
    lclDsc->lvHasExplicitInit = 1;
}

This is intentionally not NativeAOT-specific. The observed crash was in NativeAOT, but the underlying rule is about first-pass EH semantics and GC reporting of stack slots.

Test history

The test source history is short:

ad0befc6b9a 2025-09-07 Jan Kotas | Unconditionally skip GC reporting for non-interruptible aborted methods (#119403)
02dbfc20c5e 2026-01-31 Copilot   | Migrate test exclusions from issues.targets to inline ActiveIssue attributes (#123248)

PR #119403 added this test and fixed #119363.

Original issue #119363 was a CI failure in System.Security.Cryptography.Tests with:

Detected use of a corrupted OBJECTREF. Possible GC hole.

PR #119403 was titled:

Unconditionally skip GC reporting for non-interruptible aborted methods

It changed gcinfodecoder.cpp and added GitHub_119403\test119403.cs / .csproj. The original fix addressed the non-interruptible aborted-method case by skipping GC reporting when execution is aborted and the current offset is not in an interruptible range.

This new failure is related but distinct:

  • The decoder/runtime is reporting the frame.
  • The frame contains an untracked GC local stack slot.
  • The slot is valid to report from the encoded GC info perspective.
  • The JIT incorrectly allowed it to remain uninitialized until after an earlier throwing instruction.

So the appropriate fix is in the JIT’s zero-init suppression logic, not in the GC decoder.

Confirmation recipe

Use the existing ILC response file and focus the dump on the test method:

$ilc = "D:\git\runtime3\artifacts\bin\coreclr\windows.x64.Checked\x64\ilc\ilc.exe"
$rsp = "D:\git\runtime3\artifacts\tests\coreclr\obj\windows.x64.Checked\Managed\Regressions\Regressions\native\Regressions.ilc.rsp"

& $ilc "@$rsp" --parallelism:1 `
  --codegenopt:JitStdOutFile=<path>\jit-test119403.log `
  --codegenopt:JitDisasm=Test119403:SubProblem `
  --codegenopt:JitDisasmWithGC=1 `
  --codegenopt:JitGCDump=Test119403:SubProblem `
  --codegenopt:JitDump=Test119403:SubProblem

Before the fix, look for:

V02 loc0 ... [rsp+0x28] ... addr-exposed ...
IN0001: mov ebx, dword ptr [rdx+0x08]
IN0002: mov gword ptr [rsp+0x28], rcx
Stack slot id for offset 40 (0x28) (sp) (untracked) = 0.

After the fix, look for:

must init V02 because it has a GC ref
V02 loc0 ... [rsp+0x28] ... must-init ...
IN0008: xor eax, eax
IN0009: mov qword ptr [rsp+0x28], rax
IN0001: mov ebx, dword ptr [rdx+0x08]

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the JIT’s redundant zero-init removal logic to preserve prolog zero-initialization for untracked GC-pointer locals when there has been any earlier potentially-throwing instruction, ensuring those locals contain a safe null value if a GC occurs during exception dispatch (e.g., via a caller exception filter) before the frame unwinds.

Changes:

  • Track whether any prior node in the scanned region may throw (hasImplicitException).
  • Prevent setting lvHasExplicitInit for untracked GC-pointer locals when an earlier throw is possible, so prolog zero-init is retained.
  • Minor refactor to compute “may throw” once per node and reuse for existing implicit-control-flow tracking.

Comment on lines +5670 to +5676
// A caller's exception filter can run arbitrary managed code before this frame is unwound.
// If that triggers a GC, untracked GC locals may be reported even when the throwing method
// has no local EH successors and no explicit GC safe point before this store.
const bool needsInitOnImplicitException =
hasImplicitException && lclDsc->HasGCPtr() && !lclDsc->lvTracked;

if (!removedExplicitZeroInit && isEntire && !needsInitOnImplicitException &&
@MichalStrehovsky
Copy link
Copy Markdown
Member Author

/azp run runtime-nativeaot-outerloop

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@jakobbotsch
Copy link
Copy Markdown
Member

A caller exception filter can run managed code and trigger GC before the throwing frame unwinds,

How does this work when there is no GC safe point at that location? Is this for fully interruptible cases only? What happens for partially interruptible cases?

@MichalStrehovsky
Copy link
Copy Markdown
Member Author

GPT-5.5 response follows, so don't take it at face value.

There are two separate concepts here:

  1. A GC safe point where normal cooperative GC can suspend a running method and use a call-site/live-state table.
  2. An exception-aborted frame where the method is no longer going to resume normally, but its physical stack frame is still present while first-pass EH/filter code is running.

This case is the second one.

Why GC can see this frame without a call-site safe point

The GC is not interrupting SubProblem at mov ebx, [rdx+8] as a normal safe point. SubProblem has faulted, first-pass EH is running, and the caller’s filter calls GC.Collect(). The SubProblem frame is still on the stack, so the stack walker/code manager is asked to enumerate it with ExecutionAborted set.

NativeAOT does this in the code manager:

bool executionAborted = ((CoffNativeMethodInfo *)pMethodInfo)-\>executionAborted;
if (executionAborted)
    flags = ICodeManagerFlags::ExecutionAborted;

Then TGcInfoDecoder::EnumerateLiveSlots has special logic for ExecutionAborted.

Fully interruptible case

For fully interruptible GC info, the decoder can compute liveness at essentially any instruction offset. So an aborted frame can still be reported at the throw/fault IP even if that IP is not a call-site safe point.

That means all reportable GC locations must contain valid values at any instruction where the method may throw and remain on the stack for first-pass EH. For untracked stack slots, “valid” generally means either a real object reference or null.

This is the model that exposes the bug.

Partially interruptible case

For partially interruptible methods, the decoder first tries the exact safepoint table for normal, non-aborted enumeration:

if (!executionAborted && m_SafePointIndex != m_NumSafePoints)
{
    // use safe point live state
}

But for aborted frames, it does not rely on call-site safe points. Instead it checks whether the faulting IP is inside an encoded interruptible range:

if (countIntersections == 0 && executionAborted)
{
    LOG(("Not reporting this frame because it is aborted and not fully interruptible.\n"));
    goto ExitSuccess;
}

So for partially interruptible methods:

  • If the aborted IP is not in an interruptible range, the decoder skips the frame entirely. This was the essence of PR #119403: avoid reporting non-interruptible aborted frames.
  • If the aborted IP is in an interruptible range, the decoder reports the frame using the interruptible-range lifetime encoding.
  • After tracked slots are handled, it reports untracked slots for the frame unless suppressed:
if (slotDecoder.GetNumUntracked() && !(inputFlags & (ParentOfFuncletStackFrame | NoReportUntracked)))
{
    ReportUntrackedSlots(...);
}

This is why the bug still happens for the current test even though SubProblem is shown as partially interruptible in the JIT dump. The faulting instruction is in a reportable/interrruptible region, so the frame is not skipped, and the untracked stack slot is reported.

Why the fix is still needed

The bad slot is untracked:

Stack slot id for offset 40 (0x28) (sp) (untracked) = 0.

Untracked GC slots are not protected by precise per-instruction liveness in the same way tracked locals are. Once the frame is considered reportable, untracked slots are reported. Therefore, an untracked GC stack slot must be initialized before any earlier instruction can throw and make the frame visible to first-pass EH GC reporting.

In this case, before the fix:

push rbx
sub  rsp, 48
mov  ebx, dword ptr [rdx+0x08]      ; throws
mov  gword ptr [rsp+0x28], rcx      ; initializes untracked GC slot too late

After the fix:

push rbx
sub  rsp, 48
xor  eax, eax
mov  qword ptr [rsp+0x28], rax      ; safe null before any throwing instruction
mov  ebx, dword ptr [rdx+0x08]      ; throws
mov  gword ptr [rsp+0x28], rcx

So the answer is: not fully interruptible only. Fully interruptible frames can be reported at arbitrary aborted IPs; partially interruptible aborted frames are reported only if the IP is in an interruptible range. In either case, if the frame is reported, untracked GC stack slots must already contain valid GC values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants