GH-148937: fix for free-threaded GC (RSS based defer) by nascheme · Pull Request #148940 · python/cpython

nascheme · 2026-04-23T22:07:43Z

Asking the OS for the process memory usage doesn't work well given how mimalloc works. It does not promptly return memory to the OS and so the memory doesn't drop after cyclic trash is freed.

Instead of asking the OS, use mimalloc APIs to compute how much memory is being used by all mimalloc arenas. We need to stop-the-world to do this but usually we can avoid doing a collection. So, from a performance perspective, this is worth it.

Tim Peters has a GC stress tester that quickly shows the issue, linked below. Before this fix, when I run this, the process RSS quickly goes up to 1 GB. After the fix, the RSS stays at about 100 MB. For comparision, the 3.13 GC keeps RSS at about 200 MB.

tim-gc-test.py

Benchmark results

Issue: GC in 3.14t can defer too much, resulting in high memory use #148937

nascheme · 2026-04-23T22:11:32Z

Note that this adds two extra stop/start-the-world points. We need STW to call the mimalloc APIs to compute the memory usage (iterating through arenas). We could likely consolidate one or both of these with existing STW points but I think it makes the code more complex. So I decided to keep it simple for now. I think we should backport this change to 3.14.

read-the-docs-community · 2026-04-30T14:44:07Z

Documentation build overview

📚 cpython-previews | 🛠️ Build #32516238 | 📁 Comparing 801737f against main (7686abe)

🔍 Preview build

35 files changed · ± 35 modified

± Modified

nascheme · 2026-04-30T15:34:21Z

Based on a suggestion from Sam, I changed it to instead estimate mimalloc memory use by counting full mimalloc pages. This requires a couple of changes to mimalloc itself but avoids the STW blocks and so should perform better. The accounting happens when a page transitions from non-full->full and so should have minimal performance overhead.

nascheme · 2026-04-30T19:54:38Z

Benchmark results from cyclotron. First compares 3.14.3t to this PR. Note that r-trash are mostly small. Table below this compared 3.13 (GIL, generational GC) with this PR.

base=./py-3.14t/bin/python vs new=/home/nas/src/cpython/python

cycle	extra	live	t(s)	r-t	rss	r-rss	trash	r-trash	pause	r-pause
10	0	100	21.05	0.5	39M	1.0	14k	0.2	2.60	0.3
10	0	10.0k	21.05	0.6	39M	1.0	20k	0.2	4.21	0.5
10	0	30.0k	21.06	1.0	41M	1.0	30k	0.2	4.30	0.5
10	10.0k	100	21.05	0.4	45M	0.3	8k	0.1	2.50	0.2
10	10.0k	10.0k	21.05	1.0	60M	0.4	10k	0.1	2.84	0.3
10	10.0k	30.0k	33.08	1.0	83M	0.5	30k	0.2	4.49	0.4
10	100.0k	100	21.05	0.6	130M	0.1	8k	0.1	2.70	0.3
10	100.0k	10.0k	21.05	1.0	278M	0.2	10k	0.1	2.93	0.3
10	100.0k	30.0k	21.06	0.5	594M	0.4	30k	0.2	5.04	0.3
10	300.0k	100	21.06	0.8	274M	0.1	8k	0.1	2.06	0.2
10	300.0k	10.0k	21.06	0.6	881M	0.3	20k	0.2	3.68	0.4
10	300.0k	30.0k	21.06	0.8	1.9G	0.5	30k	0.2	5.48	0.5
100	0	100	25.06	1.0	39M	1.0	16k	0.2	1.72	0.3
100	0	10.0k	33.06	1.1	41M	1.1	30k	0.3	2.52	0.5
100	0	30.0k	41.07	2.0	42M	1.1	60k	0.5	4.05	0.8
100	10.0k	100	25.05	1.2	40M	0.8	8k	0.1	2.18	0.3
100	10.0k	10.0k	29.06	1.4	41M	0.9	20k	0.2	2.18	0.3
100	10.0k	30.0k	21.06	1.0	45M	0.9	30k	0.2	2.77	0.4
100	100.0k	100	21.05	0.5	49M	0.3	8k	0.1	1.82	0.2
100	100.0k	10.0k	21.06	1.0	62M	0.4	10k	0.1	2.02	0.3
100	100.0k	30.0k	21.05	1.0	95M	0.5	30k	0.2	2.79	0.3
100	300.0k	100	25.05	1.2	64M	0.2	8k	0.1	2.44	0.3
100	300.0k	10.0k	33.05	1.6	126M	0.3	20k	0.2	2.40	0.4
100	300.0k	30.0k	21.05	0.7	242M	0.7	30k	0.3	3.27	0.5
1.0k	0	100	21.05	0.7	39M	1.0	16k	0.2	2.13	0.4
1.0k	0	10.0k	25.06	1.2	39M	1.0	30k	0.3	2.96	0.4
1.0k	0	30.0k	21.06	0.6	42M	1.0	30k	0.2	4.66	0.6
1.0k	10.0k	100	21.05	0.5	39M	1.0	16k	0.2	1.81	0.3
1.0k	10.0k	10.0k	21.05	1.0	38M	1.0	30k	0.3	2.38	0.4
1.0k	10.0k	30.0k	21.07	1.0	41M	1.0	30k	0.2	3.31	0.5
1.0k	100.0k	100	21.05	1.0	40M	0.8	9k	0.1	1.76	0.3
1.0k	100.0k	10.0k	21.05	0.8	42M	0.8	20k	0.2	2.35	0.4
1.0k	100.0k	30.0k	21.05	1.0	43M	0.8	30k	0.2	4.70	0.8
1.0k	300.0k	100	25.05	0.7	42M	0.5	9k	0.1	1.82	0.3
1.0k	300.0k	10.0k	21.06	1.0	50M	0.6	30k	0.3	2.27	0.4
1.0k	300.0k	30.0k	25.06	1.2	60M	0.8	30k	0.2	4.76	0.7

Uniform columns omitted:
wl: chain
cyc%: 100
stable: yes

Legend (base vs new, matched by wl/cycle/extra/live/cyc%):
wl workload mode (chain or tree)
cyc% fraction of allocation units made cyclic
t(s) total time for new build
r-t ratio of new/base total time (1.0 = equal, 2.0 = 2x slower)
rss peak RSS for new build
r-rss ratio of new/base peak RSS
trash max uncollected cyclic-garbage for new build
r-trash ratio of new/base max trash
pause max GC pause (ms) for new build
r-pause ratio of new/base max GC pause
stable yes if new build trash count was stable (non-rising)

base=/usr/bin/python3 vs new=/home/nas/src/cpython/python

cycle	extra	live	t(s)	r-t	rss	r-rss	trash	r-trash	pause	r-pause
10	0	100	21.05	1.0	39M	2.2	14k	2.9	2.60	1.4
10	0	10.0k	21.05	1.0	39M	1.7	20k	0.3	4.21	0.6
10	0	30.0k	21.06	0.8	41M	1.4	30k	0.2	4.30	0.3
10	10.0k	100	21.05	0.8	45M	2.1	8k	1.6	2.50	1.6
10	10.0k	10.0k	21.05	0.5	60M	0.6	10k	0.1	2.84	0.3
10	10.0k	30.0k	33.08	1.3	83M	0.4	30k	0.2	4.49	0.2
10	100.0k	100	21.05	1.0	130M	2.2	8k	1.6	2.70	1.2
10	100.0k	10.0k	21.05	0.7	278M	0.4	10k	0.1	2.93	0.1
10	100.0k	30.0k	21.06	1.0	594M	0.3	30k	0.2	5.04	0.2
10	300.0k	100	21.06	1.0	274M	1.9	8k	1.6	2.06	0.5
10	300.0k	10.0k	21.06	1.0	881M	0.4	20k	0.3	3.68	0.1
10	300.0k	30.0k	21.06	0.8	1.9G	0.4	30k	0.2	5.48	0.1
100	0	100	25.06	1.1	39M	2.1	16k	2.9	1.72	1.1
100	0	10.0k	33.06	0.9	41M	1.6	30k	0.3	2.52	0.3
100	0	30.0k	41.07	1.0	42M	1.2	60k	0.3	4.05	0.1
100	10.0k	100	25.05	1.2	40M	2.1	8k	1.5	2.18	1.1
100	10.0k	10.0k	29.06	1.1	41M	1.3	20k	0.2	2.18	0.3
100	10.0k	30.0k	21.06	0.7	45M	0.9	30k	0.2	2.77	0.1
100	100.0k	100	21.05	1.0	49M	2.5	8k	1.5	1.82	1.2
100	100.0k	10.0k	21.06	0.4	62M	0.6	10k	0.1	2.02	0.2
100	100.0k	30.0k	21.05	0.5	95M	0.4	30k	0.2	2.79	0.1
100	300.0k	100	25.05	1.1	64M	2.4	8k	1.5	2.44	1.6
100	300.0k	10.0k	33.05	0.8	126M	0.5	20k	0.2	2.40	0.3
100	300.0k	30.0k	21.05	1.0	242M	0.4	30k	0.2	3.27	0.1
1.0k	0	100	21.05	0.6	39M	1.7	16k	0.7	2.13	0.8
1.0k	0	10.0k	25.06	1.0	39M	1.6	30k	0.3	2.96	0.5
1.0k	0	30.0k	21.06	0.4	42M	1.1	30k	0.1	4.66	0.3
1.0k	10.0k	100	21.05	0.6	39M	1.7	16k	0.7	1.81	0.7
1.0k	10.0k	10.0k	21.05	0.6	38M	1.3	30k	0.3	2.38	0.4
1.0k	10.0k	30.0k	21.07	0.7	41M	1.2	30k	0.1	3.31	0.2
1.0k	100.0k	100	21.05	0.8	40M	1.8	9k	0.4	1.76	0.8
1.0k	100.0k	10.0k	21.05	0.4	42M	1.1	20k	0.2	2.35	0.3
1.0k	100.0k	30.0k	21.05	0.8	43M	0.8	30k	0.1	4.70	0.3
1.0k	300.0k	100	25.05	1.0	42M	1.6	9k	0.4	1.82	0.8
1.0k	300.0k	10.0k	21.06	1.0	50M	1.0	30k	0.3	2.27	0.3
1.0k	300.0k	30.0k	25.06	1.1	60M	0.7	30k	0.1	4.76	0.3

colesbury

I'm most concerned about the logic for full_page_bytes:

Abandoned/reclaimed pages (see comment below)
I think we're missing counts for large/huge pages (MI_BIN_HUGE) that don't get marked as full.

colesbury · 2026-05-01T20:38:46Z

+// own pool), so the counter stays valid across abandon/reclaim without any
+// hand-off -- abandon and reclaim therefore have no hooks of their own.


I think there's still a problem here where we can double count or lose pages that are counted in full_page_bytes:

Page becomes full

Page is abandoned - now no longer marked as full but still counted in full_page_bytes

Block is freed from page

Page is reclaimed

Block is allocated from page - now full and double counted

I think there's a lot of subtleties here with abandoned pages. I'm not entirely sure what the right approach is.

I didn't find a good fix for that abandoned->freed->reclaimed hole. My hope is that doesn't happen too often and so if the GC runs a bit more often as a result, that's okay.

colesbury · 2026-05-01T20:40:23Z

+
+  // Total bytes (block_size * capacity) of pages currently in MI_BIN_FULL
+  // state whose pool association is this pool.
+  mi_decl_cache_align _Atomic(intptr_t)         full_page_bytes; // = 0


I'm worried about contention here because all full/not-full operations modify this shared variable. Repeated allocation/deallocation of a single block can cause the containing page to be repeatedly marked as full/not-full.

I'd prefer we do the counting in per-thread state instead of here, which is effectively per-interpreter state. We can add the total to a per-interpreter counter when the thread exits. That means gc_should_collect_mem_usage would need to loop over all thread states to get an estimate of the allocated bytes, but I think that's a worthwhile tradeoff.

I added full_page_bytes to the mi_heap_t struct, which is per-thread. Putting it inside the Python thread state does work but it adds some extra complication.

Asking the OS for the process memory usage doesn't work will given how mimalloc works. It does not promptly return memory to the OS and so the memory doesn't drop after cyclic trash is freed. Instead of asking the OS, use mimalloc APIs to compute how much memory is being used by all mimalloc arenas. We need to stop-the-world to do this but usually we can avoid doing a collection. So, from a performance perspective, this is worth it.

It's probably better to call this inside of gc_collect_main(). That way, we are not doing the STW from inside _PyObject_GC_Link() function. This should have no significant performance impact since we hit this only after the young object count hits the threshold.

This avoids using STW in exchange for less accurate memory usage estimates.

Should should avoid memory contention. Avoid casting *intptr_t to *Py_ssize_t. Include large and huge pages in count (promote eagerly to MI_BIN_FULL). Add comment noting about abandoned pages potentially being lost (their byte count never being subtracted).

nascheme · 2026-05-03T16:32:34Z

I modified the PR to handle mimalloc large and huge pages. That requires a bit a extra mimalloc change, which is a bit scary. So I think it would be better this gets into 3.15 for a while before we backport to 3.14 (assuming we think this approach is acceptable).

The biggest remaining issue, IMO, is the abandoned->free leak. I'm thinking extra complication or slowdown we would pay to fix that is not worth it.

bedevere-app Bot added the awaiting core review label Apr 23, 2026

bedevere-app Bot mentioned this pull request Apr 23, 2026

GC in 3.14t can defer too much, resulting in high memory use #148937

Open

nascheme requested a review from colesbury April 23, 2026 22:07

sergey-miryanov reviewed Apr 29, 2026

View reviewed changes

Comment thread Python/gc_free_threading.c Outdated

nascheme force-pushed the ft-gc-threshold-fix-main branch from ac09833 to a853c00 Compare April 30, 2026 14:40

colesbury reviewed May 1, 2026

View reviewed changes

nascheme added 6 commits May 1, 2026 20:31

Add blurb.

515e4c4

Compute mimalloc memory usage based on full pages.

7f654e1

This avoids using STW in exchange for less accurate memory usage estimates.

Avoid warning of unused functions.

14b9696

nascheme force-pushed the ft-gc-threshold-fix-main branch from 819a848 to 801737f Compare May 3, 2026 16:25

nascheme added 2 commits May 3, 2026 11:37

Use _mi_heap_main_get().

057f862

Merge remote-tracking branch 'origin/main' into ft-gc-threshold-fix-main

847cba1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-148937: fix for free-threaded GC (RSS based defer)#148940

GH-148937: fix for free-threaded GC (RSS based defer)#148940
nascheme wants to merge 8 commits intopython:mainfrom
nascheme:ft-gc-threshold-fix-main

nascheme commented Apr 23, 2026 •

edited

Loading

Uh oh!

nascheme commented Apr 23, 2026

Uh oh!

Uh oh!

read-the-docs-community Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

nascheme commented Apr 30, 2026

Uh oh!

nascheme commented Apr 30, 2026

Uh oh!

colesbury left a comment

Uh oh!

colesbury May 1, 2026

Uh oh!

nascheme May 3, 2026

Uh oh!

colesbury May 1, 2026

Uh oh!

nascheme May 3, 2026

Uh oh!

Uh oh!

nascheme commented May 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// own pool), so the counter stays valid across abandon/reclaim without any
		// hand-off -- abandon and reclaim therefore have no hooks of their own.

Uh oh!

Conversation

nascheme commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nascheme commented Apr 23, 2026

Uh oh!

Uh oh!

read-the-docs-community Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

nascheme commented Apr 30, 2026

Uh oh!

nascheme commented Apr 30, 2026

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

colesbury May 1, 2026

Choose a reason for hiding this comment

Uh oh!

nascheme May 3, 2026

Choose a reason for hiding this comment

Uh oh!

colesbury May 1, 2026

Choose a reason for hiding this comment

Uh oh!

nascheme May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nascheme commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nascheme commented Apr 23, 2026 •

edited

Loading

read-the-docs-community Bot commented Apr 30, 2026 •

edited

Loading

nascheme commented May 3, 2026 •

edited

Loading