cuda.core: require explicit stream for stream-scheduling APIs by Andy-Jost · Pull Request #2020 · NVIDIA/cuda-python

Andy-Jost · 2026-05-04T22:41:03Z

Closes #2001.

Summary

Removes the implicit fallback to default_stream() (or NULL) on every cuda.core API that schedules work on a stream. stream is now a required keyword-only argument; Stream_accept(None) raises TypeError. Callers that want the previous behavior must pass device.default_stream (or any explicit Stream) at the call site.

Changes

API surface

MemoryResource.allocate / MemoryResource.deallocate and overrides on DeviceMemoryResource, PinnedMemoryResource, ManagedMemoryResource, LegacyPinnedMemoryResource, and GraphMemoryResource.
Device.allocate.
GraphicsResource.map.
KernelOccupancy.max_potential_cluster_size / max_active_clusters.
Buffer.from_ipc_descriptor.

deallocate is keyword-only on every memory-resource subclass. The synchronous resources (LegacyPinnedMemoryResource, _SynchronousMemoryResource, VirtualMemoryResource) keep stream=None as a default, since those resources do not actually use the stream.

Stream_accept is promoted to cpdef so the pure-Python legacy/sync resources can call the centralized validation helper.

Exemptions (no change)

Graph.launch(stream) — same shape as Graph.upload(stream) and the kernel launch(stream, config, kernel, *args) API; stream stays the first positional argument.
VirtualMemoryResource.allocate / deallocate — cuMemCreate / cuMemMap are synchronous and not stream-ordered. Stream is keyword-only but optional and validated when provided.
Buffer.close(stream=None), GraphicsResource.unmap(stream=None), GraphicsResource.close(stream=None) — None here means "reuse the existing stream stored in the handle", not "fall back to default" — already on the exemption list in the issue.

Test Coverage

tests/test_module.py: added pytest.raises(TypeError, ...) for KernelOccupancy.max_potential_cluster_size / max_active_clusters when stream is omitted.
tests/test_graphics.py: added test_map_requires_explicit_stream (replaces test_map_with_default_stream).
All existing test sites and examples updated to pass stream= explicitly.
Full cuda_core/tests suite passes locally on Linux/A10/CUDA 13.1.

Related Work

Issue cuda.core: reject stream=None — require explicit stream everywhere #2001

…#2001) Removes the implicit fallback to default_stream() (or NULL) on APIs that schedule work on a stream. `stream` is now a required keyword-only argument; `Stream_accept(None)` raises TypeError. Affected APIs: - MemoryResource.allocate / deallocate and overrides on DeviceMemoryResource, PinnedMemoryResource, ManagedMemoryResource, LegacyPinnedMemoryResource, GraphMemoryResource. - Device.allocate. - GraphicsResource.map. - KernelOccupancy.max_potential_cluster_size / max_active_clusters. - Graph.launch (stream was previously positional). Stream_accept is promoted to cpdef so the pure-Python legacy/sync resources can call it. Also fixes a latent bug uncovered while doing this: the C++ MR deallocation callback in Buffer's GC path was calling `mr.deallocate(ptr, size, stream)` positionally, which would fail with the new keyword-only signature for every garbage-collected DeviceMemoryResource/GraphMemoryResource buffer. Switched to `stream=stream`. VirtualMemoryResource is exempt because cuMemCreate / cuMemMap are synchronous and not stream-ordered; it now accepts (and validates) an optional stream instead of rejecting any non-None value. Buffer.from_ipc_descriptor is also exempt: stream there only seeds the deallocation stream stored in the handle (no work is scheduled), the same shape as Buffer.close(stream=None). Tests, examples, and the v1.0.0 release note are updated accordingly. Co-authored-by: Cursor <cursoragent@cursor.com>

copy-pr-bot · 2026-05-04T22:41:07Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Andy-Jost · 2026-05-04T22:45:26Z

/ok to test

NVIDIA#2001) Buffer.from_ipc_descriptor previously fell back to default_stream() when stream=None. That fallback is exactly the implicit-fallback pattern issue NVIDIA#2001 removes (the chosen stream depends on global state, not the call site), so it does not belong in the same exemption category as Buffer.close(stream=None) / GraphicsResource.unmap(stream=None) which genuinely reuse an existing stream. stream is now keyword-only and required. Internal validation goes through Stream_accept like the other tightened APIs. Tests and the v1.0.0 release note updated accordingly. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-04T23:03:34Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-2020/
https://nvidia.github.io/cuda-python/pr-preview/pr-2020/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-2020/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-2020/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

Andy-Jost · 2026-05-04T23:17:18Z

/ok to test

…A#2001) - Make `deallocate` keyword-only on the synchronous resources (`LegacyPinnedMemoryResource`, `_SynchronousMemoryResource`, `VirtualMemoryResource`) so every memory-resource API obeys the kw-only rule, with `stream=None` as the default since these resources do not actually use the stream. - Revert `Graph.launch` to take `stream` positionally. It is the same shape as the kernel `launch(stream, config, kernel, *args)` API (already exempt in the issue) and shouldn't be the odd one out. - Tighten `VirtualMemoryResource.deallocate` docstring to match `allocate`. - Mark unused lambda args in `test_pass_object` as `_stream` to silence ARG005. Co-authored-by: Cursor <cursoragent@cursor.com>

Andy-Jost · 2026-05-04T23:28:13Z

/ok to test

…A#2001) Review follow-ups: - Tighten the test-only `MemoryResource` subclasses (`DummyDeviceMemoryResource`, `DummyHostMemoryResource`, `DummyPinnedMemoryResource`, `DummyUnifiedMemoryResource`, `TrackingMR`, `StreamCaptureMR`) to match the new public API: `allocate(self, size, *, stream)` and `deallocate(self, ptr, size, *, stream)` with no default. Previously the mocks accepted `stream=None` positionally, which let tests bypass the new explicit-stream policy. - Update the affected helper functions and call sites in `test_memory.py` to pass `stream=device.default_stream` explicitly. Fix the `super().deallocate(ptr, size, stream)` positional call in `test_mr_deallocate_receives_stream` to use `stream=stream`. - Update `helpers/buffers.py` similarly (`make_scratch_buffer`, `PatternGen`). - Add a direct test for the centralized `Stream_accept(None)` -> `TypeError` behavior in `test_stream.py`. - Tighten the release note for `Buffer.from_ipc_descriptor`: lead with the removal of the silent fallback to the default stream rather than the positional-to-keyword shift. Co-authored-by: Cursor <cursoragent@cursor.com>

Andy-Jost · 2026-05-04T23:53:22Z

/ok to test

`Buffer._reduce_helper` (the pickle/unpickle factory) previously called `Buffer.from_ipc_descriptor(mr, ipc_descriptor)` without a stream and relied on the implicit `default_stream()` fallback inside `Buffer_from_ipc_descriptor`. Making `from_ipc_descriptor`'s stream a required keyword-only argument broke this code path, causing every multiprocessing IPC test that pickles a `Buffer` (test_send_buffers, test_memory_ipc, test_event_ipc, test_serialize, test_workerpool, ...) to fail in the child process with: TypeError: from_ipc_descriptor() needs keyword-only argument stream Fix: pass `default_stream()` explicitly from `_reduce_helper`. The parent process's stream isn't portable across processes, so the pickle path cannot thread an explicit stream through. The receiver can still override the deallocation stream via `buffer.close(stream=...)`. The user-facing rule still holds: callers of `Buffer.from_ipc_descriptor` must pass an explicit stream. Co-authored-by: Cursor <cursoragent@cursor.com>

Andy-Jost · 2026-05-05T00:27:10Z

/ok to test

Cython-generated functions raise "FUNC() needs keyword-only argument stream" while pure-Python functions raise "FUNC() missing 1 required keyword-only argument: 'stream'" The new tests for `Kernel.occupancy.max_potential_cluster_size`, `Kernel.occupancy.max_active_clusters`, and `GraphicsResource.map` were matching only the CPython phrasing and failed against the Cython forms. Loosen the regex to `keyword-only argument`, which matches both. Co-authored-by: Cursor <cursoragent@cursor.com>

leofang

Thanks, Andy! LGTM overall. Caught one minor bug.

leofang

Overall the core API changes look correct and complete. All APIs from #2001 are properly converted to keyword-only with no default, Stream_accept correctly rejects None, and the exemptions are well-justified.

Inline comments flag one potential runtime concern (_mr_dealloc_callback passing None to pool-based MRs) and a couple of minor consistency items. See individual threads for details.

- examples/graph_update.py: use the dedicated `stream` created at the top of the example for the pinned allocation, instead of `device.default_stream`. Better model for users (Leo). - _memory/_legacy.py: route the user-supplied `stream` through `Stream_accept` in `LegacyPinnedMemoryResource.deallocate` and `_SynchronousMemoryResource.deallocate` so a non-`Stream` argument raises the clean `TypeError` from `Stream_accept` instead of an `AttributeError` from `.sync()` (matches the validation the matching `allocate` methods already do). Co-authored-by: Cursor <cursoragent@cursor.com>

…#2001) Synchronous memory resources (`LegacyPinnedMemoryResource`, `_SynchronousMemoryResource`, the various test mocks `DummyDeviceMR`, `DummyHostMR`, `DummyPinnedMR`, `DummyUnifiedMR`, `NullMemoryResource`, `TrackingMR`, `StreamCaptureMR`) take a stream argument purely for interface conformance with stream-ordered MRs but never use it. Forcing every caller to manufacture a stream just to discard it adds ceremony and a misleading model. Switch these MRs' allocate/deallocate signatures to keyword-only `stream=None` (validated via `Stream_accept` when provided), and drop the now-unused `stream=...` kwargs from ~35 call sites across examples, tests, and helpers. Also drop the `device` parameter from `buffer_initialization` and `buffer_close` test helpers (no longer needed) and remove leftover Device-setup boilerplate from the NullMemoryResource dlpack-failure tests. The user-facing rule is unchanged for the genuinely stream-ordered APIs (`DeviceMemoryResource`, `PinnedMemoryResource`, `ManagedMemoryResource`, `GraphMemoryResource`, `Device.allocate`, `Buffer.from_ipc_descriptor`, etc.): stream remains required and keyword-only. The release note is updated to reflect the sync-MR exemption (folding `LegacyPinnedMemoryResource` in alongside `VirtualMemoryResource`). Co-authored-by: Cursor <cursoragent@cursor.com>

…VIDIA#2001) Issue: the C++ ``shared_ptr`` deleter for a buffer's device-pointer handle invokes ``MemoryResource.deallocate`` via ``_mr_dealloc_callback``. The handle's deallocation stream is set separately via ``set_deallocation_stream``; if it was never set (e.g. buffers minted via ``Buffer.from_handle(ptr, size, mr=mr)`` from DLPack import, IPC import, or third-party adapters), the callback would pass ``stream=None`` to ``mr.deallocate``. After the strict-stream changes for NVIDIA#2001, the stream-ordered MR overrides reject ``stream=None`` via ``Stream_accept`` and raise ``TypeError``. The ``noexcept`` callback catches the exception, prints a warning to stderr, and returns -- silently **leaking** the underlying CUDA allocation (and any associated IPC handles). Fix: when ``h_stream`` is empty in ``_mr_dealloc_callback``, fall back to ``default_stream()`` instead of ``None``. The C++ teardown path is the unique legitimate "no-stream-context" caller (no Python frame from which to obtain a stream), so this is the one place where an implicit default-stream fallback is necessary; everywhere else the policy remains "stream is required and must be passed explicitly". Add ``test_mr_dealloc_callback_falls_back_to_default_stream`` covering the regression: a strict stream-ordered mock MR is used to back a ``Buffer.from_handle`` (no attached stream), and the test asserts that ``deallocate`` is invoked with the default stream rather than failing with ``TypeError`` and leaking. Co-authored-by: Cursor <cursoragent@cursor.com>

Andy-Jost · 2026-05-05T17:28:17Z

ca2f5f70 adds missing Stream_accept calls for validating optional streams. Also fixes a case where the wrong stream was supplied in an ignored context (though this is removed in the subsequent commit).
91807eea removes ignored stream arguments supplied to synchronous resources, including Legacy* and Dummy* MRs, and TrackingMR, among others. Adds comments indicating which APIs accept but ignore streams for compatibility.
5e9a5ea2 addresses this comment, by passing the default stream from Buffer to mr.deallocate when no deallocation stream was supplied. Although this introduces an implicit stream, which is discouraged, the alternative (leaking memory) is worse.

Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/cuda/core/_memory/_buffer.pyx # cuda_core/cuda/core/_memory/_graph_memory_resource.pyx # cuda_core/cuda/core/_memory/_legacy.py # cuda_core/cuda/core/_memory/_memory_pool.pyx # cuda_core/cuda/core/_module.pyx # cuda_core/docs/source/release/1.0.0-notes.rst

leofang · 2026-05-06T02:58:20Z

-            stream = Stream._from_handle(Stream, h_stream)
-        mr.deallocate(int(ptr), size, stream)
+        stream = Stream._from_handle(Stream, h_stream) if h_stream else default_stream()
+        mr.deallocate(int(ptr), size, stream=stream)


My apology. It seems my bot was hallucinating? I thought we guaranteed already that we always have a deallocation stream. There should be no way that we would need the default stream here. @Andy-Jost could you confirm?

Below is bot-generated:

For pool allocations via _MemPool.allocate(): the invariant holds — h_stream is always non-zero. In deviceptr_alloc_from_pool() (resource_handles.cpp:719), the stream is stored in the DevicePtrBox at creation and captured by the shared_ptr deleter. Same for Buffer.from_ipc_descriptor(). So for the normal async MR allocation path, the default_stream() fallback is unreachable.

The one path that breaks the invariant: Buffer.from_handle(ptr, size, mr=some_mr). This calls deviceptr_create_with_mr() (resource_handles.cpp:847) which creates the box with StreamHandle{} (empty). No stream parameter is accepted by from_handle(). So if someone constructs a buffer via from_handle(mr=pool_mr), the callback sees h_stream=0.

So the question is: is Buffer.from_handle(mr=pool_mr) with an async MR a legitimate use case? If not, the default_stream() fallback papers over something that should be an error. If yes, then from_handle() arguably should accept an optional stream when mr is provided.

For the test Andy added (test_mr_dealloc_callback_falls_back_to_default_stream), it constructs a buffer via Buffer.from_handle(1, 1024, mr=mr) — which is exactly the from_handle path, not the pool allocation path. This confirms the fallback only matters for from_handle, not for normal pool allocations.

Should this be flagged in the review? The options seem to be:

a) Keep the fallback but document it's specifically for the from_handle(mr=...) edge case, not for pool allocations

b) Make from_handle() accept an optional stream param when mr is provided, and treat missing-stream-on-async-MR as an error in the callback instead

leofang · 2026-05-06T03:08:42Z

    assert received["stream"].handle == stream.handle


+def test_mr_dealloc_callback_falls_back_to_default_stream():


Andy-Jost added this to the cuda.core v1.0.0 milestone May 4, 2026

Andy-Jost added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels May 4, 2026

Andy-Jost self-assigned this May 4, 2026

Andy-Jost force-pushed the explicit-stream-2001 branch from b9cf2bf to 7c4debe Compare May 4, 2026 23:24

Andy-Jost marked this pull request as ready for review May 4, 2026 23:59

Andy-Jost requested review from leofang and rparolin May 4, 2026 23:59

leofang reviewed May 5, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/_memory/_buffer.pyx

leofang reviewed May 5, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/_module.pyx

leofang requested changes May 5, 2026

View reviewed changes

Comment thread cuda_core/examples/graph_update.py Outdated

Comment thread cuda_core/tests/helpers/buffers.py Outdated

leofang added breaking Breaking changes are introduced enhancement Any code-related improvements and removed bug Something isn't working labels May 5, 2026

leofang reviewed May 5, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/_memory/_buffer.pyx

Comment thread cuda_core/cuda/core/_memory/_legacy.py Outdated

Comment thread cuda_core/tests/helpers/buffers.py Outdated

Andy-Jost and others added 2 commits May 5, 2026 10:01

Andy-Jost added 2 commits May 5, 2026 10:28

Merge branch 'main' into explicit-stream-2001

e1d3b8c

leofang mentioned this pull request May 5, 2026

Add managed-memory advise, prefetch, and discard-prefetch free functions #1775

Open

leofang reviewed May 6, 2026

View reviewed changes

		assert received["stream"].handle == stream.handle


		def test_mr_dealloc_callback_falls_back_to_default_stream():

Conversation

Andy-Jost commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Coverage

Related Work

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

Andy-Jost commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Andy-Jost commented May 4, 2026

Uh oh!

Andy-Jost commented May 4, 2026

Uh oh!

Andy-Jost commented May 4, 2026

Uh oh!

Andy-Jost commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Andy-Jost commented May 5, 2026

Uh oh!

leofang May 6, 2026

Choose a reason for hiding this comment

Uh oh!

leofang May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Andy-Jost commented May 4, 2026 •

edited

Loading