Skip to content

cuda.core: require explicit stream for stream-scheduling APIs#2020

Open
Andy-Jost wants to merge 11 commits intoNVIDIA:mainfrom
Andy-Jost:explicit-stream-2001
Open

cuda.core: require explicit stream for stream-scheduling APIs#2020
Andy-Jost wants to merge 11 commits intoNVIDIA:mainfrom
Andy-Jost:explicit-stream-2001

Conversation

@Andy-Jost
Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost commented May 4, 2026

Closes #2001.

Summary

Removes the implicit fallback to default_stream() (or NULL) on every cuda.core API that schedules work on a stream. stream is now a required keyword-only argument; Stream_accept(None) raises TypeError. Callers that want the previous behavior must pass device.default_stream (or any explicit Stream) at the call site.

Changes

API surface

  • MemoryResource.allocate / MemoryResource.deallocate and overrides on DeviceMemoryResource, PinnedMemoryResource, ManagedMemoryResource, LegacyPinnedMemoryResource, and GraphMemoryResource.
  • Device.allocate.
  • GraphicsResource.map.
  • KernelOccupancy.max_potential_cluster_size / max_active_clusters.
  • Buffer.from_ipc_descriptor.

deallocate is keyword-only on every memory-resource subclass. The synchronous resources (LegacyPinnedMemoryResource, _SynchronousMemoryResource, VirtualMemoryResource) keep stream=None as a default, since those resources do not actually use the stream.

Stream_accept is promoted to cpdef so the pure-Python legacy/sync resources can call the centralized validation helper.

Exemptions (no change)

  • Graph.launch(stream) — same shape as Graph.upload(stream) and the kernel launch(stream, config, kernel, *args) API; stream stays the first positional argument.
  • VirtualMemoryResource.allocate / deallocatecuMemCreate / cuMemMap are synchronous and not stream-ordered. Stream is keyword-only but optional and validated when provided.
  • Buffer.close(stream=None), GraphicsResource.unmap(stream=None), GraphicsResource.close(stream=None)None here means "reuse the existing stream stored in the handle", not "fall back to default" — already on the exemption list in the issue.

Test Coverage

  • tests/test_module.py: added pytest.raises(TypeError, ...) for KernelOccupancy.max_potential_cluster_size / max_active_clusters when stream is omitted.
  • tests/test_graphics.py: added test_map_requires_explicit_stream (replaces test_map_with_default_stream).
  • All existing test sites and examples updated to pass stream= explicitly.
  • Full cuda_core/tests suite passes locally on Linux/A10/CUDA 13.1.

Related Work

…#2001)

Removes the implicit fallback to default_stream() (or NULL) on APIs that
schedule work on a stream. `stream` is now a required keyword-only
argument; `Stream_accept(None)` raises TypeError.

Affected APIs:
- MemoryResource.allocate / deallocate and overrides on
  DeviceMemoryResource, PinnedMemoryResource, ManagedMemoryResource,
  LegacyPinnedMemoryResource, GraphMemoryResource.
- Device.allocate.
- GraphicsResource.map.
- KernelOccupancy.max_potential_cluster_size / max_active_clusters.
- Graph.launch (stream was previously positional).

Stream_accept is promoted to cpdef so the pure-Python legacy/sync
resources can call it.

Also fixes a latent bug uncovered while doing this: the C++ MR
deallocation callback in Buffer's GC path was calling
`mr.deallocate(ptr, size, stream)` positionally, which would fail with
the new keyword-only signature for every garbage-collected
DeviceMemoryResource/GraphMemoryResource buffer. Switched to
`stream=stream`.

VirtualMemoryResource is exempt because cuMemCreate / cuMemMap are
synchronous and not stream-ordered; it now accepts (and validates) an
optional stream instead of rejecting any non-None value.

Buffer.from_ipc_descriptor is also exempt: stream there only seeds the
deallocation stream stored in the handle (no work is scheduled), the
same shape as Buffer.close(stream=None).

Tests, examples, and the v1.0.0 release note are updated accordingly.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost added this to the cuda.core v1.0.0 milestone May 4, 2026
@Andy-Jost Andy-Jost added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels May 4, 2026
@Andy-Jost Andy-Jost self-assigned this May 4, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 4, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@Andy-Jost
Copy link
Copy Markdown
Contributor Author

/ok to test

NVIDIA#2001)

Buffer.from_ipc_descriptor previously fell back to default_stream() when
stream=None. That fallback is exactly the implicit-fallback pattern issue
NVIDIA#2001 removes (the chosen stream depends on global state, not the call
site), so it does not belong in the same exemption category as
Buffer.close(stream=None) / GraphicsResource.unmap(stream=None) which
genuinely reuse an existing stream.

stream is now keyword-only and required. Internal validation goes through
Stream_accept like the other tightened APIs. Tests and the v1.0.0 release
note updated accordingly.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

@Andy-Jost
Copy link
Copy Markdown
Contributor Author

/ok to test

…A#2001)

- Make `deallocate` keyword-only on the synchronous resources
  (`LegacyPinnedMemoryResource`, `_SynchronousMemoryResource`,
  `VirtualMemoryResource`) so every memory-resource API obeys the
  kw-only rule, with `stream=None` as the default since these resources
  do not actually use the stream.
- Revert `Graph.launch` to take `stream` positionally. It is the same
  shape as the kernel `launch(stream, config, kernel, *args)` API
  (already exempt in the issue) and shouldn't be the odd one out.
- Tighten `VirtualMemoryResource.deallocate` docstring to match
  `allocate`.
- Mark unused lambda args in `test_pass_object` as `_stream` to silence
  ARG005.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost force-pushed the explicit-stream-2001 branch from b9cf2bf to 7c4debe Compare May 4, 2026 23:24
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

/ok to test

…A#2001)

Review follow-ups:

- Tighten the test-only `MemoryResource` subclasses (`DummyDeviceMemoryResource`,
  `DummyHostMemoryResource`, `DummyPinnedMemoryResource`,
  `DummyUnifiedMemoryResource`, `TrackingMR`, `StreamCaptureMR`) to match the
  new public API: `allocate(self, size, *, stream)` and
  `deallocate(self, ptr, size, *, stream)` with no default. Previously the
  mocks accepted `stream=None` positionally, which let tests bypass the new
  explicit-stream policy.
- Update the affected helper functions and call sites in `test_memory.py` to
  pass `stream=device.default_stream` explicitly. Fix the
  `super().deallocate(ptr, size, stream)` positional call in
  `test_mr_deallocate_receives_stream` to use `stream=stream`.
- Update `helpers/buffers.py` similarly (`make_scratch_buffer`, `PatternGen`).
- Add a direct test for the centralized `Stream_accept(None)` -> `TypeError`
  behavior in `test_stream.py`.
- Tighten the release note for `Buffer.from_ipc_descriptor`: lead with the
  removal of the silent fallback to the default stream rather than the
  positional-to-keyword shift.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

/ok to test

@Andy-Jost Andy-Jost marked this pull request as ready for review May 4, 2026 23:59
@Andy-Jost Andy-Jost requested review from leofang and rparolin May 4, 2026 23:59
`Buffer._reduce_helper` (the pickle/unpickle factory) previously called
`Buffer.from_ipc_descriptor(mr, ipc_descriptor)` without a stream and
relied on the implicit `default_stream()` fallback inside
`Buffer_from_ipc_descriptor`. Making `from_ipc_descriptor`'s stream a
required keyword-only argument broke this code path, causing every
multiprocessing IPC test that pickles a `Buffer` (test_send_buffers,
test_memory_ipc, test_event_ipc, test_serialize, test_workerpool, ...)
to fail in the child process with:

    TypeError: from_ipc_descriptor() needs keyword-only argument stream

Fix: pass `default_stream()` explicitly from `_reduce_helper`. The
parent process's stream isn't portable across processes, so the pickle
path cannot thread an explicit stream through. The receiver can still
override the deallocation stream via `buffer.close(stream=...)`.

The user-facing rule still holds: callers of `Buffer.from_ipc_descriptor`
must pass an explicit stream.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

/ok to test

Cython-generated functions raise

    "FUNC() needs keyword-only argument stream"

while pure-Python functions raise

    "FUNC() missing 1 required keyword-only argument: 'stream'"

The new tests for `Kernel.occupancy.max_potential_cluster_size`,
`Kernel.occupancy.max_active_clusters`, and `GraphicsResource.map`
were matching only the CPython phrasing and failed against the
Cython forms. Loosen the regex to `keyword-only argument`, which
matches both.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread cuda_core/cuda/core/_memory/_buffer.pyx
Comment thread cuda_core/cuda/core/_module.pyx
Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Andy! LGTM overall. Caught one minor bug.

Comment thread cuda_core/examples/graph_update.py Outdated
Comment thread cuda_core/tests/helpers/buffers.py Outdated
@leofang leofang added breaking Breaking changes are introduced enhancement Any code-related improvements and removed bug Something isn't working labels May 5, 2026
Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the core API changes look correct and complete. All APIs from #2001 are properly converted to keyword-only with no default, Stream_accept correctly rejects None, and the exemptions are well-justified.

Inline comments flag one potential runtime concern (_mr_dealloc_callback passing None to pool-based MRs) and a couple of minor consistency items. See individual threads for details.

Comment thread cuda_core/cuda/core/_memory/_buffer.pyx
Comment thread cuda_core/cuda/core/_memory/_legacy.py Outdated
Comment thread cuda_core/tests/helpers/buffers.py Outdated
- examples/graph_update.py: use the dedicated `stream` created at the
  top of the example for the pinned allocation, instead of
  `device.default_stream`. Better model for users (Leo).
- _memory/_legacy.py: route the user-supplied `stream` through
  `Stream_accept` in `LegacyPinnedMemoryResource.deallocate` and
  `_SynchronousMemoryResource.deallocate` so a non-`Stream` argument
  raises the clean `TypeError` from `Stream_accept` instead of an
  `AttributeError` from `.sync()` (matches the validation the matching
  `allocate` methods already do).

Co-authored-by: Cursor <cursoragent@cursor.com>
Andy-Jost and others added 2 commits May 5, 2026 10:01
…#2001)

Synchronous memory resources (`LegacyPinnedMemoryResource`,
`_SynchronousMemoryResource`, the various test mocks `DummyDeviceMR`,
`DummyHostMR`, `DummyPinnedMR`, `DummyUnifiedMR`, `NullMemoryResource`,
`TrackingMR`, `StreamCaptureMR`) take a stream argument purely for
interface conformance with stream-ordered MRs but never use it.
Forcing every caller to manufacture a stream just to discard it adds
ceremony and a misleading model.

Switch these MRs' allocate/deallocate signatures to keyword-only
`stream=None` (validated via `Stream_accept` when provided), and drop
the now-unused `stream=...` kwargs from ~35 call sites across
examples, tests, and helpers. Also drop the `device` parameter from
`buffer_initialization` and `buffer_close` test helpers (no longer
needed) and remove leftover Device-setup boilerplate from the
NullMemoryResource dlpack-failure tests.

The user-facing rule is unchanged for the genuinely stream-ordered
APIs (`DeviceMemoryResource`, `PinnedMemoryResource`,
`ManagedMemoryResource`, `GraphMemoryResource`, `Device.allocate`,
`Buffer.from_ipc_descriptor`, etc.): stream remains required and
keyword-only. The release note is updated to reflect the sync-MR
exemption (folding `LegacyPinnedMemoryResource` in alongside
`VirtualMemoryResource`).

Co-authored-by: Cursor <cursoragent@cursor.com>
…VIDIA#2001)

Issue: the C++ ``shared_ptr`` deleter for a buffer's device-pointer
handle invokes ``MemoryResource.deallocate`` via ``_mr_dealloc_callback``.
The handle's deallocation stream is set separately via
``set_deallocation_stream``; if it was never set (e.g. buffers minted via
``Buffer.from_handle(ptr, size, mr=mr)`` from DLPack import, IPC import,
or third-party adapters), the callback would pass ``stream=None`` to
``mr.deallocate``. After the strict-stream changes for NVIDIA#2001, the
stream-ordered MR overrides reject ``stream=None`` via ``Stream_accept``
and raise ``TypeError``. The ``noexcept`` callback catches the exception,
prints a warning to stderr, and returns -- silently **leaking** the
underlying CUDA allocation (and any associated IPC handles).

Fix: when ``h_stream`` is empty in ``_mr_dealloc_callback``, fall back
to ``default_stream()`` instead of ``None``. The C++ teardown path is
the unique legitimate "no-stream-context" caller (no Python frame from
which to obtain a stream), so this is the one place where an implicit
default-stream fallback is necessary; everywhere else the policy
remains "stream is required and must be passed explicitly".

Add ``test_mr_dealloc_callback_falls_back_to_default_stream`` covering
the regression: a strict stream-ordered mock MR is used to back a
``Buffer.from_handle`` (no attached stream), and the test asserts that
``deallocate`` is invoked with the default stream rather than failing
with ``TypeError`` and leaking.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

  • ca2f5f70 adds missing Stream_accept calls for validating optional streams. Also fixes a case where the wrong stream was supplied in an ignored context (though this is removed in the subsequent commit).
  • 91807eea removes ignored stream arguments supplied to synchronous resources, including Legacy* and Dummy* MRs, and TrackingMR, among others. Adds comments indicating which APIs accept but ignore streams for compatibility.
  • 5e9a5ea2 addresses this comment, by passing the default stream from Buffer to mr.deallocate when no deallocation stream was supplied. Although this introduces an implicit stream, which is discouraged, the alternative (leaking memory) is worse.

Andy-Jost added 2 commits May 5, 2026 10:28
Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	cuda_core/cuda/core/_memory/_buffer.pyx
#	cuda_core/cuda/core/_memory/_graph_memory_resource.pyx
#	cuda_core/cuda/core/_memory/_legacy.py
#	cuda_core/cuda/core/_memory/_memory_pool.pyx
#	cuda_core/cuda/core/_module.pyx
#	cuda_core/docs/source/release/1.0.0-notes.rst
stream = Stream._from_handle(Stream, h_stream)
mr.deallocate(int(ptr), size, stream)
stream = Stream._from_handle(Stream, h_stream) if h_stream else default_stream()
mr.deallocate(int(ptr), size, stream=stream)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My apology. It seems my bot was hallucinating? I thought we guaranteed already that we always have a deallocation stream. There should be no way that we would need the default stream here. @Andy-Jost could you confirm?

Below is bot-generated:


For pool allocations via _MemPool.allocate(): the invariant holds — h_stream is always non-zero. In deviceptr_alloc_from_pool() (resource_handles.cpp:719), the stream is stored in the DevicePtrBox at creation and captured by the shared_ptr deleter. Same for Buffer.from_ipc_descriptor(). So for the normal async MR allocation path, the default_stream() fallback is unreachable.

The one path that breaks the invariant: Buffer.from_handle(ptr, size, mr=some_mr). This calls deviceptr_create_with_mr() (resource_handles.cpp:847) which creates the box with StreamHandle{} (empty). No stream parameter is accepted by from_handle(). So if someone constructs a buffer via from_handle(mr=pool_mr), the callback sees h_stream=0.

So the question is: is Buffer.from_handle(mr=pool_mr) with an async MR a legitimate use case? If not, the default_stream() fallback papers over something that should be an error. If yes, then from_handle() arguably should accept an optional stream when mr is provided.

For the test Andy added (test_mr_dealloc_callback_falls_back_to_default_stream), it constructs a buffer via Buffer.from_handle(1, 1024, mr=mr) — which is exactly the from_handle path, not the pool allocation path. This confirms the fallback only matters for from_handle, not for normal pool allocations.

Should this be flagged in the review? The options seem to be:

  • a) Keep the fallback but document it's specifically for the from_handle(mr=...) edge case, not for pool allocations
  • b) Make from_handle() accept an optional stream param when mr is provided, and treat missing-stream-on-async-MR as an error in the callback instead

assert received["stream"].handle == stream.handle


def test_mr_dealloc_callback_falls_back_to_default_stream():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking changes are introduced cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cuda.core: reject stream=None — require explicit stream everywhere

2 participants