cuda.core: require explicit stream for stream-scheduling APIs#2020
cuda.core: require explicit stream for stream-scheduling APIs#2020Andy-Jost wants to merge 11 commits intoNVIDIA:mainfrom
Conversation
…#2001) Removes the implicit fallback to default_stream() (or NULL) on APIs that schedule work on a stream. `stream` is now a required keyword-only argument; `Stream_accept(None)` raises TypeError. Affected APIs: - MemoryResource.allocate / deallocate and overrides on DeviceMemoryResource, PinnedMemoryResource, ManagedMemoryResource, LegacyPinnedMemoryResource, GraphMemoryResource. - Device.allocate. - GraphicsResource.map. - KernelOccupancy.max_potential_cluster_size / max_active_clusters. - Graph.launch (stream was previously positional). Stream_accept is promoted to cpdef so the pure-Python legacy/sync resources can call it. Also fixes a latent bug uncovered while doing this: the C++ MR deallocation callback in Buffer's GC path was calling `mr.deallocate(ptr, size, stream)` positionally, which would fail with the new keyword-only signature for every garbage-collected DeviceMemoryResource/GraphMemoryResource buffer. Switched to `stream=stream`. VirtualMemoryResource is exempt because cuMemCreate / cuMemMap are synchronous and not stream-ordered; it now accepts (and validates) an optional stream instead of rejecting any non-None value. Buffer.from_ipc_descriptor is also exempt: stream there only seeds the deallocation stream stored in the handle (no work is scheduled), the same shape as Buffer.close(stream=None). Tests, examples, and the v1.0.0 release note are updated accordingly. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
NVIDIA#2001) Buffer.from_ipc_descriptor previously fell back to default_stream() when stream=None. That fallback is exactly the implicit-fallback pattern issue NVIDIA#2001 removes (the chosen stream depends on global state, not the call site), so it does not belong in the same exemption category as Buffer.close(stream=None) / GraphicsResource.unmap(stream=None) which genuinely reuse an existing stream. stream is now keyword-only and required. Internal validation goes through Stream_accept like the other tightened APIs. Tests and the v1.0.0 release note updated accordingly. Co-authored-by: Cursor <cursoragent@cursor.com>
|
|
/ok to test |
…A#2001) - Make `deallocate` keyword-only on the synchronous resources (`LegacyPinnedMemoryResource`, `_SynchronousMemoryResource`, `VirtualMemoryResource`) so every memory-resource API obeys the kw-only rule, with `stream=None` as the default since these resources do not actually use the stream. - Revert `Graph.launch` to take `stream` positionally. It is the same shape as the kernel `launch(stream, config, kernel, *args)` API (already exempt in the issue) and shouldn't be the odd one out. - Tighten `VirtualMemoryResource.deallocate` docstring to match `allocate`. - Mark unused lambda args in `test_pass_object` as `_stream` to silence ARG005. Co-authored-by: Cursor <cursoragent@cursor.com>
b9cf2bf to
7c4debe
Compare
|
/ok to test |
…A#2001) Review follow-ups: - Tighten the test-only `MemoryResource` subclasses (`DummyDeviceMemoryResource`, `DummyHostMemoryResource`, `DummyPinnedMemoryResource`, `DummyUnifiedMemoryResource`, `TrackingMR`, `StreamCaptureMR`) to match the new public API: `allocate(self, size, *, stream)` and `deallocate(self, ptr, size, *, stream)` with no default. Previously the mocks accepted `stream=None` positionally, which let tests bypass the new explicit-stream policy. - Update the affected helper functions and call sites in `test_memory.py` to pass `stream=device.default_stream` explicitly. Fix the `super().deallocate(ptr, size, stream)` positional call in `test_mr_deallocate_receives_stream` to use `stream=stream`. - Update `helpers/buffers.py` similarly (`make_scratch_buffer`, `PatternGen`). - Add a direct test for the centralized `Stream_accept(None)` -> `TypeError` behavior in `test_stream.py`. - Tighten the release note for `Buffer.from_ipc_descriptor`: lead with the removal of the silent fallback to the default stream rather than the positional-to-keyword shift. Co-authored-by: Cursor <cursoragent@cursor.com>
|
/ok to test |
`Buffer._reduce_helper` (the pickle/unpickle factory) previously called
`Buffer.from_ipc_descriptor(mr, ipc_descriptor)` without a stream and
relied on the implicit `default_stream()` fallback inside
`Buffer_from_ipc_descriptor`. Making `from_ipc_descriptor`'s stream a
required keyword-only argument broke this code path, causing every
multiprocessing IPC test that pickles a `Buffer` (test_send_buffers,
test_memory_ipc, test_event_ipc, test_serialize, test_workerpool, ...)
to fail in the child process with:
TypeError: from_ipc_descriptor() needs keyword-only argument stream
Fix: pass `default_stream()` explicitly from `_reduce_helper`. The
parent process's stream isn't portable across processes, so the pickle
path cannot thread an explicit stream through. The receiver can still
override the deallocation stream via `buffer.close(stream=...)`.
The user-facing rule still holds: callers of `Buffer.from_ipc_descriptor`
must pass an explicit stream.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
/ok to test |
Cython-generated functions raise
"FUNC() needs keyword-only argument stream"
while pure-Python functions raise
"FUNC() missing 1 required keyword-only argument: 'stream'"
The new tests for `Kernel.occupancy.max_potential_cluster_size`,
`Kernel.occupancy.max_active_clusters`, and `GraphicsResource.map`
were matching only the CPython phrasing and failed against the
Cython forms. Loosen the regex to `keyword-only argument`, which
matches both.
Co-authored-by: Cursor <cursoragent@cursor.com>
leofang
left a comment
There was a problem hiding this comment.
Thanks, Andy! LGTM overall. Caught one minor bug.
leofang
left a comment
There was a problem hiding this comment.
Overall the core API changes look correct and complete. All APIs from #2001 are properly converted to keyword-only with no default, Stream_accept correctly rejects None, and the exemptions are well-justified.
Inline comments flag one potential runtime concern (_mr_dealloc_callback passing None to pool-based MRs) and a couple of minor consistency items. See individual threads for details.
- examples/graph_update.py: use the dedicated `stream` created at the top of the example for the pinned allocation, instead of `device.default_stream`. Better model for users (Leo). - _memory/_legacy.py: route the user-supplied `stream` through `Stream_accept` in `LegacyPinnedMemoryResource.deallocate` and `_SynchronousMemoryResource.deallocate` so a non-`Stream` argument raises the clean `TypeError` from `Stream_accept` instead of an `AttributeError` from `.sync()` (matches the validation the matching `allocate` methods already do). Co-authored-by: Cursor <cursoragent@cursor.com>
…#2001) Synchronous memory resources (`LegacyPinnedMemoryResource`, `_SynchronousMemoryResource`, the various test mocks `DummyDeviceMR`, `DummyHostMR`, `DummyPinnedMR`, `DummyUnifiedMR`, `NullMemoryResource`, `TrackingMR`, `StreamCaptureMR`) take a stream argument purely for interface conformance with stream-ordered MRs but never use it. Forcing every caller to manufacture a stream just to discard it adds ceremony and a misleading model. Switch these MRs' allocate/deallocate signatures to keyword-only `stream=None` (validated via `Stream_accept` when provided), and drop the now-unused `stream=...` kwargs from ~35 call sites across examples, tests, and helpers. Also drop the `device` parameter from `buffer_initialization` and `buffer_close` test helpers (no longer needed) and remove leftover Device-setup boilerplate from the NullMemoryResource dlpack-failure tests. The user-facing rule is unchanged for the genuinely stream-ordered APIs (`DeviceMemoryResource`, `PinnedMemoryResource`, `ManagedMemoryResource`, `GraphMemoryResource`, `Device.allocate`, `Buffer.from_ipc_descriptor`, etc.): stream remains required and keyword-only. The release note is updated to reflect the sync-MR exemption (folding `LegacyPinnedMemoryResource` in alongside `VirtualMemoryResource`). Co-authored-by: Cursor <cursoragent@cursor.com>
…VIDIA#2001) Issue: the C++ ``shared_ptr`` deleter for a buffer's device-pointer handle invokes ``MemoryResource.deallocate`` via ``_mr_dealloc_callback``. The handle's deallocation stream is set separately via ``set_deallocation_stream``; if it was never set (e.g. buffers minted via ``Buffer.from_handle(ptr, size, mr=mr)`` from DLPack import, IPC import, or third-party adapters), the callback would pass ``stream=None`` to ``mr.deallocate``. After the strict-stream changes for NVIDIA#2001, the stream-ordered MR overrides reject ``stream=None`` via ``Stream_accept`` and raise ``TypeError``. The ``noexcept`` callback catches the exception, prints a warning to stderr, and returns -- silently **leaking** the underlying CUDA allocation (and any associated IPC handles). Fix: when ``h_stream`` is empty in ``_mr_dealloc_callback``, fall back to ``default_stream()`` instead of ``None``. The C++ teardown path is the unique legitimate "no-stream-context" caller (no Python frame from which to obtain a stream), so this is the one place where an implicit default-stream fallback is necessary; everywhere else the policy remains "stream is required and must be passed explicitly". Add ``test_mr_dealloc_callback_falls_back_to_default_stream`` covering the regression: a strict stream-ordered mock MR is used to back a ``Buffer.from_handle`` (no attached stream), and the test asserts that ``deallocate`` is invoked with the default stream rather than failing with ``TypeError`` and leaking. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # cuda_core/cuda/core/_memory/_buffer.pyx # cuda_core/cuda/core/_memory/_graph_memory_resource.pyx # cuda_core/cuda/core/_memory/_legacy.py # cuda_core/cuda/core/_memory/_memory_pool.pyx # cuda_core/cuda/core/_module.pyx # cuda_core/docs/source/release/1.0.0-notes.rst
| stream = Stream._from_handle(Stream, h_stream) | ||
| mr.deallocate(int(ptr), size, stream) | ||
| stream = Stream._from_handle(Stream, h_stream) if h_stream else default_stream() | ||
| mr.deallocate(int(ptr), size, stream=stream) |
There was a problem hiding this comment.
My apology. It seems my bot was hallucinating? I thought we guaranteed already that we always have a deallocation stream. There should be no way that we would need the default stream here. @Andy-Jost could you confirm?
Below is bot-generated:
For pool allocations via _MemPool.allocate(): the invariant holds — h_stream is always non-zero. In deviceptr_alloc_from_pool() (resource_handles.cpp:719), the stream is stored in the DevicePtrBox at creation and captured by the shared_ptr deleter. Same for Buffer.from_ipc_descriptor(). So for the normal async MR allocation path, the default_stream() fallback is unreachable.
The one path that breaks the invariant: Buffer.from_handle(ptr, size, mr=some_mr). This calls deviceptr_create_with_mr() (resource_handles.cpp:847) which creates the box with StreamHandle{} (empty). No stream parameter is accepted by from_handle(). So if someone constructs a buffer via from_handle(mr=pool_mr), the callback sees h_stream=0.
So the question is: is Buffer.from_handle(mr=pool_mr) with an async MR a legitimate use case? If not, the default_stream() fallback papers over something that should be an error. If yes, then from_handle() arguably should accept an optional stream when mr is provided.
For the test Andy added (test_mr_dealloc_callback_falls_back_to_default_stream), it constructs a buffer via Buffer.from_handle(1, 1024, mr=mr) — which is exactly the from_handle path, not the pool allocation path. This confirms the fallback only matters for from_handle, not for normal pool allocations.
Should this be flagged in the review? The options seem to be:
- a) Keep the fallback but document it's specifically for the from_handle(mr=...) edge case, not for pool allocations
- b) Make from_handle() accept an optional stream param when mr is provided, and treat missing-stream-on-async-MR as an error in the callback instead
| assert received["stream"].handle == stream.handle | ||
|
|
||
|
|
||
| def test_mr_dealloc_callback_falls_back_to_default_stream(): |
Closes #2001.
Summary
Removes the implicit fallback to
default_stream()(or NULL) on everycuda.coreAPI that schedules work on a stream.streamis now a required keyword-only argument;Stream_accept(None)raisesTypeError. Callers that want the previous behavior must passdevice.default_stream(or any explicitStream) at the call site.Changes
API surface
MemoryResource.allocate/MemoryResource.deallocateand overrides onDeviceMemoryResource,PinnedMemoryResource,ManagedMemoryResource,LegacyPinnedMemoryResource, andGraphMemoryResource.Device.allocate.GraphicsResource.map.KernelOccupancy.max_potential_cluster_size/max_active_clusters.Buffer.from_ipc_descriptor.deallocateis keyword-only on every memory-resource subclass. The synchronous resources (LegacyPinnedMemoryResource,_SynchronousMemoryResource,VirtualMemoryResource) keepstream=Noneas a default, since those resources do not actually use the stream.Stream_acceptis promoted tocpdefso the pure-Python legacy/sync resources can call the centralized validation helper.Exemptions (no change)
Graph.launch(stream)— same shape asGraph.upload(stream)and the kernellaunch(stream, config, kernel, *args)API; stream stays the first positional argument.VirtualMemoryResource.allocate/deallocate—cuMemCreate/cuMemMapare synchronous and not stream-ordered. Stream is keyword-only but optional and validated when provided.Buffer.close(stream=None),GraphicsResource.unmap(stream=None),GraphicsResource.close(stream=None)—Nonehere means "reuse the existing stream stored in the handle", not "fall back to default" — already on the exemption list in the issue.Test Coverage
tests/test_module.py: addedpytest.raises(TypeError, ...)forKernelOccupancy.max_potential_cluster_size/max_active_clusterswhenstreamis omitted.tests/test_graphics.py: addedtest_map_requires_explicit_stream(replacestest_map_with_default_stream).stream=explicitly.cuda_core/testssuite passes locally on Linux/A10/CUDA 13.1.Related Work