Add green context support#1976
Conversation
Restructure tests into fixtures + classes with full resource cleanup: - Fixtures: sm_resource, wq_resource, green_ctx (with CUDAError skip), green_ctx_active (with try/finally restore), fill_kernel - _use_green_ctx context manager for safe push/pop in all tests - TestSMResourceQuery: properties, arch constraints per CC - TestSMResourceSplit: single/two-group splits, discovery, alignment, dry-run vs real parity - TestGreenContextKernelLaunch: compile + launch + verify in green ctx, two independent green contexts, SM + workqueue combined All set_current calls are paired with restore in finally blocks to prevent context stack leaks on test failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test ac5c0fc |
|
17c2be2 to
08d52d1
Compare
- Convert ContextOptions and SMResourceOptions/WorkqueueResourceOptions to cdef dataclasses for check_or_create_options compatibility. - Cache SM metadata in typed cdef fields; fall back to arch-based granularity on CUDA 12.x where CUdevSmResource lacks minSmPartitionSize/smCoscheduledAlignment. - Simplify Context to hold only ContextHandle (remove _h_green_ctx and _is_green fields). Green ctx association lives in ContextBox; is_green queries get_context_green_ctx() on demand. - ContextOptions.resources accepts Sequence only (no bare resource). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
08d52d1 to
3013fe8
Compare
Switch from the push model (dev.set_current + dev.create_stream) to the explicit model (ctx.create_stream + ctx.resources) as the primary way to use green contexts. Context.create_stream(options): - Only supported on green contexts (raises on primary contexts). - Delegates to Stream._init, which calls create_stream_handle in C++. - C++ create_stream_handle auto-dispatches: checks get_context_green_ctx and calls cuGreenCtxStreamCreate for green contexts, or cuStreamCreateWithPriority for primary. Single function, no duplication. Context.resources: - Returns a DeviceResources namespace querying this context's resources (cuCtxGetDevResource / cuGreenCtxGetDevResource), not the full device. dev.set_current(green_ctx) still works but is not the recommended path. Tests rewritten to use the explicit model throughout. Push-model set_current kept as regression tests with _use_green_ctx helper. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
62e4883 to
3287204
Compare
- Let the driver validate the nonblocking flag for green context streams: cuGreenCtxStreamCreate rejects CU_STREAM_DEFAULT. On failure, check if the context is green + nonblocking is False and raise a clear ValueError. - cuCtxGetStreamPriorityRange failure (CUDA_ERROR_INVALID_CONTEXT) now raises: "No current CUDA context. Call dev.set_current() before creating streams." - C++ create_stream_handle returns CUDA_ERROR_NOT_SUPPORTED if the context is green but cuGreenCtxStreamCreate is unavailable (CUDA < 12.5), instead of falling through to cuStreamCreateWithPriority. - ctx.resources.workqueue now dispatches to cuGreenCtxGetDevResource for green contexts, matching the SM query path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3287204 to
2812c5b
Compare
|
/ok to test 2812c5b |
Stream.resources delegates to DeviceResources._init_from_ctx via the stream's tracked context handle, returning the same resource view as ctx.resources for the stream's parent context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d5a7297 to
5b3c610
Compare
|
/ok to test 5b3c610 |
eebd4cf to
5989fd1
Compare
- dev.create_context raises ValueError (not NotImplementedError) when options or resources are missing. - Cache version checks (_check_green_ctx_support, _check_workqueue_support) at module level; raise ValueError instead of NotImplementedError. - Simplify _device_resources.pyx: merge _as_uint and _count_to_sm_count into _to_sm_count; inline unsigned int casts for coscheduled params. - Add green context classes to api.rst (Context, ContextOptions, DeviceResources, SMResource, SMResourceOptions, WorkqueueResource, WorkqueueResourceOptions). - Update all docstrings to NumPy style with Attributes/Parameters/Returns sections matching the existing codebase convention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5989fd1 to
fa254a5
Compare
|
/ok to test fa254a5 |
Andy-Jost
left a comment
There was a problem hiding this comment.
I see a few issues with the registries. It appears the context registry needs to be checked in one place.
I have a bigger concern with the stream registry. I don't see what problem it solves, and it appears to be corruptible through the user API.
| context_registry.unregister_handle(b->resource); | ||
| GILReleaseGuard gil; |
There was a problem hiding this comment.
nit: put the GIL release first for consistency in this file and to make it easier to spot.
There was a problem hiding this comment.
For my own understanding: this seems to be consistent with the rest of the file? In the shared pointer deleter we:
- unregister a handle from the registry
- get GIL guard
- call C API
Are you saying we should swap 1 and 2 for the whole file, including code that is not touched by this PR?
| stream_registry.unregister_handle(b->resource); | ||
| GILReleaseGuard gil; |
| stream_registry.unregister_handle(b->resource); | ||
| GILAcquireGuard gil; |
| StreamHandle create_stream_handle_with_owner(CUstream stream, PyObject* owner) { | ||
| if (auto h = stream_registry.lookup(stream)) { | ||
| // Reuse handles that already carry structural context metadata, e.g. | ||
| // cuda-core-owned streams. Owner-backed foreign streams still need a | ||
| // fresh handle so the supplied owner is retained. | ||
| if (get_box(h)->h_context) { | ||
| return h; | ||
| } | ||
| } |
There was a problem hiding this comment.
This stream registry only allows one entry per unique CUstream, but there can be multiple PyObject owners, which will compete for the one and only cache slot. It can easily lead to registry corruption.
At a minimum, I don't think the stream + Python owner path should use a registry at all, since it is really just tying a Py_INCREF/Py_DECREF pair to a CUstream lifetime, which can be stacked any number of times.
More broadly, I'm having trouble understanding why a stream registry is needed at all. The PR description says:
Stream._from_handleandStream_ensure_ctxprefer the registry-backed handle before falling back tocuStreamGetCtx. This fixes a latent issue where streams created in a green context would lose their context association after aset_currentswap.
I don't see how this comes into play. A registry allows a raw handle such as CUstream to round-trip through the driver without losing metadata, but set_current is about CUcontext, not CUstream. Where is the path that needs to match up a raw CUstream with an existing StreamHandle?
There was a problem hiding this comment.
@Andy-Jost correct me if I am wrong, I would think every type we own should have its own registry, why would we differentiate CUcontext from others? It is not only confusing but also unsafe in general, considering that we do want to handle the general cases for each object via from_handle() (#1989).
For example, our Buffer internally tracks the allocation Stream, so having a buffer registry would help express the intent (though for buffer we may have other considerations such as performance).
Where is the path that needs to match up a raw CUstream with an existing StreamHandle?
On way to trigger this is:
# cuda.core creates a green stream (cuGreenCtxStreamCreate under the hood)
ctx = dev.create_context(...)
s = ctx.create_stream()
# User extracts the raw handle and re-wraps it (round-trip through int)
raw = s.handle
s2 = Stream.from_handle(raw)
# Without registry: s2 is a fresh handle with no context dep
# With registry: s2 reuses the existing StreamHandle, preserving
# the ContextHandle → GreenCtxHandle dependency chain
s2.context.is_green # True with registry, might fail withoutso in this case a (stream) registry lookup is indeed useful.
It can easily lead to registry corruption.
I think we do want to address the corruption issue. Would it help if we keep the registry but exclude the owner path (i.e. foreign streams don't register)?
| """True if this context was created from device resources.""" | ||
| if not self._h_context: | ||
| return False | ||
| return get_context_green_ctx(self._h_context).get() != NULL |
There was a problem hiding this comment.
nit: consider putting this in the handle API as is_green(self._h_context).
…td::vector Review comment 1: Consolidate create_context_handle_from_green_ctx with create_context_handle_ref by adding a private overload that takes an optional GreenCtxHandle. The green ctx path now delegates to it after calling cuCtxFromGreenCtx, ensuring registry lookup and deduplication. Review comments 2-4: Move GILReleaseGuard to the first line in create_green_ctx_handle and create_context_handle_from_green_ctx for consistency with the rest of the file. Review comment 6: Keep is_green check inline in _context.pyx using get_context_green_ctx (cannot add a C++ is_green function across separate .so boundaries without linker issues). Review comment 8: Replace malloc/free with std::vector<CUdevResource> in Device.create_context for automatic cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test 41cf1de |
Close #1563. Close #112.
Summary
Add green context support to cuda.core — the explicit-model API for querying device resources, splitting SMs, creating green contexts, and using them without touching the thread-local context stack.
Design
See the companion design doc for full rationale. Key decisions:
Contexttype — no user-visibleGreenContextsubclass. A singleContextwraps either a primaryCUcontextor aCUgreenCtx+ derivedCUcontext.ctx.is_greendistinguishes them. Inspired by the CUDA runtime's execution-context (EC) abstraction.dev.resourcesnamespace —DeviceResourcesgroups hardware resource queries (dev.resources.sm,dev.resources.workqueue). Follows the existing "plural = namespace" pattern (dev.properties,kernel.attributes).ctx.resources/stream.resources— sameDeviceResourcestype, but queries the context's provisioned resources (cuCtxGetDevResource/cuGreenCtxGetDevResource) instead of the full device.SMResourceOptionswith SoA broadcasting — single dataclass forSMResource.split(). Scalar fields broadcast;countdrives the group count.count=Nonemeans discovery mode (translated tosmCount=0internally).WorkqueueResourcemergesCU_DEV_RESOURCE_TYPE_WORKQUEUE_CONFIGandCU_DEV_RESOURCE_TYPE_WORKQUEUEunder one user-facing class. Strings for option values (e.g.sharing_scope="green_ctx_balanced").ContextOptions(resources=[...])→dev.create_context()— resource descriptor generation andcuGreenCtxCreateare internal. The user passes pre-split resource objects.ctx.create_stream()creates streams bound to a green context without callingdev.set_current(). The C++ handle layer auto-dispatches betweencuGreenCtxStreamCreateandcuStreamCreateWithPrioritybased on the context type. Green context streams must be non-blocking.ctx.close()does not manage the context stack — closing a current context raisesRuntimeError.dev.set_current(green_ctx)still works for backward compatibility but is not the recommended path.New public API
Device.resources→DeviceResources(namespace:.sm,.workqueue)Context.resources→DeviceResources(context-level query of provisioned resources)Stream.resources→DeviceResources(delegates to the stream's parent context)Context.create_stream(options)→Stream(green contexts only; raises on primary)Context.is_green→boolSMResource— properties:sm_count,min_partition_size,coscheduled_alignment,flags,handle; method:split(options, *, dry_run=False)SMResourceOptions—count,coscheduled_sm_count,preferred_coscheduled_sm_countWorkqueueResource— method:configure(options)WorkqueueResourceOptions—sharing_scopeContextOptions.resources— acceptsSequence[SMResource | WorkqueueResource]Implementation details
C++ handle layer (
resource_handles.hpp/cpp):GreenCtxHandle(shared_ptr<const CUgreenCtx>) — owning handle; destructor callscuGreenCtxDestroy.ContextBoxgains aGreenCtxHandlefield so the derivedCUcontextkeeps the green ctx alive.get_context_green_ctx()provides reverse lookup.create_green_ctx_handle()combinescuDevResourceGenerateDesc+cuGreenCtxCreatein one call — the descriptor is transient (noDevResourceDescHandleneeded since CUDA has no explicit destroy for it).create_stream_handle()auto-dispatches: checksget_context_green_ctx()on the providedContextHandleand callscuGreenCtxStreamCreatefor green contexts,cuStreamCreateWithPriorityfor primary. ReturnsCUDA_ERROR_NOT_SUPPORTEDif the context is green butcuGreenCtxStreamCreateis unavailable (CUDA < 12.5).context_registry/stream_registry(HandleRegistry) deduplicate handles by raw CUDA pointer, enabling identity-preservingset_currentswaps.Bug fix — stream context tracking:
StreamBoxnow carries aContextHandledependency, populated at creation time.get_stream_context()returns it without a driver call.Stream._from_handleandStream_ensure_ctxprefer the registry-backed handle before falling back tocuStreamGetCtx. This fixes a latent issue where streams created in a green context would lose their context association after aset_currentswap.Error handling:
dev.create_context()without resources raisesValueErrorwith a clear message.nonblocking=Falseis caught by the driver (CUDA_ERROR_INVALID_VALUE) and re-raised asValueErrorwith a helpful message.cuCtxGetStreamPriorityRangefailure (CUDA_ERROR_INVALID_CONTEXT) raises "Call dev.set_current() before creating streams."Version guards:
IF CUDA_CORE_BUILD_MAJOR >= 13gatescuDevSmResourceSplit(the general/structured form).cy_driver_version() >= (12, 4, 0)for all green ctx APIs;>= (13, 1, 0)for structured splits. RaisesValueErrorwhen unsupported.cuDevSmResourceSplitByCountfor basic (homogeneous) splits. Per-groupcoscheduled_sm_countand heterogeneous counts require 13.1+ and raiseNotImplementedErroron 12.x._get_optional_driver_fn— gracefulNULLwhen bindings lack the symbol.Test coverage
33 tests in
test_green_context.py, organized with proper pytest fixtures and classes:sm_resource,wq_resource,green_ctx(withCUDAError→ skip),fill_kernel_use_green_ctxcontext manager for safe push/pop in set_current regression testsTestSMResourceQuery— properties, arch constraints (pre-Hopper vs Hopper+)TestWorkqueueResource— query, configure valid/invalidTestSMResourceSplitValidation— scalar/Sequence mismatch, negative count, dry-run blockedTestSMResourceSplit— single/two-group splits with arch-aligned counts, discovery mode, alignment, dry-run parityTestGreenContextLifecycle—is_green,create_streamon primary raises, blocking stream raises, explicit stream creation, stream/event context tracking, close-while-current guard, set_current regressionTestContextResources— green ctx SM resources are subset of device, two contexts have disjoint partitions, stream.resources matches ctx.resources (SM + workqueue)TestGreenContextKernelLaunch— compile + launch + host-verify viactx.create_stream(), two independent green contexts with different fill values, SM + workqueue combinedValidation
-- Leo's bot