Skip to content

cuda.core: convert peer_accessible_by to a live MutableSet view#2018

Open
Andy-Jost wants to merge 11 commits intoNVIDIA:mainfrom
Andy-Jost:ajost/peer-accessible-by-set-proxy
Open

cuda.core: convert peer_accessible_by to a live MutableSet view#2018
Andy-Jost wants to merge 11 commits intoNVIDIA:mainfrom
Andy-Jost:ajost/peer-accessible-by-set-proxy

Conversation

@Andy-Jost
Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost commented May 4, 2026

Summary

DeviceMemoryResource.peer_accessible_by previously returned a sorted tuple[int, ...] backed by a Python-level cache. This is fragile in two ways: the cache can diverge from driver state across multiple wrappers around the same memory pool (#1720), and tuple semantics force callers into bespoke splice/replace patterns instead of standard set operations. This PR replaces the property with a live driver-backed collections.abc.MutableSet view, matching the proxy patterns established by AdjacencySetProxy (graphs) and the in-flight AccessedBySet (managed-memory advice, #1775).

This is a breaking change, captured in the v1.0.0 release notes.

Changes

New: PeerAccessibleBySetProxy

A MutableSet subclass living in cuda_core/cuda/core/_memory/_peer_access_utils.pyx.

  • Reads (__contains__, __iter__, __len__) call cuMemPoolGetAccess.
  • Writes (add, discard, and bulk ops) call cuMemPoolSetAccess.
  • Iteration yields Device objects.
  • add, discard, __contains__ accept Device | int.
  • The owner device is silently filtered (matching the existing planner contract).
  • All bulk operations (update, |=, &=, -=, ^=, clear) issue exactly one cuMemPoolSetAccess call. This matters: peer-access transitions can take seconds per pool because every existing memory mapping is updated, so coalescing into a single driver call lets the toolkit handle the mappings in parallel.

The proxy is constructed fresh on every property access. There is nothing to cache or pickle.

Cache removal

The _peer_accessible_by field on DeviceMemoryResource is dropped, along with its initializations in __cinit__, _DMR_init, and from_allocation_handle. The owned/non-owned read split is gone — every read path now queries the driver directly. This eliminates the bug class fixed in #1720 and simplifies the implementation.

Module consolidation

All peer-access internals now live in _peer_access_utils.pyx (promoted from a .py). The cdef inline driver helpers (_query_peer_access_ids, _peer_access_includes, _set_pool_access) and the cpdef replace_peer_accessible_by setter helper moved out of _device_memory_resource.pyx. DeviceMemoryResource now exposes only the public peer_accessible_by property: the getter returns PeerAccessibleBySetProxy(self) and the setter delegates to replace_peer_accessible_by(self, devices). This keeps the DMR surface focused on its memory-management role.

Property setter

mr.peer_accessible_by = [...] is preserved and unchanged in behavior. It still does a single batched cuMemPoolSetAccess via the same shared plan_peer_access_update path the proxy uses for bulk ops.

_query_peer_access_ids GIL optimization

The driver loop that enumerates peer device IDs now runs inside a single nogil block instead of acquiring/releasing the GIL once per device. Per-call data is collected into a libcpp.vector[int], and because range(total) ascends the result is already sorted (so the trailing sorted() is gone). This is a local performance tweak with no behavior change.

Test coverage

The existing integration tests (test_memory_peer_access.py, memory_ipc/test_peer_access.py) migrated from tuple equality to set-of-Device equality. Eight new tests pin the proxy's contract end-to-end, all using the existing mempool_device_x2 fixture so they run on CI:

  • MutableSet conformance. A new assert_single_member_mutable_set_interface(subject, member, non_member) helper exercises every MutableSet method against subjects whose backing store admits at most one insertable element (the peer-access proxy can hold at most {dev1} on a 2-GPU box). The original assert_mutable_set_interface(subject, items) is unchanged for normal multi-element subjects (AdjacencySetProxy).
  • Device/int interchangeability on add/discard/__contains__.
  • Owner-filtering contract on every write path (silent no-op).
  • Error paths: add(out_of_range) and add(non_coercible) raise; discard/__contains__ swallow the same inputs; remove(non_member) raises KeyError.
  • Live driver view: a proxy obtained before another wrapper modifies the pool reflects the change with no refresh step.
  • Iteration order is ascending by device_id; elements are Device instances; __repr__ shape; getter return type.
  • Single-call batching via a monkeypatch spy on _set_pool_access (the thin Python-visible helper extracted from _apply_peer_access_diff purely so tests can spy on the actual driver call). Every bulk op (|=/&=/-=/^=/update/difference_update/clear) and the property setter is asserted to issue exactly one cuMemPoolSetAccess, zero on no-op. This includes the dmr.peer_accessible_by |= {...} augmented-assignment-on-property pattern, where the empty-diff short-circuit prevents a second driver call from the trailing setter write.

Breaking change

DeviceMemoryResource.peer_accessible_by no longer returns a tuple[int, ...]. Callers must update:

  • Tuple comparisons (mr.peer_accessible_by == (0,)) -> set comparisons (mr.peer_accessible_by == {Device(0)}).
  • Iteration that expected ints -> iteration over Device objects (use [d.device_id for d in mr.peer_accessible_by] if int IDs are needed).

Setter usage (mr.peer_accessible_by = [...]) is unchanged.

Test plan

  • Existing test_memory_peer_access.py tests pass after migration to set semantics.
  • Existing test_memory_peer_access_utils.py tests still pass (plan_peer_access_update and normalize_peer_access_targets are unchanged).
  • Existing memory_ipc/test_peer_access.py tests pass (IPC-enabled pools, parent-child IPC scenarios).
  • Full cuda_core test suite passes on a single-GPU box (2931 passed, 212 skipped — peer-access tests gated on mempool_device_x2/x3).
  • All new peer-access tests pass on a 2-GPU box.
  • pre-commit run --all-files passes (ruff, format, SPDX, cython-lint, RST checks).

Related work

DeviceMemoryResource.peer_accessible_by previously returned a sorted
tuple[int, ...] backed by a Python-level cache, which was prone to
divergence from driver state across multiple wrappers around the same
memory pool. The setter accepted Device | int and emitted a single
batched cuMemPoolSetAccess covering the diff against the cache.

This commit replaces the property with a live driver-backed view:

- Adds PeerAccessibleBySetProxy in _memory/_peer_access_utils.py, a
  collections.abc.MutableSet whose reads call cuMemPoolGetAccess and
  whose writes call cuMemPoolSetAccess. Iteration yields Device
  objects; add, discard, and __contains__ accept either a Device or a
  device-ordinal int. The proxy is constructed fresh on every property
  access, so there is nothing to cache or pickle.

- Drops the _peer_accessible_by cache field (and its initializations
  in __cinit__, _DMR_init, and from_allocation_handle), eliminating
  the owned/non-owned read split. All pools now share the same code
  path and always query the driver.

- All bulk operations on the proxy (update, |=, &=, -=, ^=, clear,
  pop) issue exactly one cuMemPoolSetAccess call. Peer-access
  transitions can take seconds per pool because every existing memory
  mapping is updated, so coalescing into a single driver call lets the
  toolkit handle the mappings in parallel. The property setter
  (mr.peer_accessible_by = [...]) preserves its original single-call
  behavior via the same shared planner path.

- Single-element add validates can_access_peer through
  plan_peer_access_update, matching the existing setter contract.

This is a breaking change captured in the v1.0.0 release notes.
Callers comparing against tuples must update to set comparisons
(mr.peer_accessible_by == {Device(0)}). Existing tests are migrated;
new tests for set-interface conformance are intentionally deferred to
a follow-up.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost added this to the cuda.core v1.0.0 milestone May 4, 2026
@Andy-Jost Andy-Jost added P1 Medium priority - Should do feature New feature or request cuda.core Everything related to the cuda.core module breaking Breaking changes are introduced labels May 4, 2026
@Andy-Jost Andy-Jost self-assigned this May 4, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 4, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@leofang
Copy link
Copy Markdown
Member

leofang commented May 4, 2026

FYI: A breaking change must be labeled with P0.

…s.pyx

The previous commit left DeviceMemoryResource carrying three pass-through
def methods (_query_peer_access_ids, _peer_access_includes,
_apply_peer_access_diff) whose only purpose was to give the pure-Python
proxy in _peer_access_utils.py a way to call cdef helpers in
_device_memory_resource.pyx. These methods served no public role and
cluttered the class API.

Promote _peer_access_utils.py to a Cython module so the proxy and the
driver-touching helpers can live together:

- Convert _peer_access_utils.py to _peer_access_utils.pyx. cimports
  cydriver and DeviceMemoryResource from the .pxd; uses nogil and direct
  CUmemAccessDesc packing identically to before.

- Move _DMR_query_peer_access_ids, _DMR_peer_access_includes,
  _DMR_apply_peer_access_diff, and _DMR_replace_peer_accessible_by from
  _device_memory_resource.pyx into the new module as cdef helpers (and a
  cpdef replace_peer_accessible_by entry point used by the property
  setter).

- Drop the three pass-through def methods from DeviceMemoryResource. The
  class is left with the property getter and setter only; everything else
  is module-level in _peer_access_utils.

- The proxy now calls the module-level cdef helpers directly instead of
  routing through methods on mr.

No behavior change. The public surface (PeerAccessibleBySetProxy,
plan_peer_access_update, normalize_peer_access_targets, PeerAccessPlan)
is preserved at the same import paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost Andy-Jost added P0 High priority - Must do! and removed P1 Medium priority - Should do labels May 4, 2026
Andy-Jost and others added 5 commits May 4, 2026 16:30
Refactor _query_peer_access_ids so the entire driver loop runs inside a
single nogil block instead of acquiring/releasing the GIL once per
device. The flag query now uses a cached as_cu(mr._h_pool) handle and
fills a libcpp.vector[int]; because range(total) ascends, the result is
already sorted and the trailing sorted() call is dropped.

Also tighten the peer_accessible_by entry in 1.0.0-notes.rst: the
breaking-change blurb only needs to state the type/element change, so
remove the implementation-flavored details about input acceptance and
batched cuMemPoolSetAccess calls.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ge cases

Existing peer-access tests covered the integration path well (real
copies across peers, the full transition matrix, shared-pool consistency)
but only touched ``in``, ``==``, and the property setter on the new set
proxy. After the v1.0.0 break that surfaced ~25 ``MutableSet`` methods,
nothing was pinning the type-coercion contract, the owner-filtering
behavior, the ``KeyError``/value-error paths, or the "one
``cuMemPoolSetAccess`` per bulk op" performance invariant.

Add the following coverage in ``test_memory_peer_access.py``:

- A ``MutableSet`` conformance test using a relaxed
  ``assert_mutable_set_interface`` mode that admits subjects holding at
  most one insertable element. CI maxes at two GPUs (one peer), so the
  multi-element protocol pass cannot run there. The new
  ``support_multi_insert=False`` path takes one insertable item plus two
  non-member sentinels and exercises every ``MutableSet`` method
  (``add``/``discard``/``remove``/``pop``/``clear``/``update``,
  comparisons, isdisjoint, subset/superset, binary and in-place
  operators, ``__iter__``/``__len__``/``__repr__``).
- ``Device``/``int`` interchangeability on ``add``/``discard``/``__contains__``.
- The owner-device filtering contract on every write (silent no-op).
- Error paths: ``add(out_of_range)`` and ``add(non_coercible)`` raise
  while the lenient ``discard``/``__contains__`` paths swallow the same
  inputs; ``remove(non_member)`` raises ``KeyError``.
- "Live driver view" semantics: a proxy obtained before another wrapper
  modifies the pool reflects the change with no refresh step.
- ``__iter__`` ordering is ascending by ``device_id`` and elements are
  ``Device`` instances; ``__repr__`` includes the class name and tracks
  live contents; the getter returns the documented proxy type.
- A batching spy that monkeypatches the module-level
  ``_apply_peer_access_diff`` and asserts that every bulk op
  (``|=``/``&=``/``-=``/``^=``/``update``/``difference_update``/
  ``clear``) and the property setter issues at most one driver call,
  zero when the diff is empty.

To make the spy possible, ``_apply_peer_access_diff`` is now a
Python-visible ``def`` wrapper around a renamed
``_apply_peer_access_diff_cython`` ``cdef inline``. The proxy and the
property setter still call ``_apply_peer_access_diff`` by bare name,
which Cython resolves through the module's globals at runtime, so a
``monkeypatch.setattr(_peer_access_utils, "_apply_peer_access_diff", ...)``
intercepts them. The extra Python-level dispatch is negligible next to
``cuMemPoolSetAccess`` itself.

Co-authored-by: Cursor <cursoragent@cursor.com>
Augmented assignment on the ``peer_accessible_by`` property
(``dmr.peer_accessible_by |= {...}``) is two trips through the
proxy/setter pair, not one: Python fetches the proxy, the proxy mutates
itself in place via ``__ior__``, and Python then assigns the
(already-mutated) proxy back through the setter. That trailing setter
call computes the diff against current driver state, finds it empty,
and short-circuits inside the ``cdef inline`` before issuing any
``cuMemPoolSetAccess`` work — so the *driver-level* contract ("one
batched call per bulk op") still holds, but the wrapper is invoked
twice, which the spy was counting.

Also, the fixture's ``dmr.peer_accessible_by = []`` reset on an already
empty pool is itself an empty-delta wrapper call.

Filter the recorded calls down to those with non-empty deltas (the ones
that translate to real driver work) and switch the bulk-ops test to use
a locally bound proxy so augmented assignment goes through ``__ior__``
once with no extra setter invocation. The setter test stays on
``dmr.peer_accessible_by = ...`` because that is the public API
contract under test there.

Co-authored-by: Cursor <cursoragent@cursor.com>
…hing

Move the actual ``cuMemPoolSetAccess`` invocation (descriptor-array
build + driver call) into a thin Python-visible ``def _set_pool_access``
in ``_peer_access_utils.pyx``. ``_apply_peer_access_diff`` now does only
the empty-diff short-circuit and delegates the work to
``_set_pool_access``, which Cython resolves through the module globals
at runtime so tests can intercept it via ``monkeypatch.setattr``.

Replace the previous internal-wrapper spy with a driver-call spy that
counts every real ``cuMemPoolSetAccess`` invocation. Earlier no-op
layers (e.g. the augmented-assignment-on-property pattern that writes
an already-mutated proxy back through the setter) short-circuit before
reaching ``_set_pool_access``, so the recorded count is exactly the
number of driver calls. The empty-delta filter and the local-binding
workaround in the bulk-ops test are gone; we now also assert that
``dmr.peer_accessible_by |= {...}`` directly on the property is still
exactly one driver call.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the ``support_multi_insert`` flag and ``non_members`` keyword
with two purpose-built helpers:

- ``assert_mutable_set_interface(subject, items)`` keeps the original
  signature and contract: at least five distinct insertable items,
  exercised against a reference set in the standard way. The graph
  ``AdjacencySetProxy`` test continues to use this unchanged.
- ``assert_single_member_mutable_set_interface(subject, member,
  non_member)`` is a focused pass for proxies whose backing store admits
  at most one insertable element at a time (here, the peer-access view
  on a 2-GPU box). It threads a single member and one non-member
  sentinel through every ``MutableSet`` method.

The two helpers share small private utilities (empty-state checks,
``__repr__`` shape) but keep their public surfaces small and linear.
A capacity-one proxy is a meaningfully different contract from a
general mutable set; naming that explicitly in the API reads better
than a flag and avoids forcing call sites to plumb sentinels through.

Co-authored-by: Cursor <cursoragent@cursor.com>
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

/ok to test

@Andy-Jost Andy-Jost marked this pull request as ready for review May 5, 2026 00:33
@Andy-Jost Andy-Jost requested a review from rparolin May 5, 2026 00:34
Comment on lines +168 to +177
def assert_single_member_mutable_set_interface(subject, member, non_member):
"""Exercise every MutableSet method on a subject with capacity one.

Use this for proxies whose backing store admits at most one insertable
element at a time (typically because the underlying resource is bounded
by hardware, e.g. a peer-access view on a system with a single valid
peer device). The subject only ever holds ``set()`` or ``{member}``;
*non_member* supplies the right-hand side of comparisons, ``isdisjoint``,
subset/superset, and binary/in-place operators so every ``MutableSet``
method is exercised at least once.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rparolin you might be able to use this to test AccessedBySetProxy in #1775.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

@leofang
Copy link
Copy Markdown
Member

leofang commented May 5, 2026

Q: It is a big diff with a very niche use case. Are we sure this is P0?

@Andy-Jost
Copy link
Copy Markdown
Contributor Author

Andy-Jost commented May 5, 2026

Q: It is a big diff with a very niche use case. Are we sure this is P0?

The peer access list is currently a tuple. This change aligns with the set proxies in the graph module (AdjacencySetProxy) and the managed memory hint functions (AccessedBySetProxy). Without this, users would have to remember which interfaces uses a proxy and which do not. I think as a matter of course we should favor container proxies; they are more Pythonic.

This change is not functionally as big as it might appear. (1) Most of the bulk is just code movement: I moved the peer access implementation from _device_memory_resource.pyx to _peer_access_utils.pyx just to keep the DMR implementation clean. (2) I had to change _peer_access_utils from a .py to a .pyx file to accommodate the moved code, and git didn't identify it as a move/rename.

Andy-Jost added 4 commits May 5, 2026 11:32
…-by-set-proxy

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	cuda_core/docs/source/api_private.rst
…-by-set-proxy

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	cuda_core/docs/source/api_private.rst
#	cuda_core/docs/source/release/1.0.0-notes.rst
…-by-set-proxy

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	cuda_core/cuda/core/_memory/_device_memory_resource.pyx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking changes are introduced cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants