Skip to content

Benchmarks: cuda.core#2005

Open
danielfrg wants to merge 4 commits intomainfrom
benchmarks-cuda-core
Open

Benchmarks: cuda.core#2005
danielfrg wants to merge 4 commits intomainfrom
benchmarks-cuda-core

Conversation

@danielfrg
Copy link
Copy Markdown
Contributor

Description

This is for matching benchmarks we have been doing for cuda.bindings to cuda.core.

I guess its up for discussion if we need these and what we want to compare them against.

Right now its basically trying to measure extra latency of the cuda.core layer by comparing the to cuda.bindings ones and matching benchmark IDs to that suite 1:1.

The main question I think is regarding the "caching" that we get from cuda.core on Device. Device instances are singletons so after a first call Device(0)doesnt hit the driver. And probably other similar cases.

I guess we could also introduce some sort of cleanups or process spawns but that would come with other latencies.

@danielfrg danielfrg self-assigned this May 1, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@danielfrg danielfrg added cuda.bindings Everything related to the cuda.bindings module performance labels May 1, 2026
@danielfrg danielfrg added this to the cuda.core v1.0.0 milestone May 1, 2026
@danielfrg danielfrg requested review from leofang, mdboom and rwgk May 1, 2026 19:00
@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented May 1, 2026

Do you have a side-by-side bindings-vs-core delta table that you could post here?


Quick "Low" findings from Cursor GPT-5.4 Extra High Fast

  • Low: benchmarks/cuda_core/compare.py and benchmarks/cuda_core/benchmarks/bench_ctx_device.py tell readers to consult BENCHMARK_PLAN.md, but there is no BENCHMARK_PLAN.md under benchmarks/cuda_core or elsewhere in the repo. The starred-row legend is useful, but the referenced deeper rationale document is missing.

  • Low: benchmarks/cuda_core/benchmarks/bench_ctx_device.py says Device() with no args returns the TLS-cached current device, but cuda_core/cuda/core/_device.pyx actually resolves that case by calling cuCtxGetDevice() when a context is active. The benchmark behavior itself is fine, and benchmarks/cuda_core/compare.py already treats that row as a different code path, but the benchmark comment is misleading about what work is really being measured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.bindings Everything related to the cuda.bindings module performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants