Skip to content

feat: default GPU endpoints to minCudaVersion 12.8#277

Open
KAJdev wants to merge 4 commits intomainfrom
zeke/ae-2408-flash-default-gpu-endpoints-to-mincuda-128
Open

feat: default GPU endpoints to minCudaVersion 12.8#277
KAJdev wants to merge 4 commits intomainfrom
zeke/ae-2408-flash-default-gpu-endpoints-to-mincuda-128

Conversation

@KAJdev
Copy link
Contributor

@KAJdev KAJdev commented Mar 17, 2026

GPU endpoints default to minCudaVersion = "12.8" to ensure workers only run on hosts with a recent CUDA driver. The value can be overridden per-endpoint via Endpoint(min_cuda_version=...) or directly on resource classes. CPU endpoints always have minCudaVersion cleared and excluded from their API payload.

Validation

minCudaVersion is validated against the CudaVersion enum. Invalid values raise a ValueError listing the accepted versions.

Closes AE-2408

@KAJdev KAJdev force-pushed the zeke/ae-2408-flash-default-gpu-endpoints-to-mincuda-128 branch from 6b0514a to 8b625a6 Compare March 17, 2026 21:49
Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Two test gaps in the plumbing

endpoint.py and resource_provisioner.py both changed to carry minCudaVersion through, but neither has a test:

  • No test for Endpoint(min_cuda_version="12.4")_build_resource_config() → resource gets minCudaVersion="12.4"
  • No test for create_resource_from_manifest with {"minCudaVersion": "12.4"} in the manifest data → resource gets the value

The field-level tests on ServerlessResource are solid, but the plumbing from Endpoint decorator to provisioner is untested end-to-end.


Verdict

Clean, well-structured change with good test coverage on the field itself. The two plumbing tests above would close the remaining coverage gap.

Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Core change — clean

minCudaVersion added to ServerlessResource (default "12.8"), exposed as min_cuda_version on Endpoint, cleared for CPU, validated against the CudaVersion enum, plumbed through manifest → provisioner → GraphQL query, and included in _hashed_fields / _has_structural_changes. Test coverage is solid — 10 tests in TestMinCudaVersion covering defaults, overrides, validation, CPU clearing, hash and structural-change behaviour.

2. Issue: Existing GPU endpoints get silently re-provisioned on next deploy

minCudaVersion is in _hashed_fields. Existing deployed endpoints have no minCudaVersion in their stored config. After upgrading, the first flash deploy sees the new "12.8" default as a structural change and triggers re-provisioning for every GPU endpoint — even if nothing else changed. For busy production endpoints that's an unexpected rolling restart with no warning.

3. Issue: No way to opt out of the "12.8" floor

In _build_resource_config:

if self.min_cuda_version is not None:
    kwargs["minCudaVersion"] = self.min_cuda_version

None means "don't include in kwargs", which causes ServerlessResource to fall back to its "12.8" default. A user who passes Endpoint(min_cuda_version=None) expecting to remove the constraint gets "12.8" silently. If there are workloads that need to run on older drivers, they have no path to opt out.

Nit: SDK Reference shows None as the constructor default

min_cuda_version: Optional[str] = None

But the table note says "GPU endpoints default to "12.8" when not set." The effective default for GPU is "12.8" — the None in the Endpoint signature is an implementation detail. Users who see = None and try passing it explicitly to clear the constraint will be confused when it doesn't work. Consider documenting the signature as min_cuda_version: Optional[str] = None # GPU endpoints default to "12.8" or updating the table default column to show "12.8" for GPU.


Verdict: PASS WITH NITS

The implementation is correct. Items 2 and 3 are worth a quick look before merge — particularly whether the silent re-provision on upgrade is acceptable or needs a migration note in the changelog.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants