feat: default GPU endpoints to minCudaVersion 12.8#277
Conversation
6b0514a to
8b625a6
Compare
runpod-Henrik
left a comment
There was a problem hiding this comment.
Question: Two test gaps in the plumbing
endpoint.py and resource_provisioner.py both changed to carry minCudaVersion through, but neither has a test:
- No test for
Endpoint(min_cuda_version="12.4")→_build_resource_config()→ resource getsminCudaVersion="12.4" - No test for
create_resource_from_manifestwith{"minCudaVersion": "12.4"}in the manifest data → resource gets the value
The field-level tests on ServerlessResource are solid, but the plumbing from Endpoint decorator to provisioner is untested end-to-end.
Verdict
Clean, well-structured change with good test coverage on the field itself. The two plumbing tests above would close the remaining coverage gap.
runpod-Henrik
left a comment
There was a problem hiding this comment.
1. Core change — clean
minCudaVersion added to ServerlessResource (default "12.8"), exposed as min_cuda_version on Endpoint, cleared for CPU, validated against the CudaVersion enum, plumbed through manifest → provisioner → GraphQL query, and included in _hashed_fields / _has_structural_changes. Test coverage is solid — 10 tests in TestMinCudaVersion covering defaults, overrides, validation, CPU clearing, hash and structural-change behaviour.
2. Issue: Existing GPU endpoints get silently re-provisioned on next deploy
minCudaVersion is in _hashed_fields. Existing deployed endpoints have no minCudaVersion in their stored config. After upgrading, the first flash deploy sees the new "12.8" default as a structural change and triggers re-provisioning for every GPU endpoint — even if nothing else changed. For busy production endpoints that's an unexpected rolling restart with no warning.
3. Issue: No way to opt out of the "12.8" floor
In _build_resource_config:
if self.min_cuda_version is not None:
kwargs["minCudaVersion"] = self.min_cuda_versionNone means "don't include in kwargs", which causes ServerlessResource to fall back to its "12.8" default. A user who passes Endpoint(min_cuda_version=None) expecting to remove the constraint gets "12.8" silently. If there are workloads that need to run on older drivers, they have no path to opt out.
Nit: SDK Reference shows None as the constructor default
min_cuda_version: Optional[str] = None
But the table note says "GPU endpoints default to "12.8" when not set." The effective default for GPU is "12.8" — the None in the Endpoint signature is an implementation detail. Users who see = None and try passing it explicitly to clear the constraint will be confused when it doesn't work. Consider documenting the signature as min_cuda_version: Optional[str] = None # GPU endpoints default to "12.8" or updating the table default column to show "12.8" for GPU.
Verdict: PASS WITH NITS
The implementation is correct. Items 2 and 3 are worth a quick look before merge — particularly whether the silent re-provision on upgrade is acceptable or needs a migration note in the changelog.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
GPU endpoints default to
minCudaVersion = "12.8"to ensure workers only run on hosts with a recent CUDA driver. The value can be overridden per-endpoint viaEndpoint(min_cuda_version=...)or directly on resource classes. CPU endpoints always haveminCudaVersioncleared and excluded from their API payload.Validation
minCudaVersionis validated against theCudaVersionenum. Invalid values raise aValueErrorlisting the accepted versions.Closes AE-2408