Replies: 1 comment
-
|
Instead of rebuilding the whole cluster, I just added a new worker node (Ubuntu 24.04.4 LTS) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I got a kubernetes cluster for dev purposes. The host OS is CEntOS 10 and I am using Qemu/KVM (libvirt) for Virtualization layer. I got a GPU GeForce RTX 3070 working; I am able to execute ML model training podman containers successfully at host level. Then I decided to enable gpu passthrough support and added the PCI device to one of my Kubernetes worker nodes, the idea? you got it, run ML jobs in k8s. I followed the steps to setup the gpu-operator via helm chart. Sadly, it doesn't work for Fedora CoreOS 43.
Here is the error:
Mar 08 03:54:43 worker02 kubelet[1235]: E0308 03:54:43.280297 1235 pod_workers.go:1324] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"nvidia-driver-ctr\" with ImagePullBackOff: \"Back-off pulling image \\\"nvcr.io/nvidia/driver:580.105.08-fedora43\\\": ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image \\\"nvcr.io/nvidia/driver:580.105.08-fedora43\\\": failed to resolve image: nvcr.io/nvidia/driver:580.105.08-fedora43: not found\"" pod="gpu-operator/nvidia-driver-daemonset-d2hvh" podUID="8551a036-564a-484a-992b-b465e2df527b"Instead of looking for a workaround, I would like to hear your opinion. As DevOps or SRe's what would be your Dev Stack to go while pursuing stable dev environments (k8s on prem with Virtualization):
So I can plan on rebuilding my dev environment.
I see images for Ubuntu 22.04 and 24.04, Rocky Linux, RHEL 9 - 10,
Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions