Skip to content

CUDA problem on A100 nodes

https://github.com/NVIDIA/nvidia-container-toolkit/issues/48

SOLUTION: https://github.com/NVIDIA/nvidia-container-toolkit/issues/381

Test: run pytorch test, wait 10 seconds, run 2nd test. State means the success of 1st and 2nd run.

Fixed:

Node State Kernel GPU
gpengine-uams.areon.net 5.4.0-182-generic A100
node-1-1.sdsc.optiputer.net 5.15.134+release+2.9.1-amd64 A100
node-2-1.sdsc.optiputer.net 5.10.187.release.2.9.0r4-amd64 A100

Bad nodes:

Node State Kernel GPU Driver
hcc-nrp-shor-c5805.unl.edu.
node-1-3.sdsc.optiputer.net 5.10.187.release.2.9.0r4-amd64 A100

Good:

Node State Kernel GPU Driver
hcc-nrp-shor-c5925.unl.edu A10
k8s-4090-02.calit2.optiputer.net 4090
k8s-dtn-01.uog.edu
rci-nrp-gpu-03.sdsu.edu A100
ry-gpu-15.sdsc.optiputer.net A4000
fiona8-0.calit2.uci.edu
fiona8.ucsc.edu
k8s-chase-ci-03.calit2.optiputer.net
k8s-chase-ci-07.calit2.optiputer.net
k8s-gpu-2.ucsc.edu
Edited by Dima Mishin