CUDA problem on A100 nodes
https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
SOLUTION: https://github.com/NVIDIA/nvidia-container-toolkit/issues/381
Test: run pytorch test, wait 10 seconds, run 2nd test. State means the success of 1st and 2nd run.
Fixed:
| Node | State | Kernel | GPU |
|---|---|---|---|
| gpengine-uams.areon.net |
|
5.4.0-182-generic | A100 |
| node-1-1.sdsc.optiputer.net |
|
5.15.134+release+2.9.1-amd64 | A100 |
| node-2-1.sdsc.optiputer.net |
|
5.10.187.release.2.9.0r4-amd64 | A100 |
Bad nodes:
| Node | State | Kernel | GPU | Driver |
|---|---|---|---|---|
| hcc-nrp-shor-c5805.unl.edu. | ||||
| node-1-3.sdsc.optiputer.net |
|
5.10.187.release.2.9.0r4-amd64 | A100 |
Good:
| Node | State | Kernel | GPU | Driver |
|---|---|---|---|---|
| hcc-nrp-shor-c5925.unl.edu |
|
A10 | ||
| k8s-4090-02.calit2.optiputer.net |
|
4090 | ||
| k8s-dtn-01.uog.edu |
|
|||
| rci-nrp-gpu-03.sdsu.edu |
|
A100 | ||
| ry-gpu-15.sdsc.optiputer.net |
|
A4000 | ||
| fiona8-0.calit2.uci.edu |
|
|||
| fiona8.ucsc.edu |
|
|||
| k8s-chase-ci-03.calit2.optiputer.net |
|
|||
| k8s-chase-ci-07.calit2.optiputer.net |
|
|||
| k8s-gpu-2.ucsc.edu |
|
Edited by Dima Mishin