애플리케이션 프로그램을 실행하는 NVIDIA T4 GPU가 장착된 PowerEdge R7515 에서 심각한 Xid 오류가 감지되었으며 GPU 처리가 중단되었습니다.
1. GPU CUDA 및 드라이버를 최신 버전인 CUDA Toolkit 12로 업데이트합니다. 12.2.2 NVIDIA 데이터 센터 GPU 드라이버: 535.129.03(Linux), GPU 지속성 모드를 활성화한 다음 사용자의 APP 테스트를 실행해도 여전히 오류가 발생합니다. 심각한 xid 오류가 감지되었습니다.
2. Nvidia-bug-report 로그 검사에 NVRM:Xid:8 오류가 많고 GPU 현재 온도: 42 다.
Driver Version: 535.129.03
CUDA Version: 12.2
Temperature
GPU Current Temp: 42 C
GPU T.Limit Temp: N/A
GPU Shutdown Temp: 96 C
GPU Slowdown Temp: 93 C
GPU Max Operating Temp: 85 C
Nov 27 10:18:12 dell-PowerEdge-R7515 kernel: [160879.547208] NVRM: Xid (PCI:0000:81:00): 8, pid=1020, name=Xorg, Channel 00000002
...
/var/log/dmesg:
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: nvidia-nvlink: NVLInk Core is being initialized, major device number 236
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: NVRM: Loading NVIDIA UNIX x86_64 Kernel Module 535.129.03 Thu Oct 19 18:56:32 UTC 2023
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.129.03 Thu Oct 19 18:42:12 UTC 2023
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: [drm] [nvidia-drm] [GPU ID 0x00008100] Loading driver
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:81:00.0 on minor 1
Nov 25 13:37:04 dell-PowerEdge-R7515 kernel: NVRM: GPU at PCI:0000:81:00: GPU-61e384ab-7c8f-7ebf-efba-da6a7d862a33
Nov 25 13:37:04 dell-PowerEdge-R7515 kernel: NVRM: GPU Board Serial Number: 1320223053140
Nov 25 13:37:04 dell-PowerEdge-R7515 kernel: NVRM: Xid (PCI:0000:81:00): 8, pid=1020, name=Xorg, Channel 00000002
...
Nov 27 10:18:12 dell-PowerEdge-R7515 kernel: NVRM: Xid (PCI:0000:81:00): 8, pid=1020, name=Xorg, Channel 00000002
3. SLI3.0에서 부팅하고 NVIDIA Tesla GPU 629 현장 진단을 실행하면 통과됩니다.