Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products

配备 NVIDIA T4 GPU 的 PowerEdge R7515 检测到严重 Xid 错误,并且 GPU 停止处理

Summary: 本文提供了 NVIDIA T4 GPU 检测到严重 Xid 错误且 GPU 停止处理的 PowerEdge R7515 的解决方案。

This article applies to   This article does not apply to 

Symptoms

配备 NVIDIA T4 GPU 的 PowerEdge R7515 正在运行应用程序 ,检测到严重 Xid 错误且 GPU 停止处理。
配备 NVIDIA T4 GPU 的 PowerEdge R7515 正在运行应用程序,检测到严重 Xid 错误且 GPU 停止处理。

1. 将 GPU CUDA 和驱动程序更新到最新版本 CUDA 工具包 12:12.2.2 NVIDIA Data Center GPU 驱动程序:535.129.03 (Linux),启用了 GPU 持久性模式,然后运行用户的应用程序测试,但仍存在错误:检测到严重 Xid 错误。
2.Nvidia-bug-report 日志检查有很多 NVRM:Xid:8 错误,GPU 当前温度:42 摄氏度。
Driver Version: 535.129.03
CUDA Version: 12.2
Temperature
GPU Current Temp: 42 C
GPU T.Limit Temp: N/A
GPU Shutdown Temp: 96 C
GPU Slowdown Temp: 93 C
GPU Max Operating Temp: 85 C

Nov 27 10:18:12 dell-PowerEdge-R7515 kernel: [160879.547208] NVRM: Xid (PCI:0000:81:00): 8, pid=1020, name=Xorg, Channel 00000002
...
/var/log/dmesg:
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: nvidia-nvlink: NVLInk Core is being initialized, major device number 236
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: NVRM: Loading NVIDIA UNIX x86_64 Kernel Module 535.129.03 Thu Oct 19 18:56:32 UTC 2023
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.129.03 Thu Oct 19 18:42:12 UTC 2023
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: [drm] [nvidia-drm] [GPU ID 0x00008100] Loading driver
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:81:00.0 on minor 1
Nov 25 13:37:04 dell-PowerEdge-R7515 kernel: NVRM: GPU at PCI:0000:81:00: GPU-61e384ab-7c8f-7ebf-efba-da6a7d862a33
Nov 25 13:37:04 dell-PowerEdge-R7515 kernel: NVRM: GPU Board Serial Number: 1320223053140
Nov 25 13:37:04 dell-PowerEdge-R7515 kernel: NVRM: Xid (PCI:0000:81:00): 8, pid=1020, name=Xorg, Channel 00000002
...
Nov 27 10:18:12 dell-PowerEdge-R7515 kernel: NVRM: Xid (PCI:0000:81:00): 8, pid=1020, name=Xorg, Channel 00000002


3. 从 SLI3.0 引导并运行 NVIDIA Tesla GPU 629 现场诊断已通过。
 
 

Cause

N/A

Resolution

1.使用Xid错误XID :8进行测试之前预计驱动程序错误,总线错误,散热问题,似乎是用户应用程序错误。
Xid 错误列表
2.让用户检查 应用程序,然后用户禁用 xorg ,问题就消失了。
 

Affected Products

PowerEdge R7515