Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products

配备 NVIDIA T4 GPU 的 PowerEdge R7515 检测到严重 Xid 错误,并且 GPU 停止处理

Summary: 本文提供了 NVIDIA T4 GPU 检测到严重 Xid 错误且 GPU 停止处理的 PowerEdge R7515 的解决方案。

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

配备 NVIDIA T4 GPU 的 PowerEdge R7515 正在运行应用程序 ,检测到严重 Xid 错误且 GPU 停止处理。
配备 NVIDIA T4 GPU 的 PowerEdge R7515 正在运行应用程序,检测到严重 Xid 错误且 GPU 停止处理。

1. 将 GPU CUDA 和驱动程序更新到最新版本 CUDA 工具包 12:12.2.2 NVIDIA Data Center GPU 驱动程序:535.129.03 (Linux),启用了 GPU 持久性模式,然后运行用户的应用程序测试,但仍存在错误:检测到严重 Xid 错误。
2.Nvidia-bug-report 日志检查有很多 NVRM:Xid:8 错误,GPU 当前温度:42 摄氏度。
Driver Version: 535.129.03
CUDA Version: 12.2
Temperature
GPU Current Temp: 42 C
GPU T.Limit Temp: N/A
GPU Shutdown Temp: 96 C
GPU Slowdown Temp: 93 C
GPU Max Operating Temp: 85 C

Nov 27 10:18:12 dell-PowerEdge-R7515 kernel: [160879.547208] NVRM: Xid (PCI:0000:81:00): 8, pid=1020, name=Xorg, Channel 00000002
...
/var/log/dmesg:
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: nvidia-nvlink: NVLInk Core is being initialized, major device number 236
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: NVRM: Loading NVIDIA UNIX x86_64 Kernel Module 535.129.03 Thu Oct 19 18:56:32 UTC 2023
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.129.03 Thu Oct 19 18:42:12 UTC 2023
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: [drm] [nvidia-drm] [GPU ID 0x00008100] Loading driver
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:81:00.0 on minor 1
Nov 25 13:37:04 dell-PowerEdge-R7515 kernel: NVRM: GPU at PCI:0000:81:00: GPU-61e384ab-7c8f-7ebf-efba-da6a7d862a33
Nov 25 13:37:04 dell-PowerEdge-R7515 kernel: NVRM: GPU Board Serial Number: 1320223053140
Nov 25 13:37:04 dell-PowerEdge-R7515 kernel: NVRM: Xid (PCI:0000:81:00): 8, pid=1020, name=Xorg, Channel 00000002
...
Nov 27 10:18:12 dell-PowerEdge-R7515 kernel: NVRM: Xid (PCI:0000:81:00): 8, pid=1020, name=Xorg, Channel 00000002


3. 从 SLI3.0 引导并运行 NVIDIA Tesla GPU 629 现场诊断已通过。
 
 

Cause

N/A

Resolution

1.使用Xid错误XID :8进行测试之前预计驱动程序错误,总线错误,散热问题,似乎是用户应用程序错误。
Xid 错误列表
2.让用户检查 应用程序,然后用户禁用 xorg ,问题就消失了。
 

Affected Products

PowerEdge R7515
Article Properties
Article Number: 000220148
Article Type: Solution
Last Modified: 26 Feb 2024
Version:  2
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.