Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products
Some article numbers may have changed. If this isn't what you're looking for, try searching all articles. Search articles

PowerEdge R7515 with NVIDIA T4 GPU Detected Critical Xid Error, and GPU stopped processing

Summary: This article gives the solution of PowerEdge R7515 with NVIDIA T4 GPU Detected Critical Xid Error, and GPU stopped processing.

This article applies to   This article does not apply to 

Symptoms

PowerEdge R7515 with NVIDIA T4 GPU running an application program Detected Critical Xid Error and GPU stopped processing.
PowerEdge R7515 with NVIDIA T4 GPU running an application program Detected Critical Xid Error and GPU stopped processing.

1.update the GPU CUDA and driver to the latest version, CUDA Toolkit 12: 12.2.2 NVIDIA Data Center GPU Driver: 535.129.03 (Linux), enabled GPU Persistence mode, then running user’s APP test and still have an error: Detected Critical Xid Error.
2. Nvidia-bug-report log check have a lot of NVRM:Xid:8 errors, GPU Current Temp: 42 C.
Driver Version: 535.129.03
CUDA Version: 12.2
Temperature
GPU Current Temp: 42 C
GPU T.Limit Temp: N/A
GPU Shutdown Temp: 96 C
GPU Slowdown Temp: 93 C
GPU Max Operating Temp: 85 C

Nov 27 10:18:12 dell-PowerEdge-R7515 kernel: [160879.547208] NVRM: Xid (PCI:0000:81:00): 8, pid=1020, name=Xorg, Channel 00000002
...
/var/log/dmesg:
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: nvidia-nvlink: NVLInk Core is being initialized, major device number 236
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: NVRM: Loading NVIDIA UNIX x86_64 Kernel Module 535.129.03 Thu Oct 19 18:56:32 UTC 2023
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.129.03 Thu Oct 19 18:42:12 UTC 2023
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: [drm] [nvidia-drm] [GPU ID 0x00008100] Loading driver
Nov 25 13:36:52 dell-PowerEdge-R7515 kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:81:00.0 on minor 1
Nov 25 13:37:04 dell-PowerEdge-R7515 kernel: NVRM: GPU at PCI:0000:81:00: GPU-61e384ab-7c8f-7ebf-efba-da6a7d862a33
Nov 25 13:37:04 dell-PowerEdge-R7515 kernel: NVRM: GPU Board Serial Number: 1320223053140
Nov 25 13:37:04 dell-PowerEdge-R7515 kernel: NVRM: Xid (PCI:0000:81:00): 8, pid=1020, name=Xorg, Channel 00000002
...
Nov 27 10:18:12 dell-PowerEdge-R7515 kernel: NVRM: Xid (PCI:0000:81:00): 8, pid=1020, name=Xorg, Channel 00000002


3.boot from SLI3.0 and running NVIDIA Tesla GPU 629 Field diagnostics is PASSED.
 
 

Cause

N/A

Resolution

1.Working with Xid Errors XID:8   for before test expect the driver error,bus error,thermal issue, It seems the user App error.
Xid error listings
2.Let the user checks the application program, then the user disabled the xorg and the issue was gone.
 

Affected Products

PowerEdge R7515