How to Troubleshoot and Resolve Memory Errors Within a Unified Computing System Environment

Summary: This article details how to troubleshoot and resolve memory errors within a Cisco Unified Computing System (UCS) environment.

This article applies to This article does not apply to

Check out resources for

Instructions

Error Identification:

Review the 'faults' tab within UCS to determine whether there are errors and impact.
Capture UCSM and Chassis logs from the affected server BEFORE any troubleshooting is done. This is necessary to capture historical data to identify whether these errors return after troubleshooting.

Error Confirmation:
Once errors are identified, clear them all, and monitor counters to see if they persist.

Log in to the UCS command line.
Reset memory errors using the following commands:

CLI# scope server X/Y
CLI# reset-all-memory-errors
CLI# commit-buffer

Clear System Event Logs using the following commands:

CLI# scope server X/Y
CLI# clear sel
CLI# commit-buffer

Reset CIMC using the following commands:

CLI# scope server X/Y
CLI# scope cimc
CLI# reset
CLI# commit-buffer

Monitor the environment for 48 hours.

If memory errors persist, capture a fresh set of UCSM and Chassis logs, and go to the next section.

Physical Troubleshooting:
Before a DIMM module can be replaced, determine if the errors are related to the socket, the DIMM, or the CPU.

This is done by swapping the hardware components and monitoring the environment. Instructions are provided below:

Put ESXi host in maintenance mode.
The faulted DIMMs should be swapped with DIMMs that were not previously showing any issues.
The server should be rebooted and remain in maintenance mode.
The server may be monitored for 48 hours to see if the issue presents itself again.

If you are unable to reseat the components, contact Dell Support or engage additional resources for assistance.

If the errors persist after reseats, follow the actions below:

If DIMM errors follow the DIMM to a new slot, and replace the DIMM.
If DIMM errors stay with the same DIMM slot, replace the motherboard.
If DIMM errors persist after DIMM and motherboard replacement, initiate a WebEx for live troubleshooting with Dell Support.

Additional Information

Watch this video:

Affected Products

Converged Infrastructure

Article Number: 000194121

Article Type: How To

Last Modified: 10 Jan 2023

Version: 3

Check if your device is covered by Support Services.

How to Troubleshoot and Resolve Memory Errors Within a Unified Computing System Environment

Summary: This article details how to troubleshoot and resolve memory errors within a Cisco Unified Computing System (UCS) environment.

Instructions

Additional Information

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Welcome

Welcome to Dell

How to Troubleshoot and Resolve Memory Errors Within a Unified Computing System Environment

Summary: This article details how to troubleshoot and resolve memory errors within a Cisco Unified Computing System (UCS) environment.

Detailed Article

Instructions

Additional Info

Affected Products

Instructions

Additional Information

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services