Error Identification:
- Review the 'faults' tab within UCS to determine whether there are errors and impact.
- Capture UCSM and Chassis logs from the affected server BEFORE any troubleshooting is done. This is necessary to capture historical data to identify whether these errors return after troubleshooting.
Error Confirmation:
Once errors are identified, clear them all, and monitor counters to see if they persist.
- Log in to the UCS command line.
- Reset memory errors using the following commands:
CLI# scope server X/Y
CLI# reset-all-memory-errors
CLI# commit-buffer
- Clear System Event Logs using the following commands:
CLI# scope server X/Y
CLI# clear sel
CLI# commit-buffer
- Reset CIMC using the following commands:
CLI# scope server X/Y
CLI# scope cimc
CLI# reset
CLI# commit-buffer
- Monitor the environment for 48 hours.
If memory errors persist, capture a fresh set of UCSM and Chassis logs, and go to the next section.
Physical Troubleshooting:
Before a DIMM module can be replaced, determine if the errors are related to the socket, the DIMM, or the CPU.
This is done by swapping the hardware components and monitoring the environment. Instructions are provided below:
- Put ESXi host in maintenance mode.
- The faulted DIMMs should be swapped with DIMMs that were not previously showing any issues.
- The server should be rebooted and remain in maintenance mode.
- The server may be monitored for 48 hours to see if the issue presents itself again.
If you are unable to reseat the components, contact Dell Support or engage additional resources for assistance.
If the errors persist after reseats, follow the actions below:
- If DIMM errors follow the DIMM to a new slot, and replace the DIMM.
- If DIMM errors stay with the same DIMM slot, replace the motherboard.
- If DIMM errors persist after DIMM and motherboard replacement, initiate a WebEx for live troubleshooting with Dell Support.