There are on-going improvements and enhancements to the Dell PowerEdge BIOS to improve memory event messaging, error handling, and "self-healing" that occur upon a server reboot. This prevents the need for a scheduled maintenance window or onsite presence to replace a DDR4 memory DIMM that was logging error events.
There are two main memory-related "self-healing" BIOS enhancements that were implemented for PowerEdge Servers with DDR4 running BIOS version 2.1.x and later. These enhancements do change the recommended steps or actions to take if memory events occur and are logged to the LifeCycle log.
Memory retraining which happens during boot (early in the Configuring Memory steps), optimizes the signal timing and margining for each DIMM/slot for best access. Memory signal timing and margining characteristics of a DIMM may change over time for several different reasons:
Previously, BIOS updates or memory configuration changes being detected would have resulted in memory retraining occurring during the subsequent boot. Starting with BIOS 2.1.x, additional correctable and uncorrectable memory errors "triggers" were added for scheduled retraining:
Warning - MEM0701 - "Correctable memory error rate exceeded for DIMM_XX." Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX." Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location XX."
Any of these errors being logged in the SEL/LifeCycle logs result in Memory retraining being scheduled for the next reboot (warm or cold). BIOS automatically forces a cold reboot regardless of what is initiated.
Critical - MEM0001 - "Multi-bit memory errors detected on memory device at location DIMM_XX."
This Multi-bit error may result in the server rebooting due to a fatal error if the Operating System is unable to handle that error. Memory retraining automatically occur during that boot. If the multi-bit error occurs in a noncritical memory location that that operating system can handle, a reboot must be scheduled.
Memory retraining during POST may "self-heal" the failing DIMM and associated slot by optimizing the signal timing and margining. A DIMM replacement for these errors is not necessary unless memory retraining fails (UEFI0106) during boot or these same errors continue to occur.
The second "self-healing' memory enhancement is PPR. PPR repairs a failing memory location by disabling the location or address at the hardware layer enabling a spare memory row to be used instead. The exact number of spare memory rows available depends on the DRAM device and DIMM size.
Previously, this functionality was limited to the manufacturing process. As with the memory retraining enhancements mentioned earlier, there are certain correctable memory errors that result in PPR being scheduled on a specific DIMM slot for the next reboot (warm or cold). BIOS automatically forces a cold reboot regardless of what is initiated. Since the PPR operation is scheduled on a specific DIMM slot, DO NOT change DIMM slot locations until the PPR operation has been run. Examples of the errors are:
Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX." Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX." Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location XX."
Any of these events ni the logs, will result in PPR being scheduled for the next reboot (warm or cold) early in the Configuring Memory phase.
After the reboot, verify that the PPR operation was successfully performed. An example of a successful PPR operation is similar to:
MEM9060 - "The Post Package Repair operation is successfully completed on the Dual In-line Memory Module (DIMM) device that was failing earlier."
UEFI0278 - "Unable to complete the Post Package Repair (PPR) operation because of an issue in the DIMM memory slot X."
A newly published Whitepaper (version 1.0) describing Dell PowerEdge server Memory-related Reliability, Availability, and Serviceability (RAS) features is now available that describes the various RAS features and capabilities available on the PowerEdge Servers - Memory Errors and Dell EMC PowerEdge YX4X Server Memory RAS Features.
For more information about correctable error threshold events, reference 14G Intel and 15G Intel/AMD PowerEdge servers: DDR4 memory: managing Correctable error threshold events.Updated April 24, 2020
Dell is continuing to enhance our "self-healing" capabilities. The following section lists the updates and enhancements associated with the different BIOS versions.
BIOS 2.1.x - Initial article publication of the "self-healing" capabilities available starting with BIOS 2.1.6 and higher, including example error messages and recommended actions.
BIOS 2.4.x and newer changes (December 2019)
BIOS 2.5.x and newer changes (February 2020)
Updated July 10, 2020
BIOS 2.7.x and newer changes (July 2020 block BIOS - targeted mid-July for web posting)
UPDATED January 13, 2021
BIOS 2.8.2 and newer changes (September 2020 block BIOS)
There are additional RAS feature enhancements being evaluated for inclusion in future BIOS updates.
This article will be updated as new information becomes available.
See Also: Guidance on troubleshooting memory by swap testing - Troubleshooting memory errors on PowerEdge systems by swap testing
Downloads and Drivers: Drivers & Downloads | Dell US