What is DDR4 Self-healing on Dell PowerEdge Servers with Intel Xeon Scalable Processors

Sammanfattning: Correctable and uncorrectable memory errors on PowerEdge Server with DDR4 and changes to troubleshooting steps

Den här artikeln gäller för Den här artikeln gäller inte för Den här artikeln är inte kopplad till någon specifik produkt. Alla produktversioner identifieras inte i den här artikeln.

Kolla in andra resurser

Symptom

What is DDR4 "self-healing" on Dell PowerEdge Servers with Intel Xeon Scalable Processors (first or second generation) with BIOS version 2.1.x or above?

How do these DDR4 "self-healing" capabilities (BIOS enhancements) change recommended customer and Technical Support actions when encountering memory errors on a server?

What are the "self-healing" enhancements in the newer BIOS versions?

Orsak

There are on-going improvements and enhancements to the Dell PowerEdge BIOS to improve memory event messaging, error handling, and "self-healing" that occur upon a server reboot. This prevents the need for a scheduled maintenance window or onsite presence to replace a DDR4 memory DIMM that was logging error events.

Upplösning

There are two main memory-related "self-healing" BIOS enhancements that were implemented for PowerEdge Servers with DDR4 running BIOS version 2.1.x and later. These enhancements do change the recommended steps or actions to take if memory events occur and are logged to the LifeCycle log.

Note:

If encountering memory errors with DDR4 on BIOS 2.0 or earlier, update BIOS to the latest revision that includes many memory Self-healing capabilities and ongoing enhancements. We always encourage customers to update to the latest available BIOS release (and iDRAC firmware) so that they can take advantage in the latest self-healing enhancements.
Previous memory troubleshooting steps included moving failing DIMMs to a different slot to confirm whether or not the errors follow the DIMM or remain with the DIMM slot. With BIOS 2.1.x or later, the first recommended step is to restart (without moving DIMMs to a different slot). This allows the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without scheduling any DIMM replacements.

1. Memory retraining enhancements

Memory retraining which happens during boot (early in the Configuring Memory steps), optimizes the signal timing and margining for each DIMM/slot for best access. Memory signal timing and margining characteristics of a DIMM may change over time for several different reasons:

Changes in Server memory configuration
BIOS changes (Memory Reference Code - MRC)
Different operating temperatures of the server or DIMM
The general age of the DIMM

Previously, BIOS updates or memory configuration changes being detected would have resulted in memory retraining occurring during the subsequent boot. Starting with BIOS 2.1.x, additional correctable and uncorrectable memory errors "triggers" were added for scheduled retraining:

Warning - MEM0701 - "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location XX."

Any of these errors being logged in the SEL/LifeCycle logs result in Memory retraining being scheduled for the next reboot (warm or cold). BIOS automatically forces a cold reboot regardless of what is initiated.

Critical - MEM0001 - "Multi-bit memory errors detected on memory device at location DIMM_XX."

This Multi-bit error may result in the server rebooting due to a fatal error if the Operating System is unable to handle that error. Memory retraining automatically occur during that boot. If the multi-bit error occurs in a noncritical memory location that that operating system can handle, a reboot must be scheduled.

Memory retraining during POST may "self-heal" the failing DIMM and associated slot by optimizing the signal timing and margining. A DIMM replacement for these errors is not necessary unless memory retraining fails (UEFI0106) during boot or these same errors continue to occur.

2. Post Package Repair (PPR)

The second "self-healing' memory enhancement is PPR. PPR repairs a failing memory location by disabling the location or address at the hardware layer enabling a spare memory row to be used instead. The exact number of spare memory rows available depends on the DRAM device and DIMM size.

Previously, this functionality was limited to the manufacturing process. As with the memory retraining enhancements mentioned earlier, there are certain correctable memory errors that result in PPR being scheduled on a specific DIMM slot for the next reboot (warm or cold). BIOS automatically forces a cold reboot regardless of what is initiated. Since the PPR operation is scheduled on a specific DIMM slot, DO NOT change DIMM slot locations until the PPR operation has been run. Examples of the errors are:

Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location XX."

Any of these events ni the logs, will result in PPR being scheduled for the next reboot (warm or cold) early in the Configuring Memory phase.

Note: A Message ID MEM8000 (Correctable memory error logging disabled for a memory device at location DIMM_XX.), without a corresponding MEM0005/MEM0701/MEM0702 on the same DIMM location, does not result in a PPR being scheduled for the next reboot.

See July 10, 2020 update for changes for the MEM8000 event and updated version 1.1 and newer white paper.

After the reboot, verify that the PPR operation was successfully performed. An example of a successful PPR operation is similar to:

MEM9060 - "The Post Package Repair operation is successfully completed on the Dual In-line Memory Module (DIMM) device that was failing earlier."

A DIMM replacement for these correctable memory errors is not necessary unless the PPR operation. An example of a failing critical PPR message is:

UEFI0278 - "Unable to complete the Post Package Repair (PPR) operation because of an issue in the DIMM memory slot X."

A newly published Whitepaper (version 1.0) describing Dell PowerEdge server Memory-related Reliability, Availability, and Serviceability (RAS) features is now available that describes the various RAS features and capabilities available on the PowerEdge Servers - Memory Errors and Dell EMC PowerEdge YX4X Server Memory RAS Features.

For more information about correctable error threshold events, reference 14G Intel and 15G Intel/AMD PowerEdge servers: DDR4 memory: managing Correctable error threshold events.

Updated April 24, 2020

Dell is continuing to enhance our "self-healing" capabilities. The following section lists the updates and enhancements associated with the different BIOS versions.

BIOS 2.1.x - Initial article publication of the "self-healing" capabilities available starting with BIOS 2.1.6 and higher, including example error messages and recommended actions.

BIOS 2.4.x and newer changes (December 2019)

MEM0702 (Correctable error rate exceeded…) - Message updated from a critical to warning. With recommended actions updated to reboot the server to allow "self-healing" to occur - For example, Post Package Repair.
- December 2019 or newer iDRAC to also be installed to get the updated message
- Recommended Action: Reboot the server to allow PPR to run
MEM9060 - Message description updated to indicate "self-healing" was successfully completed

BIOS 2.5.x and newer changes (February 2020)

A "Correctable Error Logging" BIOS option was added to allow customers to disable all LifeCycle/SEL logging related to correctable errors. All the "self-healing" features continue to function - For example, PPR and memory retraining are still scheduled and run during the next reboot (early in the Configuring Memory process).
Addition of MEM08xx errors for RDIMMs and LRDIMMs replacing existing error messages and actions. Existing error messages are still used for platforms that do not support the "self-healing" capabilities.
- February 2020 or newer iDRAC is required for the new messages to be logged.

Note: Without the updated iDRAC, new BIOS messages are "unknown" in the SEL or LifeCycle logs.

MEM0802 - Replaced MEM0702 - correctable error rate exceeded
- Recommended Action: Reboot the server to allow PPR to run. Confirm that PPR was successful (MEM0802)
MEM0804 - Replaced MEM9060 indicating PPR was successful. Now includes DIMM slot location that ran PPR
- Recommended Action: None. This event indicates "self-healing" occurred, no DIMM replacement is needed.
MEM0805 - Replaced UEFI0278 indicating PPR failed
- Recommended Action: Replace failing DIMM

Updated July 10, 2020

BIOS 2.7.x and newer changes (July 2020 block BIOS - targeted mid-July for web posting)

MEM8000 (Correctable error logging disabled) - Starting with BIOS ~2.0.x, Dell Engineering made a BIOS change to enhance the rate of correctable error detection that may impact performance. This change resulted in an uptick in MEM8000 events that were not substantiated by results from DIMM failure analysis. Starting with BIOS 2.7.x there are two changes related to MEM8000. The first is that signaling of the MEM8000 event has been modified. Second, BIOS schedules self-healing (PPR) for the next reboot. iDRAC messages are not yet updated to reflect the new actions.
- Recommended Action: Reboot the server to allow self-healing/PPR to run. Confirm that PPR was successful (MEM0804).
MEM0001 (Uncorrectable error) - Results in self-healing (PPR) to be scheduled for the next reboot. iDRAC messages are not yet updated to reflect the new actions.
- Recommended Action: None needed if the MEM0001 is associated with a critical page that the Operating System is unable to recover - Is still a fatal error resulting in a reboot. If the MEM0001 is associated with a noncritical page that the Operation System can recover from, a reboot must be scheduled to all self-healing (PPR) to occur. Confirm that PPR was successful (MEM0804).

UPDATED January 13, 2021

BIOS 2.8.2 and newer changes (September 2020 block BIOS)

MEM9072 (Uncorrectable error identified by the memory patrol scrub process- page is not consumed or in use) - Results in self-healing (PPR) to be scheduled for the next reboot. iDRAC messages are not yet updated to reflect the new actions.
- Recommended Action: Schedule a reboot soon. Delaying the reboot could result in the page being consumed resulting in a MEM0001 error that could result in a reboot occurring. Memory self-healing (PPR) runs during that reboot. Confirm that PPR was successful (MEM0804).

Note: The latest version of the Engineering white paper (version 1.3 - issue date November 20, 2020) is found at: https://downloads.dell.com/manuals/common/dellemc_poweredge_yx4x_memoryras.pdf
For Intel Xeon E and AMD EPYC content, continue to reference the original Engineering white paper (version 1.0) which is found at: PowerEdge YX4X Server Memory RAS Whitepaper v1.0 (dell.com)

There are additional RAS feature enhancements being evaluated for inclusion in future BIOS updates.

Note: For detailed description and recommended actions for specific error code messages, reference the following link: Look Up (dell.com). Since error codes (such as MEM0001) apply to multiple generations of servers and platforms, the recommended actions may not be current for the particular BIOS version. The new error codes that have been added (such as MEM0802, MEM0804, MEM0805, and so on) only apply to Servers with Intel Xeon Scalable Processors (first or second generation) .

This article will be updated as new information becomes available.

See Also: Guidance on troubleshooting memory by swap testing - Troubleshooting memory errors on PowerEdge systems by swap testing

Downloads and Drivers: Drivers & Downloads | Dell US

Berörda produkter

Dell EMC XC Series XC6420 Appliance, Dell EMC XC Core 6420 System, OEMR R240, OEMR R340, OEMR R740xd2, OEMR T140, OEMR T340, OEMR XL R240, OEMR XL R340, PowerEdge C6420, PowerEdge FC640, PowerEdge M640, PowerEdge MX740C, PowerEdge R240 , PowerEdge R340, PowerEdge R440, PowerEdge R540, PowerEdge R640, PowerEdge R740, PowerEdge R740XD, PowerEdge R740XD2, PowerEdge R940, PowerEdge T140, PowerEdge T340, PowerEdge T440, Dell EMC vSAN C6420 Ready Node ...

Produkter

VxRail 460 and 470 Nodes, VxRail E560F, VxRail P570, VxRail P570F, VxRail S570, VxRail V570F

Artikelnummer: 000053203

Artikeltyp: Solution

Senast ändrad: 19 apr. 2024

Version: 15

Kontrollera om din enhet omfattas av supporttjänster.

What is DDR4 Self-healing on Dell PowerEdge Servers with Intel Xeon Scalable Processors

Sammanfattning: Correctable and uncorrectable memory errors on PowerEdge Server with DDR4 and changes to troubleshooting steps

Symptom

Orsak

Upplösning

1. Memory retraining enhancements

2. Post Package Repair (PPR)

Berörda produkter

Produkter

Artikelegenskaper

Få svar på dina frågor från andra Dell-användare

Supporttjänster

Artikelegenskaper

Få svar på dina frågor från andra Dell-användare

Supporttjänster

Välkommen

Välkommen till Dell

What is DDR4 Self-healing on Dell PowerEdge Servers with Intel Xeon Scalable Processors

Sammanfattning: Correctable and uncorrectable memory errors on PowerEdge Server with DDR4 and changes to troubleshooting steps

Detaljerad artikel

Symptom

Orsak

Upplösning

Berörda produkter

Symptom

Orsak

Upplösning

1. Memory retraining enhancements

2. Post Package Repair (PPR)

Berörda produkter

Produkter

Artikelegenskaper

Få svar på dina frågor från andra Dell-användare

Supporttjänster

Artikelegenskaper

Få svar på dina frågor från andra Dell-användare

Supporttjänster