What is DDR4 "self-healing"?
How do these DDR4 "self-healing" capabilities (BIOS enhancements) change the recommended customer and Technical Support actions when encountering memory errors on a server?
There are two main memory-related "self-healing" BIOS enhancements that were implemented for PowerEdge Servers with DDR4 running BIOS version 2.1.x and newer. These enhancements do change the recommended steps/actions to take if memory errors occur and are logged in vCenter, VxFM, dial home or in the LifeCycle log.
Note: If you are getting memory errors with DDR4 and you are running a bios version older than 2.1.x, update your bios to the latest revision to include memory Self-healing enhancements. Then reboot your node to continue with (PPR) See Resolution Section for more details
Note: Current memory troubleshooting steps incorporate moving failing DIMMs to a different slot to confirm whether or not the errors follow the DIMM or remain with the DIMM slot.
If the 13G node is running bios 2.8.x or higher, the first recommended step is a reboot/restart (without moving DIMMs to a different slot). Allowing the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need for any DIMM replacements.
If the 14G node is running bios version 2.4.8 or higher, the first recommended step is a reboot/restart (without moving DIMMs to a different slot). Allowing the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need for any DIMM replacements.
Upgrade Bios to (2.8.x or higher for 13G) and (2.1.x or higher for 14G) to enable memory retraining enhancements for servers with DDR4 RAM installed - Memory retraining which happens during boot, optimize the signal timing/margining for each DIMM/slot for best access. Timing characteristics of a DIMM may change for several different reasons:
Examples include but are not limited to:
1. Changes in Server memory configuration
2. BIOS changes
3. Different operating temperatures of the Server or DIMM
4. The general age of the DIMM
Previously, BIOS updates or memory configuration changes being detected would have resulted in memory retraining occurring during the subsequent boot. Starting with BIOS 2.1.x (14G) and 2.8.x (13G), additional correctable and uncorrectable memory errors "triggers" were added for scheduled retraining:
Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location(s) XX."
Any of the above errors logged in the VC events/ dial home/ SEL /LifeCycle logs results in Memory retraining being scheduled for the next reboot (warm or cold), BIOS automatically forcees a cold reboot regardless of what is initiated.
Critical - MEM0001 - "Multi-bit memory errors detected on memory device at location(s) DIMM_XX."
MEM0001 results in the server rebooting due to the fatal error. Memory retraining automatically occurs during that boot.
With either of these correctable or uncorrectable (multibit) memory errors, the resulting memory retraining on reboot/restart may "self-heal" the failing DIMM by optimizing the signal timing/margining for each DIMM/slot. A DIMM replacement for these errors is not necessary unless memory retraining fails (UEFI0106) during boot or these same errors continue to occur.
2. Post Package Repair (PPR) - The second "self-healing' memory enhancement, results in repairing a failing memory location on a DIMM by disabling the location/address at the hardware layer enabling a spare memory row to be used instead. The exact number of spare memory rows available depends on the DRAM device and DIMM size.
Previously, this functionality was limited to the manufacturing process. Just like with the memory retraining enhancements mentioned earlier, there are certain correctable memory errors that result in PPR being scheduled on a specific DIMM slot for the next reboot (warm or cold). BIOS automatically force a cold reboot regardless of what is initiated. Since the PPR operation is scheduled on a specific DIMM slot, DO NOT change DIMM slot locations until the PPR operation is run. Examples of the errors are:
Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location(s) XX."
Any of the above errors being logged in the VC events/ Dial home/SEL/LifeCycle log results in Post Package Repair being scheduled for the next reboot (warm or cold).
After the reboot, verify that the PPR operation was successfully performed. An example of a successful PPR operation is similar to:
Message ID MEM9060 - "The PostPackage Repair operation is successfully completed on the Dual In-line Memory Module (DIMM) device that was failing earlier."
A DIMM replacement for these correctable memory errors is not necessary unless the PPR operation fails after the reboot. An example of a failing PPR message is:
Critical - Message ID UEFI0278 - "Unable to complete the Post Package Repair (PPR) operation because of an issue in the DIMM memory slot X."