VxFlex-IR: PowerEdge DIMM ECC correctable memory errors

摘要: Dell 13G/14G server is posting MEMXXXX errors in the iDRAC event log. This event may result in the node hanging or result in a Machine Check Exception. What should you do?

本文适用于本文不适用于本文并非针对某种特定的产品。本文并非包含所有产品版本。

症状

You have a 13G or 14G server reporting MEM errors in the iDRAC event log.

原因

ECC memory errors in most cases are caused by random alpha particle bombardment. Alpha particles are part of normal radiation that occur every day. On occasion an alpha particle knock a single electron off of a memory module corrupting the data. Modern memory modules are designed to recognize this event and repair them. Each module keeps an internal counter of how many times it's repaired a memory error. A threshold is set in the BIOS that when reached alerts the server that the number of memory events has exceeded that threshold.

Note: In a situation where you encounter message ID MEM8000 (Correctable memory error logging disabled for a memory device at location DIMM_XX) which appears in isolation (ie not in a similar time-frame) to any corresponding MEM0005/MEM0701/MEM0702 messages, it does not result in a PPR being scheduled for the next reboot.

Message ID MEM8000 in isolation or with a corresponding MCE (machine check exception) is an indication of a general failure of the DIMM module and is not a situation where the correctable or uncorrectable buckets initially overflow. This type of memory event should be treated as a DIMM failure and the listed DIMM module should be replaced at the customer s earliest convenience.

解决方案

What is DDR4 "self-healing"?
How do these DDR4 "self-healing" capabilities (BIOS enhancements) change the recommended customer and Technical Support actions when encountering memory errors on a server?

There are two main memory-related "self-healing" BIOS enhancements that were implemented for PowerEdge Servers with DDR4 running BIOS version 2.1.x and newer. These enhancements do change the recommended steps/actions to take if memory errors occur and are logged in vCenter, VxFM, dial home or in the LifeCycle log.

Note: If you are getting memory errors with DDR4 and you are running a bios version older than 2.1.x, update your bios to the latest revision to include memory Self-healing enhancements. Then reboot your node to continue with (PPR) See Resolution Section for more details

Note: Current memory troubleshooting steps incorporate moving failing DIMMs to a different slot to confirm whether or not the errors follow the DIMM or remain with the DIMM slot.

If the 13G node is running bios 2.8.x or higher, the first recommended step is a reboot/restart (without moving DIMMs to a different slot). Allowing the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need for any DIMM replacements.

If the 14G node is running bios version 2.4.8 or higher, the first recommended step is a reboot/restart (without moving DIMMs to a different slot). Allowing the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need for any DIMM replacements.

Upgrade Bios to (2.8.x or higher for 13G) and (2.1.x or higher for 14G) to enable memory retraining enhancements for servers with DDR4 RAM installed - Memory retraining which happens during boot, optimize the signal timing/margining for each DIMM/slot for best access. Timing characteristics of a DIMM may change for several different reasons:

Examples include but are not limited to:
1. Changes in Server memory configuration
2. BIOS changes
3. Different operating temperatures of the Server or DIMM
4. The general age of the DIMM

Previously, BIOS updates or memory configuration changes being detected would have resulted in memory retraining occurring during the subsequent boot. Starting with BIOS 2.1.x (14G) and 2.8.x (13G), additional correctable and uncorrectable memory errors "triggers" were added for scheduled retraining:

Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location(s) XX."

Any of the above errors logged in the VC events/ dial home/ SEL /LifeCycle logs results in Memory retraining being scheduled for the next reboot (warm or cold), BIOS automatically forcees a cold reboot regardless of what is initiated.

Critical - MEM0001 - "Multi-bit memory errors detected on memory device at location(s) DIMM_XX."

MEM0001 results in the server rebooting due to the fatal error. Memory retraining automatically occurs during that boot.

With either of these correctable or uncorrectable (multibit) memory errors, the resulting memory retraining on reboot/restart may "self-heal" the failing DIMM by optimizing the signal timing/margining for each DIMM/slot. A DIMM replacement for these errors is not necessary unless memory retraining fails (UEFI0106) during boot or these same errors continue to occur.

2. Post Package Repair (PPR) - The second "self-healing' memory enhancement, results in repairing a failing memory location on a DIMM by disabling the location/address at the hardware layer enabling a spare memory row to be used instead. The exact number of spare memory rows available depends on the DRAM device and DIMM size.

Previously, this functionality was limited to the manufacturing process. Just like with the memory retraining enhancements mentioned earlier, there are certain correctable memory errors that result in PPR being scheduled on a specific DIMM slot for the next reboot (warm or cold). BIOS automatically force a cold reboot regardless of what is initiated. Since the PPR operation is scheduled on a specific DIMM slot, DO NOT change DIMM slot locations until the PPR operation is run. Examples of the errors are:

Warning - MEM0701- "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0702 - "Correctable memory error rate exceeded for DIMM_XX."
Critical - MEM0005 - "Persistent correctable memory error limit reached for a memory device at location(s) XX."

Any of the above errors being logged in the VC events/ Dial home/SEL/LifeCycle log results in Post Package Repair being scheduled for the next reboot (warm or cold).

After the reboot, verify that the PPR operation was successfully performed. An example of a successful PPR operation is similar to:

Message ID MEM9060 - "The PostPackage Repair operation is successfully completed on the Dual In-line Memory Module (DIMM) device that was failing earlier."

A DIMM replacement for these correctable memory errors is not necessary unless the PPR operation fails after the reboot. An example of a failing PPR message is:
Critical - Message ID UEFI0278 - "Unable to complete the Post Package Repair (PPR) operation because of an issue in the DIMM memory slot X."

其他信息

受影响的产品

VxFlex Product Family

产品

VxFlex Product Family

文章编号: 000058157

文章类型: Solution

上次修改时间: 15 4月 2021

版本: 4

VxFlex-IR: PowerEdge DIMM ECC correctable memory errors

摘要: Dell 13G/14G server is posting MEMXXXX errors in the iDRAC event log. This event may result in the node hanging or result in a Machine Check Exception. What should you do?

症状

原因

解决方案

其他信息

受影响的产品

产品

文章属性

从其他戴尔用户那里查找问题的答案

支持服务

文章属性

从其他戴尔用户那里查找问题的答案

支持服务

欢迎

欢迎访问戴尔

VxFlex-IR: PowerEdge DIMM ECC correctable memory errors

摘要: Dell 13G/14G server is posting MEMXXXX errors in the iDRAC event log. This event may result in the node hanging or result in a Machine Check Exception. What should you do?

详细文章

症状

原因

解决方案

其它信息

受影响的产品

症状

原因

解决方案

其他信息

受影响的产品

产品

文章属性

从其他戴尔用户那里查找问题的答案

支持服务

文章属性

从其他戴尔用户那里查找问题的答案

支持服务