NOTE: This article does not apply to newer systems with Xeon Scalable Processor. For newer systems, check this article What is DDR4 Self-healing on Dell PowerEdge Servers with Intel Xeon Scalable Processors.
Troubleshooting memory errors on PowerEdge systems by swap testing
When a single-bit error (SBE) and/or multi-bit error (MBE) is reported on one or more memory DIMM locations, the cause might not be down to the DIMM itself, so some simple troubleshooting must be performed to determine where exactly the fault lies. See Figure 1 for an example of memory errors appearing in the iDRAC interface on an R715.
Figure 1: Memory errors as displayed in iDRAC 6 logs (English Only)
Isolating memory issues means swapping memory DIMMs into different memory sockets, channels, banks, and controllers. There are several ways that you can swap the DIMMs around to narrow down the fault. You might have to use more than one of these methods to pinpoint the faulty DIMM or Socket. Below, you find a representation of these methods. To make the explanation straightforward, we assume that the faulty DIMM is A1 or one of the sets marked in Blue in the images.
Swapping DIMMs in groups (by Channel or Bank) rather than individually is the best method to identify the failed DIMM or DIMMs.
Once a group of DIMMs has been identified to contain the failed DIMM or DIMMs, then moving single DIMMs can be used to identify which DIMMs have failed.
Swapping DIMM A1 (marked in blue) with DIMM A9 (Marked in red) to try the DIMM in a different memory channel and bank
Figure 2: Swapping DIMM A1 with DIMM A9
Swapping DIMM A1 (marked in blue) with DIMM B1 (marked in red) puts the DIMM on an altogether different memory controller (CPU).
Figure 3: Swapping DIMM A1 with DIMM B1
Swapping the whole bank of DIMMS (A1, A2, A3 - marked blue) with another bank (B1, B2, B3 - marked red) tests the whole bank of DIMMs in a new bank, on a new memory controller.
Figure 4: Swapping DIMMs A1, A2, A3 with DIMMs B1, B2, B3
Swapping a whole channel of DIMMs (A1, A4, A7 - marked blue) with another channel (B1, B2, B3 - marked red) test the whole channel of DIMMs in a new channel, and on a new memory controller.
Figure 5: Swapping DIMMs A1, A4, A7 with DIMMs B1, B4, B7
Generally, DIMM errors tend to follow the DIMMs identified in the errors. For example with a SBE reporting on DIMM A1, swapping this DIMM with different DIMM results in one of the following:
Not Applicable
Not Applicable