Gå vidare till huvudinnehållet
  • Lägg beställningar snabbt och enkelt
  • Visa beställningar och kontrollera leveransstatus
  • Skapa och kom åt en lista över dina produkter

Troubleshooting memory errors on PowerEdge systems by swap testing

Sammanfattning: Swapping memory DIMMs to troubleshoot memory errors on Dell Technologies PowerEdge servers.

Den här artikeln gäller för   Den här artikeln gäller inte för 

Symptom

NOTE: This article does not apply to newer systems with Xeon Scalable Processor. For newer systems, check this article What is DDR4 Self-healing on Dell PowerEdge Servers with Intel Xeon Scalable Processors.

Troubleshooting memory errors on PowerEdge systems by swap testing

When a single-bit error (SBE) and/or multi-bit error (MBE) is reported on one or more memory DIMM locations, the cause might not be down to the DIMM itself, so some simple troubleshooting must be performed to determine where exactly the fault lies. See Figure 1 for an example of memory errors appearing in the iDRAC interface on an R715.

iDRAC 6 logs
Figure 1: Memory errors as displayed in iDRAC 6 logs (English Only)

Isolating memory issues means swapping memory DIMMs into different memory sockets, channels, banks, and controllers. There are several ways that you can swap the DIMMs around to narrow down the fault. You might have to use more than one of these methods to pinpoint the faulty DIMM or Socket. Below, you find a representation of these methods. To make the explanation straightforward, we assume that the faulty DIMM is A1 or one of the sets marked in Blue in the images.

Swapping DIMMs in groups (by Channel or Bank) rather than individually is the best method to identify the failed DIMM or DIMMs.
Once a group of DIMMs has been identified to contain the failed DIMM or DIMMs, then moving single DIMMs can be used to identify which DIMMs have failed.


Method 1:

Swapping DIMM A1 (marked in blue) with DIMM A9 (Marked in red) to try the DIMM in a different memory channel and bank

DIMM A1 to A9
Figure 2: Swapping DIMM A1 with DIMM A9
 

Method 2:

Swapping DIMM A1 (marked in blue) with DIMM B1 (marked in red) puts the DIMM on an altogether different memory controller (CPU).

DIMM A1 to B1
Figure 3: Swapping DIMM A1 with DIMM B1
 

Method 3:

Swapping the whole bank of DIMMS (A1, A2, A3 - marked blue) with another bank (B1, B2, B3 - marked red) tests the whole bank of DIMMs in a new bank, on a new memory controller.

DIMMA 123 to B123
Figure 4: Swapping DIMMs A1, A2, A3 with DIMMs B1, B2, B3
 

Method 4

Swapping a whole channel of DIMMs (A1, A4, A7 - marked blue) with another channel (B1, B2, B3 - marked red) test the whole channel of DIMMs in a new channel, and on a new memory controller.

DIMM A147 to B147
Figure 5: Swapping DIMMs A1, A4, A7 with DIMMs B1, B4, B7
 

Interpreting the results after swapping DIMMs

Generally, DIMM errors tend to follow the DIMMs identified in the errors. For example with a SBE reporting on DIMM A1, swapping this DIMM with different DIMM results in one of the following:

  1. The error message is no longer reported, and the problem is resolved
  •   This indicates that reseating the memory resolved the issue
  1. The error message follows the DIMM (DIMM A1 is swapped with DIMM B1, and error messages is now reported against DIMM B1)
  • This indicates that the DIMM is most likely failed and requires replacement.
  1. The error message follows the DIMM socket (DIMM A1 is swapped with DIMM B1, and error messages is still reported against DIMM A1)
  • This indicates that the system board or CPU is most likely failed
  • Swapping CPUs confirms which component requires replacement
  • If the problem follows the CPU (the error message moves after swapping CPUs), replace the CPU
  • If the problem stays with the DIMM socket, replace the system board
  1. The error message does not follow the DIMM or the socket (the error is reported against a different DIMM after swapping)
  • This indicates that a different DIMM or DIMMs is most likely bad
 
NOTE: We would advise you to also keep your firmware levels up to date as this can reduce the risk of receiving memory errors and prolong the life of the DIMMs.
For more information, see Dell Knowledge Base article Dell Repository Manager (DRM).

Orsak

Not Applicable

Upplösning

Not Applicable

Berörda produkter

PowerEdge C1100, PowerEdge C2100, PowerEdge C5125, PowerEdge C5220, PowerEdge C5230, PowerEdge C6105, PowerEdge C6145, PowerEdge C6220, PowerEdge C6220 II, PowerEdge c6320

Produkter

PowerEdge c6320p, Poweredge FC430, Poweredge FC630, Poweredge FC830, PowerEdge M420, PowerEdge M520, PowerEdge M520 (for PE VRTX), PowerEdge M600, PowerEdge M605, PowerEdge M610, PowerEdge M610x, PowerEdge M620, PowerEdge M620 (for PE VRTX) , PowerEdge M630, PowerEdge M630 (for PE VRTX), PowerEdge M710, PowerEdge M710HD, PowerEdge M805, PowerEdge M820, PowerEdge M820 (for PE VRTX), PowerEdge M830, PowerEdge M830 (for PE VRTX), PowerEdge M905, PowerEdge M910, PowerEdge M915, PowerEdge R200, PowerEdge R210, PowerEdge R210 II, PowerEdge R220, PowerEdge R230, PowerEdge R300, PowerEdge R310, PowerEdge R320, PowerEdge R330, PowerEdge R410, PowerEdge R415, PowerEdge R420, PowerEdge R430, PowerEdge R510, PowerEdge R515, PowerEdge R520, PowerEdge R530, PowerEdge R530xd, PowerEdge R610, PowerEdge R620, PowerEdge R630, PowerEdge R710, PowerEdge R715, PowerEdge R720, PowerEdge R720XD, PowerEdge R730, PowerEdge R730xd, PowerEdge R805, PowerEdge R810, PowerEdge R815, PowerEdge R820, PowerEdge R830, PowerEdge R900, PowerEdge R905, PowerEdge R910, PowerEdge R920, PowerEdge R930, PowerEdge T100, PowerEdge T105, PowerEdge T110, PowerEdge T110 II, PowerEdge T130, PowerEdge T20, PowerEdge T30, PowerEdge T300, PowerEdge T310, PowerEdge T320, PowerEdge T330, PowerEdge T410, PowerEdge T420, PowerEdge T430, PowerEdge T605, PowerEdge T610, PowerEdge T620, PowerEdge T630, PowerEdge T710 ...