Start a Conversation

Unsolved

S

1 Rookie

 • 

18 Posts

27

April 15th, 2024 22:44

Memory Channel Error Identification on PowerEdge R6515

Hello

I recently bought 16x Dell Part AA783423 as part of a memory upgrade but one of the sticks seem bad.

I am experiencing memory errors on my Dell PowerEdge R6515 server and need assistance with identifying the problematic DIMM slot. The edac-util tool reports errors specifically at "mc#0csrow#3channel#2"

edac-util -v 
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#2: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#3: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#4: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#5: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#2: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#3: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#4: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#5: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#2: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#3: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#4: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#5: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#2: 74 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#3: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#4: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#5: 0 Corrected Errors

. I would like guidance on which DIMM slot corresponds to this particular memory channel. Could you provide me with the memory layout or any specific documentation that could help me isolate and address this issue? Any additional troubleshooting steps or advice would also be appreciated.

Thank you for your assistance.

Moderator

 • 

3.2K Posts

April 16th, 2024 03:25

Hi,

 

It is hard to identify EDAC error message, as we need to refer to architectural schemetics, usually this would need engineering to be involved. 

 

I would suggest, disabling EDAC and let the server's lifecycle controller capture the error, this would be an easier and faster way. These errors occur when the Error Detection and Correction (EDAC) module reads the registers from the chipset. You may not notice any memory or CPU errors in the ESM/BMC/IPMI/iDRAC log because the registers are read-once and when enabled, EDAC will get them first.

1 Rookie

 • 

18 Posts

April 16th, 2024 10:44

@DELL-Joey C​ Hello

I disabled the EDAC, rmmod amd64_edac edac_mce_amd

Now dmesg prints:

mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 17: d42040000000011b
mce: [Hardware Error]: TSC 0 ADDR 15250b5100 PPIN 2b4a63d009dc115 SYND bb4400800a800403 IPID 9600250f00
mce: [Hardware Error]: PROCESSOR 2:830f10 TIME 1713263957 SOCKET 0 APIC 2 microcode 830107a

and:
edac-util -v

edac-util: Error: No memory controller data found.

Checking the IDRAC lifecycle log I'm not seeing anything picked up there

Moderator

 • 

2.2K Posts

April 16th, 2024 12:37

Hello, 

If there is nothing on iDRAC LCC log then it's hard to say there is a memory error. EDAC Errors in 'messages' Log in RedHat Enterprise Linux (RHEL) and PowerEdge | Dell

These errors occur when the Error Detection and Correction (EDAC) module reads the registers from the chipset. You may not notice any memory or CPU errors in the ESM/BMC/IPMI/iDRAC log because the registers are read-once and when enabled, EDAC will get them first.

Resolution

Resolution :
  • Blacklist the edac driver :
    • List edac modules :
      • # lsmod | grep -i edac
    • Take the output and blacklist them :
    • Edit '/etc/modprobe.d/blacklist.conf' with your favorite editor
    • Add the modules at the bottom of the file
    • Example :
      • blacklist i7core_edac
      • blacklist edac_core
  • Reboot
  • Run hardware diagnostics

Hope that helps!

No Events found!

Top