Unsolved
1 Rookie
•
18 Posts
0
27
Memory Channel Error Identification on PowerEdge R6515
Hello
I recently bought 16x Dell Part AA783423 as part of a memory upgrade but one of the sticks seem bad.
I am experiencing memory errors on my Dell PowerEdge R6515 server and need assistance with identifying the problematic DIMM slot. The edac-util
tool reports errors specifically at "mc#0csrow#3channel#2"
edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0csrow#0channel#0: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#1: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#2: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#3: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#4: 0 Corrected Errors
mc0: csrow0: mc#0csrow#0channel#5: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0csrow#1channel#0: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#1: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#2: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#3: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#4: 0 Corrected Errors
mc0: csrow1: mc#0csrow#1channel#5: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#2: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#3: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#4: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#5: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#2: 74 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#3: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#4: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#5: 0 Corrected Errors
. I would like guidance on which DIMM slot corresponds to this particular memory channel. Could you provide me with the memory layout or any specific documentation that could help me isolate and address this issue? Any additional troubleshooting steps or advice would also be appreciated.
Thank you for your assistance.
DELL-Joey C
Moderator
Moderator
•
3.2K Posts
0
April 16th, 2024 03:25
Hi,
It is hard to identify EDAC error message, as we need to refer to architectural schemetics, usually this would need engineering to be involved.
I would suggest, disabling EDAC and let the server's lifecycle controller capture the error, this would be an easier and faster way. These errors occur when the Error Detection and Correction (EDAC) module reads the registers from the chipset. You may not notice any memory or CPU errors in the ESM/BMC/IPMI/iDRAC log because the registers are read-once and when enabled, EDAC will get them first.
SDeltaE
1 Rookie
1 Rookie
•
18 Posts
0
April 16th, 2024 10:44
@DELL-Joey C Hello
I disabled the EDAC, rmmod amd64_edac edac_mce_amd
Now dmesg prints:
Checking the IDRAC lifecycle log I'm not seeing anything picked up there
DELL-Erman O
Moderator
Moderator
•
2.2K Posts
0
April 16th, 2024 12:37
Hello,
If there is nothing on iDRAC LCC log then it's hard to say there is a memory error. EDAC Errors in 'messages' Log in RedHat Enterprise Linux (RHEL) and PowerEdge | Dell
Resolution
Hope that helps!