R740xd multiple memory errors

Question

Hello, I have 5x r740xd servers manufactured in 2020 that have been a large pain in my butt ever sense. We have been using them for 4 years now and throughout their life each one has reported on multiple occasions (basically daily) there are memory errors on each. I reboot the servers and they go away until about a day or two later then come back. The errors report on different banks during each boot, and across all servers and I've had it report all the banks are bad during boot. Occasionally one of the 5 will crash the OS randomly and we will need to idrac in and cold reboot them.

Now I've been back and forth with dell warranty support multiple times over the last 4 years and even replaced multiple chips in the beginning, only to see the exact same problem come back on the exact same banks that were replaced. I have shuffled ram around to different banks and between servers etc to try and isolate it to maybe a bad stick but I've been unsuccessful is isolating it. It seems the banks will randomly change, even if the ram hasn't moved. I was eventually told I didn't need to replace the memory anymore, these were all false positives and just reboot the server whenever I see it to get rid of the error and 'repair' the ram. I was recommended to be and am on the latest available bios 2.21.2 and all other available firmwares and modules using the suu disk on the lifecycle manager, yet it still happens all the time.

I have 3 other 5 node vsan ready node stacks, none of them do this this. I've been a server admin for about 15 years, I occasionally run across a single bad stick of ram in a machine, and it's normally due to mishandling or not using a static strap during install, and sometimes they do just go bad. It's very obvious when these problems happen, the OS will lock up in a matter of moments during a memory test or exibits extremely weird behavior. No, these survive the diagnostic, just saying bad ram and moving on, they run normally for weeks at a time before they just abruptly crash... and all the while are just constantly screaming in the background that all their memory is bad and needs replaced. It almost seems that these 'false positives' are flagging sections of memory as bad and not allowing things to read/write to it until it finally tags an address the OS needs to operate in and it croaks.

Is Dell is aware of this general type of behavior? Is this wide spread I haven't heard many conversations regarding anything like it. Yet Dell gave me the impression that at some level some of this is by design based off the responses from the tech I was dealing with. What are the odds there's a manufacturing defect on all 5 of these servers that are now out of warranty due to dell kicking the can down the road for 3x years. Is there anything I can do to fix this within the available memory options? I don't want any of the magic self diagnostic or self testing/repairing stuff and just behave like a normal server and crash if the ram is actually bad, and not pre-emptively fixing false positives, and creating problems that didn't actually exist.

DELL-Rey G · Answer

My team has about 70 of the 14G systems, I saw a similar behavior on one system. I was swapping memory around between 2 systems and trying to figure out where the issue was. Support sent me a mobo because the problem followed the system and when I was swapping the CPUs to the new board, I noticed there was thermal paste on the CPU socket pins and CPU contact pads. I carefully cleaned the pins and pads with alcohol and put the original board back in, all was good after that.

Rey
#Iwork4Dell

PowerEdge Hardware General

R740xd multiple memory errors

Was this post helpful?