Start a Conversation

Solved!

Go to Solution

1 Rookie

 • 

12 Posts

41

June 11th, 2024 04:34

Diagnosing Hardware Error Message

I have a Dell r730xd, IDRAC8, PERC H330, PERC H730P


I just developed a hardware error.  The two messages are:
Message 1:  "A fatal error was detected on a component at bus 0 device 3 function 1"
Message 2:  "A bus fatal error was detected on a component at slot 5"


Looking at the Hardware Inventory in iDRAC I can find:

BusNumber 0, FunctionNumber 2 as:
C610/X99 series chipset 6-Port SATA Controller [AHCI mode]

BusNumber 0, DeviceNumber 3, FunctionNumber 0 as:
Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 3

I'm guessing that the culprit is the first one the SATA controller.

So, two questions:

1.  Are these error messages talking about the same issue or is it two issues?
2.  Is it the SATA controller and if so, is this on the motherboard or a separate card?

I'm trying to identify the pieces I need to be looking for to affect a repair.

1 Rookie

 • 

12 Posts

June 17th, 2024 17:24

@DELL-Chris H​ 

I had another read of the User Manual for the r730xd.  You are correct that generally speaking, the servers only support one RAID controller because there is only one drive back plane.  

However, it turns out with this model there is an option to have what they call a Flex Bay on the rear.  In such a case, the rear has two drive bays connected to an H330 controller which consumes a PCI slot.  Meanwhile the H730p is a mini controller (not consuming a PCI slot) connected to the front drive bays.

I post this information more out of completeness sake in case someone else reads this thread looking for similar information.

Moderator

 • 

2.3K Posts

June 11th, 2024 09:37

Hello,

Fatal errors can occur due to a variety of reasons, including but not limited to malfunctions in applications, firmware, or drivers. These errors can be categorized as correctable errors, uncorrectable errors, non-fatal uncorrectable errors, and fatal uncorrectable errors. 

As I reviewed the warnings, I believe you are dealing with the same issue. Based on my experience mostlikely the same a PCIe issue generates multiple warnings. It should be sata controller is integrated into the motherboard. I believe your server still operational on the PERC card. So if the issue persists then you can consider to replace mobo.

 

1. Power the server down.  
2. Disconnect server from all power cables, Network cables. 
3. Hold down the power button continuously for at least 10 seconds.  
4. Insert power cabless and network cables back to the system.  
5. Wait about 2 minutes before powering on the server for iDRAC to be refreshed.
6. Power the system on. 

If still has the issue and you have a separate component on slot 5 then please the component on slot 5 reseat it. Check your all firmware and drivers are up to date. 

 

Hope that helps!

1 Rookie

 • 

12 Posts

June 11th, 2024 16:52

@DELL-Erman O​ 


Thanks, I'll try that.  I don't believe I have anything in slot 5 but will double check.  I checked the firmware and they are the newest I have found.

The symptom that alerted me to the condition is that ESXi and all VMs were suddenly gone.  No connectivity whatsoever.  So, whatever the problem, it's affecting the ESXi boot volume, which is running off the PERC H330.

Is the H330 on the motherboard?

Moderator

 • 

8.6K Posts

June 11th, 2024 17:17

It is an independent device so it isn't part of the systemboard, but depending on the version of the controller it will either be in the integrated storage controller slot, or one of the pci slots. Now another thing I noticed, and wanted to check on, it sounded like you said you had two different raid controllers installed, is that correct?

I ask as that may be the issue, as only a single internal raid controller would be supported.

 

1 Rookie

 • 

12 Posts

June 12th, 2024 03:54

@DELL-Chris H​ 

Interesting.


I bought the server used.  According to the documentation the two controllers was an option for the r730xd.  It's been running reliably without even a restart for 18 months.

Here's what iDRAC shows for controllers:

From this and what you're saying it seems like the H330 is in a PCI slot and the H730P is imbedded (presumably meaning in the integrated storage slot?)
The server documentation isn't clear on that.


Still not sure if I need a new controller card or motherboard.

1 Rookie

 • 

12 Posts

June 12th, 2024 04:12

I know the H330 controls the two drive bays on the rear.  Those are setup as mirrored boot drives running ESXi.
The H730P controls the front bays with the data stores.

(edited)

Moderator

 • 

3.2K Posts

June 12th, 2024 08:08

Hi,

 

For the record, this server looks it's OEM, Simplitivity. Some troubleshooting and firmware might not work as intended. Can you also confirm iDRAC and BIOS are up-to-date? If the firmware are all updated, the next troubleshooting is disabling C1E, C-state. The option is in BIOS > System profile settings.

1 Rookie

 • 

12 Posts

June 13th, 2024 04:17

@DELL-Joey C​ 

Thanks for your suggestions.  I believe the iDRAC, BIOS, and all drivers are up to date.

I have not tried disabling C1E.  I will look into that.

1 Rookie

 • 

12 Posts

June 17th, 2024 17:17

It turns out there was a card in slot 5.  I removed and reinstalled it and the problem appeared to go away.  However, I see no reason why reseating a card which is securely anchored would solve the problem. I have therefore removed the card (an unused network card) and the server has been running for 36 hours so far without problem.  

No data loss

No Events found!

Top