tosaraja

1 Rookie

•

34 Posts

0

100

May 30th, 2024 08:57

Possible BCM57504 problems

Concerning Dell R650 servers with BCM57504. Our setup has Ubuntu 22.04 or 24.04 running. 2 interfaces eno12399np0 and eno12409np1 are bonded together to bond0.

We are using the server as an OpenNebula build server, where OpenNebula creates a bridge br0 whenever it launches a KVM VM on the server. If the last VM is deleted, br0 is also removed. The VMs will receive their own virtual NIC, e.g. one-300234-0 that connects to br0 and traffic is vlan tagged. It also creates a virtual nic bond0.123@bond0 which I guess does the vlan tagging for the VMs and br0 gets bond0.123 as an interface.

Now the weirdest things happen randomly. Sometimes we end up with a virtual machine who couldn't connect to another VM within the same network cluster. It's an sccache server running in another VM on another host, so I'll refer to it as sccache VM so you don't get lost here. And by connect I mean ssh, ping you name it. But just this one! This problematic VM could connect to any other VM we've tried within the same network segment at the same time. And those could once again connect to this sccache VM which is unreachable for this problematic VM.

When we dug deeper into the issue, we found out that our problematic VM never received the ARP reply, when it asked who 192.x.x.x was. We confirmed that the sccache VM did reply to it, and we could tshark the reply, but this VM never received it. Next thing we did was that we went outside the VM and went to the host OS and tsharked the br0 and bond0. Neither of those saw the ARP response. As if the reply never got to the server at all. But if the sccache VM pings for the problematic VM and ARP requests for the MAC address, the problematic VM does get the broadcast and sends a reply that is received by sccache VM. So the line partially works between them. Only the ARP responses from sccache VM to the problematic VM one are lost somewhere within the problematic host. Or in other words, broadcasts are received by the problematic VM, but not unicasts, when sent from sccache VM.

So we took another similar Dell server to help us out here. We did a port mirroring setup on the switch, so that both servers receive the same IP packages.

What happened now was that while the first server never receives the ARP reply, the second one got it. This was confirmed by tsharking the bond0 on both servers. So it starts looking like a firmware or hardware bug on the NIC side.

There is yet another very weird thing here, which we can't explain.

While the ARP reply is never received on either the host or the VM, the ARP table in the VM actually gets filled with the correct ARP response. Where does that come from? How can it receive the ARP reply without ever getting the response? And if it really got the ARP information, why is ping still stuck? stracing ping ends up in a loop printing out "EAGAIN (Resource temporary unavailable)" whatever that means. And once again, if I ping something else in the same segment, it works. And ARP replies are visible.

If we go ahead and delete this VM, the br0 gets deleted as well. If we now re-create a VM the br0 will be created again and most likely everything will work again. But just most likely. It's a game of dice here.

We've also end up in a situations where br0 never gets linked to bond0.123. Once again, a random situation while running the same tools and scripts. When this happens, the VMs that use br0 they naturally don't have any kind of network connection. How can br0 be left out of an interface its instructed to have?

What's the next thing we could try out and debug?

Responses(4)

DELL-Charles R

Moderator

•

3.8K Posts

0

May 30th, 2024 14:20

Hello,

I'm not familiar with OpenNebula. Maybe others in the community that are familiar could post their experience.

I'd first recommend to make sure system firmware is up to date and run the built in hardware diagnostics to test the hardware.

Next I recommend stay with Ubuntu Server 20.04 LTS or Ubuntu Server 22.04 LTS. Ubuntu 24.04 has not be validated for the R650.

Your PowerEdge R650 supports these operating systems : https://dell.to/456dhxm

To update firmware you can use this procedure :

How to update the firmware using HTTPS connection to iDRAC

https://dell.to/4aMMJSF

To run the hardware diagnostics:

Boot in to the LifeCycle Controller (LCC) <F10> in POST and run hardware diagnostics.

T

tosaraja

1 Rookie

•

34 Posts

0

May 31st, 2024 04:30

@DELL-Charles R Latest firmwares are updated monthly. We actually even tried rolling back to the previous version of the firmware for the broadcom network cards. Ubuntu has been tried both 22.04 and 24.04, both with same results. We have multiple servers suffering from the same issues, so I doubt it would be something hardware diagnostics would find...unless for some reason the same hardware failure would occur on multiple servers. But sure, we'll run it so that we can check that box of the list.

DELL-Joey C

Moderator

•

3.4K Posts

0

May 31st, 2024 09:44

Hi,

Perhaps, have you tried getting help or advice about the issue with the Ubuntu community forum? Since it might not be a hardware issue as you have tested on other servers, it would be most likely to be configuration.

T

tosaraja

1 Rookie

•

34 Posts

0

May 31st, 2024 12:30

No, my logic initially was not to approach Ubuntu, since we could reproduce this problem in our Windows VMs as well. But, as further investigations start pointing to lower levels, as the host distro doesn't see the issue either, I could actually start looking back that way again.

View All

No Events found!

PowerEdge Hardware General

Possible BCM57504 problems