cbs_technical

13 Posts

51369

March 30th, 2021 03:00

VMware 7.0 U2 losing contact with SD card

We've recently purchased 8 x R740 servers from Dell:

Internal Dual SD module with 2 x 16GB in Mirror configuration to hold ESXi installation
PERC H740P RAID controller connected to a 16 port BP14G backplane
The servers were purchased without disks (except the 2 bundled with each) so we have purchased 16 x Kingston DC500R 1.92 TB SATA SSD drives (SEDC500R/1920G). These disks are configured in the server with 1 global hot spare and the other 15 split into 3 x RAID5 arrays to create 3 VMFS volumes for VMs.

The ESXi image I am using is the latest DellEMC custom image (VMware-VMvisor-Installer-7.0.0.update02-17630552.x86_64-DellEMC_Customized-A00.iso)

ESXi is installed on the dual internal SD modules configured in RAID 1 from the factory.

I initially installed ESXi on one of the servers and this has been working fine with no issues. However, I installed it on another server and after a few hours I could not access any of the file systems from the ESXi shell. I could access most of the VMware UI but it timed out trying to access anything related to the devices. On the first couple of occasions when it locked up, I powered down the server from the iDRAC and it booted up fine only to fail again within a day or 2.

I picked another 3 servers and ran some tests on them. So far, 3 out of the 4 have experienced the issue.

There is nothing in the LC or system event logs to indicate that there is a problem with the hardware.

Following some further debugging, I can now restore the situation without a reboot by running an esxcfg-rescan on the affected path. Running 'esxcfg-mpath -b' first shows that vmhba32 is active according to VMware. I can also access my test VM. I have included all the steps I took to recover the latest lock up this morning at the bottom of this post.

A review of the VMware logs shows the following regular entries in vmkernel.cfg when the system is working fine:

2021-03-30T07:09:52.554Z cpu38:2097983)ScsiDeviceIO: 4298: Cmd(0x45d8cea6c240) 0x85, CmdSN 0x364 from world 2099800 to dev "naa.62cea7f09b32e20027f2031de52d2e8c" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0
2021-03-30T07:09:52.600Z cpu36:2097268)NMP: nmp_ThrottleLogForDevice:3861: Cmd 0x85 (0x45d8cea6c240, 2099800) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed:
2021-03-30T07:09:52.600Z cpu36:2097268)NMP: nmp_ThrottleLogForDevice:3869: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0. Act:NONE. cmdId.initiator=0x4304fc0f4b80 CmdSN 0x369
2021-03-30T07:09:52.600Z cpu36:2097268)ScsiDeviceIO: 4325: Cmd(0x45d8cea6c240) 0x85, CmdSN 0x369 from world 2099800 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0
2021-03-30T07:09:52.600Z cpu36:2097268)ScsiDeviceIO: 4325: Cmd(0x45d8cea6c240) 0x85, CmdSN 0x36a from world 2099800 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0
2021-03-30T07:10:32.190Z cpu58:2097913)NMP: nmp_ResetDeviceLogThrottling:3782: last error status from device mpx.vmhba32:C0:T0:L0 repeated 1 times
2021-03-30T07:39:52.616Z cpu32:2097268)NMP: nmp_ThrottleLogForDevice:3861: Cmd 0x85 (0x45d8cea4f640, 2099800) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed:
2021-03-30T07:39:52.616Z cpu32:2097268)NMP: nmp_ThrottleLogForDevice:3869: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0. Act:NONE. cmdId.initiator=0x4304fc0ef000 CmdSN 0x370
2021-03-30T07:39:52.616Z cpu32:2097268)ScsiDeviceIO: 4325: Cmd(0x45d8cea4f640) 0x85, CmdSN 0x370 from world 2099800 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0
2021-03-30T07:39:52.617Z cpu32:2097268)ScsiDeviceIO: 4325: Cmd(0x45d8cea4f640) 0x85, CmdSN 0x371 from world 2099800 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x0 0x0

At the point of failure, these entries appear in vmkernel.log

2021-03-30T08:29:42.086Z cpu56:2097601)ScsiPath: 8058: Cancelled Cmd(0x45d8ceb4cd40) 0xa0, cmdId.initiator=0x45388d59b8f8 CmdSN 0x0 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:1.
2021-03-30T08:29:42.086Z cpu49:2097600)ScsiPath: 8058: Cancelled Cmd(0x45d8cf64fc40) 0x12, cmdId.initiator=0x45388d69bcb8 CmdSN 0xc3be from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:0.
2021-03-30T08:29:42.086Z cpu56:2097601)NMP: nmp_ThrottleLogForDevice:3861: Cmd 0xa0 (0x45d8ceb4cd40, 0) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed:
2021-03-30T08:29:42.086Z cpu56:2097601)NMP: nmp_ThrottleLogForDevice:3869: H:0x5 D:0x0 P:0x0 . Act:NONE. cmdId.initiator=0x45388d59b8f8 CmdSN 0x0
2021-03-30T08:29:42.086Z cpu49:2097600)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-03-30T08:29:42.086Z cpu49:2097600)ScsiDeviceIO: 4315: Cmd(0x45d8cf64fc40) 0x12, cmdId.initiator=0x45388d69bcb8 CmdSN 0xc3be from world 0 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1
2021-03-30T08:29:42.086Z cpu49:2097600)Queued:0
2021-03-30T08:29:42.086Z cpu43:2097581)ScsiVmas: 1057: Inquiry for VPD page 00 to device mpx.vmhba32:C0:T0:L0 failed with error Timeout
2021-03-30T08:29:53.088Z cpu56:2097601)ScsiPath: 8058: Cancelled Cmd(0x45d8d3d58140) 0x0, cmdId.initiator=0x45388d61bc58 CmdSN 0x0 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:1.
2021-03-30T08:29:53.088Z cpu54:2097580)VMW_SATP_LOCAL: satp_local_updatePath:856: Failed to update path "vmhba32:C0:T0:L0" state. Status=Transient storage condition, suggest retry
2021-03-30T08:30:22.085Z cpu49:2097600)ScsiPath: 8058: Cancelled Cmd(0x45d8cf64fc40) 0x12, cmdId.initiator=0x45388d69bcc8 CmdSN 0xc3bf from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:0.
2021-03-30T08:30:22.085Z cpu49:2097600)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-03-30T08:30:22.085Z cpu49:2097600)ScsiDeviceIO: 4315: Cmd(0x45d8cf64fc40) 0x12, cmdId.initiator=0x45388d69bcc8 CmdSN 0xc3bf from world 0 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1
2021-03-30T08:30:22.086Z cpu49:2097600)Queued:0
2021-03-30T08:30:32.085Z cpu44:2097913)NMP: nmp_ResetDeviceLogThrottling:3782: last error status from device mpx.vmhba32:C0:T0:L0 repeated 3 times
2021-03-30T08:30:33.087Z cpu56:2097601)ScsiPath: 8058: Cancelled Cmd(0x45d8d3d58140) 0x0, cmdId.initiator=0x45388d61bc58 CmdSN 0x0 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:0.
2021-03-30T08:30:33.087Z cpu56:2097601)NMP: nmp_ThrottleLogForDevice:3861: Cmd 0x0 (0x45d8d3d58140, 0) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed:
2021-03-30T08:30:33.087Z cpu56:2097601)NMP: nmp_ThrottleLogForDevice:3869: H:0x5 D:0x0 P:0x0 . Act:NONE. cmdId.initiator=0x45388d61bc58 CmdSN 0x0
2021-03-30T08:30:33.087Z cpu54:2097580)VMW_SATP_LOCAL: satp_local_updatePath:856: Failed to update path "vmhba32:C0:T0:L0" state. Status=Transient storage condition, suggest retry
2021-03-30T08:31:59.142Z cpu49:2097600)ScsiPath: 8058: Cancelled Cmd(0x45d8d3d25740) 0x9e, cmdId.initiator=0x45389b89a468 CmdSN 0xc3c0 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:0.
2021-03-30T08:31:59.142Z cpu49:2097600)NMP: nmp_ThrottleLogForDevice:3861: Cmd 0x9e (0x45d8d3d25740, 0) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed:
2021-03-30T08:31:59.142Z cpu49:2097600)NMP: nmp_ThrottleLogForDevice:3869: H:0x5 D:0x0 P:0x0 . Act:EVAL. cmdId.initiator=0x45389b89a468 CmdSN 0xc3c0
2021-03-30T08:31:59.142Z cpu49:2097600)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-03-30T08:31:59.142Z cpu49:2097600)ScsiDeviceIO: 4315: Cmd(0x45d8d3d25740) 0x9e, cmdId.initiator=0x45389b89a468 CmdSN 0xc3c0 from world 0 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1
2021-03-30T08:31:59.142Z cpu49:2097600)Queued:1
2021-03-30T08:32:10.083Z cpu43:2097601)ScsiPath: 8058: Cancelled Cmd(0x45d8d3d58140) 0x0, cmdId.initiator=0x45388d61bc58 CmdSN 0x0 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:1.
2021-03-30T08:32:10.083Z cpu54:2097580)VMW_SATP_LOCAL: satp_local_updatePath:856: Failed to update path "vmhba32:C0:T0:L0" state. Status=Transient storage condition, suggest retry
2021-03-30T08:32:15.919Z cpu49:2097600)ScsiPath: 8058: Cancelled Cmd(0x45d8cf64fc40) 0x9e, cmdId.initiator=0x45389e61a718 CmdSN 0xc3c1 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:0.
2021-03-30T08:32:15.919Z cpu49:2097600)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-03-30T08:32:15.919Z cpu49:2097600)ScsiDeviceIO: 4315: Cmd(0x45d8cf64fc40) 0x9e, cmdId.initiator=0x45389e61a718 CmdSN 0xc3c1 from world 0 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1
2021-03-30T08:32:15.919Z cpu49:2097600)Queued:1
2021-03-30T08:32:26.084Z cpu43:2097601)ScsiPath: 8058: Cancelled Cmd(0x45d8d3d58140) 0x0, cmdId.initiator=0x45388d61bc58 CmdSN 0x0 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:1.
2021-03-30T08:32:26.084Z cpu54:2097580)VMW_SATP_LOCAL: satp_local_updatePath:856: Failed to update path "vmhba32:C0:T0:L0" state. Status=Transient storage condition, suggest retry
2021-03-30T08:32:32.083Z cpu44:2097913)NMP: nmp_ResetDeviceLogThrottling:3782: last error status from device mpx.vmhba32:C0:T0:L0 repeated 3 times
2021-03-30T08:32:39.144Z cpu49:2097600)ScsiPath: 8058: Cancelled Cmd(0x45d8d3d25740) 0x25, cmdId.initiator=0x45389b89a518 CmdSN 0xc3c2 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:0.
2021-03-30T08:32:39.144Z cpu49:2097600)NMP: nmp_ThrottleLogForDevice:3861: Cmd 0x25 (0x45d8d3d25740, 0) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba32:C0:T0:L0" Failed:
2021-03-30T08:32:39.144Z cpu49:2097600)NMP: nmp_ThrottleLogForDevice:3869: H:0x5 D:0x0 P:0x0 . Act:EVAL. cmdId.initiator=0x45389b89a518 CmdSN 0xc3c2
2021-03-30T08:32:39.144Z cpu49:2097600)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-03-30T08:32:39.144Z cpu49:2097600)ScsiDeviceIO: 4315: Cmd(0x45d8d3d25740) 0x25, cmdId.initiator=0x45389b89a518 CmdSN 0xc3c2 from world 0 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1
2021-03-30T08:32:39.144Z cpu49:2097600)Queued:1
2021-03-30T08:32:50.082Z cpu43:2097601)ScsiPath: 8058: Cancelled Cmd(0x45d8d3d58140) 0x0, cmdId.initiator=0x45388d61bc58 CmdSN 0x0 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:1.
2021-03-30T08:32:50.082Z cpu54:2097580)VMW_SATP_LOCAL: satp_local_updatePath:856: Failed to update path "vmhba32:C0:T0:L0" state. Status=Transient storage condition, suggest retry
2021-03-30T08:32:55.920Z cpu49:2097600)ScsiPath: 8058: Cancelled Cmd(0x45d8cf64fc40) 0x25, cmdId.initiator=0x45389e61a7c8 CmdSN 0xc3c3 from world 0 to path "vmhba32:C0:T0:L0". Cmd count Active:0 Queued:0.
2021-03-30T08:32:55.920Z cpu49:2097600)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-03-30T08:32:55.920Z cpu49:2097600)ScsiDeviceIO: 4315: Cmd(0x45d8cf64fc40) 0x25, cmdId.initiator=0x45389e61a7c8 CmdSN 0xc3c3 from world 0 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1

The only indication from the ESXi UI is that it can't update Bootbank

I noticed that there was a firmware update for the SD module but I have experienced problems on the current release 1.9 and the latest release 1.13. Where I have upgraded to 1.13, I have upgraded the firmware for all other devices.

If you need any more information, let me know.

regards,
Aidan

Recovery steps from last failure:

[root@dc2-vm2:~] esxcfg-mpath -b
naa.62cea7f09b32e20027f2031de52d2e8c : Local DELL Disk (naa.62cea7f09b32e20027f2031de52d2e8c)
   vmhba2:C2:T0:L0 LUN:0 state:active sas Adapter: 52cea7f09b32e200  Target: 60f2031de52d2e8c

mpx.vmhba32:C0:T0:L0 : Local USB Direct-Access (mpx.vmhba32:C0:T0:L0)
   vmhba32:C0:T0:L0 LUN:0 state:active Local HBA vmhba32 channel 0 target 0
   

[root@dc2-vm2:~] esxcfg-mpath -L
vmhba32:C0:T0:L0 state:active mpx.vmhba32:C0:T0:L0 vmhba32 0 0 0 NMP active local usb.vmhba32 usb.0:0
vmhba2:C2:T0:L0 state:active naa.62cea7f09b32e20027f2031de52d2e8c vmhba2 2 0 0 HPP active local sas.52cea7f09b32e200 sas.60f2031de52d2e8c


[root@dc2-vm2:~] esxcli storage core device world list
Device                                World ID  Open Count  World Name
------------------------------------  --------  ----------  ----------
naa.62cea7f09b32e20027f2031de52d2e8c   2097225           1  idle0
naa.62cea7f09b32e20027f2031de52d2e8c   2097565           1  bcflushd
naa.62cea7f09b32e20027f2031de52d2e8c   2098156           1  J6AsyncReplayManager
naa.62cea7f09b32e20027f2031de52d2e8c   2099437           1  hostd
naa.62cea7f09b32e20027f2031de52d2e8c   2111296           1  vmm0:cbs-apa6
naa.62cea7f09b32e20027f2031de52d2e8c   2118636           1  sh
mpx.vmhba32:C0:T0:L0                   2099437           1  hostd
mpx.vmhba32:C0:T0:L0                   2118627           1  python


[root@dc2-vm2:~] esxcfg-rescan -d vmhba32
Rescan complete, however some dead paths were not removed because they were in use by the system. Please use the 'storage core device world list' command to see the VMkernel worlds still using these paths.


This locked up the VM.  The existing ssh session froze and could not open a console in UI.


root@dc2-vm2:~] esxcfg-rescan -d vmhba32
Rescan complete, however some dead paths were not removed because they were in use by the system. Please use the 'storage core device world list' command to see the VMkernel worlds still using these paths.

[root@dc2-vm2:~] esxcli storage core device world list
Device                                World ID  Open Count  World Name
------------------------------------  --------  ----------  ----------
naa.62cea7f09b32e20027f2031de52d2e8c   2097225           1  idle0
naa.62cea7f09b32e20027f2031de52d2e8c   2097565           1  bcflushd
naa.62cea7f09b32e20027f2031de52d2e8c   2097656           1  vmsyslogd
naa.62cea7f09b32e20027f2031de52d2e8c   2098156           1  J6AsyncReplayManager
naa.62cea7f09b32e20027f2031de52d2e8c   2111296           1  vmm0:cbs-apa6
naa.62cea7f09b32e20027f2031de52d2e8c   2118702           1  cat
mpx.vmhba32:C0:T0:L0                   2099437           1  hostd


This recovered the VM but lost vmhba32.
A df only showed disk1 on vmhba2 (RAID array)

[root@dc2-vm2:/vmfs/volumes/605f5530-46c79eec-6faa-2cea7fe99e94/cbs-apa6] df -h
Filesystem Size  Used Available Use% Mounted on
VMFS-6     1.7T 22.0G      1.7T   1% /vmfs/volumes/disk1


vmhba32 showing as "Failed" in UI


[root@dc2-vm2:~] esxcfg-rescan -u vmhba32


This recovered vmhba32 in the UI and on a df

Responses(117)

DiegoLopez

4 Operator

•

2.7K Posts

0

March 30th, 2021 07:00

Hello @cbs_technical,

A couple of additional questions please: is your VMware Dell OEM supported? I mean, was the OS license bought with the server or later? I want to check if the OS is covered by the warranty because, if so the case could be escalated to the OS Support team.

I am pretty sure, the Kingston DC500R are not supported. As far a I know those are not Enterprise certified drives. Can you please remove them from one of the systems and check if the problem still appears?

What model are the SD cards? Were they shipped with the server? Or you bought them?

Please, can you confirm you follow this steps for the OS installation: Installing ESXi on flash media: https://dell.to/2Pj52vg

Can you please, post a firmware inventory screenshoot to check all firmwares? (You can check this from the IDRAC).

Thank you in advanced.
Regards.

C

cbs_technical

13 Posts

0

March 30th, 2021 08:00

Hi Diego,

Thanks for the response:

A couple of additional questions please: is your VMware Dell OEM supported? I mean, was the OS license bought with the server or later? I want to check if the OS is covered by the warranty because, if so the case could be escalated to the OS Support team.

I downloaded the latest DellEMC image from VMware:
VMware-VMvisor-Installer-7.0.0.update02-17630552.x86_64-DellEMC_Customized-A00.iso

I am using the free license.

I am pretty sure, the Kingston DC500R are not supported. As far a I know those are not Enterprise certified drives. Can you please remove them from one of the systems and check if the problem still appears?

The DC500R are part of Kingston's enterprise data centre series:

https://www.kingston.com/unitedkingdom/en/ssd/dc500-data-center-solid-state-drive

We already have 10 x T620 servers using Kingston enterprise drives and we haven't had any issues with them in the past. I did check with sales when we purchased these drives and they said they were supported.

I am not at work today but I can go in tomorrow and swap out the drives in one of the problem servers with a couple of the original Dell drives to test.

What model are the SD cards? Were they shipped with the server? Or you bought them?

They were supplied by Dell and pre-installed on the servers.

Please, can you confirm you follow this steps for the OS installation: Installing ESXi on flash media: https://dell.to/2Pj52vg

I installed ESXi using these steps. The only difference is that I used the Virtual CD/DVD/ISO device from the iDRAC from a mounted ISO on my PC instead of removable media. The PC was connected to the server on the same wired LAN.

Can you please, post a firmware inventory screenshoot to check all firmwares? (You can check this from the IDRAC)

Here is the firmware inventory list from one of the servers having the issue:

OS COLLECTOR, v6.0, A00	6.0
Internal Dual SD Module	1.13
Backplane 1	2.52
Dell EMC iDRAC Service Module Embedded Package v3.5.1, A00	3.5.1
Power Supply.Slot.1	00.26.35
Power Supply.Slot.2	00.26.35
PERC H740P Adapter	51.13.2-3714
BIOS	2.10.0
Dell OS Driver Pack, 20.08.09, A00	20.08.09
Integrated Dell Remote Access Controller	4.40.00.00
Dell 64 Bit uEFI Diagnostics, version 4301, 4301A50, 4301.51	4301A50
Broadcom Gigabit Ethernet BCM5720 - 2C:EA:7F:E9:A5:84	21.60.16
Broadcom Gigabit Ethernet BCM5720 - 2C:EA:7F:E9:A5:85	21.60.16
Broadcom Gigabit Ethernet BCM5720 - 2C:EA:7F:E9:A5:86	21.60.16
Broadcom Gigabit Ethernet BCM5720 - 2C:EA:7F:E9:A5:87	21.60.16
System CPLD	1.1.4
Lifecycle Controller	4.40.00.00

regards,
Aidan

Dell-DylanJ

4 Operator

•

2.9K Posts

0

March 30th, 2021 09:00

Hello,

With the OS license not having been purchased from us, your support coverage wouldn't extend to the software side of things. That having been said, I can still grab an R740 and install our version of ESXi on it to try to confirm if there is an issue with that image or not. Looking through your post, I think our processes and configuration will be pretty much the same, excusing the Kingston drives. I'm willing to bet that the system I pull will have differing makes and models of storage.

We can also look at hardware with you, but as you had indicated, I wouldn't jump to the conclusion that this is a hardware issue quite yet. I'll bring my test system up to all the latest on firmware, that may save you some time, as well. If it seems to make a difference on my side, I'll certainly let you know.

The files and folders that you're losing access to - are those mounted on the SD storage right along with the rest of ESXi, or are they on your hard drives? With the fast path and HBA, I'd expecting the issue to be communication with external storage, like a SAN. If this is all internal storage, the best play might be to export the PERC log and see if it gives any further indication of what is happening. Exporting the PERC log can be done through the OS, but would also require a reboot to complete the install of OpenManage. You can also acquire this log through the iDRAC using a SupportAssist collection without any additional software installation. If there is indeed a communication problem with those Kingston drives, I'd expect it to show up there.

If you'd like to send me the service tags of your systems, I'd be happy to check your support coverage for you. At the very least, this can help identify some options to assist you.

H

HONGLMN

6 Posts

0

March 30th, 2021 11:00

Hello.

I am having a very, very similiar - if not identical - experience with our Dell R740xd servers.

They are pre-installed with from-factory SD cards and SSD drives for the local storage.

I am seeing the same messages as the original poster. I started having problems after the 7.0 U2 update.

C

cbs_technical

13 Posts

0

March 30th, 2021 12:00

Interesting @HONGLMN. As these are new machines, I have gone straight to 7.0 U2. What version were you running previously? I may be able to test the previous version you were running on my hardware to see if the problem goes away.

regards,
Aidan

C

cbs_technical

13 Posts

0

March 30th, 2021 12:00

Hi @HONGLMN

I've looked at the Dell releases and the last one released before 7.0 U2 was released on 15 Jan and relates to 7.0 1c (VMware build 17325551).

I'll get one of my servers installed with that version and report any issues I get.

As a matter of interest, are you losing access to any of the VMs on the ESXi hosts when the problem occurs or just access to ESXi and the SD card? I'm not running vCenter so I'm accessing ESXi locally on each host.

regards,
Aidan

C

cbs_technical

13 Posts

0

March 30th, 2021 12:00

Hi Dylan,

Many thanks for your help.

The files and folders that you're losing access to - are those mounted on the SD storage right along with the rest of ESXi, or are they on your hard drives?

The problem is just with access to files on the SD card itself. I can still access files on the hard drives. My test VM sits on a datastore on the hard drive (RAID array) and when the problem arises the VMs continue on as normal. The problem happened a couple of times today and I managed to remount the SD card (vmhba32) without affecting the VM in any way. The PERC, which is seen as vmhba2 by ESXi always remains mounted and has not had a problem.

The .locker folder is mounted on the the first datastore and the logs are still being updated while access to the SD is lost.

On one of the problem machines, I re-installed ESXi on the first array completely bypassing the SD card all together and this machine has been running for 4 days now without any issues.

I have not set up Support Assist on any of the servers yet but I will do this tomorrow on the problem machines to see if that highlights anything.

I will PM the service tags to you along with details of which units have had issues.

I will also follow Diego's advice and put some Dell disks in one of the problem machines to take the Kingston disks out of the equation.

regards,
Aidan

H

HONGLMN

6 Posts

0

March 30th, 2021 12:00

It was the Dell-specific 7.01d distribution. I don't have the specific build number - I'm not in the office today.

Also, my ESXi servers are connected to vCenter, and there was (still is?) a bug using vCenter to upgrade them from 7.01d to 7.0 U2. So, I after they were borked from that bug, I upgraded them by booting them to a CD with the Dell-specific 7.0 U2 image and upgrading them manually to recover them.

But, these are production servers, and I am having to reboot them to recover them every few days. I installed Skyline Health Diagnostics and collected the log files. They show almost identical errors to what you originally posted.

H

HONGLMN

6 Posts

0

March 31st, 2021 08:00

Okay, I must have used 7.0 1c and then upgraded within vCenter to 7.0 1d before going to 7.0 U2.

In addition to the

Bootbank cannot be found at path '/bootbank'

messages, I start getting warnings about hostd performance issues and "possible storage bottleneck" which, according to what I can tell from Skyline Health Diagnostics, might be related to "storage issues including HBA."

So, the real world impact is that VMs start to appear unresponsive. (I don't think they're really "dead" though, they appear unresponsive because any I/O they do is really, really, really slow.) The ESXi server's CPUs are nearly 100% idle during this time.

In order to recover, I have to "Force restart VMs" from the ESXi server console and reboot the server. And that takes 30 minutes to happen due to I/O issues too.

C

cbs_technical

13 Posts

0

March 31st, 2021 14:00

Hi @Dell-DylanJ,

I have installed one of the problem servers with 7.0U2 on Dell disks as per @DiegoLopez's advice.

@HONGLMN I have set up another of the problem servers on 7.0U1 (the stock Dell 7.0 1c version) to see if the problem still occurs.

I now have a total of 7 servers configured with VMs and will keep checking to see if the SD card locks up in any of them. The server that I re-installed on the RAID array (bypassing the SD card) is still running fine without any issues.

I will get SupportAssist set up on the servers tomorrow and will post as soon as any of the servers hit a problem.

regards,
Aidan

NLO_elisa

6 Posts

0

April 2nd, 2021 04:00

@cbs_technical

Any news?

On our brand new PowerEdge R440 we have the same issue.
Running 7.0 U1 Dell Version: A01, Build# 17325551 for 2 months without problems, on 2021/03/11 upgraded to 7.0 U2 Dell Version: A00, Build# 17630552.
Host became unresponsive 2 times in 2 weeks.

After analyzing VMware Skyline Health Diagnostics for vSphere:

2021-04-01T20:07:40.294Z cpu0:2129784)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba32:C0:T0:L0" state in doubt; requested fast path state update...
2021-04-01T20:07:40.294Z cpu0:2129784)ScsiDeviceIO: 4315: Cmd(0x45b90c7ee1c0) 0x28, cmdId.initiator=0x4304f04dd140 CmdSN 0x1 from world 2130054 to dev "mpx.vmhba32:C0:T0:L0" failed H:0x5 D:0x0 P:0x0 Cancelled from path layer. Cmd count Active:1

followed by

2021-04-01T19:07:41.056Z cpu16:2129742)ALERT: Bootbank cannot be found at path '/bootbank'
2021-04-01T20:07:40.349Z cpu20:2130064)ALERT: Bootbank cannot be found at path '/bootbank'

We can still log on to esxi and vcenter but we are unable to issue commands, only solutions is to reboot the host.

Yesterday we did a rollback to 7.0 U1 Dell Version: A01, Build# 17325551, hopefully this will solve this issue.

DELL-Chris H

Moderator

•

9.2K Posts

0

April 2nd, 2021 05:00

It is not recommended to attempt upgrades to ESXi 7.0 with the A00 or A01 version customized ISO image. The A00 / A01 Dell EMC customized ISO is suitable for new installations of ESXi 7.0.

Are any of you seeing these issues on fresh installs of 7.0?

C

cbs_technical

13 Posts

0

April 2nd, 2021 13:00

@NLO_elisa It looks like you are seeing same issues and like @HONGLMN, only when you upgraded from 7.0 U1 to 7.0 U2. As these are new servers, I have only ever tried 7.0 U2 and have encountered the problem on 5 of the 7 servers I am testing.

Of the 7 I currently have on test:

* 5 of them are running on the SD card with 7.0 U2
* 1 of them is running on the SD card with 7.0 U1
* 1 is running on the local SSD drives (no SD in use) with 7.0 U2.

As of yet I have only had issues running on the SD card with 7.0 U2.

I have installed SupportAssist on all 7 servers and am just waiting for one to fail so that I can send a support collection to Dell for analysis. Nothing has failed yet since yesterday afternoon.

@DELL-Chris H I have only been using fresh installs of 7.0 U2 and 7.0 U1. No downgrades or upgrades.

regards,
Aidan

M

mattjudson

8 Posts

0

April 4th, 2021 18:00

We are experiencing the same issue.... We have a sev 1 / sev 2 vmware case open on this issue and haven't got this much info yet. I will be looping in our Dell Tech team on Monday after seeing all this as well. We have 111 R540's that upgraded from 7.0 U1c to 7.0 U2 last week where Almost immediately we started experiencing ~10 different new hosts disconnecting from vcenter per day. This is ultimately due to the hostd process deterioration and the /bootbank going inaccessible. Meanwhile We have 40 other R540's that use mirrored SAS drives for the esxi os install drive instead of the the mirrored SD cards which have not had any issue what so ever. There is definitely something funky with 7.0 U2 and the vmhba32 USB driver with the mirrored SD card modules on that specic config. Seeing this really makes me think we need to revert back to U1c

DELL-Joey C

Moderator

•

3.7K Posts

0

April 4th, 2021 23:00

Hi @mattjudson and @cbs_technical,

Thanks for updating us on the issue. It may be an issue with the driver communication with the OS and IDSDM.

When the issue occurs, does iDRAC able to detect the SD cards? Something I found from the VMWare KB: https://dell.to/3rPh6Sd, does it relate?

1
2
3
4
5
6
7
8

View All

No Events found!

PowerEdge Hardware General

VMware 7.0 U2 losing contact with SD card