Unsolved
This post is more than 5 years old
2 Intern
•
308 Posts
1
4188
A case of how fault HBA impacting other fabric device
A case of how fault HBA impacting other fabric device
As more and more virtual machines being deployed, it will make troubleshooting difficult when one of them breaks down. This article introduces such one case.
Detailed Information
One day around 16:00, a customer found that dozens of virtual machines burned out serious performance degradation. After checking the switches and storages, they did not find the hardware failure, and the VMs were deployed on different physical servers, they could not locate the cause to the fault.
Engineers arrived on site and exchanged opinions with customers, they found that the physical servers which had VMs fault connected to two brocade DCX switches and two VNX. They decided to troubleshoot the problem in the order: storage – switch – physical server –VM.
They quickly checked the two storages and no hardware failure found, nor the error log around 16:00, so we excluded the possibility of storage issue.
We continued to check the switches, found that one switch port started continuous link reset on 15:55, the log is as follows:
15:55:05.786101 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA
15:55:08.985736 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA
15:55:12.180839 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA
15:55:15.380806 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA
15:55:18.580879 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA
15:55:21.786292 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA
15:55:24.985815 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA
…….
Confirmed with the customer, port 221 connected to ESXi HBA, and it had zone connections with spa & spb of both VNXc and VNXd. Zone configuration of VNXc is as follows:
zone: DELL_R910E9_PCI1_VNXc
10:00:00:90:fa:12:01:01
50:06:01:61:46:e0:7a:eb <-port:1 spa
50:06:01:69:46:e0:7a:eb <-port:129 spa
50:06:01:67:46:e0:7a:eb <-port:17 spa
50:06:01:6f:46:e0:7a:eb <-port:145 spb
50:06:01:66:46:e0:7a:eb <-port:97 spb
50:06:01:6e:46:e0:7a:eb <-port:225 spb
Zone configuration of VNXd is as follows:
zone: DELL_R910E9_PCI1_VNXd
10:00:00:90:fa:12:01:01
50:06:01:61:46:e0:7b:5e <-port:2 spa
50:06:01:69:46:e0:7b:5e <-port:130 spa
50:06:01:67:46:e0:7b:5e <-port:18 spa
50:06:01:6f:46:e0:7b:5e <-port:146 spb
50:06:01:66:46:e0:7b:5e <-port:98 spb
50:06:01:6e:46:e0:7b:5e <-port:226 spb
Checked port 221 as belowing:
frames enc crc crc too too bad enc disc link loss loss frjt fbsy
tx rx in err g_eof shrt long eof out c3 fail sync sig
=================================================================
221: 555.8m 3.3g 0 2 2 0 0 0 146.8k 31.3k 2 2 3 0 0
Could see that enc_out=146.8k (cumulative), Disc_c3=31.3k (cumulative). Disc_c3 means large amount of frame loss.
Typed the command portstatsshow 221:
tim_txcrd_z 705147375 Time TX Credit Zero (2.5Us ticks)
tim_txcrd_z_vc 0- 3: 0 0 0 705147375
tim_txcrd_z_vc 4- 7: 0 0 0 0
tim_txcrd_z_vc 8-11: 0 0 0 0
tim_txcrd_z_vc 12-15: 0 0 0 0
er_enc_in 0 Encoding errors inside of frames
er_crc 2 Frames with CRC errors
er_trunc 0 Frames shorter than minimum
er_toolong 0 Frames longer than maximum
er_bad_eof 0 Frames with bad end-of-frame
er_enc_out 146801 Encoding error outside of frames
er_bad_os 1753318851 Invalid ordered set
er_rx_c3_timeout 0 Class 3 receive frames discarded due to timeout
er_tx_c3_timeout 31335 Class 3 transmit frames discarded due to timeout
er_c3_dest_unreach 0 Class 3 frames discarded due to destination unreachable
Compared with other ports, the value of port 221 tim_txcrd_z was much higher. tim_txcrd_z is defined as the number of times that the port was polled and that the port was unable to transmit frames because the transmit Buffer-to-Buffer Credit (BBC) was zero. The purpose of this statistic is to detect congestion or a device affected by latency.
Meanwhile, the value of er_bad_os was also very high, which represented communication protocol error or HBA fault. If with high er_ebc_out counter, poor contact or high failure rate of the physical transmission media could also be the reason.
The performance issue of the device connected by the port will impact the whole fabric performance. See the port status of switch DCXa which connected VNXc spa & spb:
portstatsshow 1
stat_wtx 3592663296 4-byte words transmitted
stat_wrx 2724642389 4-byte words received
stat_ftx 690181059 Frames transmitted
stat_frx 3815603788 Frames received
stat_c2_frx 0 Class 2 frames received
stat_c3_frx 3815603788 Class 3 frames received
stat_lc_rx 0 Link control frames received
stat_mc_rx 0 Multicast frames received
stat_mc_to 0 Multicast timeouts
stat_mc_tx 0 Multicast frames transmitted
tim_rdy_pri 0 Time R_RDY high priority
tim_txcrd_z 1117455850 Time TX Credit Zero (2.5Us ticks)
tim_txcrd_z_vc 0- 3: 0 0 0 1117455850
tim_txcrd_z_vc 4- 7: 0 0 0 0
tim_txcrd_z_vc 8-11: 0 0 0 0
tim_txcrd_z_vc 12-15: 0 0 0 0
The 6 ports which connected VNXc spa & spb, their tim_txcrd_z value was much higher than others’. It is reasonable to suspect during the lasting link reset of port 221, the BB credit of the ports was used up. The congestion lasted for a period of time, and caused the performance of other device in the fabric degraded.
The status of ports which connected VNXc was the same.
According to this, customers isolated the HBA connected by port 221, all the VM applications recovered.
Author: Jiawen Zhang
dynamox
2 Intern
2 Intern
•
20.4K Posts
0
April 8th, 2015 04:00
very nice, do you have a similar cheat sheet for detecting slow drain on Cisco MDS ?
ECN-APJ
2 Intern
2 Intern
•
308 Posts
0
April 9th, 2015 23:00
Hi Dynamox,
For detecting slow drain, you should check for output discards, Link Failure with "LR Rcvd B2B" Message, and credit loss.
lukasz.borek
1 Message
0
January 22nd, 2021 00:00
Do we know root cause of that issue? We hit same problem yesterday. One ESX port generated lots of SCN LR_PORT event also enc_out and disc c3 counters. Killing ISL links and all devices connected to them. The resoult was high response time for all block devices connected to affected ports.
Same story is described here : https://support.hpe.com/hpesc/public/docDisplay?docId=kc0132818en_us&docLocale=en_US