ECN-APJ

2 Intern

•

308 Posts

1

4188

April 8th, 2015 02:00

A case of how fault HBA impacting other fabric device

A case of how fault HBA impacting other fabric device

Share:

Please click here for all contents shared by us.

Introduction

As more and more virtual machines being deployed, it will make troubleshooting difficult when one of them breaks down. This article introduces such one case.

Detailed Information

One day around 16:00, a customer found that dozens of virtual machines burned out serious performance degradation. After checking the switches and storages, they did not find the hardware failure, and the VMs were deployed on different physical servers, they could not locate the cause to the fault.

Engineers arrived on site and exchanged opinions with customers, they found that the physical servers which had VMs fault connected to two brocade DCX switches and two VNX. They decided to troubleshoot the problem in the order: storage – switch – physical server –VM.

They quickly checked the two storages and no hardware failure found, nor the error log around 16:00, so we excluded the possibility of storage issue.

We continued to check the switches, found that one switch port started continuous link reset on 15:55, the log is as follows:

15:55:05.786101 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA

15:55:08.985736 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA

15:55:12.180839 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA

15:55:15.380806 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA

15:55:18.580879 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA

15:55:21.786292 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA

15:55:24.985815 SCN LR_PORT (0);g=0x0 D2,P0 D2,P0 221 NA

…….

Confirmed with the customer, port 221 connected to ESXi HBA, and it had zone connections with spa & spb of both VNXc and VNXd. Zone configuration of VNXc is as follows:

zone: DELL_R910E9_PCI1_VNXc

10:00:00:90:fa:12:01:01

50:06:01:61:46:e0:7a:eb <-port:1 spa

50:06:01:69:46:e0:7a:eb <-port:129 spa

50:06:01:67:46:e0:7a:eb <-port:17 spa

50:06:01:6f:46:e0:7a:eb <-port:145 spb

50:06:01:66:46:e0:7a:eb <-port:97 spb

50:06:01:6e:46:e0:7a:eb <-port:225 spb

Zone configuration of VNXd is as follows:

zone: DELL_R910E9_PCI1_VNXd

10:00:00:90:fa:12:01:01

50:06:01:61:46:e0:7b:5e <-port:2 spa

50:06:01:69:46:e0:7b:5e <-port:130 spa

50:06:01:67:46:e0:7b:5e <-port:18 spa

50:06:01:6f:46:e0:7b:5e <-port:146 spb

50:06:01:66:46:e0:7b:5e <-port:98 spb

50:06:01:6e:46:e0:7b:5e <-port:226 spb

Checked port 221 as belowing:

frames enc crc crc too too bad enc disc link loss loss frjt fbsy

tx rx in err g_eof shrt long eof out c3 fail sync sig

=================================================================

221: 555.8m 3.3g 0 2 2 0 0 0 146.8k 31.3k 2 2 3 0 0

Could see that enc_out=146.8k (cumulative), Disc_c3=31.3k (cumulative). Disc_c3 means large amount of frame loss.

Typed the command portstatsshow 221:

tim_txcrd_z 705147375 Time TX Credit Zero (2.5Us ticks)

tim_txcrd_z_vc 0- 3: 0 0 0 705147375

tim_txcrd_z_vc 4- 7: 0 0 0 0

tim_txcrd_z_vc 8-11: 0 0 0 0

tim_txcrd_z_vc 12-15: 0 0 0 0

er_enc_in 0 Encoding errors inside of frames

er_crc 2 Frames with CRC errors

er_trunc 0 Frames shorter than minimum

er_toolong 0 Frames longer than maximum

er_bad_eof 0 Frames with bad end-of-frame

er_enc_out 146801 Encoding error outside of frames

er_bad_os 1753318851 Invalid ordered set

er_rx_c3_timeout 0 Class 3 receive frames discarded due to timeout

er_tx_c3_timeout 31335 Class 3 transmit frames discarded due to timeout

er_c3_dest_unreach 0 Class 3 frames discarded due to destination unreachable

Compared with other ports, the value of port 221 tim_txcrd_z was much higher. tim_txcrd_z is defined as the number of times that the port was polled and that the port was unable to transmit frames because the transmit Buffer-to-Buffer Credit (BBC) was zero. The purpose of this statistic is to detect congestion or a device affected by latency.

Meanwhile, the value of er_bad_os was also very high, which represented communication protocol error or HBA fault. If with high er_ebc_out counter, poor contact or high failure rate of the physical transmission media could also be the reason.

The performance issue of the device connected by the port will impact the whole fabric performance. See the port status of switch DCXa which connected VNXc spa & spb:

portstatsshow 1

stat_wtx 3592663296 4-byte words transmitted

stat_wrx 2724642389 4-byte words received

stat_ftx 690181059 Frames transmitted

stat_frx 3815603788 Frames received

stat_c2_frx 0 Class 2 frames received

stat_c3_frx 3815603788 Class 3 frames received

stat_lc_rx 0 Link control frames received

stat_mc_rx 0 Multicast frames received

stat_mc_to 0 Multicast timeouts

stat_mc_tx 0 Multicast frames transmitted

tim_rdy_pri 0 Time R_RDY high priority

tim_txcrd_z 1117455850 Time TX Credit Zero (2.5Us ticks)

tim_txcrd_z_vc 0- 3: 0 0 0 1117455850

tim_txcrd_z_vc 4- 7: 0 0 0 0

tim_txcrd_z_vc 8-11: 0 0 0 0

tim_txcrd_z_vc 12-15: 0 0 0 0

The 6 ports which connected VNXc spa & spb, their tim_txcrd_z value was much higher than others’. It is reasonable to suspect during the lasting link reset of port 221, the BB credit of the ports was used up. The congestion lasted for a period of time, and caused the performance of other device in the fabric degraded.

The status of ports which connected VNXc was the same.

According to this, customers isolated the HBA connected by port 221, all the VM applications recovered.

Author: Jiawen Zhang

Responses(3)

dynamox

2 Intern

•

20.4K Posts

0

April 8th, 2015 04:00

very nice, do you have a similar cheat sheet for detecting slow drain on Cisco MDS ?

ECN-APJ

2 Intern

•

308 Posts

0

April 9th, 2015 23:00

Hi Dynamox,

For detecting slow drain, you should check for output discards, Link Failure with "LR Rcvd B2B" Message, and credit loss.

LB

lukasz.borek

1 Message

0

January 22nd, 2021 00:00

Do we know root cause of that issue? We hit same problem yesterday. One ESX port generated lots of SCN LR_PORT event also enc_out and disc c3 counters. Killing ISL links and all devices connected to them. The resoult was high response time for all block devices connected to affected ports.

Same story is described here : https://support.hpe.com/hpesc/public/docDisplay?docId=kc0132818en_us&docLocale=en_US

View All

No Events found!

Enterprise Support

A case of how fault HBA impacting other fabric device

Please click here for all contents shared by us.

Introduction

Detailed Information