Start a Conversation

Unsolved

This post is more than 5 years old

4188

April 8th, 2015 02:00

A case of how fault HBA impacting other fabric device

A case of how fault HBA impacting other fabric device

Share: twitter.png

Please click here for all contents shared by us.

Introduction

As more and more virtual machines being deployed, it will make troubleshooting difficult when one of them breaks down. This article introduces such one case.


Detailed Information

One day around 16:00, a customer found that dozens of virtual machines burned out serious performance degradation. After checking the switches and storages, they did not find the hardware failure, and the VMs were deployed on different physical servers, they could not locate the cause to the fault.

Engineers arrived on site and exchanged opinions with customers, they found that the physical servers which had VMs fault connected to two brocade DCX switches and two VNX. They decided to troubleshoot the problem in the order: storage – switch – physical server –VM.

They quickly checked the two storages and no hardware failure found, nor the error log around 16:00, so we excluded the possibility of storage issue.

We continued to check the switches, found that one switch port started continuous link reset on 15:55, the log is as follows:

15:55:05.786101 SCN LR_PORT (0);g=0x0                       D2,P0  D2,P0  221   NA  

15:55:08.985736 SCN LR_PORT (0);g=0x0                       D2,P0  D2,P0  221   NA  

15:55:12.180839 SCN LR_PORT (0);g=0x0                       D2,P0  D2,P0  221   NA  

15:55:15.380806 SCN LR_PORT (0);g=0x0                       D2,P0  D2,P0  221   NA  

15:55:18.580879 SCN LR_PORT (0);g=0x0                       D2,P0  D2,P0  221   NA  

15:55:21.786292 SCN LR_PORT (0);g=0x0                       D2,P0  D2,P0  221   NA  

15:55:24.985815 SCN LR_PORT (0);g=0x0                       D2,P0  D2,P0  221   NA

…….

Confirmed with the customer, port 221 connected to ESXi HBA, and it had zone connections with spa & spb of both VNXc and VNXd. Zone configuration of VNXc is as follows:

zone:  DELL_R910E9_PCI1_VNXc                     

         10:00:00:90:fa:12:01:01

         50:06:01:61:46:e0:7a:eb           <-port:1     spa

         50:06:01:69:46:e0:7a:eb           <-port:129  spa

         50:06:01:67:46:e0:7a:eb           <-port:17    spa

         50:06:01:6f:46:e0:7a:eb           <-port:145   spb

         50:06:01:66:46:e0:7a:eb           <-port:97    spb

         50:06:01:6e:46:e0:7a:eb           <-port:225   spb

Zone configuration of VNXd is as follows:

zone: DELL_R910E9_PCI1_VNXd                     

         10:00:00:90:fa:12:01:01

         50:06:01:61:46:e0:7b:5e           <-port:2       spa

         50:06:01:69:46:e0:7b:5e           <-port:130    spa

         50:06:01:67:46:e0:7b:5e           <-port:18      spa

         50:06:01:6f:46:e0:7b:5e           <-port:146     spb

         50:06:01:66:46:e0:7b:5e           <-port:98      spb

         50:06:01:6e:46:e0:7b:5e           <-port:226     spb

Checked port 221 as belowing:

frames      enc    crc    crc    too    too    bad    enc   disc   link   loss   loss   frjt   fbsy

       tx     rx      in    err    g_eof  shrt   long   eof     out   c3    fail    sync   sig

     =================================================================

221:  555.8m   3.3g   0      2      2      0      0      0    146.8k  31.3k   2      2      3      0      0

Could see that enc_out=146.8k (cumulative), Disc_c3=31.3k (cumulative). Disc_c3 means large amount of frame loss.

Typed the command portstatsshow 221:

tim_txcrd_z                                        705147375   Time TX Credit Zero (2.5Us ticks)

tim_txcrd_z_vc  0- 3:  0           0           0           705147375

tim_txcrd_z_vc  4- 7:  0           0           0           0       

tim_txcrd_z_vc  8-11:  0           0           0           0       

tim_txcrd_z_vc 12-15:  0           0           0           0       

er_enc_in                                               0           Encoding errors inside of frames

er_crc                                                          2           Frames with CRC errors

er_trunc                                                  0           Frames shorter than minimum

er_toolong                                               0           Frames longer than maximum

er_bad_eof                                              0           Frames with bad end-of-frame

er_enc_out                                          146801      Encoding error outside of frames

er_bad_os                                                    1753318851  Invalid ordered set

er_rx_c3_timeout                        0           Class 3 receive frames discarded due to timeout

er_tx_c3_timeout                      31335       Class 3 transmit frames discarded due to timeout

er_c3_dest_unreach                0           Class 3 frames discarded due to destination unreachable

Compared with other ports, the value of port 221 tim_txcrd_z was much higher.  tim_txcrd_z is defined as the number of times that the port was polled and that the port was unable to transmit frames because the transmit Buffer-to-Buffer Credit (BBC) was zero. The purpose of this statistic is to detect congestion or a device affected by latency.

Meanwhile, the value of er_bad_os was also very high, which represented communication protocol error or HBA fault. If with high er_ebc_out counter, poor contact or high failure rate of the physical transmission media could also be the reason.

The performance issue of the device connected by the port will impact the whole fabric performance. See the port status of switch DCXa which connected VNXc spa & spb:

portstatsshow 1

stat_wtx                                                  3592663296  4-byte words transmitted

stat_wrx                                                    2724642389  4-byte words received

stat_ftx                                                        690181059   Frames transmitted

stat_frx                                                          3815603788  Frames received

stat_c2_frx                                                   0           Class 2 frames received

stat_c3_frx                                               3815603788  Class 3 frames received

stat_lc_rx                                                  0           Link control frames received

stat_mc_rx                                                  0           Multicast frames received

stat_mc_to                                                       0           Multicast timeouts

stat_mc_tx                                                0           Multicast frames transmitted

tim_rdy_pri                                                 0           Time R_RDY high priority

tim_txcrd_z                                        1117455850  Time TX Credit Zero (2.5Us ticks)

tim_txcrd_z_vc  0- 3:  0           0           0           1117455850

tim_txcrd_z_vc  4- 7:  0           0           0           0       

tim_txcrd_z_vc  8-11:  0           0           0           0       

tim_txcrd_z_vc 12-15:  0           0           0           0       

The 6 ports which connected VNXc spa & spb, their tim_txcrd_z value was much higher than others’. It is reasonable to suspect during the lasting link reset of port 221, the BB credit of the ports was used up. The congestion lasted for a period of time, and caused the performance of other device in the fabric degraded.

The status of ports which connected VNXc was the same.

According to this, customers isolated the HBA connected by port 221, all the VM applications recovered.

                                                                                                                                              

Author: Jiawen Zhang





2 Intern

 • 

20.4K Posts

April 8th, 2015 04:00

very nice, do you have a similar cheat sheet for detecting slow drain on Cisco MDS ?

2 Intern

 • 

308 Posts

April 9th, 2015 23:00

Hi Dynamox,

For detecting slow drain, you should check for output discards, Link Failure with "LR Rcvd B2B" Message, and credit loss.

January 22nd, 2021 00:00

Do we know root cause of that issue? We hit same problem yesterday. One ESX port generated lots of SCN LR_PORT event also enc_out  and disc c3 counters. Killing ISL links and all devices connected to them. The resoult was high response time for all block devices connected to affected ports. 

 

Same story is described here : https://support.hpe.com/hpesc/public/docDisplay?docId=kc0132818en_us&docLocale=en_US

No Events found!

Top