Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.

Connectrix B-Series switch: Defects FOS-849642 & FOS-847091 - Gen 7 directors and switches could encounter a failure condition that causes CRC errors, ports faults or causes a disruptive reboot

Summary: Gen 7 directors and switches (X7-8, X7-4, 7730 and 7720) running any version of Fabric OS (FOS) v9.1.x prior to v9.1.1c or running on FOS v9.2.0 could encounter a failure condition that causes CRC errors, ports faults or causes a disruptive reboot in response to severe congestion and activation of the oversubscription management behavior of the Traffic Optimizer feature Defects FOS-849642 & FOS-847091 These two defects will be corrected in FOS v9.1.1c and v9.2.0a. Pending this qualification, customers who are affected may choose to implement the workaround. ...

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content


Symptoms

Products Affected
Brocade X7-8, X7-4, 7730 and 7720 running FOS v9.1.x or FOS v9.2.0
Corrected in Releases
Brocade FOS v9.1.1c, v9.2.0a and higher versions

Only Gen 7 products are at risk.
Gen 7 directors (X7-8 and X7-4) with an FC64-48 and/or FC32-X7-48 port blade installed are at risk of encountering both the overflow and “verify” errors. FC32-64 and FC32-48 port blades installed in Gen 7 directors are not at risk of encountering either failure.
Gen 7 switches (G730 and G720) are only at risk of encountering the buffer overflow failure. These switches are not exposed to nor are they at risk of encountering the “verify” error failure condition.
To further be at risk, the fabric must experience severe congestion resulting in oversubscription management by Traffic
Optimizer. The following RASlog message will be observed if this level of response was ever encountered:
[TO-1006], 1011618/1002267, FID 128, INFO, Switch_100, Flows destined to dev02 device have been moved to PG_OVER_SUBSCRIPTION_4G_16G PG., cfs_ctrlr.c, line: 1470, comp:cfsd, ltime:2023/05/17-06:15:33:923058
The oversubscription management action by Traffic Optimizer only exists in FOS v9.1.x firmware. Gen 7 products
running on FOS v9.0.x are not at risk to either of the failure conditions.


Buffer Overflow Failure Risk Conditions
For the buffer overflow condition to occur, in addition to requiring a period of severe congestion, the F-ports on the Gen 7
director or switch also need to have been configured from the default value to a greater number of buffers. FOS will
assign at most 28 buffers by default.
Any Gen 7 director or switch that has had their maximum F-Port buffer counts increased above the default values used by
FOS are potentially at risk, and any X7-8 or X7-4 director that was previously running FOS v9.0.x could be at risk to
encounter “verify” errors. In both cases, Traffic Optimizer must also attempt to manage routing of frames in response to
an oversubscription event caused during a period of severe congestion.

To determine directors and switches that might be at risk, use the “portbuffershow” command to view the Buffer Usage
image.png
If the total of all buffer usage for ports on the same ASIC/chip that are also zoned together add up to a value greater than 256 buffers, then the Gen 7 switch is considered to be at risk to encounter a buffer overrun should a severe congestion event require oversubscription management from Traffic Optimizer. The failure will not be encountered on every oversubscription management event, as the number of buffers being managed at the time of the event needs to exceed 256 while Traffic Optimizer is managing oversubscription, but being configured to potentially handle more than 256 buffers will put the switch at risk.
In the example output shown above, if all 8 F-ports are in one zone together, the switch is at risk to encounter a frame
buffer overflow while Traffic Optimizer is managing an oversubscription condition as the total buffer usage count in this example is 360.
However, in the following example where the F-Ports are not all zoned together, this switch would not be at risk as the two zones (shown in green) total up to 232 buffers and 128 buffers respectively.
image.png
The maximum number of ports utilized for oversubscription management is 8 ports. If more than 8 ports are zoned together from the same ASIC/chip, then total the 8 ports with the highest Buffer Usage values to determine risk.

NOTE: Gen 7 directors and switches that have never had their F-Port buffer counts changed from default are not at risk to encounter this frame buffer overflow issue. The maximum value used as a default setting for Max/Reserved Buffers is 28 for Gen 7 products, however, fewer buffers could be allocated depending on switch type and optic speed. Customers
that have never increased their Max/Reserved Buffer counts from default will not encounter the buffer overflow
issue. Even with 8 ports zoned together, using the maximum default allocation of 28 buffers per port, the total value of
maximum Buffer Usage is only 224 frames.

“Verify” Failure Risk Conditions
In addition to the buffer overflow issue, X7-8 and X7-4 directors could also potentially be at risk to “verify” error messages if the following conditions are met in this order:
  • X7-8 or X7-4 director previously running on FOS v9.0.x
  • The director is then upgraded to FOX v9.1.x
  • The director then has F-ports that log out and log in while at the v9.1.x version
  • The director then encounters an oversubscription event that requires management from Traffic Optimizer
  • The director then performs an HA fail-over (firmware upgrade causes a fail-over to happen)
  • The director encounters another oversubscription event that requires management from Traffic Optimizer 
X7-8 or X7-4 directors that meet all of these conditions, in the specified sequence, could be at risk to encounter “verify” errors during oversubscription management from Traffic Optimizer.
  • X7-8 or X7-4 directors that have only ever run on FOS v9.1.x firmware are not at risk to encounter the “verify” error as only the v9.1 programming model is being used for all ports. Gen 7 directors must have been previously running with FOS v9.0.x in order to be susceptible to this issue.
  • X7-8 or X7-4 directors that have been cold-booted / power cycled while running on FOS v9.1.x firmware are also not at risk to encounter the “verify” error as all ports will use the v9.1 programming after the re-boot

Symptoms
Gen 7 directors and switches that have encountered an oversubscription management event will observe the following
Traffic Optimizer RASlog:
[TO-1006], 1011618/1002267, FID 128, INFO, Switch_100, Flows destined to b1a02 device have been moved to PG_OVER_SUBSCRIPTION_4G_16G PG., cfs_ctrlr.c, line: 1470, comp:cfsd, ltime:2023/05/17-06:15:33:923058
Additional symptoms that could appear due to these identified issues could be:
  • Large counts of CRC errors on a link may be observed that are not fixed with optic/cable replacement
  • Frames may be discarded, credit on a link can be lost
  • Ports may be faulted, ASIC may halt and be faulted
  • A director may observe an unexpected HA fail-over or even a cold restart of the director
  • Switches may observe a cold restart

Cause

Oversubscription management by the Traffic Optimizer feature under specific conditions could cause failure scenarios
impacting transmission of frames or ports being managed. Under severe congestion scenarios, these failures could also
impact the performance of other Fabric OS (FOS) daemons, active on the switch, leading to software watchdog time-outs
resulting in an HA fail-over or switch panic.

Gen 7 directors and switches (X7-8, X7-4, 7730 and 7720) that encounter an overflow of frame buffers while attempting
to manage and re-route oversubscribed flows in response to a severe congestion event can cause unexpected errors. If
the number of frames overruns the buffer used to manage the oversubscription handling, then these excess frames can
be missed during Traffic Optimizer handling. These excess frames can potentially be overwritten by other frames leading
to frame CRC errors or even port faults if header information is overwritten. Under severe congestion scenarios, the management of these overflow / excess frames can lead to the blocking of other FOS daemons which can result in
watchdog time-outs. Critical daemons that time-out will cause an HA fail-over or disruptive switch reboot.

In addition to potential frame overflow handling, X7-8 and X7-4 directors that previously had been operating on FOS v9.0.x and then later upgraded to FOS v9.1.x could encounter verify errors after HA fail-overs (including those caused by firmware upgrades to higher versions of v9.1.x). Multiple “verify” error messages will be observed during oversubscription management by Traffic Optimizer due to a detected conflict in programming of ports created when some ports but not all ports are reset while at v9.1.x. The conflict between congestion management programming on ports that were never reset while at v9.0.x and then later encountered congestion management while at v9.1.x on ports that were reset can appear after an HA fail-over event.

Resolution

Work-Around
“At risk” directors and switches can disable the Traffic Optimizer oversubscription management action.
Issue the following CLI command from the maintenance account to disable the oversubscription
management action behavior within Traffic Optimizer
maintenance> serviceexec trafoptdebug --enableosclassification 0
NOTE: The maintenance command needs to be run on all Logical Switches in the chassis.
NOTE: The setting will be persistent across fail-overs and power cycles

Corrective Action
A software solution provided in FOS v9.1.1c and higher will prevent these failures. The same solutions are also provided
in the FOS v9.2.0a and higher versions of FOS v9.2.x. Upgrading to these versions of FOS will prevent an overrun of
frames due to oversubscription management and will also prevent “verify” errors on X7 directors.

For any Gen 7 director or switch (X7-8, X7-4, 7730 and 7720) still running a version of FOS v9.0.x and could be “at risk”
to encounter the issues described, it is recommended to wait for the release of FOS v9.1.1c before upgrading.

Gen 7 directors and switches that are currently operating on a v9.1.x or v9.2.0 releases, and are determined to be at risk,
should implement the work-around. Deactivating the Traffic Optimizer oversubscription management action will prevent
both the buffer overrun and “verify” errors from occurring. After upgrading to v9.1.1c or v9.2.0a, the oversubscription
management action can be re-enabled via the following command:
Issue the following CLI command from the maintenance account to re-enable the oversubscription management action behavior within Traffic Optimizer
maintenance> serviceexec trafoptdebug --enableosclassification 1
NOTE: The maintenance command needs to be run on all Logical Switches in the chassis.

Any Gen 7 director or switch that has already encountered the “buffer overflow” failure will need to perform a cold restart
to fully recover from the failure condition:
Directors: Slot power off/on the impacted port blade
Switches: Reboot (cold restart) the switch
Option 1: Perform the reboot action shown above and then implement the work-around to disable the oversubscription management action from within Traffic Optimizer
Option 2: Upgrade to a version of FOS with the solution and then perform the reboot action shown above.

Upgrading to a version of FOS with the solution provided will prevent the “buffer overflow” failure from happening, but
once the failing condition is encountered, only a cold restart of the ASIC will resolve the failure condition.
Upgrading to a version of FOS with the solution provided will prevent and automatically recover from the “verify” error
condition without any further action.

After upgrading to a version of FOS that contains the solution, a check of internal memory will be performed to determine if the director or switch has previously encountered the failure and requires a reboot to recover from the error condition.
The following RASlog will be displayed should the failure condition be detected after upgrading FOS to a version with the solution:
2023/06/01-17:07:50 (GMT), [C5-1057], 5, SLOT 2 | CHASSIS, CRITICAL, Switch_3,
S10,C0: HW ASIC Chip is in an inconsistent state = 0x1002.
If the above RASlog is observed after upgrading FOS, then the director or switch has previously encountered the “buffer
overflow” failure prior to upgrade and will need to perform a cold restart to fully recover from the failure condition:
Directors: Slot power off/on the impacted port blade
Switches: Reboot (cold restart) the switch

Article Properties


Affected Product

Connectrix DS-7720B, Connectrix DS-7730B, Connectrix ED-DCX7-4B, Connectrix ED-DCX7-8B

Last Published Date

26 Oct 2023

Version

3

Article Type

Solution