What is the CSS MISSCOUNT value recommendation for 11.2.0.4 RAC?

My customer is upgrading to 11.2.0.4 and apparently the new Oracle default is 15 secs. Their OCR/Voting disk groups are being dismounted and cluster nodes evicted because writes are taking in excess of 15 secs. The prior CSS MISSCOUNT default was 200 secs.

They have created a separate ASM disk group for OCR/Voting. 5 x 2G fully provisioned thin luns bound and pinned to FC tier.

Any suggestion on CSS MISSCOUNT? The storage is VMAX40K.

Regards,

Mike

Responses(2)

jeff_browning

256 Posts

0

September 5th, 2014 12:00

According to the Oracle docs:

http://docs.oracle.com/cd/E11882_01/rac.112/e16794/crsref.htm#CWADD91142

The parameter should be set to 30 seconds. So, 15 is obviously wrong. However, I would also be figuring out why an IO to this puppy is taking more than 15 seconds.

PeterHG

14 Posts

0

September 8th, 2014 12:00

Hi Mike, been a while since i looked at this, the EMC recommended setting for CSS MISSCOUNT is still 120 (as it has been since the feature was introduced in 10g). From my experience, a very good reason for any relatively high value remains that while the 11gR2 default of 30 is perfectly fine under normal conditions, a low value of say 15 seconds can "hide" the actual source of a problem being reported as an ERROR condition in ocssd.log.

Two CSS timeout parameters are very important, when tuning node eviction conditions;

MISSCOUNT (wait time for the cluster interconnect to resume normal operation)

DISKTIMEOUT (wait time for normal access to voting disks to resume)

Clearly, there is a balance to strike here, for example, when you know you are doing maintenance on the Oracle disk groups, CSS DISKTIMEOUT should be increased. However, for a production RAC environment, these settings need to reflect your needs for acceptable wait times with respect to the performance of the application. When configuring these settings, it is a good idea to consider in the rare event of a problem, how long is it reasonable to wait for redundant interconnect or storage paths to failover, or nodes to reboot before Oracle RAC automatically starts node eviction to resolve access issues between nodes and to storage.

A question for you, are you certain MISSCOUNT is the problem parameter here? As that would indicate a cluster interconnect issue, not with a voting disk timeout that is managed with DISKTIMEOUT (default 200).

The alert.log and ocssd.log files are the places to start to understand what is actually happening, certainly interested to hear how you get on with this, do let us know.

Regards, Peter

View All

No Events found!