Unsolved

This post is more than 5 years old

26 Posts

1324

March 29th, 2007 03:00

AutoStart 5.2 Solaris Oracle module failed to detect oracle database

Hi Fork,

We have Oracle 10g (10.2.0.3) running in Solaris 9. After we monitor database with AutoStart we found node the database running normal with no issue but once we move database to other node we found the database tester detect data base failed and trying to restart database (that alway success).

If we do disable monitor and manual check the database, the database still running with no issue. So I'm not sure what wrong with the testing process that cause AutoStart restart database.

I check log of oracle module and see messages that database are in shutdown in progress??? That wired, and this happen only one node of cluster.

Any idea?

Regards,
Kawin.

157 Posts

March 29th, 2007 03:00

Does Oracle process fail the existence or the response test? Have you checked the Oracle log if the database is doing a restart for whatever reason? Does it happen every time you start the resource group on this node, or only when starting it up after failover?

26 Posts

March 29th, 2007 03:00

I'm not sure which existence or response test, how can identified?

The thing that I saw in the screen is the Oracle process is become red (failed) and then bring the whole group down. I check oracle log and oracle module log with no error on it (as I said the database is look fine and I can connect database or even run query during the tester indicate that database is failed. (monitor is disabled)

Regards,
Kawin.

26 Posts

March 29th, 2007 04:00

Hello,

I just remember, the fail is not occur during database startup but it failed about 5 minutes after database is started. So the existing test (check pmon,smon process) might not an issue.

Regards,
Kawin.

157 Posts

March 29th, 2007 04:00

First check if that's the case....can't remember the exact name, but when you click on the existence monitor, there are three values you can edit, one of theme is the delay before the first test takes place.
As I said above, first do a failover with the monitoring on the group disabled - that will prevent cluster to do a group restart if the oracle fails and allow you to figure out what's going on and see if all the processes that the existence monitor is looking for exist.

26 Posts

March 29th, 2007 04:00

Thanks for quickly response, can you help me specified which parameter I should edit?

Note: I have downtime to test it again tomorrow.

157 Posts

March 29th, 2007 04:00

Most likely it is failing the existence test. Existence test (you can check it out in State Monitors on the left pane of the console) is a perl script that checks if the following oracle processes exist: pmon, smon, lgwr, arc0, dbw0 and ckpt. If they are missing your Oracle process goes into failed state. Disable monitoring on resource group, do a failover and then check if all of these processes exist. If not (and it is not a problem), you can edit the existence monitor. It is also possible that the monitor starts executing too soon while all processes are not up yet, in that case you can increase the timing values (first start) in the monitor properties.

Happy troubleshooting. :)

157 Posts

March 29th, 2007 05:00

Why not? Are you sure that one of these processes did not stop? Also check that it is not the existence test it is failing (the resource would go into "no response state", and marked with a red question mark).

26 Posts

March 29th, 2007 06:00

The Oracle process still running because I can see those process and can even connect to database and ran the query command (select * from v$database).

So that why I said it so wired. If problem is inside the AutoStart it should happen on both node but this happen only one node of cluster.

157 Posts

March 29th, 2007 06:00

It is quite simple, there is something different one the other node that causes the existence monitor script to exit with code 1, and it is not Autostart. Even if you can connect to the database it does not mean that all of the processes the existence monitor is looking for are there.
You can actually check why the existence monitor script is failing:
- redirect the standard output and standard error of the oracle process to a file
- set the variable FT_TRACE_TESTS to 1 in the oracle process configuration

This way next time you start the process the output of the existence monitor script will be redirected to the specified output file with _exist suffix.

26 Posts

March 30th, 2007 06:00

I ran the test again, I found the database become unresponsive after database started about 8-9 minutes and this happen only once. Because I disable group monitor so after short while it's go back to online again. I wait for more than 10 minute with no more error (unresponsive).

I think it's maybe response test rather than the existence test.

The error messages on the oraproc.log show ORA-03135 connection lost contact.

Any idea?

26 Posts

April 1st, 2007 20:00

Finally I found agent on node that I have problem with oracle agent is not start properly. I found this when I using "ftcli" and it can't query anything on that node. Then I restart the agent/backbone and then every thing back to work.

Regards,
Kawin.

Top