Start a Conversation

Unsolved

This post is more than 5 years old

4622

August 7th, 2014 12:00

DPA no longer gathers data after Networker jobsdb corruption

I do have this open as a case with EMC support but I have received nothing of substance in a week and wondered if another user may have met a similar problem.

We have been having issues with one of our Networker servers a week ago which we believed were a result of jobsdb corruption and which we resolved by stopping, deleting jobsdb folder and restarting.  This resolved the backup application issue.

However since then that server has not communicated with the DPA server successfully and we see in the server.log:

agent.reporter - agentReporterThread(): Couldn't send entry from the store & forward queue to the server

EMC's response has been to delete the Data Collection Agent, however when I try to delete the DCA it takes 2 hours then gives a server error so their fix is not possible.  Does anyone have any idea what I could try next?

66 Posts

August 7th, 2014 17:00

If you had corruption in your jobsdb, it could be that the DPA agent has some corrupted information that the server won't accept (just a theory).

When you say you try and delete the DCA, do you mean delete the host from the inventory, or clicking "Remove Data Collection Agent" from the server settings screen?

I'd suggest:

- Uninstall the DPA Agent

- Go to Admin / System / Configure System Settings and expand Data Collection Agents

- See if the agent is still listed. If it is, click on the agent then click "REMOVE DATA COLLECTION AGENT"

- Reinstall the DPA Agent

Hope that helps.

Gareth

August 8th, 2014 02:00

Thanks for responding Gareth. I am clicking "Remove Data Collection Agent", my understanding is that removing the host from the inventory would remove historical data which we do not want to do.  However, the system does not seem to want to remove the Data Collection Agent from the DPA server, it always gives a server error (and takes 2 hours to come up with that).

I can try reinstalling the DPA agent, however that would require me to go through change control as the backup server is a production server so we would not be able to do that for 1 month or more.  I'm not entirely sure what that would achieve, could I not delete any metadata to reset the agent?

66 Posts

August 8th, 2014 05:00

I am pretty certain that upgrading to DPA 6.1.1 and a recent patch level would resolve the problem with removing data collection agent (seen something similar elsewhere) but I presume that would also be a change control delay.

I had suggested the reinstall in order to clean up the metadata, but doing it manually is not overly complicated.

- Stop the DPA Agent

- Rename the /agent/data directory (to something like data_old)

- Start the DPA Agent (this will create a new clean data directory)

And you're right about removing the host would remove data - it is not something that should ever be done if you want to keep historical data so glad to hear that it is not what you were trying.

Gareth

August 8th, 2014 07:00

I tried renaming the data directory and restarting and that had no impact; collection failed and there was nothing in log so I stopped agent and renamed back.

Next I amended another server to use this server as its collector and that worked fine which seems to confirm the agent is functioning okay (we have also successfully ran historical data collections) and that it is communicating okay with the DPA server.

This therefore seems to point at the DPA database for this server being the problem; this is our largest backup server with 70% of clients and 90% of data being backed up there...  Seeing as we are unable to change any parameters the problem seems to have something to do with data for this particular server in the database and that we need to do some kind of database integrity check to resolve the issue...

August 8th, 2014 09:00

Correction on the last note; I did not do that as I chose not to save the policy override and it ran on local agent.  I then tried the converse and that failed.  I went to try the original override (collecting data for a different server with the server that is giving us a problem) and found the Data Collection Agent had vanished.  I went and checked in the admin tab and it was no longer listed under Data Collection Agents.  I restarted DPA agent and it is there again, now running a test collection and it appears to be working.

August 11th, 2014 03:00

Update:

It seems DPA successfully collected the backlog of data then after that has been giving an “unable to run request” error; it also continues to try to run the collection from another server even though that is no longer configured as an override (I have just restarted the DPA agent on that server to see if that may clear this), I have also increased the timeout as this is a large server and may take longer to return data.   It also seems that although the collector was set to Debug level = info (and restarted) it has been doing debug logging over the weekend; I can see this error near the bottom of the log:

DBG1    26621.26659    20140811:104238        com.base.i18n - convertToUTF8Start(): iconv to 'UTF-8' not possible or reading from UTF-8 hence not using a converter

DBG1    26621.26659    20140811:104238      com.base.runcmd - csystemreadline(): wrote 867 bytes, 0 remaining

WARN    26621.26659    20140811:104238 common.localdb.sqlit - csqlOpenConnection(): Failed to open db file /opt/emc/dpa/agent/data/networker_jobmonitor_backupserver_NetWorker.db: unable to open database file

ERR     26621.26659    20140811:104238 common.localdb.pstor - pstoreOpen(): Failed to open connection

ERR     26621.26659    20140811:104238         clctr.module - aapiRunRequest(): Unable to run request networker:jobmonitor as there was an error opening the required pstore /opt/emc/dpa/agent/data/networker_jobmonitor_backupserver_NetWorker.db.

I would guess that is saying there is some kind of corrupt characters in the pstore that makes reading it impossible.  I am therefore going to reset the data directory, restart and see if the agent works okay after that.  It looks like the data during the period where we saw jobsdb corruption and have since deleted the jobsdb has now been caught up with, so I should be able to run a historical collect on the missing data, if this works.

This has worked so far, but during the daytime jobsdb is not so busy.  Next I will run a small historical collect and see if that addresses the issue…

Regards


David

August 12th, 2014 09:00

Historical data collect worked and the DPA agent for this server is working okay.  The problem with resolving this stems from being unable to delete the Data Collection Agent in the DPA GUI; it seems after this failed attempting to use a different agent as the collector knocked out what was blocking the deletion and the deletion completed!  If this happens again I will run through this process and see if it is a successful workaround.

No Events found!

Top