ouzo12

1 Rookie

•

11 Posts

0

1683

June 3rd, 2021 08:00

Avamar Virtual Edition disks very full

Following the expiration of a large number of backups, the AVE disks are very full:

Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 1000G 974G 26G 98% /data01
/dev/sdc1 1000G 974G 27G 98% /data02
/dev/sdd1 1000G 971G 29G 98% /data03

The problem started on the 31st of May:

But now we are in a chicken and egg situation. It won't checkpoint and it won't garbagecollect. HFS check won't run because it needs a new checkpoint. Not that it'd save space anyway...

Checkpoint failed with result MSG_ERR_DISKFULL : cp.20210603145550 started Thu Jun 3 15:56:20 2021 ended Thu Jun 3 15:56:20 2021, completed 0 of 6509 stripes
Last GC: finished Thu Jun 3 16:01:44 2021 after 00m 30s >> recovered 0.00 KB (MSG_ERR_DISKFULL)
Last hfscheck: finished Mon May 31 09:41:56 2021 after 01h 02m >> checked 5837 of 5837 stripes (OK)

So what can one do now? I have deleted all but the one remaining validated checkpoint:

Tag Time Validate Deletable
----------------- ----------------------- --------- ---------
cp.20210531065414 2021-05-31 07:54:14 BST Validated No

I wish there was a way to trash all checkpoints (including the current one) without affecting client configurations... I don't care about the backed up content anyway.

Responses(10)

ionthegeek

2K Posts

0

June 3rd, 2021 08:00

I strongly recommend you work with support on this issue.

However, if you don't care about any of the data added to the system since 05/31 and just want it running again in a hurry, you can shut down GSAN and force it to roll back using dpnctl.

dpnctl stop

dpnctl start --force_rollback

You will be prompted to select a checkpoint to which to roll back.

Important: All data added to the system after the creation of the checkpoint cp.20210531065414 will be irretrievably lost.

With the disks this full, it's possible that the rollback may fail due to lack of disk space. If you encounter this problem, please contact support as they will need to assist you.

ionthegeek

2K Posts

0

June 3rd, 2021 08:00

Also, please note that running GC will make an OS capacity problem worse, not better. This is because garbage collection alters the "current" copy of the data. Checkpoint overhead is a result of the deltas between the current data and the oldest checkpoint, so introducing additional changes will generate additional checkpoint overhead.

Checkpoint overhead is normally rolled off following the successful hfscheck of a new checkpoint but if there is too big a spike in OS capacity, the system can end up in a situation similar to the one you find yourself in now. Support has ways to coax the system to create a new checkpoint despite high capacity (though this is not always possible if the capacity is extremely high).

O

ouzo12

1 Rookie

•

11 Posts

0

June 4th, 2021 01:00

Thanks for the tip, I hadn't thought of rolling back to the last checkpoint. And as you can see, nothing has been added since the 31st anyway.

My only concern is that the problem will reoccur when the backups that expired on the 31st expire once more the next day.

But I'll give it a shot anyway.

O

ouzo12

1 Rookie

•

11 Posts

0

June 4th, 2021 06:00

Well, it did successfully roll back to 31st May. Then MCS wouldn't start. It also needed to be rolled back (the Administrator's manual doesn't tell you this...)

/data0?/cur/ got renamed to cur. , rather than be deleted. I removed it using 'rm -rf'.

Then everything started and the disk usage was 73%. Great! I thought.

But no, even though there was no checkpoint, gc, hfscheck or backup running, the background maintenance started writing more and more data to /cur/.

11:12:21 DEV tps    rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
11:13:21 sda 23.45 548.47    181.66    31.14    0.40     17.17 2.59 6.07
11:13:21 sdb 452.42 123158.75 197466.17 708.69   139.06   310.72 2.23 100.77
11:13:21 sdc 467.35 104410.88 168975.81 584.97   134.63   291.65 2.16 100.77
11:13:21 sdd 488.61 107655.63 148116.96 523.47   136.58   282.90 2.06 100.75

Over the course of 5 hours, it went back to 98%...

ionthegeek

2K Posts

1

June 4th, 2021 06:00

If you use dpnctl to perform a rollback in one step, the MCS flush restore will happen automatically.

The cur.* directory is generated during rollback. This normally only sticks around if there has been a failed rollback. I do not ever recommend removing files from the data partitions manually using rm. There is a tool called clean.dpn that will clean up stale cur.* directories (along with stale hfscheck and hfscheck.* directories).

I'm not sure what you mean by "background maintenance". If the system has a lot pending writes sitting in the write logs, it may have hit a threshold where it started flushing these changes to the stripe files which will generate checkpoint overhead. Another possibility is asynchronous crunching, which is essentially defragmentation for stripes. The crunching process prepares garbage collected stripes to ingest new data. Asynchronous crunching uses a rolling window to try to anticipate how much data the system will ingest during the next backup window, then tries to prepare exactly enough stripes to ingest that amount of data. If the backup ingest over the last two weeks or so has been really spiky, that might cause it to overestimate. In any case, if you repeat the rollback, you can disable asynchronous crunching temporarily to see if that helps:

avmaint config --ava asynccrunching=false

There is a performance penalty for doing this since the system now has to crunch stripes on demand while under load. Once the system gets back into a normal pattern of ingest and garbage collection, asynccrunching should be re-enabled.

Avamar is very sensitive to sudden changes in capacity, especially as it gets closer to full. If you continue to have OS capacity issues, the system may be undersized.

I really strongly recommend working with support on these sorts of issues as they have tools and procedures to help determine if an issue like this is a single incident or more a systemic problem.

O

ouzo12

1 Rookie

•

11 Posts

0

June 4th, 2021 08:00

Well, as this is AVE 18.2, I looked at the Administrator's guide and it said that rollback should be done using:

rollback.dpn --cptag=...

I did that, it said success, but it failed to start MCS due to MCS not having been rolled back (or something to that effect). So I rolled that back manually also.

In any case, I'm pretty much where I started 9 hours ago.

Yes, it was probably async crunching. This happens in the maintenance window, right? But I'm not an expert in the internals of AVE. And yes, it is undersized but I cannot do anything about that now.

I have now opened a Service Request, thanks for the responses. But I think in the end I will trash the whole thing and deploy a clean instance of a newer version.

ionthegeek

2K Posts

0

June 4th, 2021 08:00

If GSAN is rolled back manually using rollback.dpn, MCS does have to be restored from flush. We don't want the MCS database to be out of sync with GSAN because it has knowledge of an alternate timeline.

asynccrunching is not specifically a maintenance task - it will run any time the system is idle (no backups or maintenance tasks in progress) and the crunching goal for the day has not been met.

O

ouzo12

1 Rookie

•

11 Posts

0

June 11th, 2021 05:00

So the SR has been open for 7 calendar days now. I opened it last Friday (June 4th).

On Monday (June 7th), I got an email asking me what the problem was (even though I had described it in the request).

Since Monday I've had no further contact. The case says 'Next Customer Contact: 14 June'.

So now I've had enough. I would have had to upgrade it to AVE 19.3 anyway, so I trashed it and installed a nice fresh empty AVE 19.3 in its place.

ionthegeek

2K Posts

0

June 11th, 2021 07:00

I'm sorry to hear that your support experience wasn't very good. If you send me a DM with the SR number, I can bring it to support management's attention.

DR

Dwayne Ryder

1 Rookie

•

9 Posts

0

March 6th, 2023 19:00

Sadly I've had a lot more bad experiences in the last few years than we used to (none to lots). you need to push back hard when it happens, and it will normally get sorted out if you do.
on the flip side I've had out of support products (i.e. AER) supported well past end of official support.

NOTE: there is a warning regarding capacity, and while I know for large systems its way too early as its all based on older hardware, but you must remember you need space to free up space.
ie I see 90% usage and don't care, I do however stress if I see us hit 95% as at 98% we risk a clean on data domain not happening soon enough to recover space to keep running without way to much intervention.

yes I know I'm late to the party. just advising to anyone stumbling on this to push back on support cases where you don't get the expected response times, and know your data intake rates and avoid less than a weeks worth of 'free space' like the plague.
I like to keep at 30+ days... and yes I've still run out, due to issues with dell shipping additional disks for our datadomains in a timely fashion..... and all due to issues with how they place these orders internally requiring unnecessary hardware sent on top of the 15 disks in a disk pack

View All

No Events found!