In this video, we'll be discussing the best practices to continue maintaining the Data Domain system from reaching its threshold. These are the topics we will be discussing.
Let's look at each cause for possible high Data Domain utilization and its best practices. Cause one could be backups are not being expired due to Data Domain may contain a large number of old files. The files are classified as you can see within, in this example, files are classified according to their ages (daily, weekly, monthly, etc.).
A large number of old files are seen where backups may not be getting expired. In this situation, the scan impacts storage usage. But if we experience this particular scenario, it's best to actually look at the current retention policy from the AV GUI or AUI. The next possible cause could be the backups may not be getting expired due to their domain may contain a huge number of small files.
In this particular example, it depicts the interest of a large number of small files that are 10 KB or smaller. As we can see, the interest of smaller files like 10 KB or smaller have increased, and finally, the utilization has also increased consistently. So, hence, the impact on storage usage.
The recommendation is to set up the backup procedure so that the small files are ingested rarely. If at all, we get into this situation. The next possible cause could be replication lag may be causing data to be not removed during the cleaning process. And here is an example where the Data Domain might show these kinds of alerts where there is replication sync as of time is more than x number of hours ago.
In this particular situation, it is best to actually review the replication context for all the contexts, and that is with the command "replication status all" which shows the detailed information about the context number, destination path, context as enabled or disabled, and seek the ass of time pieces of information.
We can also further dig down upon the particular context using the replication status with the context number. In this particular example, which is context number 11, further look into the DF command using the DM command for the Data Domain capacity utilization, the space utilization on this particular Data Domain.
If there is cleanable space, and there is also a huge replication lag which is as per this example, then break the pair and start the cleaning to claim the space on the Data Domain. This particular KB has more information on break and resync replication, and it also depicts the information in regards to how we will have to perform those kinds of changes using the Data Domain QE and the Data Domain CLI as well.
The next possible scenario would be where the snapshots might be unexpired on the Data Domain. So, in order to actually fix this particular issue, we will need to fetch the Avamar Mtree_name using the Avamar list. In order to view the snapshots on that particular Avamar Mtree, and snapshot list against the Avamar Mtree_name would show list the snapshots on the Data Domain.
And then further, we log in into the Avamar CLI with the CP list command in order to review the checkpoints. There is one thing to note here, whatever active snapshots are seen on the Data Domain, those should match the checkpoints on Avamar.
Remaining should show us expired. And further, whenever, if at all, we find that the snapshots are the old snapshots are still present and they are not being expired and they can be expired, this can be done using the snapshot expire command as shown in this particular slide. More information on all of these best practices is available in this particular KB article number.
And they have been and they have explained in detail about everything that we have discussed in detail. And also, it follows a small video as well that can help with this particular situation. Let's look at the possible scenarios for targeted domain Appliance, which can be higher than the source Data Domain, and the best practices against the possible causes.
Whenever there is an Avamar rollback that is performed on the associated domain, the destination Data Domain can hold extra taste of data depending on the rollback time. In order to mitigate this, in order to reduce the Avamar rollback situations, it is always a good practice to review the HFS check failures and hardware failures as soon as they occur.
Whenever we look at the Avamar GUI or the proactive health check, which actually flags the HFS check failures and the hardware failures, it is best to address them as soon as we identify those. And also, there is a parameter called as base file tracking that will be disabled during the MMR rollback process.
So, this particular base file tracking parameter, if at all it is actually left to the parameter left to false versus its default value as true, this may result in very large space discrepancy between the source and target. So, it is always best advice to look into the base file tracking parameter post the MMR rollback has been completed if it is set to true in order to avoid this space discrepancy.
The next possible scenario is the cleaning discrepancies. If the source and the destination Data Domains are running cleaning on different days, one of the Data Domains is running cleaning more aggressively or longer, there will be a discrepancy in the capacity utilization.
It is definitely not a good practice to run cleaning every day or every other day as the scan fragment, the data rate speeds also can be severely impacted due to this particular setting where cleaning is being run every day or every other day. And all of these scenarios and more scenarios have also been discussed on this particular issue on this particular topic in this KB here.
And along with the examples, and also, there is an explanation about each and every possible scenario where the target is actually higher than the source. Next, we will look into the topic of later domain space usage alerts. So, we may have noticed some of these alerts where the Data Domain utilization is 95 and above or 100, where active tier has already reached 100.
When the Data Domain is at 95 or above, there may be some processes that might be affected like cleaning or other processes. And when the active tier is at 100, no new data will be written to the DDR, which may cause backups or replication to fail. So, there is also a space usage warning setup that we can actually set it to a lower value.
Space usage warning alerts are by default set to 90 on Data Domains, and the critical space usage warnings are set to 95 respectively. So, these can be actually looked into on the Data Domain system as to what are these values set to using the command "files as option show warning space usage" and "files as option show critical space usage".
However, by checking these values, you can see what are those values set to. And then we can change these default values to a lower percentage of values. Like the warning space usage can be set to, for example, 75 on the Data Domain, and the critical space usage can be set to 85 respectively.
This way, it reduces the impact on the Data Domain system. When the critical threshold is still at 85 percent, it is easier and faster to resolve the capacity issues without any impact on Data Domain processes, like we discussed in the previous slide where some of the processes can be impacted or Data Domain processes will need to be some of the Data Domain processes will be impacted when the Data Domain utilization is 95 and above.
However, there is a point to note here that these warnings can be set up post the high Data Domain utilization issues are resolved first. And this particular KB actually explains in detail how to resolve the Data Domain capacity resolution paths with regards to the flows. And each of these scenarios also has respective videos which have explained in detail how to resolve Data Domain capacity issues. I hope this video helps with explaining the best practices.
Thank you for watching this video.