PowerEdge: How to fix Double Faults and Punctures in RAID Arrays
Summary:This article provides information about Double Faults and Punctures in a RAID array and it also advises how to fix the problem.
Please select a product to check article relevancy
This article applies to This article does not apply toThis article is not tied to any specific product.Not all product versions are identified in this article.
Warning: Following these steps result in the loss of all data on the array, before performing the steps, ensure that all data on the array is backed up and that following these steps does not impact any other arrays.
RAID arrays are not immune to data errors. RAID controller and hard drive firmware contain functionality to detect and correct many types of data errors before they are written to an array/drive.
Data errors can be caused by physical bad blocks, such as a "Head Crash" or degradation of the platter's ability to magnetically store bits in a specific location.
A bad block, also known as a bad Logical Block Address (LBA), can also be caused by logical data errors, such as a "bit flip" or incorrect data being written to a drive.
Bad LBAs are commonly reported as the Sense Code 3/11/0.
Dell hardware-based RAID controllers offer features such as Patrol Read and Check Consistency to correct many data error scenarios.
Perform regular Check Consistency operations will correct for single faults, whether a physical bad block or a logical error of the data.
Check Consistency will also mitigate the risk of a double fault condition in the event of additional errors.
Figure 1 Multiple Single Faults in a RAID 5 array - Optimal Array
Figure 2 Double Fault with a Failed Drive (Data in Stripes 1 and 2 is lost) - Degraded Array.
Figure 3 Punctured Stripes (Data in Stripes 1 and 2 is lost due to double fault condition) - Optimal array.
A puncture is a feature of Dell's PERC controllers designed to allow the controller to restore the redundancy of the array despite the loss of data caused by a double fault condition.
A puncture is also known as "rebuild with errors."
A puncture can occur in one of two situations: a double fault already exists, or a double fault does not exist.
A puncture can occur in three locations: a blank space, a non-critical data space, or a data space that is accessed.
Any condition that causes data to be inaccessible in the same stripe on more than one drive is a double fault
Double faults cause the loss of all data within the impacted stripe
All punctures are double faults but all double faults are NOT punctures
Proactive maintenance can correct existing errors and prevent some errors from occurring.
Update drivers and firmware on controllers, hard drives, backplanes, and other devices.
Perform routine Check Consistency operations.
Review logs for indications of problems.
Note: If the check consistency completes without errors, you can safely assume that the array is now healthy and the puncture is removed. Data can now be restored to the healthy array.
Caution: If a known or suspected double fault or puncture condition exists, follow these steps to minimize the risk of more severe problems:
Perform a routine Check Consistency (the array must be optimal)
Determine if hardware problems exist
Check the controller log
Perform hardware diagnostics
Contact Dell Technical Support as needed
Note: If these steps have been done, there are additional concerns. Punctures can cause hard drives to go into a predictive failure status over time. Data errors that are propagated to a drive will be reported as media errors on the drive, even though no hardware problems exist.
Note: Monitoring the system allows problems to be detected and corrected in a timely manner, which also reduces the risk of more serious problems.