Dell PERC 9 controllers (H330, H730, H730P, and H830) introduced a feature called Rapid Rebuild that speeds up the time to rebuild failed drives in certain conditions. This feature is based on T10 Rebuild Assist. Dell has determined that there is a possibility for data integrity issues when this feature is used under certain conditions.
Table of content
- Feature Operation
- Problem Statement
- How can I tell if this has happened
- Solution
Feature Operation:
Any drive that is capable of Rapid Rebuild will register this capability with the controller. This feature is supported with parity raid virtual disks: Raid 5, RAID 6, RAID 50 and RAID 60. The feature requires a server to have capable drives, parity based RAID levels, and a configured hot spare (either global or dedicated to the exact VD). Each capable drive in the VD keeps track of its own failed blocks/sectors. A drive may then fail in such a way that it can still communicate with the PERC, and tell the PERC which sectors are still "good". Instead of performing time consuming RAID recovery XOR algorithms for the entire disk, the PERC will copy the good sectors to the hot spare, and only have to recover the known bad sectors. The PERC will copy the good sectors to the hot spare, and only have to rebuild those known bad sectors. Without Rapid Rebuild, the PERC has to rebuild all sectors which can be very time consuming for large capacity drives.
Problem Statement
When the PERC is rebuilding the data for the "bad" sectors, it incorrectly writes data from cache to the failed drive instead of the hot spare. This results in data and associated parity not being written to the hot spare. In write through mode, parity errors will occur. In write back mode, errors will occur in both data and associated parity.
How can I tell if this has happened
Note: How to extract the PERC Controller log is explained
in the article SLN295784.
From the PERC Controller log if you see the below highlighted text you have encountered the issue.
C0:EVT#395950-08/17/16 13:54:59: 114=State change on PD 0b(e0x20/s11) from OFFLINE(XX) to REBUILDASSIST(12)
Solution
-
If your VD was in Write Through mode, only parity data is at risk and running a CC (consistency check) will restore your parity. This will only work if this is a single occurrence of rebuild assist. If more than one occurrence of rebuild assist to the same VD, you should restore your data from a previous backup.
-
If your VD was in Write Back mode and you have encountered the issue then you should restore your data from backup. Unfortunately, there is no way to recover the lost data. Please restore from a previous backup.
If you have not encountered this issue then to protect against this scenario please update your PERC H730, H730p, H830 controller firmware to 25.5.0.0018 and PERC H330 controller firmware to 25.5.0.0019 or later firmware which disables the Rapid Rebuild feature.
To download the latest firmware version, please navigate to the section "Drivers and Downloads" of a 13G server and expand the "SAS Raid" menu file.
The correct firmware has been implemented in the factory and new servers are not exposed to this issue.
Dell Note: As part of on-going business process improvement across all key functions, Dell continually reviews key processes and implements improvements. Dell places a high focus on the development, test and manufacturing processes for our server and storage systems. These process improvements will help prevent future problems and are allowing Dell to react more rapidly and more aggressively to potential issues in the field.