Dell R750 Raid controller with Write-Back cache breaks

Question

Hi,

I've recently encountered an issue with my disks operation,

I'm running Apache Kafka on 12 R750 Machines.

Each machine is configured with 16 disks (1T MTFDDAK480TDS), in position 0 and 1 i've set RAID1 for the OS, with Writeback cache enabled.

And for the rest of the 14 (2T HFS1T9G3H2X069N) disks I've set EACH with RAID0 with WriteBack cache (so basically i just want to enable write-back on each disk) and connect each disk seperatly to the OS.

The R\W operations is very intensive, but looking on the actual IOps it is far far away from what the vendor offers (about 10-15% from r\w)

Machines: Intel Silver 4316 CPU 256GB

Controller:

Name Device Description PCI Slot Firmware Version Driver Version Cache Memory Size
PERC H755 Front (Embedded) RAID Controller in SL 3 Not Applicable 52.21.0-4606 --NA-- 8192 MB

Now for the issue:

Once a week ~ 1 of the machines in the cluster crash,

the OS freezes with Read-Error and Write-Errors such as:

[2109203.4709301 EXT4-fs (dm-5): I/O error while writing superblock
[2109203.470931] EXT4-fs (dm-5): previous I/O error to superblock detected
£2109702 1709401 EVT4_fr annan (lavica da—1) · ext4 find entru 1577 innde #391899. comm. kuorker/u161·A· reading directoru 1k [2109203.470949] EXT4-fs warning (device sdg1): ext4_end_bio:311: I/O error 10 writing to inode 19923086 (offset 60616704 size 492160 starting block 66002410)
[2109203.4709591 EXT4-fs (dm-5): I/O error while writing superblock
[2109203.471187] EXT4-fs warning (device sdg1): ext4_end_bio:311: I/O error 10 writing to inode 19923086 (offset 60616704 size 492160.startiva.block 66AĄZZZ1). N WILLAM WA avany ungewa wawwar
[2109203.4713181 EXT4-fs error (device sd11): mpage_map_and_submit_extent:2633: comm kworker/u162:8: Failed to mark inode 79167 22 dirty
Εκθ
[2109203.471335] EXT4-fs (sdl1): I/O error while writing superblock
[2109203.471338] EXT4-fs error. (device. sd)1)..iv_ext4 uritengoes:2962: Journal has aborted [2109203.471355] EXT4-fs (sdl1): I/O error while writing superblock
[2109203.4713901 EXT4-fs error (device dm-1): _ext4_find_entry:1577: inode #391899: comm kworker/u161:2: reading directory lbl [2109203.471399] EXT4-fs (dn-1): I/O error while writing superblock
[2109203.4714241 Core dump to I/usr/share/apport/apport pipe failed
[2109203.4715031 Read-error on swap-device (253:0:3028830)
[2109203.4715081 EXT4-fs warning (device sdg1): ext4_end_bio:311: I/O error 10 writing to inode 19923086 (offset 60616704 size 492160 starting block 66003028)
[2109203.4715701 EXT4-fs error (device dm-1): _ext4_find_entry: 1577: inode #391899: comm kworker/u162:4: reading directory lbl
ck 0
[2109203.471591) Cöfé dumpˇto ́ìzušr?snare/apþort/appórt pipe fàlieà
[2109203.4715951 JBD2: Detected 10 errors while flushing file data on sd11-8 [2109203.471604] EXT4-fs.(sdl1a:.10. error.ubile.writing suverblock..... .........a u
[2109203.4716291 Read-error on swap-device (253:0:6001568) [2109203.4716311 Read-error on swap-device (253:0:6001576) [2109203.471735] Aborting journal on device sdn1-8.
[2109203.4717911 Read-error on swap-device (253:0:7510600)
[2109203.4718271 Ex14-rs-warning-Taedicé-sag?5?·Ex79_ek_d10-11- 1/0 error 10 writing to a 2000 (US vulurve size 492160 starting block 66003200)
[2109203.471845] JBD2: Error -5 detected when updating journal_superblock for sdn1-8.
ck 0
[2109203.471864] EXT4-fs (dm-1): I/O error while writing superblock
ck 0
[2109203.471971] Ex14-is error' caevice am-1): ext4_1ina_entfy:1577: at #10. Cuma kworker/u101.2. reading directory 101 [2109203.4719211 Read-error on swap-device (253:0:2980224)
[2109203.471937] EXT4-fs (dm-1): I/O error while writing superblock [2109203.4719581 Read-error on swap-device (253:0:7445152)
277952 starting block 66009607)
[2109203.4721931 EXT4-fs error (device dm-2): ext4_wait_block_bitmap:519: comm supervisord: Cannot read block bitmap un. 209. .block.hitman. 6815245
-
block_gr
[2109203.4722091 EXT4-fs error (device dm-2): ext4_discard_preallocations:4105: comm supervisord: Error -5 loading buddy inform tion for 209
ck 0
[2109203.4722521 Core dump to /usr/share/apport/apport pipe failed
[2109203.472255] EXT4-fs error (device sdn1) in ext4_reserve inode write: 6031: Journal has aborted

At this point the IDRAC shows questioms marks on all controller components, and only after cold reboot the controller is able to get back to work.

some of the data disks FS will get broken and i will need to fsck them to be able to continue working.

I need an hint, something to check because i'm about to stop using the write-back and loose a lot of performance :(

Thanks!

DELL-Chris H · Answer

user_c9e2f8,

There really isn't enough information to make a determination at this point. Would you do me a favor and confirm some details for me?

How is it confirmed that it's writeback that's causing it?

Are the servers up to date and current, what are the current version of BIOS, iDrac, raid controller, etc?

How many systems are being impacted?

What troubleshooting have you already tried?

If it happens across multiple drives, what other commonalities have you noticed?

When you see the one machine of the cluster crashes, is it always the same, or always different?

Lastly, what's occuring on the servers at the time of the crash/lock up? Any consistency to the timing?

Let me know what you see and we can go from there.

user_c9e2f8 · Answer

DELL@DELL-Chris H Hi, sorry for the late response, i didn't get notified when you've replied here :(

How is it confirmed that it's writeback that's causing it?

It is not confirmed.

Are the servers up to date and current, what are the current version of BIOS, iDrac, raid controller, etc?

Some of them was upgraded to latest(bios,controller), some not, but they still fail (new or old firmware)

How many systems are being impacted?

Currently 12 R750 machines

What troubleshooting have you already tried?

Tried to see if there's any anomaly in the machine metrics, more Disk IO \ CPU (got nothing special) tried to read the controller logs from megacli, it seems the controller just disconnects suddenly...

If it happens across multiple drives, what other commonalities have you noticed?

TBH, nothing, the same errors different times (once a month per machine)

When you see the one machine of the cluster crashes, is it always the same, or always different?

It is usally different, tho it can happen that the same machine will crash twice in a few days, but it's rare.

Lastly, what's occuring on the servers at the time of the crash/lock up? Any consistency to the timing?

Very sporadic, 24h /7d a week nothing specific, it's always show the errors that the machine can't read data from disks.

Let me know what you see and we can go from there.

Another question, is it possible to somehow bypass the protection mechanism the forces me to press X?

I would like in the mean-time to auto-restart those machine (instead of waking up in the middle of the night) as it is kafka, i don't care too much for the data.

Thanks!

PowerEdge HDD/SCSI/RAID

Dell R750 Raid controller with Write-Back cache breaks

Was this post helpful?