1 Rookie
•
4 Posts
0
53
R740 CPU lockups
Hi folks,
We have some issues with SAN storage causing kernel CPU locks/kernel crashes on a pair of Dell R740 servers (Oracle RAC), does this sound like a hardware issue related to perhaps a BIOS issue, would a later BIOS resolve these potentially? Oracle are suggesting it may be.
I can provide a full console log with stack traces if required
Message from syslogd@xxxxx at Jul 9 00:45:11 ...
kernel:watchdog: BUG: soft lockup - CPU#43 stuck for 22s! [migration/43:228]
Message from syslogd@xxxxx at Jul 9 00:45:11 ...
kernel:watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [rcu_sched:12]
Message from syslogd@xxxxx at Jul 9 00:45:14 ...
kernel:watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:72680]
Message from syslogd@xxxxx at Jul 9 00:45:23 ...
kernel:watchdog: BUG: soft lockup - CPU#36 stuck for 22s! [perl:18444]
Message from syslogd@xxxxx at Jul 9 00:45:34 ...
kernel:watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [migration/19:108]
Message from syslogd@xxxxx at Jul 9 00:46:00 ...
kernel:watchdog: BUG: soft lockup - CPU#36 stuck for 23s! [perl:18444]
Message from syslogd@xxxxx at Jul 9 00:46:06 ...
kernel:watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [migration/19:108]
Message from syslogd@xxxxx at Jul 9 00:46:32 ...
kernel:watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [migration/19:108]
[ 274.185013] rcu: 13-...!: (1 GPs behind) idle=c42/0/0x1 softirq=535/535 fqs=14
[ 274.243373] sd 19:0:2:14: alua: port group 01 state A preferred supports tolusnA
[ 274.302739] rcu: 26-...!: (0 ticks this GP) idle=75a/1/0x4000000000000000 softirq=347/347 fqs=15
[ 310.498515] NMI watchdog: Watchdog detected hard LOCKUP on cpu 48
[ 310.498515] Modules linked in: mgag200 lpfc drm_kms_helper sd_mod syscopyarea nvmet_fc sysfillrect sysimgblt ahci fb_sys_fops nvmet drm_vram_helper ttm libahci nvme_fc nvme_fabrics uas igb nvme_core drm i40e megaraid_sas libata usb_storage scsi_transport_fc dca i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod fuse
[ 310.498522] CPU: 48 PID: 12 Comm: rcu_sched Not tainted 5.4.17-2136.326.6.1.el7uek.x86_64 #2
[ 310.498523] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.10.2 02/24/2021
[ 310.498523] RIP: 0010:native_queued_spin_lock_slowpath+0x6c/0x1fd
[ 310.498524] Code: ff ff 75 43 f0 0f ba 2f 08 0f 82 31 01 00 00 31 d2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 20 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 cc cc cc cc f6 c4
[ 310.498524] RSP: 0018:ffffb99ac0263e20 EFLAGS: 00000002
[ 310.498525] RAX: 00000000006c0101 RBX: 0000000000000246 RCX: 0000000000000000
[ 310.498525] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9ee70340
[ 310.498525] RBP: ffffb99ac0263e20 R08: 00000000ffff0520 R09: 0000000000000003
[ 310.498526] R10: 0000000000000008 R11: 071c71c71c71c71c R12: 0000000000000001
[ 310.498526] R13: 0000000000004000 R14: 000000000000000f R15: ffffffff9ee70340
[ 310.498526] FS: 0000000000000000(0000) GS:ffffa109c0e00000(0000) knlGS:0000000000000000
[ 310.498527] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 310.498527] CR2: 00007fd66a1f8000 CR3: 0000003e3860a004 CR4: 00000000007606e0
[ 310.498527] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 310.498528] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 310.498528] PKRU: 55555554
[ 310.498528] Call Trace:
[ 310.498528] <NMI>
[ 310.498528] ? show_regs+0x59/0x60
[ 310.498529] ? watchdog_overflow_callback+0xb2/0x117
[ 310.498529] ? __perf_event_overflow+0x57/0xee
[ 310.498529] ? perf_event_overflow+0x14/0x1a
<snip>
Thanks
Andy
DELL-Rey G
3 Apprentice
3 Apprentice
•
949 Posts
0
July 10th, 2024 15:25
I would also suggest updating the CPLD after updating the bios and idrac. I have seen similar error messages in vmware when the vsan storage that the operating system is running on is having problems. Is the operating system running on a BOSS card? if yes, I would also suggest updating the BOSS fw.
Rey
#Iwork4Dell
DELL-Erman O
Moderator
Moderator
•
2.5K Posts
0
July 10th, 2024 14:08
Hi Andy,
Do you see any warning in system event logs on iDRAC? I always suggest checking BIOS and iDRAC up to date.
Might be enabling processor x2APIC helpful like on this article PowerEdge R740: Enabling Processor x2APIC Support | CPU Best Practices | Dell Technologies Info Hub I'm not familiar Oracle but checking kernel update also can help.
Hope that helps!