Start a Conversation

Solved!

Go to Solution

1 Rookie

 • 

4 Posts

53

July 10th, 2024 09:20

R740 CPU lockups

Hi folks,

We have some issues with SAN storage causing kernel CPU locks/kernel crashes on a pair of Dell R740 servers (Oracle RAC), does this sound like a hardware issue related to perhaps a BIOS issue, would a later BIOS resolve these potentially?  Oracle are suggesting it may be.

I can provide a full console log with stack traces if required

Message from syslogd@xxxxx at Jul 9 00:45:11 ...
kernel:watchdog: BUG: soft lockup - CPU#43 stuck for 22s! [migration/43:228]

Message from syslogd@xxxxx at Jul 9 00:45:11 ...
kernel:watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [rcu_sched:12]

Message from syslogd@xxxxx at Jul 9 00:45:14 ...
kernel:watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [kworker/4:1:72680]

Message from syslogd@xxxxx at Jul 9 00:45:23 ...
kernel:watchdog: BUG: soft lockup - CPU#36 stuck for 22s! [perl:18444]

Message from syslogd@xxxxx at Jul 9 00:45:34 ...
kernel:watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [migration/19:108]

Message from syslogd@xxxxx at Jul 9 00:46:00 ...
kernel:watchdog: BUG: soft lockup - CPU#36 stuck for 23s! [perl:18444]

Message from syslogd@xxxxx at Jul 9 00:46:06 ...
kernel:watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [migration/19:108]

Message from syslogd@xxxxx at Jul 9 00:46:32 ...
kernel:watchdog: BUG: soft lockup - CPU#19 stuck for 22s! [migration/19:108] 

[  274.185013] rcu:     13-...!: (1 GPs behind) idle=c42/0/0x1 softirq=535/535 fqs=14
[  274.243373] sd 19:0:2:14: alua: port group 01 state A preferred supports tolusnA
[  274.302739] rcu:     26-...!: (0 ticks this GP) idle=75a/1/0x4000000000000000 softirq=347/347 fqs=15
[  310.498515] NMI watchdog: Watchdog detected hard LOCKUP on cpu 48
[  310.498515] Modules linked in: mgag200 lpfc drm_kms_helper sd_mod syscopyarea nvmet_fc sysfillrect sysimgblt ahci fb_sys_fops nvmet drm_vram_helper ttm libahci nvme_fc nvme_fabrics uas igb nvme_core drm i40e megaraid_sas libata usb_storage scsi_transport_fc dca i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod fuse
[  310.498522] CPU: 48 PID: 12 Comm: rcu_sched Not tainted 5.4.17-2136.326.6.1.el7uek.x86_64 #2
[  310.498523] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.10.2 02/24/2021
[  310.498523] RIP: 0010:native_queued_spin_lock_slowpath+0x6c/0x1fd
[  310.498524] Code: ff ff 75 43 f0 0f ba 2f 08 0f 82 31 01 00 00 31 d2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 20 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 cc cc cc cc f6 c4
[  310.498524] RSP: 0018:ffffb99ac0263e20 EFLAGS: 00000002
[  310.498525] RAX: 00000000006c0101 RBX: 0000000000000246 RCX: 0000000000000000
[  310.498525] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff9ee70340
[  310.498525] RBP: ffffb99ac0263e20 R08: 00000000ffff0520 R09: 0000000000000003
[  310.498526] R10: 0000000000000008 R11: 071c71c71c71c71c R12: 0000000000000001
[  310.498526] R13: 0000000000004000 R14: 000000000000000f R15: ffffffff9ee70340
[  310.498526] FS:  0000000000000000(0000) GS:ffffa109c0e00000(0000) knlGS:0000000000000000
[  310.498527] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  310.498527] CR2: 00007fd66a1f8000 CR3: 0000003e3860a004 CR4: 00000000007606e0
[  310.498527] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  310.498528] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  310.498528] PKRU: 55555554
[  310.498528] Call Trace:
[  310.498528]  <NMI>
[  310.498528]  ? show_regs+0x59/0x60
[  310.498529]  ? watchdog_overflow_callback+0xb2/0x117
[  310.498529]  ? __perf_event_overflow+0x57/0xee
[  310.498529]  ? perf_event_overflow+0x14/0x1a

<snip>

Thanks

Andy

3 Apprentice

 • 

949 Posts

July 10th, 2024 15:25

I would also suggest updating the CPLD after updating the bios and idrac.  I have seen similar error messages in vmware when the vsan storage that the operating system is running on is having problems. Is the operating system running on a BOSS card? if yes, I would also suggest updating the BOSS fw. 

Rey
#Iwork4Dell

Moderator

 • 

2.5K Posts

July 10th, 2024 14:08

Hi Andy, 

Do you see any warning in system event logs on iDRAC? I always suggest checking BIOS and iDRAC up to date. 

Might be enabling processor x2APIC helpful like on this article PowerEdge R740: Enabling Processor x2APIC Support | CPU Best Practices | Dell Technologies Info Hub I'm not familiar Oracle but checking kernel update also can help. 

 

Hope that helps!

 

 

No Events found!

Top