Start a Conversation

Solved!

Go to Solution

2330

April 20th, 2021 09:00

PS4210 won't boot

I have a PS4210 that's been running fine for around 4 years.  Very rarely rebooted or power cycled.  When I rebooted it yesterday, it won't boot up fully.  One controller has a red light, the other two greens.  The output below is from the active (working) controller.

At login, it gives this:

ATTENTION!
A critical health condition exists on the array. User intervention is required.
Please call your support provider.
Once the problem is repaired the array MUST be rebooted.

Once logged in:

CLI> support exec "raidtool"
Driver Status: *Admin Intervention Requested*
RAID LUN 0 Ok.
raid status unrecoverable.
12 Drives (0,2,4,6,8,1,3,5,7,9,10,11)
RAID 6 (64KB sectPerSU)
Capacity 5,761,185,873,920 bytes
RAID LUN 1 Ok.
raid status unrecoverable.
11 Drives (12,13,14,15,16,17,18,19,20,21,22)
RAID 6 (64KB sectPerSU)
Capacity 5,185,067,286,528 bytes
Available Drives List: 23

CLI> support exec "uname -a"
NetBSD 5.0_STABLE NetBSD 5.0_STABLE (EQL.PSS) #0: Mon Apr 30 01:53:12 EDT 2018 build@m64:/buildarea/V10.0.1__Mon_Apr_30_2018_01_40_29_EDT/bin/destdir.sbmips.64.release/EQL.PSS.64 sbmips

CLI> support exec "diskview -j"
Enc/Drive State Write Read Power Drive Bad ForceWrite Reset Read Scan Max Max
Retrys Retrys Cycles Timeouts Blocks Retrys Fail Timeout Errors Cominits HrstMsecs
______________________________________________________________________________________________________________________
0/ 0 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 1 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 2 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 3 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 4 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 5 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 6 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 7 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 8 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 9 Online 0 0 0 0 0 0 0 0 0 0 0
0/10 Online 0 0 0 0 0 0 0 0 0 0 0
0/11 Online 0 0 0 0 0 0 0 0 0 0 0
0/12 Online 0 0 0 0 0 0 0 0 0 0 0
0/13 Online 0 0 0 0 0 0 0 0 0 0 0
0/14 Online 0 0 0 0 0 0 0 0 0 0 0
0/15 Online 0 0 0 0 0 0 0 0 0 0 0
0/16 Online 0 0 0 0 0 0 0 0 0 0 0
0/17 Online 0 0 0 0 0 0 0 0 0 0 0
0/18 Online 0 0 0 0 0 0 0 0 0 0 0
0/19 Online 0 0 0 0 0 0 0 0 0 0 0
0/20 Online 0 0 0 0 0 0 0 0 0 0 0
0/21 Online 0 0 0 0 0 0 0 0 0 0 0
0/22 Online 0 0 0 0 0 0 0 0 0 0 0
0/23 Online 0 0 0 0 0 0 0 0 0 0 0

Verified that support contracts are expired just a couple months ago.  Looking for suggestions on how to recover.  Thanks in advance!

1 Rookie

 • 

1.5K Posts

April 20th, 2021 09:00

Hello, 

 Curious why you rebooted?  

  One thing you can try is power down the array.  Pull out the active controller a couple of inches. Connect the serial port to the other CM and power it up.  It's possible missing cache is located in the the other CM.  If so the array should boot up.  Then you can insert the other CM. 

 If it doesn't come up fully run those two commands again. 

 Depending on where you are you can purchase a one-time support call and get some assistance.  You also might want to investigate renewing the support contract. 

 Do you have a backup of this data? 

 Regards, 

Don 

8 Posts

April 20th, 2021 10:00

Many thanks for the reply Don, very timely and useful.

Rebooted because our software has the ability to reboot and shut down the entire system (this array, other servers, network switch, etc) and was testing that capability.

Performed your suggested actions and it booted off the other controller (haven't reinserted the bad one yet).  Array functions fine currently but I won't do anything else until another (current) set of backups is completed and taken offline.

So will the controller with the missing cache (once reinserted) then get that data from the one that's working now, and it will be ok, and no expected problems in the future?   Or should I expect that the controller that had the missing data will do that again someday and therefore it should be replaced?  (support stuff being handled by our contracts people)

Thanks again!

1 Rookie

 • 

1.5K Posts

April 20th, 2021 11:00

Hello, 

 It sounds like you didn't do a proper shutdown proper restart when you tested your software?   If a proper restart or shutdown was done it should flush cache automatically and that prevents what happened. Where the active CM was out of sync with the passive but the passive held pending IO 

 Re: Cache. Doesn't quite work that way.  The active CM always syncs cache with the passive.  So when you insert the controller it will become the passive and sync with the active automatically.  

 Glad that it came up OK. 

 Regards, 

Don 

8 Posts

April 20th, 2021 12:00

This procedure isn't new, we've run this procedure dozens of times on at least 10+ different suites without issues.  We stop all IO first, then use the API to initiate the reboot properly via the EqualLogic itself.  What I was trying to get to was if we have an actual hardware problem with this particular 4210 or was it just a fluke.  Sounds like it was just a fluke.   Thanks again for your help, I'll note this fix in our database so that if anyone else runs into it, there's at least somewhere to start.

1 Rookie

 • 

1.5K Posts

April 20th, 2021 13:00

Hello, 

 You are very welcome!  

 Which API are you using?   How long from when you stop IO to when you restart or shutdown? 

 I would not call this a "fix".   it's the first step in a triage process.  It can mean that there is a problem in the transaction database the array uses.  

 Regards, 

Don 

8 Posts

April 21st, 2021 10:00

We're using the Powershell api, via the eqlpstools.dll file.   There's at least a 10 second delay between both ESXI servers being down (actually down, not just started shutdown) and when the EqualLogic is told to shut down or reboot.

So I've reinserted the 2nd CM, and it's got green and amber lights so it's assumed the secondary role.  I can log in but the 'raidview' command returns 'Initialization has not started' and has been that way for an hour.   Since the volumes are up, and the primary controller reports the Luns ok, is there anything else to check?

1 Rookie

 • 

1.5K Posts

April 21st, 2021 10:00

Hello, 

 Re: API.  Ok thanks.  

 Re: RAIDTOOL.  Sounds like you are on the seconday CM?  At the CLI>  prompt?   Not the GrpName> prompt 

Only that active controller has access to the disks and RAID 

 On the active CM if you run GrpName>member select MEMBERNAME show  That will give you the current status of the active and passive CMs 

  If the volumes are up then the RAIDset has to be up as well. 

 Regards, 

Don

8 Posts

April 21st, 2021 12:00

From the active CM:  member select show    

Returns: status online, no critical or warning conditions, 2 controllers, healthstatus normal, but no specific info about primary vs secondary controllers that I see.  I can get into the PS Group Manager gui (java) and everything looks good there, both CMs up, both batteries good.

8 Posts

April 21st, 2021 13:00

Yep, everything looks good here, though the secondary CM doesn't seem to know its temps:

member select xxxxeql show controllers
___________________________ Controller Information ____________________________
SlotID: 0                         Status: active
Model: 70-0485(TYPE 19)           BatteryStatus: ok
ProcessorTemperature: 46          ChipsetTemperature: 32
LastBootTime: 2021-04-20:13:21:46 SerialNumber:
Manufactured: 057K                   CN-04V1MG-XXXXX-XXX-XXXX
ECOLevel: C00                     CM Rev.: A02
FW Rev.: Storage Array Firmware   BootRomVersion: 4.4.6
V10.0.1 (R457114)                 BootRomBuilDate: Mon Jul 28 13:57:41
                                     EDT 2014
_______________________________________________________________________________
SlotID: 1                         Status: secondary
Model: 70-0485(TYPE 19)           BatteryStatus: ok
ProcessorTemperature: 0           ChipsetTemperature: 0
LastBootTime: 2021-04-21:12:04:07 SerialNumber:
Manufactured: 057K                  CN-04V1MG-XXXXX-XXX-XXXX
ECOLevel: C00                     CM Rev.: A02
FW Rev.: Storage Array Firmware   BootRomVersion: 4.4.6
V10.0.1 (R457114)                 BootRomBuilDate: Mon Jul 28 13:57:41
                                     EDT 2014
______________________________ Cache Information ______________________________
CacheMode: write-back Controller-Safe: disabled
Low-Battery-Safe: enabled
_______________________________________________________________________________

1 Rookie

 • 

1.5K Posts

April 21st, 2021 13:00

Hello, 

 Re: Temp. I suspect that has to do with being booted up on single CM.   If you reboot it again in the future or remove wait 1-2 minutes and re-insert might resolve that.  You can also connect to the secondary CM via serial and reboot it to see if that restores the temp info. 

 Regards, 

Don 

1 Rookie

 • 

1.5K Posts

April 21st, 2021 13:00

Hello, 

Sorry.  add "controllers" 

member select MEMBERNAME show controllers    You will see 'active' and 'secondary' as their status 

mem sel MEMNAME show controllers
___________________________ Controller Information ____________________________
SlotID: 0

mem sel aer11array-m1 show controllers
___________________________ Controller Information ____________________________
SlotID: 0
Status: active
Model: 70-0477(TYPE 14)
BatteryStatus: ok
ProcessorTemperature: 68
ChipsetTemperature: 27
LastBootTime: 2021-02-21:09:50:38
SerialNumber: XXXXXXXXXXXXXXXX
Manufactured: 02CC
ECOLevel: C00
CM Rev.: A03
FW Rev.: Storage Array Firmware V9.1.2 (R440008)
BootRomVersion: 3.9.1
BootRomBuilDate: Wed Nov 16 14:22:40 EST 2011
_______________________________________________________________________________
_______________________________________________________________________________
SlotID: 1
Status: secondary
Model: 70-0477(TYPE 14)
BatteryStatus: failed
ProcessorTemperature: 0
ChipsetTemperature: 0
LastBootTime: 2021-02-21:09:52:53
SerialNumber: XXXXXXXXXXXXXXX
Manufactured: 02CC
ECOLevel: C00
CM Rev.: A03
FW Rev.: Storage Array Firmware V9.1.2 (R440008)
BootRomVersion: 3.9.1
BootRomBuilDate: Wed Nov 16 14:22:40 EST 2011
_______________________________________________________________________________

______________________________ Cache Information ______________________________
CacheMode: write-back
Controller-Safe: disabled
Low-Battery-Safe: enabled

 

1 Rookie

 • 

1.5K Posts

April 21st, 2021 14:00

Hello, 

 The transaction DB is on the RAID not controllers.  If that were bad you would not be able to boot.  It has a self checking mechanism at boot up.  

 Regards, 

Don 

8 Posts

April 21st, 2021 14:00

Ok, looks like I'm good then.  Glad there's no lasting worries.  We'll check on the temps thing and look again into the reboot procedure and see if we missed something.

Thanks for all your help, Don!

Glenn

8 Posts

April 21st, 2021 14:00

Ok, so other than the temps, anything else to look at to determine if there is a problem in the transaction database ?

1 Rookie

 • 

1.5K Posts

April 21st, 2021 15:00

Hello, 

 Yup, looks good.   I am glad I could help you. 

  Take care. 

Regards,

Don

No Events found!

Top