Array still Initializing after energy lost PS4210

Question

Hi,

I'm having problems with a Storage Server Dell PS4210, two weeks ago there was a general blackout and the storage was completely turned off. The configuration is three ps4210 servers, one master and two slaves. Where everyone got up except the master.

After running some commands to verify the array I get the following:

CLI> support exec "raidtool"
Driver Status: Admin Intervention Requested
RAID LUN 0 Degraded.
raid status unrecoverable.
11 Drives (0,2,4,6,8,1,3,5,7,f,f)
RAID 6 (64KB sectionPerSU)
Capacity 17,617,013,637,120 bytes
Available Drives List: 9

It stayed in "storage array still initializing" 10 days ago... please urgent help.

DELL-Erman O · Answer

Hi,

Can you try the steps dwilliam62 mentioned on these two threads I found?
Tried to restart the active controller to force a failover, See if the passive can pick up the drives. The next step is to remove all the components, controllers, and drives, including the backplane sits in. make sure they are fully seated, and power it back up again.

reference links are:

https://dell.to/3S3k1Vk

https://dell.to/3S3k2sm

Hope that helps!

adres98 · Answer

Hi Erman Ozkurt,

Thanks for the quick answer.

This is are the outputs from the commands...

CLI> support exec "uname -a"

You are running a support command, which is normally restricted to PS Series Tec
hnical Support personnel. Do not use a support command without instruction from
Technical Support.
NetBSD 5.0_STABLE NetBSD 5.0_STABLE (EQL.PSS) #0: Thu Sep 1 08:03:53 EDT 2016 build@m64:/buildarea/V8.1.6__Thu_Sep_01_2016_07_43_26_EDT/bin/destdir.sbmips.64.release/EQL.PSS.64 sbmips

CLI>support exec "diskview -j"

You are running a support command, which is normally restricted to PS Series Tec
hnical Support personnel. Do not use a support command without instruction from
Technical Support.
Enc/Drive State Write Read Power Drive Bad ForceWrite Reset Read Scan Max Max
Retrys Retrys Cycles Timeouts Blocks Retrys Fail Timeout Errors Cominits HrstMsecs
___________________________ _____________
0/ 0 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 1 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 2 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 3 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 4 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 5 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 6 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 7 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 8 Online 0 0 0 0 0 0 0 0 0 0 0
0/ 9 Online 0 0 0 0 0 0 0 0 0 0 0
0/10 Bad/Failed 0 0 0 0 0 0 0 0 0 0 0
0/11 Slot Empty 0 0 0 0 0 0 0 0 0 0 0
CLI>

With these results of the commands, what do you recommend me to do?

Regards,

Thanks

adres98 · Answer

P.S. => This is the result from the other command:

CLI> su exec raidtool -Z
You are running a support command, which is normally restricted to PS Series Tec
hnical Support personnel. Do not use a support command without instruction from
Technical Support.
Active RAID LUNs: 0
Driver Status = *ADMIN INTERVENTION REQUIRED*.
Malloc Bytes = 0KB
Outstanding Active I/O's = 0
Pending I/O's = 0
Pending Resource Reqs = 0
Outstanding StripeLocks = 0
Allocated Sectors = 0

Device = 000
status = 008
outio = 00000000
drives = 09
disk luns: 7 5 3 1 8 6 4 2 0
disk count:10
disk lun= 0 status=0x00000400 drive active device=0
disk lun= 1 status=0x00000400 drive active device=0
disk lun= 2 status=0x00000400 drive active device=0
disk lun= 3 status=0x00000400 drive active device=0
disk lun= 4 status=0x00000400 drive active device=0
disk lun= 5 status=0x00000400 drive active device=0
disk lun= 6 status=0x00000400 drive active device=0
disk lun= 7 status=0x00000400 drive active device=0
disk lun= 8 status=0x00000400 drive active device=0
disk lun= 9 status=0x00001000 hot spare no-device
CLI>

DELL-Erman O · Answer

Thanks for outputs. I'm suspected the controller can't read RAIDset somehow. In disks view one of them seems failed and one of them is empty. Could you do a powerdown/flea power drain?

dwilliam62 · Answer

Hello,

This RAIDset is in a lost cache condition. The firmware is also well out of date as well.

When you log back in you can try clearing the cache. CLI>clearlostdata

This will disregard the lost cache and allow the array to boot. However, whatever data was lost is gone. Which could include data for the internal databases the array uses. Which could result in data loss.

If you verified that both controllers were the active one after the restarts then the command to clear the cache is your last hope. If not, then power down the array. Boot each controller individually. If CM0 for example was active, after powering it down, pull it out slightly from the backplane and power up on CM1.

If the array boots, then plug CM0 back in.

If not, then the clearlostdata is your best chance. Once booted up run filesystem checks on all your servers and VMs.

Regards,

Don

#Iwork4Dell

adres98 · Answer

Hello,

I tell you the following, yesterday I tried to change the CM0 and CM1 as indicated, but the equipment is only working with the CM0, the CM1 connects but does not turn on any light. However, yesterday we connected by SERIAL to see the boot sequence and see if it threw any other error and this is the result.

System will be rebooted in about 5 seconds...
[PWRDN]
[ENTRY][1][EQL][WDOG][WRESET]
[ENTRY][1][EQL][WDOG][210][WFP][MINIT][MSIZE][MLEV0][MLEV1][MLEV2][MLEV3][SCRUB][DDRP][BOOT2][PROC][BOARD][SMP][PCIE][FLASH][NAE][USB]

#########################################################
# #
# Dell (tm) Inc. Storage Array #
# Copyright 2001-2014 #
# #
# Controller Information: #
# Part=70-0485 Rev=A02 SN= ** ECO=C00 #
# #
#########################################################
Bootloader Version 4.4.6 (SWINT Rev:4)
Compiled on Mon Jul 28 13:57:41 EDT 2014
(type h for help)
Enter Ctrl-P for boot prompt

Executing bootcmd0 [dload sd primary/eqlstor.gz]
0X80231000/11912 0X85f08100/40043276 entrypt 80231000

Executing bootcmd1 [run]
cpu_online_map=ffff, userapp_cpu_map ffff
psb_os_active_mask=0, psb_os_mask=0
boot1_info: userapp_cpu_map=ffff, psb_os_cpu_map=0
cpu_online_map = 0xffff
Jumping to the application... 0x80231000
------------------------------------------------------------
Preparing ffff bitmask of cpus to run
No network device to cleanup
count = 16, total = 16
All slave cpus (16) ack'ed userapp init
count = 4, total = 4
All slave cpus (4) ack'ed message ring init
============ cpu_0 ==============
func = 0x80231000, args = 0x0
sp = 0xffffffff8f24dfe0, gp = 0xffffffff8f24c000
master_cpu = 0, master_mask = 00000001, buddy_mask = 0000ffff
psb_os_cpu_map = 00000000, mode = 1, kseg_master = 0
app_shared_mem: addr = 0000000000000000, size = 0000000000000000, orig = 0000000000000000

Dell Inc. Storage Array

sbs_early_init: Battery in Ship Mode at array restart - charge to full

plat_pfgxlp_wait_for_battery: Smart Battery Remaining Capacity = 678 mAh after 0-second delaySP:1.00:cache_driver.cc:1056:WARNING:28.3.17:Active control module cache is now in write-through mode. Array performance is degraded.
SP:11.16:mips_pss_init.c:363:INFO:28.2.107:Control module in slot 0 with serial number CN-04V1MG-77921-56G-006W is designated as active.
MFS set up
Building databases...
Restoring password file...
SP:1663789922.70:emm.c:1333:INFO:28.2.6:Enclosure serial number: CN-01N9TR-70821-53K-11GP-A01.
SP:1663789923.05:cache_driver.cc:1058:INFO:28.2.39:Active control module cache set to write-back mode.
SP:1663789923.06:emm.c:2363:ERROR:28.4.47:Critical health conditions exist.
Correct immediately before they affect array operation.
The service life of the control module battery is depleted.
There are 1 outstanding health conditions. Correct these conditions before they affect array operation.
Wed Sep 21 15:52:10 EDT 2022
Sep 21 15:52:10 init: kernel security level changed from 0 to 1

PS Series Storage Arrays
Unauthorized Access Prohibited

My query goes, if it is possible to know what information will be lost if the "clearlostdata" command is executed, on the other hand, of all the disk bays, each one has two LED indicators, of which only two disks are on, both LEDs, and the others just top LEDs.

If there any other procedure recommended before the clearlostdata command.

Thanks

adres98 · Answer

In short answer, the problem that the array is not completing the startup is the battery?

tapelibraryfixer · Answer

'The service life of the control module battery is depleted.'

adres98 · Answer

Is there a whey to rebuild the DATA with the snapshot of the volumnes? Please, will be really thankfull for a hand

adres98 · Answer

Ok, Thanks. Just to know, I saw online generic Batteries is there a way to replace them temporarly until the originals arrive. Cause in my local market there is no Replacements part. Thanks bro

tapelibraryfixer · Answer

without the battery in a working state, if power was lost during normal operation the data would be lost.

so yes the battery needs to work to finish booting properly.

the controller LED's should be green, green and green, orange

dwilliam62 · Answer

Hello,

Something isn't quite making sense here. For sure at least one of the controllers has a bad battery. But one of the controllers was able to boot up all the way otherwise you could not have run raidtool. If both CMs have bad batteries on 4210/6210/6610s the array will constantly cycle between controllers. Is that what is happening with your array?

Re: Cache. No there is no way to know what data will be lost if you run clearlostdata.

Re: snapshot can only recover data that has already been written. Lost cache hasn't made it to disk yet.

Re: Batteries. It's a module with batteries and a control board. I don't know of any "generic' batteries you can use. They would have to be a 100% replacement part.

When the batteries near their end-of-life the array starts reporting this about 90 days or more.

If you try booting with just one controller installed what happens? If that CM fails to boot up to a login problem try the other one. If they both fail to boot to a login prompt, they yes, very likely both batteries are bad and will have to be replaced. Then you would still likely need to run clearlostdata to get back online.

Regards,

Don

#iworkfordell

adres98 · Answer

Hello,

The array was running just with one CM (cm0) the other one was out of the chassis because it didn't turn on. However, we just reintroduced it just now and I manage to get it up but I am attaching a picture so you can see it.

If we run the "clearlostdata" command, 100% of the information could be lost, to make a decision since they are 20 days offline.

WhatsApp Image 2022-09-22 at 1.47.28 PM.jpeg

The first storage is Failing...

WhatsApp Image 2022-09-22 at 2.08.31 PM.jpeg

Thank you

dwilliam62 · Answer

Hello,

You will not loose 100% of the information stored on the array. You will only lose the data that was attempted to be written to the array. Which will be a very small amount of actual data.

Regards,

Don

#iworkfordell

adres98 · Answer

Ok we are going to run the command but just in case is there any time estimated that the storage will take to finish?? Thanks

EqualLogic

Array still Initializing after energy lost PS4210

Was this post helpful?