Unsolved
This post is more than 5 years old
6 Posts
0
2166
Unexpectedly disconnected 2 sds of 3...
Hello everybody, i need your help...
We have a volume, contains 3 sds with this configuration:
sds1: 9 TB (1+4+4)
sds2: 5 TB (1+4)
sds3: 5 TB (1+4)
Spare Percentage = 35
today we unexpectively lost sds2 and sds3. Volume is degraded, data become unavailable.
Some hours later we created new SDS2 and attach to it HDDs of lost SDS. We added it to volume via GUI without test and clean ('activate without test' checkbox).
Even late we done same action with sds3.
Now volume looks like this:
sds1: 9 TB (1+4+4)
sds2: 5 TB (1+4) *lost
sds2: 5 TB (1+4) *added
sds3: 5 TB (1+4) *lost
sds3: 5TB (1+4) *added
Now ScaleIO making intensive rebuild, data still unavailable... any request to disks ending with i/o error.
Is it possible to restore data in this situation? What to do next? What will happen when rebuild completes?
ScaleIO version 1.32.3455.5
carterbury
12 Posts
0
June 23rd, 2017 08:00
Did you confirm network connectivity between your mdm and sds 2 and 3? When I have lost an sds it is usually network related or a reboot.
Did you check if the SDS process is still running on those servers? What OS are these nodes built on. I have had the SDS process crash when using Flash Cache on the latest release. I have had to stop using it, otherwise my SDS process fails.
What do you see in the logs? If using linux as I would recommend, check /opt/emc/scaleio/sds/logs and /opt/emc/scaleio/mdm/logs
Check trc.0. Also check your regular /var/log/syslog for any OS issues.
pawelw1
306 Posts
0
June 23rd, 2017 09:00
Hi,
Instead of simply creating new SDS's and attaching disks from the failed ones, please try to follow this procedure:
https://support.emc.com/kb/201817
Basically it should allow you to introduce new SDS's as the existing ones and help with data recovery.
Cheers,
Pawel
glhf
6 Posts
0
June 25th, 2017 19:00
Hi, i can't follow this link... salesforce tells "we can't log you in". Can you give me the another link or way to get this article? Thank you.
pawelw1
306 Posts
0
June 26th, 2017 03:00
Please try to register with salesforce - not sure if you need a support contract number to view the KBs though.
glhf
6 Posts
0
June 27th, 2017 05:00
Network works good. Lost of sds 2 and 3 was caused by physical sds-server failure. Server was reinstalled with new OS and ScaleIO software (SDS) with same versions. OS is Linux Debian (before and after). New SDSs with old disks, with assigned new ip addresses added to MDM.
Old SDS lost irretrievably.
In logs on SDS with MDM (combined) (/opt/emc/scaleio/sds/logs/trc.3):
23/06 09:14:59.934273 netSocket_CloseIfNotActive:00663: pSock 0x7ff69c00c850 socket(44) ownerType(CON) state(CONNECTED) type(SNDRCV) pollState(0x4)::Socket receive is Inactive, will close. (msgSend 2, msgRecv 0, bHasMsgToSend 0, bMemPending 0, rcvNotActiveCount 1)bOldClusterVer 0, bOldVer 0
23/06 09:14:59.934295 netSocket_SetState:01892: pSock 0x7ff69c00c850 socket(44) ownerType(CON) state(CONNECTED) type(SNDRCV) pollState(0x4)::Changing state to 4 (debug) 3
23/06 09:14:59.934301 netSocket_StartClosing:00536: pSock 0x7ff69c00c850 socket(44) ownerType(CON) state(CLOSING) type(SNDRCV) pollState(0x14)::Closing
23/06 09:14:59.934316 netSocket_Close:00137: pSock 0x7ff69c00c850 socket(44) ownerType(CON) state(CLOSING) type(SNDRCV) pollState(0x14)::Closing
23/06 09:14:59.934321 netSocket_SetState:01892: pSock 0x7ff69c00c850 socket(44) ownerType(CON) state(CLOSING) type(SNDRCV) pollState(0x14)::Changing state to 5 (debug) 4
23/06 09:14:59.934352 netSocket_Close:00153: pSock 0x7ff69c00c850 socket(44) ownerType(CON) state(CLOSED) type(SNDRCV) pollState(0x14)::Closed
23/06 09:14:59.934359 netCon_SocketClosedFromOs:00902: Socket closed - SERVER (Disconnect) in con 2beecb6900000004 #1, #1
23/06 09:14:59.934364 netCon_SocketClosedFromOs:00920: Con 2beecb6900000004 state changed CONNECTED => DISCONNECTED
23/06 09:14:59.934368 netSocket_Destroy:00246: pSock 0x7ff69c00c850 socket(4294967295) ownerType(CON) state(CLOSED) type(SNDRCV) pollState(0x14)::Destroyed
23/06 09:14:59.934374 contNet_DisconnectedNotif:00864: Con 2beecb6900000004 disconnected
23/06 09:14:59.934379 contTgt_DisconnectedNotif:00315: Tgt disconnect, tgt 2beecb6900000004 hCon 33b00e
23/06 09:14:59.934382 netCon_SocketClosedFromOs:01035: Con closed 2beecb6900000004 - by server (1)
23/06 09:14:59.934387 netConManager_FreeCon:00288: Con aborted 2beecb6900000004
23/06 09:14:59.934390 netConManager_FreeConDone:00254: ********** => Con deleted 2beecb6900000004
23/06 09:15:00.472653 raidComb_WriteAsync:03287: Write to comb retry 270280010031 (Lba 6170104 24), numOfSendRetry 0, 10 ms
23/06 09:15:01.578429 raidComb_WriteAsync:03287: Write to comb retry 270280010031 (Lba 6170104 24), numOfSendRetry 1, 1110 ms
23/06 09:15:01.813280 raidComb_WriteAsync:03287: Write to comb retry 2702800102df (Lba 6860896 8), numOfSendRetry 0, 20 ms
23/06 09:15:01.821675 raidComb_WriteAsync:03287: Write to comb retry 2702800102df (Lba 6860904 8), numOfSendRetry 0, 20 ms
23/06 09:15:01.883536 raidComb_WriteAsync:03287: Write to comb retry 27028001001a (Lba 6838416 8), numOfSendRetry 0, 20 ms
23/06 09:15:01.918471 raidComb_WriteAsync:03287: Write to comb retry 27028001026f (Lba 6156448 16), numOfSendRetry 0, 10 ms
23/06 09:15:02.177105 raidComb_WriteAsync:03287: Write to comb retry 270280010301 (Lba 241328 280), numOfSendRetry 0, 10 ms
23/06 09:15:02.481302 contCmd_Print:00108: Set Comb state - Comb Id 270380018004 - Start
23/06 09:15:02.481319 raidComb_SetCombState:00663: Set comb raid state 270380018004, raid state 1 => 64 (dit)
23/06 09:15:02.481406 contCmd_Print:00108: Set Comb state - Comb Id 27038001800c - Start
23/06 09:15:02.481426 raidComb_SetCombState:00663: Set comb raid state 27038001800c, raid state 1 => 64 (dit)
23/06 09:15:02.481434 contCmd_Print:00108: Set Comb state - Comb Id 27038001800f - Start
23/06 09:15:02.481452 raidComb_SetCombState:00663: Set comb raid state 27038001800f, raid state 1 => 64 (dit)
23/06 09:15:02.481461 contCmd_Print:00108: Set Comb state - Comb Id 270380018010 - Start
Something like this shows in logs on second MDM.
Briefly: two SDS unexpectedly gone offline. But disks with data not corrupted.
pawelw1
306 Posts
0
June 27th, 2017 07:00
Hi,
Did you manage to register at support.emc.com to get access to the KB? PM me if you can't.
Thanks,
Pawel
glhf
6 Posts
0
June 27th, 2017 08:00
Pawel,I registered on salesforce, but still no access to that article for same reason... On support.emc.com I registered too.
PS - cant PM to you because we have no connections.