How to replace VNX Unified Storage SP

Published OCT 02, 2024

This Dell EMC Customer Replaceable Unit (CRU) Video demonstrates how to replace a Storage Processor (SP) in a VNX System

So, we’re going to generate, lab 7 part 3 is going to generate a procedure to remove a CPU module in a VNX5700 or 7500 storage processor enclosure. So, we’re going to bring up the procedure for that. So, I again at the VNX procedure generator, I’m going to select “Next” for 5700. And this is a File/Unified hardware replacement. Again, “Next.” And in this case, I’m going to refine the VNX fifty-seven File/Unified SPE hardware. And I’m going to generate by selecting the radio button for a “Replace SPE CPU module.”

And I’m going to go ahead and select “Next” and generate the procedure for that. Once you have the procedure, we’re again going to read all the upfront documentation for the ESD and any issues, and we’re going to start to write in by looking at and trying to diagnose and identify the FRU or the CRU that’s that’s at fault. So, this is my procedure.

I’m going to go ahead and close that out. Make sure I have the correct one. Yeah. Replacing a CPU module. So, we’re gonna close out of that. And I’ve launched Unisphere. And I’m going to come up and I’m gonna look at the system and I’m gonna look at the storage hardware.

So, I select the VNX storage system, I select the storage tab, and then select the storage hardware. Instead of storage for file. And again, I want to look for any fault conditions that may be present. You now look for that in the tree. So, if you saw a red F, for example, it would propagate from a component on up to the tree, so that’s where we can validate that something has failed. We’re going to assume there is something failed here, in this case, a storage processor.

So, once we have determined that, we can go ahead and, as an option, run the Verify Storage System wizard to perform a health check. And I could do that by coming over here. I could launch USM and go through the procedures similar to what we did in part 2 of launching USM and collecting a verification of the system. So, I’m going to assume that we we’ve done that and that has been done.

The next thing we might want to do and is recommended is that we prepare the storage for the storage process for the replacement procedure by trespassing the LUNs on the target that we want to replace. So, for example, in a system with two SPs, A and B, and SP B need to replace, we would want to go and replace the, or trespass the LUNs that are associated or currently owned by SP B. So, to do that, we would have to go to Storage and go to LUNs. And then, there’s more than one way to do this; I’m just showing you one way here.

The current owner is shown here. So, again, I’m replacing that SP B, so I would go ahead and select that, right-click, and trespass that LUN. And I’m going to go ahead and do that. Unlike other options with Unisphere, you can’t go ahead and select a lot of them trespassing; you have to trespass one at a time, at least from Unisphere. So, I’m gonna go ahead and do all these LUNs. I actually found if we hit the Ctrl button, we select these. It looks like it depends on how many you select, but in this case, when I have three, I’m gonna try to trespass them all, and it looks like it’s gonna work, so the idea is that we need to trespass the LUNs.

So, now, I see everything off, at least on this system, that is owned by SP B, so I’ve trespassed the LUNs. And I can go ahead and establish a serial connection using the same parameters we’ve been using all along here I listed for you in the guide, and then I’m going to log on to the Control Station 0 and become super user or root user. We will again stop our ConnectHome and email services.

And verify that. And now we do need to power down the storage process. So, now again, we’re gonna use a navicli command to do this. So, we’re gonna go and we already trespassed the LUNs. And, again, if you don’t have a security file, you would have to go ahead and either create one or you have to use the full credentials in the command line. So, here is an example of the command, fairly lengthy. And you find even sometimes in a PuTTY session you don’t have enough wraparound to issue this command, so I might have to expand that box out a bit.

But at any rate, we have the SP that we, in this case SP A, and we’re gonna shut down the peer SP, which is SP B. And remember, we already trespassed that, so I wouldn’t do that. Oops! Oh, I’m gonna try it again one more time. You have to know how to type in this business, apparently. So, do you want to shut down and hold the SP now? Yes, I do. And we’re gonna perform that.

That one’s just going down, I can I use a ping command to verify that it has actually gone down. And if I were to ping that IP address, it should come back as not available. I’m gonna go to 153 actually, the SP B, and I can see that it has no response, so Ctrl-C out of that. So, at this point, I’ve trespassed the LUNs, I’ve stopped or halted the storage processor, and you can go ahead at this point and locate the documentation to perform the remove and replacement of the components per the procedure guide.

So, remember, after you install the particular CPU, then you’re gonna have to change DIMMs, change all that out, and you’re gonna replace it all back in, cable it back up same way you had if before. So, after that’s done, we’ll be ready to continue. After the XP or X Plate has been prepared for removal, begin by removing the two power supply cooling modules in front of the faulty SP or X Plate.

Then, squeeze the two orange tabs labeled with a black number two towards each other to unlock the latches. Push the latches away from each other to fully release the module from the enclosure. Place the module onto an anti-static surface and transfer the memory modules into the new SP or X Plate.

To reinstall the SP or X Plate, align the module with the guides on the side of the enclosure. Slide it into the enclosure until the latches start to move inwards. Push the latches toward each other to fully seat and lock the module into place. To reinstall the power supplies, align the module with its slot and push it into the enclosure. Raise the black latch to secure the module into place.

You should hear an audible click when the latch is properly engaged. So, we can see, actually, if we went to EMC Unisphere, as you see here, and looked at the storage hardware, we can see the “F” that we talked about earlier, where it says a SP B’s unmanageable, or removed in this case. And that again is propagated up the tree. So, what we want to do now is, after you performed the replacement, after you changed all the components, you’re going to go ahead and put that back in there and then you’re going to have to power up the storage processor.

And again, you can see your guide because it’s dependent now. This is an issue with 5.31, but had you had something prior to that, make sure you consult the guide as to how you go about that. And we’re going to issue the reboot peer SP. So, we’re gonna issue that command here. So, I’ve typed in the reboot peer SP. And we’re going to issue the command. And at this point, we’re going to wait for the storage process to power up, and it’s going to be powered up when the fault LED is not lit and the power LED is green.

So, again, as with the part 2 replacement for the I/O module, this can take, you know, up to ten minutes or so, so we’re gonna let this power up. You can then validate on the front LEDs, or you could go to the storage hardware section and make sure the “F” has gone out and the storage processor is manageable again. So, just as a note while waiting for the SP to come back up, don’t forget you have SP event logs and you can look at these logs, for example, looking at A here, and these can give you clues as to what’s going on in the system as well. A lot of these are informational. And we look for different ones.

There’s a critical one here that occurred and caching was disabled. And it gives you the time and date, so we can take a look at that. Now, we probably can’t read this unless SP has come back up to the point where we actually can, but it hasn’t at this point, so we won’t be able to communicate with that.

And, of course, there’s always just the alerts that come up here, you see several here, which you can always click and get details as well. So, we’re still looking, still waiting for the system to come back up. We can monitor it, we can look at the actual, again, storage hardware. Once this gets cleared, we should be all set to go. It looks like it’s cleared now and that the SP is finally present.

So, the SP is rebooted and we’re going to go ahead and continue on with the lab, when you’re gonna have to restore any trespassed LUNs that occurred. I can see here that, just by looking at the current ownership, that the LUNs have come back under their current SP once the SP has come up. And so, that’s a good thing. And that was done… I don’t want to go to storage; I want to go to LUNs. And again, we can see everything looks good there.

So, it’s always a good idea to, you know, assign and try to restore those as well, as we did in lab 2. We could do it with the CLI or with the Unisphere. If I go to the System tab, I looked at trespassed LUNs, there doesn’t appear to be any LUNs that are trespassed, so we’re in good shape here. I’ll say “OK” to that. And again, if you did have our LUNs, we’ll go ahead and just take a look at that.

All I did is bring up a PuTTY session and issued a nas_storage, listed the commands, and this is what was seen before. And then I could actually go over and fail those back if we had LUNs that needed to be failed back. And I have to fail back, of course, using the serial number name, so I’ll just cut and paste that in there. And this should come back and do that. And again, we showed you how to do that earlier in the lab exercise part 2 of this, of lab 7.

So, that looks good. Looks like, after this passes, we can go ahead and run a NAS check-up again. Once you run the NAS check-up, you could possibly go and run your Verify Storage System wizard again. And once you did that, you would go ahead and enable the ConnectHome notifications, and that would complete the lab exercise. So, here we see it returned. Then, go ahead and run a NAS check-up again. And we’re gonna let that run.

So, again, the check has run clean. I see no failures, warnings, but no failures. And at this point, you can go ahead and enable the ConnectHome and the email services and run a verification check again if you want with the Verify Storage System software. And if you need to do that, you can refer to the previous lab. And that is going to conclude the part 3 of lab 7.

Suggested Videos

How to replace a Faulted Disk Drive VNX

How to replace a Faulted Disk Drive VNX

6:21

Related Articles