Replace a failed node within the ObjectScale Software Bundle
Use this Node Replacement service procedure to replace a failed node.
Prerequisites
Ensure that the new node has the same operating system version and networking configuration as the other nodes within the ObjectScale Software Bundle. Ensure that the system has an extra FTT quota, that is, if the system is FTT=1, ensure that there are no extra nods down. If the system is FTT=2, the other down node size is <=1. Ensure that there are no other ongoing service procedures or recoveries.
NOTE:If this FTT requirement is not met, do not proceed with these steps; call Dell Support.
About this task
When a node goes to failure, all pods on that node turn to terminating state. Stateless pods would be rescheduled to another available node after five minutes, and stateful pods would keep terminating.
Steps
The ObjectScale Software Bundle CMO Platform Manager APIs require a keycloak token to authenticate the requests for cluster management tasks.
The ObjectScale Software Bundle contains a CMO Platform Manager running on Kubernetes within the cluster that is used to request cluster management tasks, like service procedures.
Collect the keycloak account information from the secret:
NOTE:If the
remove_os_packages parameter is set to
true, the OS packages are removed from the node. This precludes the user from adding the node back to the cluster without reinstalling those OS packages.
Scale down the node using the CMO Platform Manager scale down API.
NOTE: If the node is unreachable (the logs read "Unreachable=1"), a scale down operation would report failure, even though the scale down happens successfully.
When the operation is finished, the operation
"state" is marked as
"complete".
Confirm that the new node appears in the node list.
kubectl get node
NOTE:Although the status of this operation may appear as failed, but the failure node could be removed successfully. Check the node status.
Delete the PVC, volumes, and LVGs of stateful pods on the removed node.
Retrieve all PVCs bound to the node to be removed.
NOTE:The node name is listed as part of the
volume.kubernetes.io/selected-node annotation in the
describe output of each PVC.
The PVC names and the described details are obtained with the following commands.
Get PVC names:
kubectl get pvc
Get the details for each listed PVC:
kubectl describe pvc <PVC_NAME>
Get the node for each listed PVC:
PVCs are namespace-scoped resources. Repeat this step for all namespaces used by ObjectScale.
for i in `kubectl get pvc --no-headers -o jsonpath="{.items[*].metadata.name}"`; do echo "=== $i"; kubectl get pvc $i -o json | grep selected-node | grep -v "{}"; done
kubectl get ac | grep <NODE_ID> | awk '{print $1}' | xargs kubectl delete ac
Verify Resource Removal.
Check for CSI Bare-Metal Node:
kubectl get csibmnodes | grep <NODE_ID>
Check for available capacity:
kubectl get ac | grep <NODE_ID>
Check for drive CRs:
kubectl get drive | grep <NODE_ID>
Delete pending stateful pods.
kubectl get pods -o wide -A | grep Pending
NOTE:After the removal of a failed node, there may be some pods left in the
Pending state. These are likely StatefulSet pods that were previously running on the removed node. This includes SS, influxdb, bookie, and atlas pods. Once deleted, they, along with their associated volumes, are re-created on another available node.
Data is not available for the Topic
Please provide ratings (1-5 stars).
Please provide ratings (1-5 stars).
Please provide ratings (1-5 stars).
Please select whether the article was helpful or not.
Comments cannot contain these special characters: <>()\