drb45

15 Posts

2810

April 9th, 2021 11:00

csi-powerscale 1.5.0 NFS client entry changes

In csi-powerscale up through 1.4.0, NFS clients would be entered using the node FQDN if resolvable. It seems in 1.5.0, this has changed again and now entries are added using just the IP.

This is fine, but it caused an issue with existing volumes, especially single-writer volumes. Volumes attached at the time of upgrade stay attached, but when they are detatched/un-mounted, the existing FQDN client entry is not removed from the export. Then, the next mount attempt will fail:

[Warning] AttachVolume.Attach failed for volume "k8s-abc1234zyx" : rpc error: code = FailedPrecondition desc = runid=41 export '76' in access zone 'MYAZ' already has other clients added to it, and the access mode is SINGLE_NODE_WRITER, thus the request fails

My guess is the 1.5.0 version is now only looking for the IP to remove, and so doesn't clean up the FQDN entry that was originally added by 1.4.0?

Another issue I've noticed is that "localhost" is also being added to the client lists on the export. And localhost is left behind on the export when the volume un-mounts.

Responses(15)

T

Thar_J

42 Posts

0

July 9th, 2021 01:00

Hi @drb45

I apologize for the delayed response.

Are you using the current csi-powerscale 1.6 version.

If you are using 1.5 driver, kindly try this:

Example : kubectl patch daemonset isilon-node -n isilon -p ‘{“spec”: {“template”: {“spec”:{“dnsPolicy”: “ClusterFirst”}}}}’

If using 1.6 driver, then the dnsPolicy value is configurable via myvalues.yaml file, in myvalues.yaml kindly change the dnsPolicy”: “ClusterFirst" on all nodes.

Please check to change dnsPolicy : 'ClusterFirst' and see if the problem persists.

Regards

Thar_J

T

Thar_J

42 Posts

0

April 12th, 2021 00:00

Hi

Regarding localhost :

This was done explicitly to enhance the security. An empty export is accessible to all the hosts on the network -- so we populate it with a dummy 'localhost' entry so as to restrict this access.

T

Thar_J

42 Posts

0

April 12th, 2021 01:00

Would like to know is it a Helm based installation or Operator based installation.

D

drb45

15 Posts

0

April 12th, 2021 05:00

It's a Helm install.

On the localhost thing - I was pretty sure but just tested it to verify, an empty NFS client list doesn't make an export wide open, it can't be mounted by any clients. But I can see the logic in the dummy localhost.

T

Thar_J

42 Posts

0

April 15th, 2021 02:00

Hi

Just wanted to understand was there a possible change in the network config (dns change) any recently.

Kindly attach the kubectl logs for below , we would like to analyze the logs.

kubectl get pods -n isilon

kubectl logs -n isilon -c driver

D

drb45

15 Posts

0

April 15th, 2021 05:00

There were no network or DNS changes. I had to roll back to version 1.4.0 because of the change in behavior. After rolling back, it resumed the previous behavior of using FQDN on the export client lists.

T

Thar_J

42 Posts

0

April 16th, 2021 00:00

Hi @drb45

I understand after rolling back the csi-powerscale driver from 1.5 to 1.4 the previous behaviour of FQDN is working fine. We have tried the upgrade in various environments at ease , We would like to isolate the underlying cause which has caused the FQDN issue.

Would like to know the steps you followed to upgrade the driver from 1.4 to 1.5.0 previously.

Kindly let us know to upgrade the current powerscale driver from 1.4.0 to 1.5.0. We would like to help you on that.

Regards

Thar_J

D

drb45

15 Posts

0

April 16th, 2021 07:00

I'll have to build a test cluster to try and reproduce this, because I can't keep upgrading/downgrading in our live environment since it's an actively used application. I'll report back when I've had a chance to do that.

If it helps, our cluster is running RKE 1.18.15 (being upgraded to 1.19.9 later today) on RHEL 7 nodes.

My process to upgrade was:

Updated my config file to align with the changes from 1.4 -> 1.5
Created the cluster credential secret using the new secrets.json file
Ran the csi-install.sh script with the --upgrade flag

D

drb45

15 Posts

0

June 10th, 2021 12:00

Hi @Thar_J

Sorry for the long radio silence on this, I've been slammed with other projects and this hasn't been a high priority since 1.4.0 was working fine.

Today I had some time so I built a new K8s (RKE) 1.19.10 cluster on RHEL 7 nodes to test this on, which matches our production cluster. I started with csi-powerscale v1.3.0.1 because I was also testing something else which I'll put in another post (fsGroup behavior). I then upgraded it to 1.4.0, and 1.5.0. Below are my findings:

v1.3.0.1: NFS client(s) = worker node FQDN
v1.4.0: NFS client(s) = worker node FQDN
v1.5.0: NFS client(s) = worker node IP

Between versions there were no changes at all to the OS or networking config on the nodes, in fact they weren't even rebooted.

T

Thar_J

42 Posts

0

June 13th, 2021 23:00

Hi @drb45

The CSI Driver first tries with FQDN, if it doesn't work, it tries via IP. But the priority is given to FQDN.

Regards

Thar_J

D

drb45

15 Posts

0

June 14th, 2021 05:00

I guess something changed between 1.4.0 and 1.5.0 then, since the observed behavior changed between the two versions while there were no changes at all to the underlying hosts (not even a reboot).

It would be fine except for the issue I first mentioned, that RWO volumes which are mounted at the time of the CSI upgrade don't get their NFS client list cleaned up properly when they unmount. The existing FQDN entry isn't removed when it unmounts (since 1.5.0 is looking for an IP entry), then the next mount fails because there's already a client defined on the export..

D

drb45

15 Posts

0

July 9th, 2021 13:00

@Thar_J we're still running on v1.4.0 because of the change in the client entry behavior affecting currently-mounted volumes.

On my test cluster with v1.6.0, I applied dnsPolicy: "ClusterFirst" and can confirm that reverted the behavior to the pre-1.5 behavior of using the node DNS name. So it appears that will allow me to go ahead with upgrading to 1.5 and then 1.6 without affecting RWO volumes that are actively mounted.

That said, I still don't think there's a clean path for me to migrate from "ClusterFirst" to the default/recommended "ClusterFirstWithHostNet" - but what I'll plan to do is make that switch during our next maintenance when all the RWO pods will get killed anyway.

Thanks for the help looking into this.

T

Thar_J

42 Posts

0

July 11th, 2021 22:00

Hi @drb45

Glad to hear that applying dnsPolicy: "ClusterFirst" reverted the behavior to the pre-1.5 using the node DNS name.

Kindly plan to upgrade the CSI-Powerscale at your convenience.

Regards

Thar_J

D

DesmondKoh

6 Posts

0

August 7th, 2021 03:00

Hi Thar_J,

I'm using csi-powerscale 1.6 and below is in myvalues.yaml file.

dnsPolicy: "ClusterFirstWithHostNet"

However, I did run the following command:

kubectl patch daemonset isilon-node -n isilon -p ‘{“spec”: {“template”: {“spec”:{“dnsPolicy”: “ClusterFirst”}}}}’

When I create a pod, it failed to start due to error below:

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 84s default-scheduler Successfully assigned default/nginx-pv-pod to desmond-virtualbox
Warning FailedAttachVolume 16s (x8 over 84s) attachdetach-controller AttachVolume.Attach failed for volume "k8s-115ff725f7" : rpc error: code = Internal desc = runid=73 internal error occured when attempting to add client ip 'desmond-virtualbox=#=#=kube-cluster=#=#=192.168.0.216' to export '5', error : 'failed to add client ip '192.168.0.216' to export id '5' : 'failed to add client to export id '5' with access zone 'System' : 'Get "https://192.168.0.200:8080/platform/2/protocols/nfs/exports/5?zone=System": context deadline exceeded'''

Where did it go wrong?

D

DesmondKoh

6 Posts

0

August 7th, 2021 23:00

Hi Thar_J,

Please ignore my question. I managed to resolve it. The issue is due to the Test Isilon system on my laptop does not have DNS configured. Hence it unable to resolve the Worker Node hostname when CSI driver tried to add the hostname into the NFS Export client list. After I configured a DNS server then it is working perfectly.

View All

No Events found!

Containers

csi-powerscale 1.5.0 NFS client entry changes