-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use msgrv2 always for all ODF modes(internal,external,provider) #2793
Use msgrv2 always for all ODF modes(internal,external,provider) #2793
Conversation
Initially in 4.13 when encryption in-transit was implemented we started to use msgrv2 for internal mode only. External mode & Provider mode were excluded. For external mode the concern was the availability of msgrv2 port always on the RHCS cluster. But now we have reached ceph v18 & msgrv2 is being by default on since ceph Nautilus 14.2. So we can now use mgsr2 for external mode. For provider mode the concern was availability of the required kernel versions on old client clusters. But now that concern also seems no longer valid. So we can now use msgrv2 for provider mode as well. Also as we would always use msgrv2, we can now simplify the GetCephFSKernelMountOptions function to remove the conditional checks. If encryption is enabled use "secure" otherwise "prefer-crc". Signed-off-by: Malay Kumar Parida <[email protected]>
Signed-off-by: Malay Kumar Parida <[email protected]>
/hold |
/cc @travisn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, assuming we can confirm that RHCS will always have the msgrv2 port available.
@parth-gr Can you also confirm if the external cluster script collects the v2 endpoint (port 3300) when detecting the mons? Commonly I would expect both v1 and v2 to be available, so we just want to start relying exclusively on the v2 port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the existing PVs/VSCs continue to work if ms_mode is changed to prefer_crc from legacy, before 4.17 we (provider) weren't using this option and so it was default.
LGTM and I'd like a quick opinion from @Madhu-1, thanks.
For external mode we wouldn't be requiring msgr2 from the ceph components, it will only change the clients to start using the msgr2 port. By default, we will just be changing the port, and the existing volumes can continue using the v1 port until the next time they are mounted. So I would expect no issues with this transition, but certainly we should pay extra attention during upgrade testing. |
Yes, existing volumes would continue to use the legacy v1 port without any issues. The mount option is utilised only during volume mounting process in the beginning. So any new volumes that would be mounted after this change will use the prefer-crc mount option. |
@parth-gr: changing LGTM is restricted to collaborators In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Does that means v2 port can still mount/remount the PVC that is created by v1 port?? |
@malayparida2000 are we sure that RHCS doesn't support running only on v1 ports? if its supported we also need to support the same, if not we are good, can we get this confirmation? |
By support, I have to check with the support team may be. But yes on a RHCS cluster msgrv2 port can be disabled if wanted by the customer, so technically a RHCS cluster can be of v1 port only. |
Yes if remounting happens then the volume will now mount using the v2 port |
we need to check on this one before making this default. |
Just an update on the discussion here, |
Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters. We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if |
For the external cluster, we check if the leader endpoint provided by the Python script has v2 if it contains a v2 port rook internally update it So we should first default the collection of the v2 port from the Python script and we may also add the CR spec |
@parth-gr My question is a bit different, here are a few points to consider:
|
To add we also might need to check, |
} | ||
// Always require msgr2 for all modes of ODF | ||
networkSpec.Connections.RequireMsgr2 = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we can blindly change that for provider mode. What happens to existing deployments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For existing deployments the mons after the upgrade would have the v1 port still available, so there would be no problem for the existing volumes that are already mounted. For any new volumes that are created after the upgrade, the volume would use the v2 port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malayparida2000 I am not sure that will work out of the box, don't we need to inform the CSI running on the clients that we moved to v2? If so how do we do that in provider mode? the CSI is running on many different clusters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the ODF version upgrade happens from 4.17 to 4.18 where this change will come in, The subscription channel would change for ocs-operator, which would trigger a change in the hash for the clients & would trigger a reconciliation for them based on this
channelName, |
When that happens the client would now get the new info from the provider cluster. That's my understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would now get the new info from the provider cluster
- this config will be pushed to clients right after provider gets upgraded but not when the client gets upgraded
- nevertheless, based on all the info presented even 4.17 clients will have v2 endpoint which should be supported by kernel on client clusters
don't we need to inform the CSI running on the clients that we moved to v2
- two changes will happen after provider upgrade, new v2 endpoints and mount option
ms_mode=prefer-crc
will be pushed to client clusters and IIUC they can be changed on the fly w/o any issues for both existing and new workloads.
For downstream we can use the |
To capture our discussion:
|
From Slack by Parth
@parth-gr this should mean this PR is now no longer blocked from external mode perspective, right? |
@malayparida2000 can we do a quick testing?? |
Sounds good to do a test, will do it. |
Currently, there is some infra issue with the availability of external mode setups from QE & anyways, the scenario we are concerned about is the behavior of the cluster upon upgrade from 4.17 to 4.18 which will be easy to test when this change is officially part of the build. And anyway, I have to document the whole process as well, so I will do the e2e testing of the flow after the PR is merged & part of the build. |
/unhold |
/lgtm |
@parth-gr: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
LGTM, since Ohad has an unresolved comment need to wait for that to be resolved. |
Ohad will be on PTO till the 21st of Oct. This is an important change & with more delay to its merge we are losing precious time we might need to make further changes based on the result of this. This PR also blocks other stories, thereby blocking the Epic's progress. As we have acks from all Povs including Provider mode from Leela, We should go ahead and merge this. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: malayparida2000, parth-gr, travisn The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
2 similar comments
/retest |
/retest |
/test ocs-operator-bundle-e2e-aws |
/retest |
/override ci/prow/ocs-operator-bundle-e2e-aws |
@iamniting: Overrode contexts on behalf of iamniting: ci/prow/ocs-operator-bundle-e2e-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
38cbf3a
into
red-hat-storage:main
@malayparida2000 We missed rebasing it against main and running make deps-update, there are changes in metrics/vendor/ from this PR in my local |
Didn't understand, How does this PR affect the vendor folder? |
Initially in 4.13 when encryption in-transit was implemented we started to use msgrv2 for internal mode only. External mode & Provider mode were excluded.
For external mode, the concern was the availability of msgrv2 port always on the RHCS cluster. But now we have reached ceph v18 & msgrv2 is being by default on, since ceph Nautilus 14.2. So we can now use mgsr2 for external mode safely.
For provider mode, the concern was the availability of the required kernel versions on old client clusters. But now that concern also seems no longer valid. So we can now use msgrv2 for provider mode as well.
Also as we would always use msgrv2, we can now simplify the GetCephFSKernelMountOptions function to remove the conditional checks. If encryption is enabled use "secure" otherwise "prefer-crc".