Use msgrv2 always for all ODF modes(internal,external,provider) #2793

malayparida2000 · 2024-09-11T17:01:27Z

Initially in 4.13 when encryption in-transit was implemented we started to use msgrv2 for internal mode only. External mode & Provider mode were excluded.

For external mode, the concern was the availability of msgrv2 port always on the RHCS cluster. But now we have reached ceph v18 & msgrv2 is being by default on, since ceph Nautilus 14.2. So we can now use mgsr2 for external mode safely.

For provider mode, the concern was the availability of the required kernel versions on old client clusters. But now that concern also seems no longer valid. So we can now use msgrv2 for provider mode as well.

Also as we would always use msgrv2, we can now simplify the GetCephFSKernelMountOptions function to remove the conditional checks. If encryption is enabled use "secure" otherwise "prefer-crc".

Initially in 4.13 when encryption in-transit was implemented we started to use msgrv2 for internal mode only. External mode & Provider mode were excluded. For external mode the concern was the availability of msgrv2 port always on the RHCS cluster. But now we have reached ceph v18 & msgrv2 is being by default on since ceph Nautilus 14.2. So we can now use mgsr2 for external mode. For provider mode the concern was availability of the required kernel versions on old client clusters. But now that concern also seems no longer valid. So we can now use msgrv2 for provider mode as well. Also as we would always use msgrv2, we can now simplify the GetCephFSKernelMountOptions function to remove the conditional checks. If encryption is enabled use "secure" otherwise "prefer-crc". Signed-off-by: Malay Kumar Parida <[email protected]>

Signed-off-by: Malay Kumar Parida <[email protected]>

malayparida2000 · 2024-09-11T17:02:19Z

/hold

malayparida2000 · 2024-09-11T17:02:54Z

/cc @travisn

travisn

LGTM, assuming we can confirm that RHCS will always have the msgrv2 port available.

travisn · 2024-09-11T17:37:22Z

@parth-gr Can you also confirm if the external cluster script collects the v2 endpoint (port 3300) when detecting the mons? Commonly I would expect both v1 and v2 to be available, so we just want to start relying exclusively on the v2 port.

leelavg

Does the existing PVs/VSCs continue to work if ms_mode is changed to prefer_crc from legacy, before 4.17 we (provider) weren't using this option and so it was default.

LGTM and I'd like a quick opinion from @Madhu-1, thanks.

travisn · 2024-09-12T03:16:29Z

For external mode we wouldn't be requiring msgr2 from the ceph components, it will only change the clients to start using the msgr2 port. By default, we will just be changing the port, and the existing volumes can continue using the v1 port until the next time they are mounted. So I would expect no issues with this transition, but certainly we should pay extra attention during upgrade testing.

malayparida2000 · 2024-09-12T04:56:22Z

Does the existing PVs/VSCs continue to work if ms_mode is changed to prefer_crc from legacy, before 4.17 we (provider) weren't using this option and so it was default.

LGTM and I'd like a quick opinion from @Madhu-1, thanks.

Yes, existing volumes would continue to use the legacy v1 port without any issues. The mount option is utilised only during volume mounting process in the beginning. So any new volumes that would be mounted after this change will use the prefer-crc mount option.

controllers/storagecluster/cephcluster.go

openshift-ci · 2024-09-12T05:44:14Z

@parth-gr: changing LGTM is restricted to collaborators

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

parth-gr · 2024-09-12T05:46:31Z

Yes, existing volumes would continue to use the legacy v1 port without any issues. The mount option is utilised only during volume mounting process in the beginning.

Does that means v2 port can still mount/remount the PVC that is created by v1 port??

Madhu-1 · 2024-09-12T06:05:42Z

@malayparida2000 are we sure that RHCS doesn't support running only on v1 ports? if its supported we also need to support the same, if not we are good, can we get this confirmation?

malayparida2000 · 2024-09-12T11:47:12Z

@malayparida2000 are we sure that RHCS doesn't support running only on v1 ports? if its supported we also need to support the same, if not we are good, can we get this confirmation?

By support, I have to check with the support team may be. But yes on a RHCS cluster msgrv2 port can be disabled if wanted by the customer, so technically a RHCS cluster can be of v1 port only.

malayparida2000 · 2024-09-12T11:47:49Z

Yes, existing volumes would continue to use the legacy v1 port without any issues. The mount option is utilised only during volume mounting process in the beginning.

Does that means v2 port can still mount/remount the PVC that is created by v1 port??

Yes if remounting happens then the volume will now mount using the v2 port

Madhu-1 · 2024-09-12T14:36:23Z

@malayparida2000 are we sure that RHCS doesn't support running only on v1 ports? if its supported we also need to support the same, if not we are good, can we get this confirmation?

By support, I have to check with the support team may be. But yes on a RHCS cluster msgrv2 port can be disabled if wanted by the customer, so technically a RHCS cluster can be of v1 port only.

we need to check on this one before making this default.

malayparida2000 · 2024-10-01T06:17:17Z

Just an update on the discussion here,
Me & Travis did get in touch with Federico(Ceph PM) & Orit. Their opinion is on the RHCS cluster since nautilus(14.2) the msgrv2 port is by default available & in use. There is no reasonable general reason for v2 port being turned off/unavailable on RHCS cluster. So we should be safe in making the v2 port default for ODF. We are also in touch with Illya about this & would soon make a decision on this.

travisn · 2024-10-03T18:05:40Z

Just an update on the discussion here, Me & Travis did get in touch with Federico(Ceph PM) & Orit. Their opinion is on the RHCS cluster since nautilus(14.2) the msgrv2 port is by default available & in use. There is no reasonable general reason for v2 port being turned off/unavailable on RHCS cluster. So we should be safe in making the v2 port default for ODF. We are also in touch with Illya about this & would soon make a decision on this.

Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters.

We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if RequireMsgr2 is set in the cephcluster CR for the external cluster scenario, we replace any v1 ports with v2 ports as part of the rook cluster reconcile?

parth-gr · 2024-10-04T08:52:39Z

Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters.
We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if RequireMsgr2 is set in the cephcluster CR for the external cluster scenario, we replace any v1 ports with v2 ports as part of the rook cluster reconcile?

For the external cluster, we check if the leader endpoint provided by the Python script has v2 if it contains a v2 port rook internally update it RequireMsgr2: true https://github.com/rook/rook/blob/58a0c834bf5bce6782de5c387f3dac732393ef33/pkg/operator/ceph/cluster/cluster_external.go#L119

So we should first default the collection of the v2 port from the Python script and we may also add the CR spec RequireMsgr2 to true

travisn · 2024-10-04T14:48:18Z

Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters.
We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if RequireMsgr2 is set in the cephcluster CR for the external cluster scenario, we replace any v1 ports with v2 ports as part of the rook cluster reconcile?

For the external cluster, we check if the leader endpoint provided by the Python script has v2 if it contains a v2 port rook internally update it RequireMsgr2: true https://github.com/rook/rook/blob/58a0c834bf5bce6782de5c387f3dac732393ef33/pkg/operator/ceph/cluster/cluster_external.go#L119

So we should first default the collection of the v2 port from the Python script and we may also add the CR spec RequireMsgr2 to true

@parth-gr My question is a bit different, here are a few points to consider:

Upstream we don't want to default to v2, only for downstream
OCS operator (with this PR) is planning to set RequireMsgr2: true for external mode (it was already set for internal mode), so there is no need for Rook to automatically set requireMsgr2.
The code you linked is perfect for upstream, but I think we need to add another check such that if RequireMsgr2: true, force the v2 port for the external mon endpoints.

malayparida2000 · 2024-10-04T16:42:57Z

Agreed, let's go ahead with this requirement to always use the v2 port for downstream, including for external clusters.
We just need to ensure the mon endpoints that are imported are also setting v2 ports. I believe this is already something that Rook enforces during the cluster reconcile. @parth-gr Could you test to confirm that if RequireMsgr2 is set in the cephcluster CR for the external cluster scenario, we replace any v1 ports with v2 ports as part of the rook cluster reconcile?

For the external cluster, we check if the leader endpoint provided by the Python script has v2 if it contains a v2 port rook internally update it RequireMsgr2: true https://github.com/rook/rook/blob/58a0c834bf5bce6782de5c387f3dac732393ef33/pkg/operator/ceph/cluster/cluster_external.go#L119
So we should first default the collection of the v2 port from the Python script and we may also add the CR spec RequireMsgr2 to true

@parth-gr My question is a bit different, here are a few points to consider:

Upstream we don't want to default to v2, only for downstream

OCS operator (with this PR) is planning to set RequireMsgr2: true for external mode (it was already set for internal mode), so there is no need for Rook to automatically set requireMsgr2.

The code you linked is perfect for upstream, but I think we need to add another check such that if RequireMsgr2: true, force the v2 port for the external mon endpoints.

To add we also might need to check,
4. For upgraded clusters, When requireMsgr2 is true, we have to make sure the rok-ceph-csi-config CM changes the mon ports to v2 ports like it does for internal mode.

nb-ohad · 2024-10-06T11:09:16Z

controllers/storagecluster/cephcluster.go

 	}
+	// Always require msgr2 for all modes of ODF
+	networkSpec.Connections.RequireMsgr2 = true


I am not sure we can blindly change that for provider mode. What happens to existing deployments?

For existing deployments the mons after the upgrade would have the v1 port still available, so there would be no problem for the existing volumes that are already mounted. For any new volumes that are created after the upgrade, the volume would use the v2 port.

@malayparida2000 I am not sure that will work out of the box, don't we need to inform the CSI running on the clients that we moved to v2? If so how do we do that in provider mode? the CSI is running on many different clusters

When the ODF version upgrade happens from 4.17 to 4.18 where this change will come in, The subscription channel would change for ocs-operator, which would trigger a change in the hash for the clients & would trigger a reconciliation for them based on this

ocs-operator/services/provider/server/server.go

Line 860 in f80f43b

channelName,

When that happens the client would now get the new info from the provider cluster. That's my understanding.

would now get the new info from the provider cluster

this config will be pushed to clients right after provider gets upgraded but not when the client gets upgraded

nevertheless, based on all the info presented even 4.17 clients will have v2 endpoint which should be supported by kernel on client clusters

don't we need to inform the CSI running on the clients that we moved to v2

two changes will happen after provider upgrade, new v2 endpoints and mount option ms_mode=prefer-crc will be pushed to client clusters and IIUC they can be changed on the fly w/o any issues for both existing and new workloads.

parth-gr · 2024-10-07T13:38:27Z

@parth-gr My question is a bit different, here are a few points to consider:

Upstream we don't want to default to v2, only for downstream

OCS operator (with this PR) is planning to set RequireMsgr2: true for external mode (it was already set for internal mode), so there is no need for Rook to automatically set requireMsgr2.

The code you linked is perfect for upstream, but I think we need to add another check such that if RequireMsgr2: true, force the v2 port for the external mon endpoints.

For downstream we can use the RequireMsgr2: true info, but the problem is with extracting the data,
How do we make sure the downstream python script always scrapes the v2 port

travisn · 2024-10-07T16:57:11Z

For downstream we can use the RequireMsgr2: true info, but the problem is with extracting the data, How do we make sure the downstream python script always scrapes the v2 port

To capture our discussion:

We will make the change in Rook to enforce the v2 port for external clusters if requiremsgr2 is true
The external python script will be changed in the downstream-only branch to only use the v2 port.

malayparida2000 · 2024-10-09T09:20:04Z

For downstream we can use the RequireMsgr2: true info, but the problem is with extracting the data, How do we make sure the downstream python script always scrapes the v2 port

To capture our discussion:

We will make the change in Rook to enforce the v2 port for external clusters if requiremsgr2 is true

The external python script will be changed in the downstream-only branch to only use the v2 port.

From Slack by Parth

From two changes we discussed,
I think we don't the rook upstream change
As we always check for the cluster.Spec.RequireMsgr2() https://github.com/rook/rook/blob/master/pkg/operator/ceph/cluster/cluster_external.go#L126
If it set from the CR still it will work with the same check

@parth-gr this should mean this PR is now no longer blocked from external mode perspective, right?

parth-gr · 2024-10-09T09:23:35Z

@malayparida2000 can we do a quick testing??
(End-to-end)
And after this change for sure it wont block red-hat-storage/rook#748

malayparida2000 · 2024-10-09T09:30:00Z

@malayparida2000 can we do a quick testing?? (End-to-end) And after this change for sure it wont block red-hat-storage/rook#748

Sounds good to do a test, will do it.

malayparida2000 · 2024-10-14T08:12:21Z

@malayparida2000 can we do a quick testing?? (End-to-end) And after this change for sure it wont block red-hat-storage/rook#748

Sounds good to do a test, will do it.

Currently, there is some infra issue with the availability of external mode setups from QE & anyways, the scenario we are concerned about is the behavior of the cluster upon upgrade from 4.17 to 4.18 which will be easy to test when this change is officially part of the build. And anyway, I have to document the whole process as well, so I will do the e2e testing of the flow after the PR is merged & part of the build.

malayparida2000 · 2024-10-14T08:12:33Z

/unhold

parth-gr · 2024-10-14T08:40:42Z

/lgtm

openshift-ci · 2024-10-14T08:40:45Z

@parth-gr: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

malayparida2000 · 2024-10-15T05:23:10Z

The PR is now LGTM from both the Internal Mode & External Mode perspective.
@nb-ohad / @leelavg Can you please ack LGTM from the Provider-Client Mode perspective so we can finally merge this?

leelavg · 2024-10-15T05:33:10Z

LGTM, since Ohad has an unresolved comment need to wait for that to be resolved.

malayparida2000 · 2024-10-16T15:49:02Z

Ohad will be on PTO till the 21st of Oct. This is an important change & with more delay to its merge we are losing precious time we might need to make further changes based on the result of this. This PR also blocks other stories, thereby blocking the Epic's progress. As we have acks from all Povs including Provider mode from Leela, We should go ahead and merge this.
/cc @travisn

travisn · 2024-10-16T16:12:50Z

/lgtm
/approve

openshift-ci · 2024-10-16T16:12:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: malayparida2000, parth-gr, travisn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [travisn]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

malayparida2000 · 2024-10-17T02:36:41Z

/retest

malayparida2000 · 2024-10-17T04:40:24Z

/retest

malayparida2000 · 2024-10-17T05:58:11Z

/retest

malayparida2000 · 2024-10-17T08:24:22Z

/test ocs-operator-bundle-e2e-aws

malayparida2000 · 2024-10-17T09:09:47Z

/retest

iamniting · 2024-10-17T09:57:08Z

/override ci/prow/ocs-operator-bundle-e2e-aws

openshift-ci · 2024-10-17T09:57:13Z

@iamniting: Overrode contexts on behalf of iamniting: ci/prow/ocs-operator-bundle-e2e-aws

In response to this:

/override ci/prow/ocs-operator-bundle-e2e-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rewantsoni · 2024-10-17T11:46:37Z

@malayparida2000 We missed rebasing it against main and running make deps-update, there are changes in metrics/vendor/ from this PR in my local

malayparida2000 · 2024-10-17T12:06:31Z

@malayparida2000 We missed rebasing it against main and running make deps-update, there are changes in metrics/vendor/ from this PR in my local

Didn't understand, How does this PR affect the vendor folder?

malayparida2000 added 2 commits September 11, 2024 22:24

Fix unit tests

52899eb

Signed-off-by: Malay Kumar Parida <[email protected]>

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2024

malayparida2000 requested a review from leelavg September 11, 2024 17:02

openshift-ci bot requested a review from travisn September 11, 2024 17:02

travisn reviewed Sep 11, 2024

View reviewed changes

leelavg reviewed Sep 12, 2024

View reviewed changes

parth-gr suggested changes Sep 12, 2024

View reviewed changes

controllers/storagecluster/cephcluster.go Show resolved Hide resolved

malayparida2000 mentioned this pull request Oct 3, 2024

Include Encryption In-Transit state as part of the desired state hash sent to clients #2837

Merged

nb-ohad reviewed Oct 6, 2024

View reviewed changes

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 14, 2024

openshift-ci bot requested a review from travisn October 16, 2024 15:49

openshift-ci bot assigned travisn Oct 16, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 16, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2024

openshift-merge-bot bot merged commit 38cbf3a into red-hat-storage:main Oct 17, 2024
11 checks passed

malayparida2000 deleted the msgrv2 branch January 7, 2025 05:37

Use msgrv2 always for all ODF modes(internal,external,provider) #2793

Use msgrv2 always for all ODF modes(internal,external,provider) #2793

Uh oh!

Conversation

malayparida2000 commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malayparida2000 commented Sep 11, 2024

Uh oh!

malayparida2000 commented Sep 11, 2024

Uh oh!

travisn left a comment

Choose a reason for hiding this comment

Uh oh!

travisn commented Sep 11, 2024

Uh oh!

leelavg left a comment

Choose a reason for hiding this comment

Uh oh!

travisn commented Sep 12, 2024

Uh oh!

malayparida2000 commented Sep 12, 2024

Uh oh!

Uh oh!

openshift-ci bot commented Sep 12, 2024

Uh oh!

parth-gr commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Madhu-1 commented Sep 12, 2024

Uh oh!

malayparida2000 commented Sep 12, 2024

Uh oh!

malayparida2000 commented Sep 12, 2024

Uh oh!

Madhu-1 commented Sep 12, 2024

Uh oh!

malayparida2000 commented Oct 1, 2024

Uh oh!

travisn commented Oct 3, 2024

Uh oh!

parth-gr commented Oct 4, 2024

Uh oh!

travisn commented Oct 4, 2024

Uh oh!

malayparida2000 commented Oct 4, 2024

Uh oh!

nb-ohad Oct 6, 2024

Choose a reason for hiding this comment

Uh oh!

malayparida2000 Oct 6, 2024

Choose a reason for hiding this comment

Uh oh!

nb-ohad Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

malayparida2000 Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leelavg Oct 15, 2024

Choose a reason for hiding this comment

Uh oh!

parth-gr commented Oct 7, 2024

Uh oh!

travisn commented Oct 7, 2024

Uh oh!

malayparida2000 commented Oct 9, 2024

Uh oh!

parth-gr commented Oct 9, 2024

Uh oh!

malayparida2000 commented Oct 9, 2024

Uh oh!

malayparida2000 commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malayparida2000 commented Oct 14, 2024

Uh oh!

parth-gr commented Oct 14, 2024

Uh oh!

malayparida2000 commented Sep 11, 2024 •

edited

Loading

parth-gr commented Sep 12, 2024 •

edited

Loading

malayparida2000 Oct 8, 2024 •

edited

Loading

malayparida2000 commented Oct 14, 2024 •

edited

Loading