Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s 1.30 provider slim s390x #1252

Merged
merged 10 commits into from
Nov 8, 2024

Conversation

chandramerla
Copy link
Contributor

@chandramerla chandramerla commented Aug 21, 2024

What this PR does / why we need it:
Enables s390x support in kubernetes provider 1.30.
This is required for running s390x based tests spinning a k8s cluster using k8s provider.

Special notes for your reviewer:
Earlier PR (created for 1.28 provider) whose comments are addressed in this PR: #1201
Changes required in respective prow jobs to publish provider are in PR: #3566

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:

NONE

@kubevirt-bot kubevirt-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels Aug 21, 2024
@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 21, 2024
@kubevirt-bot
Copy link
Contributor

Hi @chandramerla. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dhiller
Copy link
Contributor

dhiller commented Aug 21, 2024

/cc

@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 22, 2024
@chandramerla chandramerla marked this pull request as ready for review August 22, 2024 05:10
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 22, 2024
@chandramerla
Copy link
Contributor Author

I've run publish.sh inside the containers used in kubevirtci prow jobs, which build and publish gocli, centos9, k8s provider images and alpine container image. With the changes in publish.sh script in this PR and prow job related changes (In PR kubevirt/project-infra#3566), I've successfully tested publishing of multiarch manifest-list for all the above images to icr.io (as I don't have write access in quay.io). In published multiarch manifest-lists s390x arch based images are built/present for gocli, centos9 and k8s-1.30 (slim).

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 23, 2024
@chandramerla chandramerla force-pushed the k8s-1.30-provider-slim-s390x branch 3 times, most recently from 6746500 to 173b4bd Compare August 23, 2024 15:17
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 24, 2024
@chandramerla chandramerla force-pushed the k8s-1.30-provider-slim-s390x branch 2 times, most recently from 4f24d4a to 7289dd0 Compare August 25, 2024 04:58
@chandramerla
Copy link
Contributor Author

/sig ci

@kubevirt-bot kubevirt-bot added the sig/ci Denotes an issue or PR as being related to sig-ci, marks changes to the CI system. label Aug 26, 2024
@brianmcarey
Copy link
Member

/test ?

@kubevirt-bot
Copy link
Contributor

@brianmcarey: The following commands are available to trigger required jobs:

  • /test check-provision-alpine-with-test-tooling
  • /test check-provision-centos-base
  • /test check-provision-k8s-1.29
  • /test check-provision-k8s-1.30
  • /test check-provision-k8s-1.31
  • /test check-provision-manager
  • /test check-up-kind-ovn
  • /test check-up-kind-sriov

The following commands are available to trigger optional jobs:

  • /test check-gocli
  • /test check-up-kind-1.27-vgpu
  • /test check-up-kind-1.30-vgpu

Use /test all to run the following jobs that were automatically triggered:

  • check-provision-alpine-with-test-tooling
  • check-provision-k8s-1.29
  • check-provision-k8s-1.30
  • check-provision-k8s-1.31
  • check-provision-manager
  • check-up-kind-1.27-vgpu
  • check-up-kind-1.30-vgpu
  • check-up-kind-sriov

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@brianmcarey
Copy link
Member

/test check-provision-k8s-1.30
/test check-gocli

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 7, 2024
@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 5, 2024
@dhiller
Copy link
Contributor

dhiller commented Nov 5, 2024

@chandramerla would you mind taking a look at the failing check-provision lanes please?

@chandramerla
Copy link
Contributor Author

@chandramerla would you mind taking a look at the failing check-provision lanes please?

@dhiller Thanks for the labels.
I'm seeing all the check-provision jobs (even outside this PR) are failing with same kubevirt conformance tests related to sig-network.

The common error across all those tests is

/usr/bin/virt-chroot create-tap --tap-name tap0 --uid 107 --gid 107 --queue-number 0 --mtu 1480, err: exit status 1, output: Error: failed to create tap device named tap0. Reason: [0]: operation not permitted

This is happening since last evening/night, before that I've not seen those sig-network failures.
I don't have much clue here. So I was trying to reach @brianmcarey on the same.

@dhiller
Copy link
Contributor

dhiller commented Nov 5, 2024

@chandramerla would you mind taking a look at the failing check-provision lanes please?

@dhiller Thanks for the labels. I'm seeing all the check-provision jobs (even outside this PR) are failing with same kubevirt conformance tests related to sig-network.

The common error across all those tests is

/usr/bin/virt-chroot create-tap --tap-name tap0 --uid 107 --gid 107 --queue-number 0 --mtu 1480, err: exit status 1, output: Error: failed to create tap device named tap0. Reason: [0]: operation not permitted

This is happening since last evening/night, before that I've not seen those sig-network failures. I don't have much clue here. So I was trying to reach @brianmcarey on the same.

@EdDev @ormergi do you have an idea on why this is happening - why is the tap getting a not permitted error?

@dhiller
Copy link
Contributor

dhiller commented Nov 5, 2024

What we noticed is that the last change to kubevirtci happened on 29th Oct, where the failure started happening yesterday.

@dhiller
Copy link
Contributor

dhiller commented Nov 5, 2024

So it might seem some change on kubevirt itself is responsible here.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

4 similar comments
@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@EdDev
Copy link
Member

EdDev commented Nov 6, 2024

The common error across all those tests is

/usr/bin/virt-chroot create-tap --tap-name tap0 --uid 107 --gid 107 --queue-number 0 --mtu 1480, err: exit status 1, output: Error: failed to create tap device named tap0. Reason: [0]: operation not permitted

This is happening since last evening/night, before that I've not seen those sig-network failures. I don't have much clue here. So I was trying to reach @brianmcarey on the same.

@EdDev @ormergi do you have an idea on why this is happening - why is the tap getting a not permitted error?

I do not know what has changed to trigger this, but I do not think it is related to the k/k project.

The permission problem can be caused by SELinux or chroups. These are the reasons I saw in the past.
I suggest you run this one on the kubevirt/kubevirt project.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@ormergi
Copy link
Contributor

ormergi commented Nov 6, 2024

I noticed the new provider use generic cloud image instead of vagrant box for the cluster node OS
Maybe its worth trying with vagrant box to eliminate this is the issue?

Also, I saw the kernel args are slight diffrent, for example the s390x provider kernel args doesnt specify nopti, maybe it case some slowness when tap device is created when selinux is involved.
https://github.com/kubevirt/kubevirtci/blob/5481bd92fa3804fad0f14821cda8aff5ae209f24/cluster-provision/centos9/scripts/kernel.s390x.args
https://github.com/kubevirt/kubevirtci/blob/5481bd92fa3804fad0f14821cda8aff5ae209f24/cluster-provision/centos9/scripts/kernel.args

@brianmcarey
Copy link
Member

I noticed the new provider use generic cloud image instead of vagrant box for the cluster node OS Maybe its worth trying with vagrant box to eliminate this is the issue?

Also, I saw the kernel args are slight diffrent, for example the s390x provider kernel args doesnt specify nopti, maybe it case some slowness when tap device is created when selinux is involved. https://github.com/kubevirt/kubevirtci/blob/5481bd92fa3804fad0f14821cda8aff5ae209f24/cluster-provision/centos9/scripts/kernel.s390x.args https://github.com/kubevirt/kubevirtci/blob/5481bd92fa3804fad0f14821cda8aff5ae209f24/cluster-provision/centos9/scripts/kernel.args

Hi @ormergi - thanks for taking a look at this - we are seeing this on all kubevirtci PRs since the 4th. For example a simple PR - #1317

@ormergi
Copy link
Contributor

ormergi commented Nov 6, 2024

Hi @ormergi - thanks for taking a look at this - we are seeing this on all kubevirtci PRs since the 4th. For example a simple PR - #1317

Thanks for the heads up
I see last change was centoOS bump #1314, its not the first time we see issues with such changes, lets try to revert this change and see if things stabilize, WDTY?

@brianmcarey
Copy link
Member

Hi @ormergi - thanks for taking a look at this - we are seeing this on all kubevirtci PRs since the 4th. For example a simple PR - #1317

Thanks for the heads up I see last change was centoOS bump #1314, its not the first time we see issues with such changes, lets try to revert this change and see if things stabilize, WDTY?

Ok but these lanes ran against that PR all ok. As these lanes take the nightly builds of kubevirt - and this started on Nov 4th. I was wondering could there be some changes in kubevirt on the 3rd that could of caused this?

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

4 similar comments
@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@kubevirt-commenter-bot
Copy link

/retest-required
This bot automatically retries required jobs that failed/flaked on approved PRs.
Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@brianmcarey
Copy link
Member

/hold

To avoid needless retesting.

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 7, 2024
@brianmcarey
Copy link
Member

/hold cancel

/retest

@kubevirt-bot kubevirt-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 8, 2024
@kubevirt-bot kubevirt-bot merged commit d4d136b into kubevirt:main Nov 8, 2024
11 checks passed
kubevirt-bot added a commit to kubevirt-bot/kubevirt that referenced this pull request Nov 8, 2024
[d4d136b K8s 1.30 provider slim s390x](kubevirt/kubevirtci#1252)
[6c4fa8f crio: pin cri-o to previous minor version for each provider](kubevirt/kubevirtci#1320)
[12789ee kind-1.28: add conformance.json](kubevirt/kubevirtci#1286)

```release-note
NONE
```

Signed-off-by: kubevirt-bot <[email protected]>
kubevirt-bot added a commit to kubevirt-bot/kubevirt that referenced this pull request Nov 9, 2024
[d4d136b K8s 1.30 provider slim s390x](kubevirt/kubevirtci#1252)
[6c4fa8f crio: pin cri-o to previous minor version for each provider](kubevirt/kubevirtci#1320)
[12789ee kind-1.28: add conformance.json](kubevirt/kubevirtci#1286)

```release-note
NONE
```

Signed-off-by: kubevirt-bot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. sig/ci Denotes an issue or PR as being related to sig-ci, marks changes to the CI system. size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants