Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubevirt,vgpu: Bump vgpu lanes to use new kind-1.30-vgpu provider #3499

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

brianmcarey
Copy link
Member

What this PR does / why we need it:

The vgpu lanes are currently testing against kubernetes v1.27 which is no longer supported on the main branch of kubevirt.

Update these lanes to use the new provider kind-1.30-vgpu.

This requires the following PR to be merged into kubevirt first: kubevirt/kubevirt#12244

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

/cc @dhiller @xpivarc

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:


@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/XS labels Jun 28, 2024
@brianmcarey
Copy link
Member Author

/hold

as mentioned in the description - this requires kubevirt/kubevirt#12244 to be merged.

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 28, 2024
@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 12, 2024
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 15, 2024
@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@dhiller
Copy link
Contributor

dhiller commented Jul 16, 2024

@brianmcarey so I figure due to the still flaky history around the lane we are not there yet?

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@xpivarc
Copy link
Member

xpivarc commented Aug 13, 2024

@brianmcarey is this ready?

@brianmcarey
Copy link
Member Author

@brianmcarey is this ready?

No still seeing intermittent failures on cluster-up. I haven't had a chance to come back to this yet.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 13, 2024
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 29, 2024
@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

Copy link
Contributor

@dhiller dhiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@brianmcarey should we repeatedly run rehearse, so we get more feedback here, or do you think this is not required?

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2024
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhiller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2024
@brianmcarey
Copy link
Member Author

/approve

@brianmcarey should we repeatedly run rehearse, so we get more feedback here, or do you think this is not required?

Yes I still think its not stable.

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 11, 2024
The vgpu lanes are currently testing against kubernetes v1.27 which is
no longer supported on the main branch of kubevirt.

Update these lanes to use the new provider kind-1.30-vgpu.

This requires the following PR to be merged into kubevirt first:
kubevirt/kubevirt#12244

Signed-off-by: Brian Carey <bcarey@redhat.com>
@kubevirt-bot kubevirt-bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 12, 2024
@kubevirt-bot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 12, 2024
@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@brianmcarey
Copy link
Member Author

/rehearse

@kubevirt-bot
Copy link
Contributor

Rehearsal jobs created for this PR:

rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu

You can trigger rehearsal for all jobs by commenting either /rehearse or /rehearse all
on this PR.

For a specific PR you can comment /rehearse {job-name}.

For a list of jobs that you can rehearse you can comment /rehearse ?.

@kubevirt-bot
Copy link
Contributor

@brianmcarey: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-project-infra-prow-deploy-test 959e381 link true /test pull-project-infra-prow-deploy-test
rehearsal-pull-kubevirt-e2e-kind-1.30-vgpu 959e381 link unknown /test pull-kubevirt-e2e-kind-1.30-vgpu

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@brianmcarey
Copy link
Member Author

brianmcarey commented Sep 17, 2024

I have been looking into the failures with this new provider - about 50% of the time the kind cluster fails to create due to some missing cgroups in the kind node container. I added a retry to the kind cluster create in the provider and this results in the provider coming up successfully every time. I haven't been able to identify why these cgroups are present sometimes and not present others - there maybe some issue with the mount that kind does on /sys

For now it is probably good enough to just add the retry in the provider.

Welcome to Debian GNU/Linux 12 (bookworm)!

Failed to attach 1 to compat systemd cgroup /init.scope: No such file or directory
Couldn't move remaining userspace processes, ignoring: Input/output error
Queued start job for default target graphical.target.
-.slice: Failed to migrate controller cgroups from /, ignoring: Input/output error
[  OK  ] Created slice kubelet.slic… used to run Kubernetes / Kubelet.
[  OK  ] Created slice system-modpr…lice - Slice /system/modprobe.
[  OK  ] Started systemd-ask-passwo…quests to Console Directory Watch.
[  OK  ] Set up automount proc-sys-…rmats File System Automount Point.
[  OK  ] Reached target cryptsetup.…get - Local Encrypted Volumes.
[  OK  ] Reached target integrityse…Local Integrity Protected Volumes.
[  OK  ] Reached target paths.target - Path Units.
[  OK  ] Reached target slices.target - Slice Units.
[  OK  ] Reached target swap.target - Swaps.
[  OK  ] Reached target veritysetup… - Local Verity Protected Volumes.
[  OK  ] Listening on systemd-journ…socket - Journal Audit Socket.
[  OK  ] Listening on systemd-journ…t - Journal Socket (/dev/log).
[  OK  ] Listening on systemd-journald.socket - Journal Socket.
[  OK  ] Reached target sockets.target - Socket Units.
Failed to attach 146 to compat systemd cgroup /dev-hugepages.mount: No such file or directory
         Mounting dev-hugepages.mount - Huge Pages File System...
Failed to attach 146 to compat systemd cgroup /dev-hugepages.mount: No such file or directory
Failed to attach 147 to compat systemd cgroup /sys-kernel-debug.mount: No such file or directory
         Mounting sys-kernel-debug.… - Kernel Debug File System...
Failed to attach 147 to compat systemd cgroup /sys-kernel-debug.mount: No such file or directory
Failed to attach 149 to compat systemd cgroup /sys-kernel-tracing.mount: No such file or directory
         Mounting sys-kernel-tracin… - Kernel Trace File System...
Failed to attach 149 to compat systemd cgroup /sys-kernel-tracing.mount: No such file or directory

@xpivarc
Copy link
Member

xpivarc commented Sep 17, 2024

@brianmcarey Have you seen the issues on cgroupv2?

@brianmcarey
Copy link
Member Author

brianmcarey commented Sep 17, 2024

@brianmcarey Have you seen the issues on cgroupv2?

Trying locally with cgroups v2 we hit a different issue

Sep 17 14:57:16 vgpu-control-plane systemd[1]: Stopped kubelet.service - kubelet: The Kubernetes Node Agent.
Sep 17 14:57:16 vgpu-control-plane systemd[1]: Starting kubelet.service - kubelet: The Kubernetes Node Agent...
Sep 17 14:57:16 vgpu-control-plane sh[489]: ERROR: this script needs /sys/fs/cgroup/cgroup.procs to be empty (for writing the top-level cgroup.subtree_control)

https://github.com/kubernetes-sigs/kind/blob/52394ea8a92eed848d086318e983697f4a5afa93/images/base/files/kind/bin/create-kubelet-cgroup-v2.sh#L26

@brianmcarey
Copy link
Member Author

With the following applied the lane passes every time:
https://github.com/kubevirt/kubevirtci/pull/1275/files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants