Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moved windows workflow to Kubernetes hosted runner #18967

Merged
merged 42 commits into from
Nov 14, 2024

Conversation

Eliasj42
Copy link
Contributor

@Eliasj42 Eliasj42 commented Oct 31, 2024

Moved the windows MSVC workflow to a kubernetes hosted windows runner. Also moved a file location to Azure instead of GCP.

workflow running successfully: https://github.com/iree-org/iree/actions/runs/11827489914/job/32955730437

Adresses #18813: runs windows workflow through a runner self hosted through kubernetes on a 32 vcpu machine. Still need to implement sccache.

Job currently takes about 1 hour, may need to upgrade to 64 vcpu machine, although implementation of sccache will also save time

ci-exactly: windows_x64_msvc

@Eliasj42 Eliasj42 added infrastructure Relating to build systems, CI, or testing platform/windows 🚪 Windows-specific build, execution, benchmarking, and deployment labels Oct 31, 2024
@Eliasj42 Eliasj42 self-assigned this Oct 31, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link logs of this running in the PR description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for linking https://github.com/iree-org/iree/actions/runs/11619486082/job/32359204800.

Here's what I'm looking for / seeing:

Timing

  • Total time: 57m
  • Time for checkout: 2m30s (good - GitHub's large hosted runners were taking 8-10 minutes sometimes)
  • Installing deps: around 1 minute total (fine, low priority to optimize, way better than the 20 minutes installing Visual Studio / MSVC before)
  • Build: 52 minutes
  • Test: 1 minute

Other misc things

.github/workflows/ci_windows_x64_msvc.yml Outdated Show resolved Hide resolved
.github/workflows/ci_windows_x64_msvc.yml Show resolved Hide resolved
.github/workflows/ci_windows_x64_msvc.yml Outdated Show resolved Hide resolved
.github/workflows/ci_windows_x64_msvc.yml Outdated Show resolved Hide resolved
.github/workflows/ci_windows_x64_msvc.yml Outdated Show resolved Hide resolved
.github/workflows/ci_windows_x64_msvc.yml Outdated Show resolved Hide resolved
.github/workflows/ci_windows_x64_msvc.yml Outdated Show resolved Hide resolved
Comment on lines 60 to 61
- name: "Clean up build dir"
run: Remove-Item -Path "C:\mnt\azure\build-windows" -Recurse -Force
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Are these runner persistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, I have it run at the beginning because if the workflow fails, it doesn't reach the cleanup step. Do you know how to add this to the process that runs after the workflow is completed, regardless of if it succeeds or not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? I thought our Linux cluster had ephemeral runners? Wouldn't the Windows cluster follow the same convention? (All of this should go on the tracking issue, PR description, etc. - I'm reading between the lines here when it would be much easier to review the design if it was documented)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how to add this to the process that runs after the workflow is completed, regardless of if it succeeds or not?

In the workflow file there are options: https://stackoverflow.com/questions/58858429/how-to-run-a-github-actions-step-even-if-the-previous-step-fails-while-still-f

For this sort of self-hosted runner setup though, it's usually safer to configure at the runner level, like https://github.com/iree-org/iree/blob/main/build_tools/github_actions/runner/config/hooks/post_job.sh (maybe ARC has a similar hook you can connect to)

Copy link
Collaborator

@saienduri saienduri Nov 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd These are still ephemeral runners that scale up and down based on github activity. The runners that are spun up use the same storage mount that we clean up after every run due to some storage problems with the ones that the runners are coming up with. Because we are completely clearing the storage every time, this should be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the OS partition is ephemeral but the storage is persistent? Do multiple runners share the same storage disk then...? Can you elaborate on "some storage problems" somewhere (GitHub tracking issue would be best, but I'll settle for a PR review thread like here or one of our internal chats) so we can investigate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was initially for the nightly case, but yes, as we scale this up and move it to presubmit, we should be moving to an ephemeral storage mount which is pretty easy to do in kubernetes (in which case we remove the cleanup step as well). @Eliasj42 has more context on the issues with the storage problems, but I remember he tried an exhaustive list of things over a couple of days which is when we decided to pivot to a mount

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but we'll be aiming for presubmit (pull_request) or at least postsubmit (push). Nightly already works today and is free, so this would "just" be speeding up that build and taking on some extra billing costs if we stop after just this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postsubmit with a 1 hour build on a single runner may be reasonable to start putting mileage on this.

.github/workflows/ci_windows_x64_msvc.yml Show resolved Hide resolved
@Eliasj42 Eliasj42 force-pushed the windows-runner-staging-2 branch 3 times, most recently from ae985df to 38e56c4 Compare October 31, 2024 19:45
@ScottTodd
Copy link
Member

Please also follow the policies at https://iree.dev/developers/general/contributing/

  • Follow the branch naming conventions

And general github best practices

  • Capitalize the first word in the PR title (and follow similar writing guidelines for the PR description)
  • Avoid force pushing once code review has started

@Eliasj42 Eliasj42 changed the title moved windows workflow to Kubernetes hosted runner Moved windows workflow to Kubernetes hosted runner Oct 31, 2024
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is much better now. If we have capacity, a 1 hour build might be okay to enable on presubmit. I'd like to draw the line at 20 minutes, but we do have some 20 minute build + 20 minute test workflows. Another 20 minutes total on top of that isn't too crazy.

.github/workflows/ci_windows_x64_msvc.yml Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for linking https://github.com/iree-org/iree/actions/runs/11619486082/job/32359204800.

Here's what I'm looking for / seeing:

Timing

  • Total time: 57m
  • Time for checkout: 2m30s (good - GitHub's large hosted runners were taking 8-10 minutes sometimes)
  • Installing deps: around 1 minute total (fine, low priority to optimize, way better than the 20 minutes installing Visual Studio / MSVC before)
  • Build: 52 minutes
  • Test: 1 minute

Other misc things

Elias Joseph added 15 commits November 13, 2024 19:07
Signed-off-by: Elias Joseph <[email protected]>

Signed-off-by: Elias Joseph <[email protected]>

Signed-off-by: Elias Joseph <[email protected]>

added build dir env variables

Signed-off-by: Elias Joseph <[email protected]>

Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Elias Joseph added 15 commits November 13, 2024 19:07
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Eliasj42 and others added 3 commits November 13, 2024 11:10
Signed-off-by: Elias Joseph <[email protected]>
…g/iree into windows-runner-staging-2

ask git why this merge is necessary I have no idea
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share the latest logs of this workflow running, and put them in the PR description?

Elias Joseph added 2 commits November 13, 2024 23:01
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
Signed-off-by: Elias Joseph <[email protected]>
SCCACHE_AZURE_CONNECTION_STRING: "${{ secrets.AZURE_CCACHE_CONNECTION_STRING }}"
SCCACHE_AZURE_BLOB_CONTAINER: ccache-container
SCCACHE_CCACHE_ZSTD_LEVEL: 10
SCCACHE_AZURE_KEY_PREFIX: "ci_windows_x64_msvc"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI looks like we'll need some extra setup to use sccache on Windows beyond just running the build_all.sh script.

Linux uses source ./build_tools/cmake/setup_sccache.sh. Can write a similar script for Windows or work something back into build_all.sh.

Fine to keep this here, but it does make it seem like caching is expected to work, when I think it won't. Something to look into as part of #18557

@Eliasj42 Eliasj42 merged commit a70ea83 into main Nov 14, 2024
27 of 28 checks passed
@Eliasj42 Eliasj42 deleted the windows-runner-staging-2 branch November 14, 2024 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infrastructure Relating to build systems, CI, or testing platform/windows 🚪 Windows-specific build, execution, benchmarking, and deployment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants