Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: placement group should not be added for reservation #8220

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vsoch
Copy link
Contributor

@vsoch vsoch commented Feb 14, 2025

Reservations are special cases that often can be created with placement already in mind, or have instances in different availability zones (or far enough away) so adding a placement group automatically will prevent provision of a large set of resources. Changing the default behavior to always require the user to specify a placement group for EFA is overkill, but a good balance is, in the case EFA is enabled and there is no placement group, when there is a reservation do not add the group automatically, but issue a warning to the user they can choose to respond to. TLDR: when a user deploys a cluster via a reservation they are responsible for adding the placement group.

Description

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the userdocs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes
  • (Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

  • Backfilled missing tests for code in same general area 🎉
  • Refactored something and made the world a better place 🌟

@bryantbiggs
Copy link
Member

Reservations are zonal - in what scenario(s) are you trying to use reservations across multiple AZs?

@vsoch
Copy link
Contributor Author

vsoch commented Feb 14, 2025

@bryantbiggs we had reservations for 16 GPU nodes (each with 8 GPU), and 11 of them (I guess had to have been) in a different zone. We would bring up the cluster asking for 16, but only got 11. The bug was that eksctl was creating the placement group and thus preventing the entire provisioning. It took us a few hours of waiting for clusters, trying different things, and then finally noticing that and I did a custom build of eskctl so we could move forward with the study. It was a loss of time (and money) that I'm hoping others don't need to face (for reservations you pay as soon as the clock starts ticking, regardless of if you can use your resources), and I'd like a way to run experiments under this setup that doesn't require me to custom compile eksctl.

This is just one suggestion for a change and it might not be the right or best way - if this case is rare enough we could add an extra flag that explicitly says "do not automatically create the group." I'll add @bollig from AWS that can give more detail on our issue if needed.

Reservations are special cases that often can be created
with placement already in mind, or have instances in different
availability zones (or far enough away) so adding a placement
group automatically will prevent provision of a large set of
resources. Changing the default behavior to always require the
user to specify a placement group for EFA is overkill, but
a good balance is, in the case EFA is enabled and there is no
placement group, when there is a reservation do not add the
group automatically, but issue a warning to the user they can
choose to respond to. TLDR: when a user deploys a cluster via
a reservation they are responsible for adding the placement
group.

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch force-pushed the reservation-placement-group-bug branch from 0c9cf9c to 947f898 Compare February 14, 2025 22:28
@bryantbiggs
Copy link
Member

In your scenario, you had multiple reservations that you added to one node group, and these reservations were not in the same AZ?

@vsoch
Copy link
Contributor Author

vsoch commented Feb 14, 2025

No, I don't think so - we had one reservation ID. You can see it (and my notes) here: https://github.com/converged-computing/performance-study/blob/3feacac373f3b9f1170928e5a65d79e794a9b404/experiments/aws/eks/gpu/eks-config.yaml#L26-L34

@bryantbiggs
Copy link
Member

Hmm, something seems off. One reservation and I see you have it limited to one AZ, there shouldn't have been any negative impact from a placement group as far as I am aware

@vsoch
Copy link
Contributor Author

vsoch commented Feb 14, 2025

I can tell you with absolute, 100% certainty we only got 11 nodes when the placement group was forced, and @bollig can confirm. If you look up CASE 172324112900882 in your internal system, you'll see the conversation with your support that advised us to remove it (after which I did the custom build, because it was impossible to otherwise). After that, it worked.

@bryantbiggs
Copy link
Member

To be clear, I'm not disputing what you experienced. More importantly, I want to solve this experienced issue in the correct location. Let me dig around internally on the placement group behavior a bit

@vsoch
Copy link
Contributor Author

vsoch commented Feb 15, 2025

Sounds good - thanks @bryantbiggs ! And it could be this allocation was special for us (I don't know the details) and this issue is extremely unlikely to happen for anyone else. If that's the case, then the custom build is probably our best bet, and we can pray to the GPU gods that circumstances change in the future to make them easier to get! 😆

@dims
Copy link
Contributor

dims commented Feb 15, 2025

@vsoch

IX6irst

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants