bug: placement group should not be added for reservation #8220

vsoch · 2025-02-14T22:14:01Z

Reservations are special cases that often can be created with placement already in mind, or have instances in different availability zones (or far enough away) so adding a placement group automatically will prevent provision of a large set of resources. Changing the default behavior to always require the user to specify a placement group for EFA is overkill, but a good balance is, in the case EFA is enabled and there is no placement group, when there is a reservation do not add the group automatically, but issue a warning to the user they can choose to respond to. TLDR: when a user deploys a cluster via a reservation they are responsible for adding the placement group.

Fix for [Bug] placement group should be optional for capacity reservations #7949 (comment)

Description

Checklist

Added tests that cover your change (if possible)
Added/modified documentation as required (such as the README.md, or the userdocs directory)
Manually tested
Made sure the title of the PR is a good description that can go into the release notes
(Core team) Added labels for change area (e.g. area/nodegroup) and kind (e.g. kind/improvement)

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

Backfilled missing tests for code in same general area 🎉
Refactored something and made the world a better place 🌟

bryantbiggs · 2025-02-14T22:19:44Z

Reservations are zonal - in what scenario(s) are you trying to use reservations across multiple AZs?

vsoch · 2025-02-14T22:23:12Z

@bryantbiggs we had reservations for 16 GPU nodes (each with 8 GPU), and 11 of them (I guess had to have been) in a different zone. We would bring up the cluster asking for 16, but only got 11. The bug was that eksctl was creating the placement group and thus preventing the entire provisioning. It took us a few hours of waiting for clusters, trying different things, and then finally noticing that and I did a custom build of eskctl so we could move forward with the study. It was a loss of time (and money) that I'm hoping others don't need to face (for reservations you pay as soon as the clock starts ticking, regardless of if you can use your resources), and I'd like a way to run experiments under this setup that doesn't require me to custom compile eksctl.

This is just one suggestion for a change and it might not be the right or best way - if this case is rare enough we could add an extra flag that explicitly says "do not automatically create the group." I'll add @bollig from AWS that can give more detail on our issue if needed.

Reservations are special cases that often can be created with placement already in mind, or have instances in different availability zones (or far enough away) so adding a placement group automatically will prevent provision of a large set of resources. Changing the default behavior to always require the user to specify a placement group for EFA is overkill, but a good balance is, in the case EFA is enabled and there is no placement group, when there is a reservation do not add the group automatically, but issue a warning to the user they can choose to respond to. TLDR: when a user deploys a cluster via a reservation they are responsible for adding the placement group. Signed-off-by: vsoch <[email protected]>

bryantbiggs · 2025-02-14T22:33:44Z

In your scenario, you had multiple reservations that you added to one node group, and these reservations were not in the same AZ?

vsoch · 2025-02-14T22:37:27Z

No, I don't think so - we had one reservation ID. You can see it (and my notes) here: https://github.com/converged-computing/performance-study/blob/3feacac373f3b9f1170928e5a65d79e794a9b404/experiments/aws/eks/gpu/eks-config.yaml#L26-L34

bryantbiggs · 2025-02-14T22:40:56Z

Hmm, something seems off. One reservation and I see you have it limited to one AZ, there shouldn't have been any negative impact from a placement group as far as I am aware

vsoch · 2025-02-14T23:13:05Z

I can tell you with absolute, 100% certainty we only got 11 nodes when the placement group was forced, and @bollig can confirm. If you look up CASE 172324112900882 in your internal system, you'll see the conversation with your support that advised us to remove it (after which I did the custom build, because it was impossible to otherwise). After that, it worked.

bryantbiggs · 2025-02-15T00:10:04Z

To be clear, I'm not disputing what you experienced. More importantly, I want to solve this experienced issue in the correct location. Let me dig around internally on the placement group behavior a bit

vsoch · 2025-02-15T00:18:30Z

Sounds good - thanks @bryantbiggs ! And it could be this allocation was special for us (I don't know the details) and this issue is extremely unlikely to happen for anyone else. If that's the case, then the custom build is probably our best bet, and we can pray to the GPU gods that circumstances change in the future to make them easier to get! 😆

dims · 2025-02-15T02:57:49Z

@vsoch

vsoch force-pushed the reservation-placement-group-bug branch from 0c9cf9c to 947f898 Compare February 14, 2025 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: placement group should not be added for reservation #8220

bug: placement group should not be added for reservation #8220

vsoch commented Feb 14, 2025 •

edited

Loading

bryantbiggs commented Feb 14, 2025

vsoch commented Feb 14, 2025 •

edited

Loading

bryantbiggs commented Feb 14, 2025

vsoch commented Feb 14, 2025

bryantbiggs commented Feb 14, 2025

vsoch commented Feb 14, 2025

bryantbiggs commented Feb 15, 2025

vsoch commented Feb 15, 2025

dims commented Feb 15, 2025

bug: placement group should not be added for reservation #8220

Are you sure you want to change the base?

bug: placement group should not be added for reservation #8220

Conversation

vsoch commented Feb 14, 2025 • edited Loading

Description

Checklist

BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯

bryantbiggs commented Feb 14, 2025

vsoch commented Feb 14, 2025 • edited Loading

bryantbiggs commented Feb 14, 2025

vsoch commented Feb 14, 2025

bryantbiggs commented Feb 14, 2025

vsoch commented Feb 14, 2025

bryantbiggs commented Feb 15, 2025

vsoch commented Feb 15, 2025

dims commented Feb 15, 2025

vsoch commented Feb 14, 2025 •

edited

Loading

vsoch commented Feb 14, 2025 •

edited

Loading