Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Resource manager overhaul #552

Draft
wants to merge 19 commits into
base: master
Choose a base branch
from
Draft

Conversation

wild-endeavor
Copy link
Contributor

@wild-endeavor wild-endeavor commented Apr 14, 2023

TL;DR

This change reworks the way task resources are handled in Admin. Currently the behavior:

  • For matchable task resources, settings at different levels of specificity completely wipe out each other. That is, if you have a default of CPU="1" set at the project level for project X and Memory="1Gi" set at the project/domain level (e.g. X-development), then the latter setting wipes out the first one entirely. With this change, both the CPU and Memory setting will be returned.
  • This merging affect both the matchable resources endpoints (e.g. http://localhost:8088/api/v1/project_attributes/flytesnacks?resource_type=0) and when workflows are launched with tasks/containers that do not have resources set.

Also,

  • Defaults that were previously set in code, have been removed. (Another PR will need to be put in to remove any defaults in the Helm charts/sandbox environment.)

What this PR does not change:

  • Limits are still set to requests for all resources. This was done because having memory (and possibly other resource types) defaults not equal to limits results in oversubscription and then possibly strange/inconsistent eviction logic. From K8s docs: "Guaranteed pods are guaranteed only when requests and limits are specified for all the containers and they are equal." Please see the rejected IDL change for a little more discussion.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

Tracking Issue

flyteorg/flyte#3574
flyteorg/flyte#3467
flyteorg/flyte#3399

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
…il, MergeTaskResourceSpec return nil if low is not nil but all empty, ConstrainTaskResourceSpec before checking default requests, first merge limits with default limits in case default limits was empty, fix resources_test

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
…orking, node weirdness but don't think it's on the admin side.

Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
@codecov
Copy link

codecov bot commented May 3, 2023

Codecov Report

Merging #552 (26ec924) into master (cca9e17) will increase coverage by 0.90%.
The diff coverage is 50.11%.

❗ Current head 26ec924 differs from pull request most recent head f6fa4ca. Consider uploading reports for the commit f6fa4ca to get more accurate results

@@            Coverage Diff             @@
##           master     #552      +/-   ##
==========================================
+ Coverage   58.78%   59.68%   +0.90%     
==========================================
  Files         170      170              
  Lines       16055    13413    -2642     
==========================================
- Hits         9438     8006    -1432     
+ Misses       5783     4557    -1226     
- Partials      834      850      +16     
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/manager/impl/util/shared.go 69.07% <ø> (+3.55%) ⬆️
pkg/repositories/database.go 43.33% <ø> (+2.81%) ⬆️
pkg/repositories/gormimpl/resource_repo.go 56.20% <0.00%> (-7.67%) ⬇️
pkg/rpc/adminservice/base.go 3.88% <0.00%> (+0.49%) ⬆️
pkg/runtime/application_config_provider.go 18.18% <0.00%> (-15.16%) ⬇️
pkg/runtime/task_resource_provider.go 8.69% <0.00%> (-24.64%) ⬇️
pkg/manager/impl/util/resources.go 50.25% <47.87%> (-40.42%) ⬇️
pkg/manager/impl/resources/resource_manager.go 64.65% <56.39%> (+0.61%) ⬆️
pkg/workflowengine/impl/prepare_execution.go 85.88% <70.00%> (+0.88%) ⬆️
pkg/manager/impl/execution_manager.go 72.26% <92.30%> (+2.25%) ⬆️
... and 4 more

... and 144 files with indirect coverage changes

Copy link
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments - mostly just clarifications, but my two main concerns (probably unwarranted) are:

  1. dealing with explicit 0 values to denote that it does not exist, specifically useful for overriding the default limits
  2. conversion between types (flyteidl vs admin vs etc). I think most of these functions can be greatly simplified if we operate over a singular type and covert before calling them.

@@ -87,12 +86,12 @@ var testCluster = "C1"
var outputURI = "output uri"

var resourceDefaults = runtimeInterfaces.TaskResourceSet{
CPU: resource.MustParse("200m"),
Memory: resource.MustParse("200Gi"),
CPU: testutils.GetPtr(resource.MustParse("200m")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than adding a GetPtr function everywhere can resource.MustParse return a pointer? Or have another function resource.MustParsePtr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is K8s code unf... don't want to wade through getting a change merged through that process. I can try though.

// then both should be limited by any platform limits.
// anything 0 or empty is not set.
// if both requests and limits end up empty, return nil. if one is empty, return nil for it
func (m *ExecutionManager) getResources(ctx context.Context, taskResources *core.Resources, platformResources workflowengineInterfaces.TaskResources) *core.Resources {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems sound to me, but would love to see this simplified. For example, can we convert to a similar type and use a function like:

func UnionResources(resources...) (Resources) {
    ...
}

Comment on lines -248 to -250
if !taskResourceRequirements.Defaults.EphemeralStorage.IsZero() ||
!taskResourceRequirements.Limits.EphemeralStorage.IsZero() ||
!platformTaskResources.Defaults.EphemeralStorage.IsZero() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to check for cases where a user manually sets a resource limit to 0 to override the default limits? For example - flyteadmin configurations default limit to 5G Memory, if a user wants a task with unbounded Memory can they set limits(mem="0') to override this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah not sure. So as of this PR, "0", is not treated specially until the very end. It's like any other number. It's at prepare_execution.go::addExecutionOverrides that we check or zero and then don't apply them. Added a unit test to see this directly. Easy enough add logic to execution_manager.go to not add resources if they're 0. What do you think?

if len(requestEntries) == 0 && len(limitEntries) == 0 {
return nil
}
res := core.Resources{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code this replaces (current lines 210-213) we init the Request and Limits to empty lists (ie. []*core.Resources_ResourceEntry{}). I don't believe there is some chance for a nil pointer exception downstream if we do not do this right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we solve this with more testing. looked through the code and i don't think it's used. Also ran locally with no admin config, no matchable resources, and no task resources, and things seem to be running almost correctly. The pod is still being created with

    resources:
      limits:
        cpu: "0"
        memory: "0"
      requests:
        cpu: "0"
        memory: "0"

but the fly crd has node resources

resources: {}

and has execution config task resources:

executionConfig:
  Interruptible: null
  MaxParallelism: 25
  OverwriteCache: false
  RecoveryExecution: {}
  TaskPluginImpls: {}
  TaskResources:
    Limits:
      CPU: "0"
      EphemeralStorage: "0"
      GPU: "0"
      Memory: "0"
      Storage: "0"
    Requests:
      CPU: "0"
      EphemeralStorage: "0"
      GPU: "0"
      Memory: "0"
      Storage: "0"

which is correct because these are non-nullable. and resources doesn't show up at all in the task template part of the task definition.

if downstream code can't handle a nil for some reason it should handle it there i feel.

Signed-off-by: Yee Hing Tong <[email protected]>
func (m *ExecutionManager) getResources(ctx context.Context, taskResources *core.Resources, platformResources workflowengineInterfaces.TaskResources) *core.Resources {

// requests: coalesce(task request, platform default)
// limits: coalesce(task limits, task requests)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this only makes sense for non-compressible resources like memory and storage.
Compressible resources like cpu should not be "forced" to inherit limits from requests if not explicitly specified. At the very least the limit should be explicitly removable/nullable.

@flixr
Copy link

flixr commented May 5, 2023

Personally I'm not in favor of "merging" these settings across hierarchies and I prefer that those are applied as a whole from the most specific level.
While I can also see the appeal of just overriding some settings in higher/more specific levels, IMHO this just makes it more implicit (hence more complicated and harder to understand) with only a small benefit.
But in the end this is not so important to me.

Either way the important part is: it needs to be possible to unset specific settings (i.e. cpu limits), either by omission or by explicit null/0.

@flixr
Copy link

flixr commented May 5, 2023

Also I'm wondering if we should rather leverage k8s limit ranges instead of re-implementing all this in flyteadmin?

Signed-off-by: Yee Hing Tong <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants