-
Notifications
You must be signed in to change notification settings - Fork 63
Resource manager overhaul #552
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
…il, MergeTaskResourceSpec return nil if low is not nil but all empty, ConstrainTaskResourceSpec before checking default requests, first merge limits with default limits in case default limits was empty, fix resources_test Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
…does it Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
…orking, node weirdness but don't think it's on the admin side. Signed-off-by: Yee Hing Tong <[email protected]>
Signed-off-by: Yee Hing Tong <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #552 +/- ##
==========================================
+ Coverage 58.78% 59.68% +0.90%
==========================================
Files 170 170
Lines 16055 13413 -2642
==========================================
- Hits 9438 8006 -1432
+ Misses 5783 4557 -1226
- Partials 834 850 +16
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments - mostly just clarifications, but my two main concerns (probably unwarranted) are:
- dealing with explicit
0
values to denote that it does not exist, specifically useful for overriding the default limits - conversion between types (flyteidl vs admin vs etc). I think most of these functions can be greatly simplified if we operate over a singular type and covert before calling them.
@@ -87,12 +86,12 @@ var testCluster = "C1" | |||
var outputURI = "output uri" | |||
|
|||
var resourceDefaults = runtimeInterfaces.TaskResourceSet{ | |||
CPU: resource.MustParse("200m"), | |||
Memory: resource.MustParse("200Gi"), | |||
CPU: testutils.GetPtr(resource.MustParse("200m")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than adding a GetPtr
function everywhere can resource.MustParse
return a pointer? Or have another function resource.MustParsePtr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is K8s code unf... don't want to wade through getting a change merged through that process. I can try though.
// then both should be limited by any platform limits. | ||
// anything 0 or empty is not set. | ||
// if both requests and limits end up empty, return nil. if one is empty, return nil for it | ||
func (m *ExecutionManager) getResources(ctx context.Context, taskResources *core.Resources, platformResources workflowengineInterfaces.TaskResources) *core.Resources { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems sound to me, but would love to see this simplified. For example, can we convert to a similar type and use a function like:
func UnionResources(resources...) (Resources) {
...
}
if !taskResourceRequirements.Defaults.EphemeralStorage.IsZero() || | ||
!taskResourceRequirements.Limits.EphemeralStorage.IsZero() || | ||
!platformTaskResources.Defaults.EphemeralStorage.IsZero() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to check for cases where a user manually sets a resource limit to 0 to override the default limits? For example - flyteadmin configurations default limit to 5G Memory, if a user wants a task with unbounded Memory can they set limits(mem="0')
to override this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah not sure. So as of this PR, "0", is not treated specially until the very end. It's like any other number. It's at prepare_execution.go::addExecutionOverrides
that we check or zero and then don't apply them. Added a unit test to see this directly. Easy enough add logic to execution_manager.go
to not add resources if they're 0. What do you think?
if len(requestEntries) == 0 && len(limitEntries) == 0 { | ||
return nil | ||
} | ||
res := core.Resources{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the code this replaces (current lines 210-213) we init the Request
and Limits
to empty lists (ie. []*core.Resources_ResourceEntry{}
). I don't believe there is some chance for a nil pointer exception downstream if we do not do this right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we solve this with more testing. looked through the code and i don't think it's used. Also ran locally with no admin config, no matchable resources, and no task resources, and things seem to be running almost correctly. The pod is still being created with
resources:
limits:
cpu: "0"
memory: "0"
requests:
cpu: "0"
memory: "0"
but the fly crd has node resources
resources: {}
and has execution config task resources:
executionConfig:
Interruptible: null
MaxParallelism: 25
OverwriteCache: false
RecoveryExecution: {}
TaskPluginImpls: {}
TaskResources:
Limits:
CPU: "0"
EphemeralStorage: "0"
GPU: "0"
Memory: "0"
Storage: "0"
Requests:
CPU: "0"
EphemeralStorage: "0"
GPU: "0"
Memory: "0"
Storage: "0"
which is correct because these are non-nullable. and resources doesn't show up at all in the task template part of the task definition.
if downstream code can't handle a nil for some reason it should handle it there i feel.
Signed-off-by: Yee Hing Tong <[email protected]>
func (m *ExecutionManager) getResources(ctx context.Context, taskResources *core.Resources, platformResources workflowengineInterfaces.TaskResources) *core.Resources { | ||
|
||
// requests: coalesce(task request, platform default) | ||
// limits: coalesce(task limits, task requests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO this only makes sense for non-compressible resources like memory and storage.
Compressible resources like cpu should not be "forced" to inherit limits from requests if not explicitly specified. At the very least the limit should be explicitly removable/nullable.
Personally I'm not in favor of "merging" these settings across hierarchies and I prefer that those are applied as a whole from the most specific level. Either way the important part is: it needs to be possible to unset specific settings (i.e. cpu limits), either by omission or by explicit null/0. |
Also I'm wondering if we should rather leverage k8s limit ranges instead of re-implementing all this in flyteadmin? |
Signed-off-by: Yee Hing Tong <[email protected]>
TL;DR
This change reworks the way task resources are handled in Admin. Currently the behavior:
CPU="1"
set at the project level for project X andMemory="1Gi"
set at the project/domain level (e.g. X-development), then the latter setting wipes out the first one entirely. With this change, both the CPU and Memory setting will be returned.http://localhost:8088/api/v1/project_attributes/flytesnacks?resource_type=0
) and when workflows are launched with tasks/containers that do not have resources set.Also,
What this PR does not change:
Guaranteed
pods are guaranteed only when requests and limits are specified for all the containers and they are equal." Please see the rejected IDL change for a little more discussion.Type
Are all requirements met?
Complete description
Tracking Issue
flyteorg/flyte#3574
flyteorg/flyte#3467
flyteorg/flyte#3399