-
Notifications
You must be signed in to change notification settings - Fork 30
[build-tools] [ENG-15188] Cancel hanging workflows #525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[build-tools] [ENG-15188] Cancel hanging workflows #525
Conversation
Functionality will be needed in `steps` and can be imported from there See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Added `infoCallbackFn` to `spawnAsync.SpawnOptions` to allow for tracking when the process writes something at info level See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Functionality will be needed in `steps` and can be imported from there See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Added warn and kill timeouts to build step command execution See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Improved readability by extracting timeouts to class fields and setting them up to a private method See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
sjchmiela
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some tests too?
packages/steps/src/BuildStep.ts
Outdated
| const BUILD_STEP_WARN_TIMEOUT_MS = 15 * 60 * 1000; | ||
| const BUILD_STEP_KILL_TIMEOUT_MS = 30 * 60 * 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this customizable per step, please? Like
steps:
- run: maestro test
error-no-logs: 30m # not sure about the param name or value format / parsing, maybe you can figure something. Maybe GitHub has something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GHA has jobs.<job_id>.timeout-minutes and jobs.<job_id>.steps[*].timeout-minutes, although it's not the same as what we are doing since it only looks at total job/step execution time - even if the job/step is alive and printing, if it passes the timeout it gets cancelled. So in this case I think we shouldn't try to mimick GHA as it would be confusing to people familiar with it, maybe something like no-logs-timeout-minutes? WDYT?
packages/steps/src/BuildStep.ts
Outdated
| const ppid = nullthrows(spawnPromise.child.pid); | ||
| const pids = await getParentAndDescendantProcessPidsAsync(ppid); | ||
| pids.forEach((pid) => { | ||
| process.kill(pid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't killing child.pid sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be, but we do it the same way with Install Dependencies and Install Pods build steps and I seem to vaguely remember there was a reason for it, but I can't recall it now
packages/steps/src/BuildStep.ts
Outdated
| // stdin is /dev/null, std{out,err} are piped into logger. | ||
| stdio: ['ignore', 'pipe', 'pipe'], | ||
| }); | ||
| this.commandTimedOut = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we might not need to assign commandTimedOut or commandWarnTimeout or commandKillTimeout on this. Is there a reason to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just readability, if all is defined within the function executing the command over half of the function body is taken by timeout handing, which is not the main purpose of the function. Having those as properties allows for extracting the timeout setup to a separate private method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move this (setting up timeouts and catching-and-throwing-speciifc-error) to spawnAsync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might work. Are you aware of any cases where we use spawnAsync and might not want to have timeouts?
Co-authored-by: Stanisław Chmiela <[email protected]>
Co-authored-by: Stanisław Chmiela <[email protected]>
Moved the creation, refreshing, handling and removal of optional timeouts into `spawnAsync` itself, while the caller might, but does not have to, define if the timeouts are needed and if they are what their params are See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
| errorClass?: IErrorClass; | ||
| errorMessage?: string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we need to customize errorClass and errorMessage? Couldn't we just pass in warnTimeoutMs and killTimeoutMs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to, if not passed a default error with a default message will be thrown. But I thought it might be useful to customise them depending on where the functionality is used (workflow command, build step, etc.) and have the appropriate message
| logger.warn( | ||
| noLogsTimeout.warn?.message ?? | ||
| SPAWN_WARN_TIMEOUT_DEFAULT_MESSAGE.replace( | ||
| '${minutes}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this simpler.
| const errorMessage = | ||
| noLogsTimeout?.kill?.errorMessage ?? | ||
| SPAWN_KILL_TIMEOUT_DEFAULT_ERROR_MESSAGE.replace('${minutes}', spawnKillTimeout.toString()); | ||
| const ErrorClass = noLogsTimeout?.kill?.errorClass ?? Error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use some well-known error class for this? Maybe we want to use UserFacingError for this, or add a new subclass of UserFacingError in eas-build-job and use that?
| pipeSpawnOutput(logger, spawnPromise.child, options); | ||
| } | ||
| } | ||
| spawnResult = await spawnPromise; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think in some places we don't await the SpawnPromise by design. E.g. here.
Instead of try-catch and awaiting the spawnPromise, could we attach the catch as spawnPromise.catch(err => ...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, doing it like this breaks one of the tests, and I'm not exactly sure why. Might be the .catch() interferes with the original implementation of spawnAsync from @expo/spawn-async. Investigating
Moved the creation of timeout messages to helper functions to unclutter the main function body See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Removed the iterative killing of descendant pids See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
This reverts commit 138b2bb.
This reverts commit 3094b81.
Even though it's always set at this point, TS still thinks it can be undefined See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Returning the SpawnPromise unawaited, as before, but with `catch` and `finally` callbacks attached See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Added a new subclass of UserFacingError for spawn timeouts See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Allowed for setting the timeout in minutes on the step definition See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Inner promise from `@expo/spawn-async` rejecting caused the outer process to exit unexpectedly See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Updated tests for StepsConfigParser checking for customizable timeouts See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Why
https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
How
Added warn and kill timeouts for workflow command execution, which get reset after logs are printed, and removed after workflow step finishes. If they time out, first a warning is printed and then process gets killed. Similar solution works for builds
Test Plan
Automated tests pass
TODO: add more tests