Skip to content

Conversation

@radoslawkrzemien
Copy link
Contributor

@radoslawkrzemien radoslawkrzemien commented Apr 4, 2025

Why

https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows

How

Added warn and kill timeouts for workflow command execution, which get reset after logs are printed, and removed after workflow step finishes. If they time out, first a warning is printed and then process gets killed. Similar solution works for builds

Test Plan

Automated tests pass
TODO: add more tests

@linear
Copy link

linear bot commented Apr 4, 2025

Improved readability by extracting timeouts to class fields and setting them up to a private method

See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
@radoslawkrzemien radoslawkrzemien marked this pull request as ready for review April 8, 2025 09:45
Copy link
Contributor

@sjchmiela sjchmiela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some tests too?

Comment on lines 63 to 64
const BUILD_STEP_WARN_TIMEOUT_MS = 15 * 60 * 1000;
const BUILD_STEP_KILL_TIMEOUT_MS = 30 * 60 * 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this customizable per step, please? Like

steps:
  - run: maestro test
    error-no-logs: 30m # not sure about the param name or value format / parsing, maybe you can figure something. Maybe GitHub has something similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it

Copy link
Contributor Author

@radoslawkrzemien radoslawkrzemien Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GHA has jobs.<job_id>.timeout-minutes and jobs.<job_id>.steps[*].timeout-minutes, although it's not the same as what we are doing since it only looks at total job/step execution time - even if the job/step is alive and printing, if it passes the timeout it gets cancelled. So in this case I think we shouldn't try to mimick GHA as it would be confusing to people familiar with it, maybe something like no-logs-timeout-minutes? WDYT?

const ppid = nullthrows(spawnPromise.child.pid);
const pids = await getParentAndDescendantProcessPidsAsync(ppid);
pids.forEach((pid) => {
process.kill(pid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't killing child.pid sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be, but we do it the same way with Install Dependencies and Install Pods build steps and I seem to vaguely remember there was a reason for it, but I can't recall it now

// stdin is /dev/null, std{out,err} are piped into logger.
stdio: ['ignore', 'pipe', 'pipe'],
});
this.commandTimedOut = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we might not need to assign commandTimedOut or commandWarnTimeout or commandKillTimeout on this. Is there a reason to do so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just readability, if all is defined within the function executing the command over half of the function body is taken by timeout handing, which is not the main purpose of the function. Having those as properties allows for extracting the timeout setup to a separate private method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move this (setting up timeouts and catching-and-throwing-speciifc-error) to spawnAsync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might work. Are you aware of any cases where we use spawnAsync and might not want to have timeouts?

radoslawkrzemien and others added 3 commits April 8, 2025 13:57
Moved the creation, refreshing, handling and removal of optional timeouts into `spawnAsync` itself, while the caller might, but does not have to, define if the timeouts are needed and if they are what their params are

See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Comment on lines +25 to +26
errorClass?: IErrorClass;
errorMessage?: string;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we need to customize errorClass and errorMessage? Couldn't we just pass in warnTimeoutMs and killTimeoutMs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to, if not passed a default error with a default message will be thrown. But I thought it might be useful to customise them depending on where the functionality is used (workflow command, build step, etc.) and have the appropriate message

logger.warn(
noLogsTimeout.warn?.message ??
SPAWN_WARN_TIMEOUT_DEFAULT_MESSAGE.replace(
'${minutes}',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this simpler.

const errorMessage =
noLogsTimeout?.kill?.errorMessage ??
SPAWN_KILL_TIMEOUT_DEFAULT_ERROR_MESSAGE.replace('${minutes}', spawnKillTimeout.toString());
const ErrorClass = noLogsTimeout?.kill?.errorClass ?? Error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use some well-known error class for this? Maybe we want to use UserFacingError for this, or add a new subclass of UserFacingError in eas-build-job and use that?

pipeSpawnOutput(logger, spawnPromise.child, options);
}
}
spawnResult = await spawnPromise;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think in some places we don't await the SpawnPromise by design. E.g. here.

Instead of try-catch and awaiting the spawnPromise, could we attach the catch as spawnPromise.catch(err => ...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, doing it like this breaks one of the tests, and I'm not exactly sure why. Might be the .catch() interferes with the original implementation of spawnAsync from @expo/spawn-async. Investigating

Moved the creation of timeout messages to helper functions to unclutter the main function body

See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Returning the SpawnPromise unawaited, as before, but with `catch` and `finally` callbacks attached

See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Inner promise from `@expo/spawn-async` rejecting caused the outer process to exit unexpectedly

See: https://linear.app/expo/issue/ENG-15188/replicate-the-cancel-hanging-build-functionality-for-workflows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants