Skip to content

Conversation

aednichols
Copy link
Collaborator

@aednichols aednichols commented Oct 15, 2025

Description

Create the predefinedMachineType attribute and document as an alpha feature with some limitations. Centaur tests include a CPU type n2 and a GPU g2.

To keep this PR bounded, I elected to defer any cost considerations to a future story.

We had previously discussed creating an allowlist of types n1-standard-*, n2-standard-* etc. we can calculate cost for. It didn't make sense to me after finding our cost logic only works with *-custom-* and not *-standard-*.

Closes #7535
Supersedes #7545

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@aednichols aednichols marked this pull request as ready for review October 15, 2025 21:34
@aednichols aednichols requested a review from a team as a code owner October 15, 2025 21:34
@aednichols aednichols changed the title AN-751 Optional gcp_machine_type runtime attribute AN-751 predefinedMachineType runtime attribute Oct 15, 2025
override def validateValue: PartialFunction[WomValue, ErrorOr[MachineType]] = {
case WomString(s) => MachineType(s).validNel
case other =>
s"Invalid '$key': String value required but got ${other.womType.friendlyName}.".invalidNel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming, in the case that the user supplies an invalid machine type, we let Batch handle that error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right now they will get an error from Batch. It might be worth adding a test case to characterize the behavior, actually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was about to suggest that!

Copy link
Contributor

@LizBaldo LizBaldo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thoroughly documenting the new option, with the necessary warnings as the behavior can be confusing.
I have a clarifying question about what the batch UI displays, and a request to create a backlog ticket to handle cost estimation once this is not an alpha feature anymore :)

new Exception(
s"Task $jobTag failed. $returnCodeMessage GCP Batch task exited with ${errorCode}(${errorCode.code}). ${message}"
)
) with NoStackTrace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the no stack trace?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we deliberately create exceptions in the program flow, my opinion is that they should never have a stack trace as it clutters the log and is not relevant for debugging.

A second order issue is that users often diligently copy-paste entire stack traces, rendering Slack threads and Zendesk cases unreadable.

After:

2025-10-16 14:38:12 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO  -
  WorkflowManagerActor: Workflow 974aa6ec-eccf-4267-8e83-65f230967dd6 failed (during ExecutingWorkflowState): cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor$$anon$1: Task minimal_hello_world.hello_world:NA:1 failed.
  The job was stopped before the command finished. GCP Batch task exited with Success(0).

Before:

2025-10-16 16:58:09 cromwell-system-akka.dispatchers.engine-dispatcher-111 INFO  -
  WorkflowManagerActor: Workflow 70e6cac9-e991-48a6-92e9-da333c209e1e failed (during ExecutingWorkflowState): java.lang.Exception: Task minimal_hello_world.hello_world:NA:1 failed.
  The job was stopped before the command finished. GCP Batch task exited with Success(0). 
	at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor$.StandardException(GcpBatchAsyncBackendJobExecutionActor.scala:83)
	at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleFailedRunStatus$1(GcpBatchAsyncBackendJobExecutionActor.scala:1152)
	at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.$anonfun$handleExecutionFailure$1(GcpBatchAsyncBackendJobExecutionActor.scala:1168)
	at scala.util.Try$.apply(Try.scala:210)
	at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleExecutionFailure(GcpBatchAsyncBackendJobExecutionActor.scala:1160)
	at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleExecutionFailure(GcpBatchAsyncBackendJobExecutionActor.scala:144)
	at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$handleExecutionResult$11.applyOrElse(StandardAsyncExecutionActor.scala:1506)
	at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$handleExecutionResult$11.applyOrElse(StandardAsyncExecutionActor.scala:1503)
	at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:490)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

)

/**
* The "compute resource" concept is a suggestion to Batch regarding how many jobs can fit on a single VM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I 100% understand. Why do we supply the compute resource if it is not used and lead to UI confusion?

Copy link
Collaborator Author

@aednichols aednichols Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because otherwise Google displays default values that are even more wrong.

If we make a one-line change to develop to omit setComputeResource() we still get the right machine shape, we just get Google's default values in the UI:

Screenshot 2025-10-16 at 15 32 25

In the future we could enhance the code to calculate a CPU and memory size for each predefinedMachineShape and set them in the UI as well. As far as I can tell this is a nice-to-have, maybe it will happen as part of the cost enhancements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha, yeah definitely a follow up thing to do if you can add that to the cost ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: Support standard machine types on Google Backend for running workflows on modern GPUs

4 participants