AN-751 `predefinedMachineType` runtime attribute #7817

aednichols · 2025-10-15T14:10:33Z

Description

Create the predefinedMachineType attribute and document as an alpha feature with some limitations. Centaur tests include a CPU type n2 and a GPU g2.

To keep this PR bounded, I elected to defer any cost considerations to a future story.

We had previously discussed creating an allowlist of types n1-standard-*, n2-standard-* etc. we can calculate cost for. It didn't make sense to me after finding our cost logic only works with *-custom-* and not *-standard-*.

Closes #7535
Supersedes #7545

Release Notes Confirmation

`CHANGELOG.md`

I updated CHANGELOG.md in this PR
I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

I added a suggested release notes entry in this Jira ticket
I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

We write the HTML docs to disk on the CI instance and then throw them away.

lucymcnatt · 2025-10-15T21:51:38Z

...s/google/batch/src/main/scala/cromwell/backend/google/batch/util/MachineTypeValidation.scala

+  override def validateValue: PartialFunction[WomValue, ErrorOr[MachineType]] = {
+    case WomString(s) => MachineType(s).validNel
+    case other =>
+      s"Invalid '$key': String value required but got ${other.womType.friendlyName}.".invalidNel


Just confirming, in the case that the user supplies an invalid machine type, we let Batch handle that error?

Yes, right now they will get an error from Batch. It might be worth adding a test case to characterize the behavior, actually.

was about to suggest that!

LizBaldo

Thanks for thoroughly documenting the new option, with the necessary warnings as the behavior can be confusing.
I have a clarifying question about what the batch UI displays, and a request to create a backlog ticket to handle cost estimation once this is not an alpha feature anymore :)

LizBaldo · 2025-10-16T14:06:18Z

.../main/scala/cromwell/backend/google/batch/actors/GcpBatchAsyncBackendJobExecutionActor.scala

    new Exception(
      s"Task $jobTag failed. $returnCodeMessage GCP Batch task exited with ${errorCode}(${errorCode.code}). ${message}"
-    )
+    ) with NoStackTrace


Why the no stack trace?

When we deliberately create exceptions in the program flow, my opinion is that they should never have a stack trace as it clutters the log and is not relevant for debugging.

A second order issue is that users often diligently copy-paste entire stack traces, rendering Slack threads and Zendesk cases unreadable.

After:

2025-10-16 14:38:12 cromwell-system-akka.dispatchers.engine-dispatcher-5 INFO - WorkflowManagerActor: Workflow 974aa6ec-eccf-4267-8e83-65f230967dd6 failed (during ExecutingWorkflowState): cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor$$anon$1: Task minimal_hello_world.hello_world:NA:1 failed. The job was stopped before the command finished. GCP Batch task exited with Success(0).

Before:

2025-10-16 16:58:09 cromwell-system-akka.dispatchers.engine-dispatcher-111 INFO - WorkflowManagerActor: Workflow 70e6cac9-e991-48a6-92e9-da333c209e1e failed (during ExecutingWorkflowState): java.lang.Exception: Task minimal_hello_world.hello_world:NA:1 failed. The job was stopped before the command finished. GCP Batch task exited with Success(0). at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor$.StandardException(GcpBatchAsyncBackendJobExecutionActor.scala:83) at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleFailedRunStatus$1(GcpBatchAsyncBackendJobExecutionActor.scala:1152) at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.$anonfun$handleExecutionFailure$1(GcpBatchAsyncBackendJobExecutionActor.scala:1168) at scala.util.Try$.apply(Try.scala:210) at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleExecutionFailure(GcpBatchAsyncBackendJobExecutionActor.scala:1160) at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleExecutionFailure(GcpBatchAsyncBackendJobExecutionActor.scala:144) at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$handleExecutionResult$11.applyOrElse(StandardAsyncExecutionActor.scala:1506) at cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$handleExecutionResult$11.applyOrElse(StandardAsyncExecutionActor.scala:1503) at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:490) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

LizBaldo · 2025-10-16T14:08:15Z

...ogle/batch/src/main/scala/cromwell/backend/google/batch/api/GcpBatchRequestFactoryImpl.scala

    )

+    /**
+     * The "compute resource" concept is a suggestion to Batch regarding how many jobs can fit on a single VM.


I am not sure I 100% understand. Why do we supply the compute resource if it is not used and lead to UI confusion?

Because otherwise Google displays default values that are even more wrong.

If we make a one-line change to develop to omit setComputeResource() we still get the right machine shape, we just get Google's default values in the UI:

In the future we could enhance the code to calculate a CPU and memory size for each predefinedMachineShape and set them in the UI as well. As far as I can tell this is a nice-to-have, maybe it will happen as part of the cost enhancements.

Ah gotcha, yeah definitely a follow up thing to do if you can add that to the cost ticket

aednichols added 19 commits October 9, 2025 17:04

Machine class for GCP

696b193

Test cases

e2f5bbb

Merge remote-tracking branch 'origin/develop' into aen_an_751

83d50e5

New validation

2b6b199

Merge remote-tracking branch 'origin/develop' into aen_an_751

29a0441

scalafmtAll

775b1ed

Enhanced toString

9153f00

Enhance tests

2eca2f1

Disable no-op Scaladoc generation

192172a

We write the HTML docs to disk on the CI instance and then throw them away.

Enhance tests to check instance metadata

68ce96e

Add GPU test

678e29d

Docs

c95c551

Remove Life Sciences references

a5af024

Fix markdown syntax

9359832

Maybe this fixes syntax?

11791ce

Changelog

a21c860

Further clean up nvidiaDriverVersion

dd6fa13

Extra explain cpuPlatform

19966c6

Clarify comment

721bd0c

aednichols marked this pull request as ready for review October 15, 2025 21:34

aednichols requested a review from a team as a code owner October 15, 2025 21:34

aednichols changed the title ~~AN-751 Optional gcp_machine_type runtime attribute~~ AN-751 predefinedMachineType runtime attribute Oct 15, 2025

Rename: camelCase to match other attrs

67dc13d

lucymcnatt reviewed Oct 15, 2025

View reviewed changes

aednichols assigned mbaumann-broad Oct 15, 2025

aednichols added 2 commits October 15, 2025 18:01

Just Say No to stack traces

aa8dbc4

Test orderly failure for invalid type

5f799c8

LizBaldo approved these changes Oct 16, 2025

View reviewed changes

aednichols added 2 commits October 16, 2025 14:25

e2-medium is cheapest sensible VM

c5e9025

Docs

750328b

Boop tests by updating docs

112dedd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AN-751 `predefinedMachineType` runtime attribute #7817

AN-751 `predefinedMachineType` runtime attribute #7817

Uh oh!

aednichols commented Oct 15, 2025 •

edited

Loading

Uh oh!

lucymcnatt Oct 15, 2025

Uh oh!

aednichols Oct 15, 2025

Uh oh!

lucymcnatt Oct 15, 2025

Uh oh!

LizBaldo left a comment

Uh oh!

LizBaldo Oct 16, 2025

Uh oh!

aednichols Oct 16, 2025

Uh oh!

LizBaldo Oct 16, 2025

Uh oh!

aednichols Oct 16, 2025 •

edited

Loading

Uh oh!

LizBaldo Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AN-751 predefinedMachineType runtime attribute #7817

Are you sure you want to change the base?

AN-751 predefinedMachineType runtime attribute #7817

Uh oh!

Conversation

aednichols commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release Notes Confirmation

CHANGELOG.md

Terra Release Notes

Uh oh!

lucymcnatt Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

aednichols Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

lucymcnatt Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

LizBaldo left a comment

Choose a reason for hiding this comment

Uh oh!

LizBaldo Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

aednichols Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

LizBaldo Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

aednichols Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LizBaldo Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AN-751 `predefinedMachineType` runtime attribute #7817

AN-751 `predefinedMachineType` runtime attribute #7817

aednichols commented Oct 15, 2025 •

edited

Loading

`CHANGELOG.md`

aednichols Oct 16, 2025 •

edited

Loading