Skip to content

Consider using TTL more extensively in k8s executor #6452

@BioWilko

Description

@BioWilko

New feature

Currently ttlSecondsAfterFinished is an optional config option for k8s executor which adds metadata to jobs/pods which tells the k8s control plane to kill resources upon completion / failure after that many seconds.

However, the nextflow monitor pool loop will actively delete resources some amount of time post completion, this appears to be variable from my observation and if you have lots of fast running processes this can add up to lots of resources for the loop to monitor slowing the main process down. Also, if the control plane kills a job before nextflow tries to this will lead to an error like this:

Oct-06 18:32:43.421 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Unexpected error in tasks monitor pool loop
org.codehaus.groovy.runtime.InvokerInvocationException: nextflow.k8s.client.K8sResponseException: Request DELETE /apis/batch/v1/namespaces/<NAMESPACE>/jobs
/nf-62e16f1931488513c9732bfd1e7d5718-f9506 returned an error code=404

  {
      "kind": "Status",
      "apiVersion": "v1",
      "metadata": {
          
      },
      "status": "Failure",
      "message": "jobs.batch \"nf-62e16f1931488513c9732bfd1e7d5718-f9506\" not found",
      "reason": "NotFound",
      "details": {
          "name": "nf-62e16f1931488513c9732bfd1e7d5718-f9506",
          "group": "batch",
          "kind": "jobs"
      },
      "code": 404
  }

        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:348)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
        at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
        at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
        at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:323)
        at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
        at groovy.lang.Closure.run(Closure.java:505)
        at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: nextflow.k8s.client.K8sResponseException: Request DELETE /apis/batch/v1/namespaces/<NAMESPACE>/jobs/nf-62e16f1931488513c9732bfd1e7d5718-f9506 re
turned an error code=404

In the log which puts the pipeline into a state where it will no longer spawn new jobs / pods but also not finish with an error.

I suggest that you consider refactoring some of the k8s functionality to make better use of k8s features (such as TTL), better utilise the control plane to cheaply achieve many of the same features, or at the very least better handle the case where a resource no longer exists when nextflow tries to delete it.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions