-
Notifications
You must be signed in to change notification settings - Fork 746
Description
New feature
Currently ttlSecondsAfterFinished is an optional config option for k8s executor which adds metadata to jobs/pods which tells the k8s control plane to kill resources upon completion / failure after that many seconds.
However, the nextflow monitor pool loop will actively delete resources some amount of time post completion, this appears to be variable from my observation and if you have lots of fast running processes this can add up to lots of resources for the loop to monitor slowing the main process down. Also, if the control plane kills a job before nextflow tries to this will lead to an error like this:
Oct-06 18:32:43.421 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Unexpected error in tasks monitor pool loop
org.codehaus.groovy.runtime.InvokerInvocationException: nextflow.k8s.client.K8sResponseException: Request DELETE /apis/batch/v1/namespaces/<NAMESPACE>/jobs
/nf-62e16f1931488513c9732bfd1e7d5718-f9506 returned an error code=404
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "jobs.batch \"nf-62e16f1931488513c9732bfd1e7d5718-f9506\" not found",
"reason": "NotFound",
"details": {
"name": "nf-62e16f1931488513c9732bfd1e7d5718-f9506",
"group": "batch",
"kind": "jobs"
},
"code": 404
}
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:348)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:328)
at groovy.lang.MetaClassImpl.doInvokeMethod(MetaClassImpl.java:1333)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1088)
at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1007)
at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:645)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:628)
at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:82)
at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:323)
at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
at groovy.lang.Closure.run(Closure.java:505)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: nextflow.k8s.client.K8sResponseException: Request DELETE /apis/batch/v1/namespaces/<NAMESPACE>/jobs/nf-62e16f1931488513c9732bfd1e7d5718-f9506 re
turned an error code=404
In the log which puts the pipeline into a state where it will no longer spawn new jobs / pods but also not finish with an error.
I suggest that you consider refactoring some of the k8s functionality to make better use of k8s features (such as TTL), better utilise the control plane to cheaply achieve many of the same features, or at the very least better handle the case where a resource no longer exists when nextflow tries to delete it.