Processes and workflows recursion [EXPERIMENTAL] #2521

pditommaso · 2021-12-23T16:16:13Z

pditommaso
Dec 23, 2021
Maintainer

Computational workflows have usually based on a DAG, i.e. direct-acyclic-graph of tasks to be computed, as such the concept of recursion or iteration does not fit well into this model.

However, there are not uncommon use cases in which recursion could be useful in a computational workflow.

Using Nextflows DSL1, it's possible to create an iteration using an output channel, linked to an upstream process, which essentially creates a continuous feedback loop.

https://nextflow-io.github.io/patterns/index.html#_feedback_loop

However, this is solution is quite tricky to implement, even more, it's not supported by Nextlflow DSL2, since the channel.create operation is not allowed anymore.

Recursion in DSL2

As of version 21.11.0-edge, provides an - experimental - syntax that brings native support for recursion to Nextflow workflows, without the need to hack with channels creation.

The overall idea is to provide a new recurse operation that allow the re-execute the process or sub-workflow to which is applied with the last output produced as input for a fixed number of times or an until a user-provided condition is verified.

As usual, an example is the best way to describe it:

nextflow.enable.dsl=2
nextflow.preview.recursion=true

process foo {
  input:
    val x
    val z
  output:
    val y
    val p
  exec:
    y = x+1
    p = z * 2
}

workflow {
    foo.recurse(10, 20).times(4)
    foo.out[1].view()
}

In the snippet above foo represent an arbitrary Nextflow process taking two input values, nothing new has been added here.

In order to repeat the execution of the foo process, the keyword recurse is applied to it, passing the initial input values to be used for the first iteration as an argument e.g. (10,20).

The iteration i-th +1 will automatically take as inputs the outputs of the iteration i-th and stops after 4 iterations as stated by .times(4).

If we execute this piece of code, the following output is printed

» nextflow run nf-recursion/times.nf 
executor >  local (4)
[44/e3ca5c] process > foo (4) [100%] 4 of 4 ✔
40
80
160
320

In place of times is possible also to use until to provide a condition to be evaluated to determine the stop condition. For example:

nextflow.enable.dsl=2
nextflow.preview.recursion=true

params.data = "hello.txt"

process foo {
  input:
    path 'input.txt'
  output:
    path 'result.txt'
  script:
    """
    cat input.txt > result.txt
    echo "Task $task.index added this" >> result.txt
    """
}

workflow {
  foo
    .recurse(file(params.data))
    .until{ it-> it.size() > 100 }

  foo
    .out
    .view(it->it.text)
}

Note that until received as arguments the outputs of the current iteration, therefore in the example above it holds result.txt file produced by the process execution, and the iteration terminates when the output file size is > 100 bytes.

Running this snippet the following output is printed:

» nextflow run nf-recursion/until.nf 
[f0/778696] process > foo (6) [100%] 6 of 6 ✔
hello
Task 1 added this

hello
Task 1 added this
Task 2 added this

hello
Task 1 added this
Task 2 added this
Task 3 added this

hello
Task 1 added this
Task 2 added this
Task 3 added this
Task 4 added this

hello
Task 1 added this
Task 2 added this
Task 3 added this
Task 4 added this
Task 5 added this

hello
Task 1 added this
Task 2 added this
Task 3 added this
Task 4 added this
Task 5 added this

Accumulators

Another important use case when dealing with recursion is the ability to repeat the execution with not with just the previous output, but will the outputs produced by any previous executions.

When using reactive programming this is known as scan operation

Nextflow implements a similar concept applying the scan operation to a workflow or process component. For example:

nextflow.enable.dsl=2
nextflow.preview.recursion=true

process foo {
  input: 
    path data
  output:
    path 'out_*.txt'  
  """
    echo "Task $task.index inputs: $data" > out_${task.index}.txt
  """
}

workflow {
  data = channel.fromPath("sample*.txt").toSortedList(it->it.name).flatten()
  foo.scan( data ).view(it -> it.text)
}

running the above snippet the following output will be printed

» nextflow run nf-recursion/scan_files.nf 
executor >  local (4)
[89/9be40d] process > foo (4) [100%] 4 of 4 ✔
Task 1 inputs: sample_1.txt

Task 2 inputs: sample_2.txt out_1.txt

Task 3 inputs: sample_3.txt out_1.txt out_2.txt

Task 4 inputs: sample_4.txt out_1.txt out_2.txt out_3.txt

Note that when using scan it should not be provided with any termination condition, because the number of times the process (or the workflow is executed) only depends on items emitted by the input channel(s) as in any other Nextflow process.

Limitations & cavetas

This is still an explorative feature with a lot of limitations:

currently either when using recurse or scan the input and output definitions, either for a process or a workflow, should be identical
the recurse operation allows as input only object values and channel value, queue channel cannot be used.
tasks are not aware of the current interaction. It may be useful to have some kind of index representing the level of recursion or iteration
the inputs in the scan operation contains both the actual input, plus the accumulated output, there's no way to distinguish them (apart from using hacking on the file names)
the output files from a previous iteration can easily collide with the inputs when they have the same same.

Conclusion

This feature potential can have a big impact on Nextflow workflows, for this reason, I think it's very important to collect the feedback of the community or whoever may be interested in this feature.

Do you think it's useful? How it could be improved. Your comments are welcome

Resources

You can find the above examples at this repo.

sminot · 2021-12-23T19:34:24Z

sminot
Dec 23, 2021

This seems like it could be incredibly useful! I have no comments apart from my strong support.

0 replies

dfornika · 2021-12-23T21:32:50Z

dfornika
Dec 23, 2021

I'd tried to develop a pipeline for generating simulated datasets that could be used to test variant calling tools:

https://github.com/dfornika/simulate-outbreak-dataset

The idea is to do a rudimentary simulation of an outbreak by introducing mutations into a reference genome, then passing those mutated genomes forward to another iteration. I was never quite able to get it working the way I wanted using DSL2 syntax. I'm looking forward to giving it another try with the new .recurse() syntax.

7 replies

dfornika Dec 24, 2021

Yes, I'm using .recurse().times(params.iterations)

dfornika Dec 24, 2021

The pipeline does appear to be working as intended in its current state. I ran it using the Mycobacterium tuberculosis H37Rv reference, with 8 iterations. I generated this tree from the generated mutated sequences:

pditommaso Dec 24, 2021
Maintainer Author

Since you are always using the same reference, you could create a channel emitting always the same reference file as many times as you need to repeat the scan, eg

channel.fromList( [file('some/reference.fa')] * YOUR_TIMES_PARAM ).set { inputs_ch }

Note also the scan can be applied to a workflow capturing two or more processes that implements the logic to be repeated

dfornika Dec 24, 2021

Excellent, that works! I've updated the pipeline again, using .scan() with your suggestion of loading a channel with repeated instances of the reference.

Using .scan() instead of .recurse() has eliminated the issue I was having where I had to add an underscore to the mutated genome filenames to pass them along to the next iteration. Here's a screenshot of the work directory of the last iteration, showing all of the accumulated mutated genomes (1580.fa, 2789.fa, etc.)

pditommaso Dec 24, 2021
Maintainer Author

Nice, very excited to see this in action!

mes5k · 2021-12-24T20:57:09Z

mes5k
Dec 24, 2021

Christmas came early! Very excited to try this out. Thanks @pditommaso!

0 replies

bentsherman · 2022-01-21T19:45:11Z

bentsherman
Jan 21, 2022
Maintainer

This is a good initial implementation. For this feature to be robust, I think it needs to be rooted in the concept of a finite state machine. FSMs are used to design digital circuits that require recursion or iteration. Digital circuits are just workflows implemented in hardware, so they are a good analogy for Nextflow. See also recurrent neural networks.

Taking the above example from @dfornika and this state machine diagram, we can see that:

the input is a single file
the "combinational logic" consists of (1) selecting a file from the inputs, (2) mutating the file, and (3) appending the new file to a list
the "memory" is the running list of files (initially empty)
the output is also the list of files

This example works in Nextflow because the state and the output happen to be the same, but what if they aren't? I think a process/workflow that supports recursion should be able to have a sense of "state" that is distinct from the output (but can be the same).

Being able to define state variables for a process/workflow would (I think) solve limitations 1, 3, 4, and 5. If you have state, the input and output don't have to be identical. State can be used to keep track of recursion level (either by hand or we could add a convenience function). State allows you to distinguish between the initial input and accumulated outputs in the scan example. In fact, if you can define state variables then I think the difference between recurse and scan breaks down, although we could still have those functions for convenience.

Regarding limitation 2, passing a queue channel into a recursive process/workflow should work just like any other process/workflow -- each item in the channel should be processed independently. Instead of calling a process once per item like normal, each item will trigger a recursive sequence of tasks, but the overall parallelism of the queue channel remains the same.

So basically I think we need to rework this feature to allow for state channels. I think the example by @dfornika is a good one to work with -- how would this example be written if we could define state channels? Another good example would be to implement a multiplier "workflow" that takes two integers (as lists of 0s and 1s) and produces the product.

If we do go forward with this feature, I think it will be important to give some examples for when recursion should and should not be used. Many recursive/iterative use cases can just be implemented in an imperative language (a Python script) and then called in a single process. As far as I can tell, it is only when the state consists of files (especially large files) that recursion should be used at the workflow level.

9 replies

bentsherman Jan 31, 2022
Maintainer

Just submitted an issue #2609

Midnighter Oct 11, 2022

You seem to have found a good alternative solution. Maintaining global state seems like a dangerous path to go down.

cjw85 Oct 11, 2022

To be clear, there's no global state required for recursion, what is required is state local to the recursed process. The discussion is really whether that should be explicit (which is more flexible and allows various extensions without further direct support from nextflow beyond implementing the idea of state) or somehow implicit through hacking of input and output definitions.

Midnighter Oct 11, 2022

I see what you mean but I would argue that once you record state in a process/function, it has a global state.

I would also regard passing state from output to input as being explicit rather than being implicit.

cjw85 Oct 11, 2022

The secondary issue with doing it like that (contorting the state with the input and output) is that the state may not be something which should be output to the outside world.

It's easy to imagine a process that wants to take some action contingent on say a binary value stored within the state. Or that the state is combined with a new input to create and output, but is updated independently. This binary value is of no consequence to the outside world so shouldn't be in the output channel.

To give two further examples of distinct behaviour: the archetypal scan function emits it's state on each iteration as an output, but a fold emits only the final state after iteration has finished.

bentsherman · 2022-01-31T16:44:19Z

bentsherman
Jan 31, 2022
Maintainer

Another thought I had is maybe scan() isn't necessary and could be replaced with recurse(). The only difference I see is that scan() allows input files to be included as output files, which by default Nextflow doesn't do. But perhaps that should just be an option for output channels, independent of recursion.

Rewriting the scan_files.nf example to use recurse():

nextflow.enable.dsl=2
nextflow.preview.recursion=true

process foo {
  input:
    path infiles
    val i
    path outfiles
  output:
    path infiles
    val i_
    path 'out_*.txt', includeInputs: true
  script:
  """
  echo "Task ${i} inputs: ${infiles[i]}" > out_${i}.txt

  # ${i_ = i + 1}
  """
}

workflow {
  infiles = channel.fromPath("sample*.txt").toSortedList(it->it.name)

  foo.recurse( infiles, 0, [] ).until { infiles, i, outfiles -> (i == infiles.size()) }
  foo.out[2].view(it -> it.text)
}

Also this way you can keep track of inputs vs outputs.

19 replies

bentsherman Oct 5, 2022
Maintainer

I think I am beginning to understand the different ways the process recursion can be used.

While you can implement reductions (like a scan) with recurse, it requires you to collect all of the inputs before performing the reduction
Alternatively, you might want to perform a "streaming" reduction on a channel as items are emitted, which describes the purpose of scan and your use case
However, scan is somewhat brittle because it combines the inputs and outputs under the hood instead of allowing the user to define the reduction

So now I'm thinking, instead of scan we should have a reduce method that performs a streaming reduction. It would work like the reduce operator, except the kernel is a process instead of a Groovy closure. And that way you could perform any reduction, not just a scan.

Here is a sketch based on your example:

nextflow.preview.recursion=true

process foo {
  input: 
    tuple val(meta), path(data) // next input
    path(summary_json)          // prev state
  output:
    path(summary_json)          // next state
  """
  echo "meta: $meta , data: $data" >> ${summary_json}
  """
}

workflow {
  ch_inputs = channel.fromPath("sample*.txt") | map { data -> [[id: 'id'], data] }
  ch_summary = foo.reduce( ch_inputs, file('summary.json') )
}

The reduce method should expect a process with two inputs and one output. The first input will be streamed as normal, while the second input (the "state") will be initialized with the given argument and fed back into itself. This is a different way of thinking about processes but I like the direction.

cjw85 Oct 5, 2022

It would work like the reduce operator, except the kernel is a process instead of a Groovy closure.

😃 This is back where is all started! (see private thread, I suggested exactly the same).

Beyond this discussion there are other cases where it would be nice if the callback of channel operators were a process rather than just a Groovy expression.

you might want to perform a "streaming" reduction

If we're not talking about streaming, I think we've rather abandoned the dataflow model. 😛

bentsherman Apr 27, 2024
Maintainer

@cjw85 I've been mulling over this thread for a long time, more-so recently as I've been thinking about how to improve the dataflow syntax in Nextflow. I think I found a way to do a streaming reduction over a process, though it might be a while before it's fully supported.

Adapting your code example with the accumulating JSON file:

process ACCUMULATE {
  input:
  path "result.json"
  path "input.json"

  output:
  path "result.json"

  script:
  """
  cat input.json > result.json
  """
 }

workflow {
  Channel.fromPath("inputs/*.json")
    | scan { result, file ->
      result = ACCUMULATE ( result, file )
      println result.text
    }
}

There are two pieces to make it work:

scan should just be a regular operator. I have implemented it in nf-boost, but we'll merge it into core Nextflow at some point
Instead of invoking a process with a channel, invoke the process within an operator like a regular function (because that's what it is at the end of the day), then the compiler should figure out the dataflow logic to make it work. Processes currently come with a map operator baked in, but this way, a process can be invoked within any operator, such as reduce or scan, which leads to a nice solution for your use case.

I really like this approach because it makes the process invocation actually consistent with the process definition -- ACCUMULATE is just a function that takes two files and returns a file. From the user's point of view, there are no process "input/output channels", only values, though the compiler might use channels under the hood.

It also does not require feedback channels like recurse. In fact, I wonder if recurse is even needed if we have the above approach. The only use case that comes to mind is iterating on a single value until some condition is met which depends on the value itself (as opposed to a fixed number of iterations which can be done with reduce). But you could just do that with some Bash/Groovy code in a single process, there is no need to involve dataflow logic there.

Anyway, this change is the kind of thing that would come in DSL3 since it's a very different way to call processes. Not sure when it'll happen but I think it's the right direction.

cjw85 Apr 27, 2024

I presume you mean:

cat input.json >> result.json

(albeit that being a little bit gnarly when files are staged with symlinks)

invoke the process within an operator like a regular function

This is the key part I think. As soon as you can use a process more like a function in this way you can do a lot more stuff without needing to design explicit handling of specific higher-order functions in the DSL.

bentsherman Apr 28, 2024
Maintainer

Yes I meant >>, and I would probably make a copy each time. In any case, the challenge now is the "let the compiler figure out the rest" part. We'll need a lot of machinery under the hood to achieve this nicer syntax on the surface. Hope to have a prototype in the coming months.

bobamess · 2022-03-01T09:47:01Z

bobamess
Mar 1, 2022

How would recursion work with sub-workflows?

The examples given so far appear to only involve recursive processes.

2 replies

bentsherman Mar 1, 2022
Maintainer

Workflows can also have inputs/outputs and they can be called like processes. So something like this:

nextflow.enable.dsl=2
nextflow.preview.recursion=true

params.data = "hello.txt"

process tick {
  input:
    path 'input.txt'
  output:
    path 'result.txt'
  script:
    """
    cat input.txt > result.txt
    echo "Task $task.index : tick" >> result.txt
    """
}

process tock {
  input:
    path 'input.txt'
  output:
    path 'result.txt'
  script:
    """
    cat input.txt > result.txt
    echo "Task $task.index : tock" >> result.txt
    """
}

workflow clock {
  take: infile
  main:
    tick(infile)
    tock(tick.out)
  emit:
    tock.out
}

workflow {
  clock
    .recurse(file(params.data))
    .until{ it-> it.size() > 100 }

  clock
    .out
    .view(it->it.text)
}

bobamess Mar 1, 2022

Great! Thanks. That would be good

subwaystation · 2022-03-16T14:05:16Z

subwaystation
Mar 16, 2022

What if I have to change the parameters for each recursion iteration? Any ideas?

1 reply

bentsherman Mar 16, 2022
Maintainer

This is an interesting question. Kinda depends on what you mean by "parameters". And I think it works in most cases but not all.

The process inputs and outputs have to be the same, so both should actually include your inputs, (intended) outputs, and any additional "state" variables like an index.

If you're just dealing with vals, it should work as in Paolo's first example.

If you're just dealing with files, it should work as in Paolo's second example. Additionally, your input/output files could be a glob pattern so that you can have a list of files that maybe change on each iteration.

If you have both vals and files that change on each iteration, then it gets tricky. Referencing one of my earlier examples:

process foo {
  input:
    path infiles
    val i
    path outfiles
  output:
    path infiles
    val i_
    path 'out_*.txt', includeInputs: true
  script:
  """
  echo "Task ${i} inputs: ${infiles[i]}" > out_${i}.txt

  # ${i_ = i + 1}
  """
}

You could also do this:

  script:
  i = i + 1
  """
  echo "Task ${i} inputs: ${infiles[i]}" > out_${i}.txt
  """

But not this:

  script:
  """
  echo "Task ${i} inputs: ${infiles[i]}" > out_${i}.txt
  """
  i = i + 1

I tried the latter approach and it doesn't actually run the bash script. @pditommaso wondering if you think this should work, running groovy code after a bash script.

Anyway, @subwaystation let me know if these examples answer your question.

subwaystation · 2022-03-16T17:02:24Z

subwaystation
Mar 16, 2022

I also asked in the Nextflow-help chat and got a satisfying answer there. Thanks @bentsherman !

2 replies

bentsherman Mar 16, 2022
Maintainer

Copying @robsyme's solution from this Slack thread for the record:

params.allopts = [123,23432,234,234]
params.data = "input_and_output.txt"

process Looper {
    input:
    file "io.txt"

    output:
    file "io.txt"

    script:
    param = params.allopts[task.index-1]
    """
    echo Adding new line: ${param} >> io.txt
    """
}

workflow {
    Looper
    .recurse(file(params.data))
    .times(params.allopts.size())
}

Midnighter Oct 11, 2022

I came here to look for this use-case. It would be cool to have a more "natural" way for this. Maybe something like the following where an additional input parameter would be inserted by nextflow.

process Looper {
    input:
    path "io.txt"
    val option

    output:
    path "io.txt"

    script:
    """
    echo Adding new line: ${option} >> io.txt
    """
}

workflow {
    Looper
    .recurse(file('input_and_output.txt'))
    .for([123, 23432, 234, 234])
}

plaquette · 2022-06-22T13:57:44Z

plaquette
Jun 22, 2022

is it possible to extend this to rerun a sub-workflow for different sets of inputs?

sort of:

workflow{
   
    data =[[input: path1, output: path2],[input: path3, output: path4]]

    main:
        for (tup in data){
            do_stuff.recurse(tup)
        }
}

13 replies

bentsherman Jun 23, 2022
Maintainer

@plaquette stepping back for a moment, does each subworkflow iteration actually depend on the output of the previous iteration? If not, then there is no need to use the recursion feature, you should allow Nextflow to run them in any order or in parallel. It seems like you just want to process each item in an array and they seem to be independent.

plaquette Jun 24, 2022

oh i was not aware that i can tell nextflow to do such. no - the iteration step is not recursive and independent of the former output.

i have a set of inputs with one tuple per species:

[ [input: "path_to_species_one", parameter_1: "p_1_1", ... , parametern: "p_1_n"], [input: "path_to_species_two", parameter_1: "p_2_1", ... , parametern: "p_2_n"], ... , [input: "path_to_species_n", parameter_1: "p_n_1", ... , parametern: "p_n_n"] ]

and i want to run the subworkflow sequential for each element of the list above, like this:

workflow subworkflow{

    take:
        data_sub
    
    main:

        input_ch = Channel.fromPath(data_sub.input + "/*.fasta").flatten().collate(data_sub.parameter_1)
        
        p2_ch = Channel.value(data_sub.parameter_2)

        ...

        if (data_sub.parameter_n != null){
            parameter_n_ch = Channel.fromPath(data.parameter_n)
        }
        else{
            parameter_n_ch = Channel.value(false)
        }

        process1(input_ch, p2_ch)

        process2(process1.out, ...)

        ...
       
    

}


workflow{
    
    main:
        data =  [ [input: "path_to_species_one", parameter_1: "p_1_1", ... , parametern: "p_1_n"], [input: "path_to_species_two", parameter_1: "p_2_1", ... , parametern: "p_2_n"], ... , [input: "path_to_species_n", parameter_1: "p_n_1", ... , parametern: "p_n_n"] ]

        for (tup in data){
            subworkflow(tup)
        }
}

but if i do it like above - the channels in the subworkflow get mixed for different sets of inputs.

so how would i write the nf script to behave similar to bash-script of the sort:

nextflow run  main.nf --input="path_to_species_one" --parameter_1="p_1_1" ... parametern="p_1_n"

nextflow run  main.nf --input="path_to_species_two" --parameter_1="p_2_1" ... parametern="p_2_n"

.
.
.

nextflow run main.nf --input="path_to_species_n" --parameter_1="p_n_1" ... parametern="p_n_n"

thank you!

best,

j.

bentsherman Jun 24, 2022
Maintainer

Okay I see. Your situation is a little more complicated but it can be handled with channel operators. Additionally, you don't need a subworkflow, you can do it all in one workflow.

I reworked your example to just pass all parameters for a given input to each process because that is the simplest. Each process can access whichever parameters it needs.

process process1 {
  input:
    tuple val(species), val(params), path(files)
  output:
    tuple val(species), val(params), path('*.out')

  // ...
}

// process process2 ...

workflow {
  // group params into a sub-list
  ch_params =  Channel.fromList([
    ["path_to_species_1", ["p_1_1", /* ... , */ "p_1_n"]],
    ["path_to_species_2", ["p_2_1", /* ... , */ "p_2_n"]],
    // ... ,
    ["path_to_species_n", ["p_n_1", /* ... , */ "p_n_n"]]
  ])

  // create new channel that maps each path to list of files
  ch_files = ch_inputs.map { it -> [it[0], file("${it[0]}/*.fasta")] }

  // merge into a new channel with the form [path, [params], [files]]
  ch_inputs = ch_params.join(ch_files)

  // invoke first process with inputs
  process1(ch_inputs)

  // process1 should output channel with the form [path, [params], [process1_outputs]]
  process2(process1.out)

  // ...
}

plaquette Jun 27, 2022

sorry for delayed response - thank you very much for the provided example! i'll try it as soon as possible.

it looks like it implies redoing a lot of the pipeline since all the inputs now have to be changed to fit your pattern.

bentsherman Jun 27, 2022
Maintainer

You're welcome. For what it's worth, it's probably not the only way to structure your inputs, it just seemed like the most straightforward to me at the time.

jemunro · 2022-09-07T01:16:36Z

jemunro
Sep 7, 2022

How about an implementation like this?

process append_square {
    input: 
        val(x)
    output: 
        val(y)
    exec:
        y = x + x[-1]**2
}

workflow MyRecursion {
    take: 
        data

    main:
        result =  data |
            append_square |
            branch { 
                done: it[-1] >= 10
                feedback: true
            }

    emit:
        result.done

    recurse:
        result.feedback
}

workflow {
    Channel.of([2], [3], [4]) |
        MyRecursion |
        view
}

[4, 16]
[3, 9, 81]
[2, 4, 16]

I see a few advantages to doing it this way:

Arbitrary nextflow operators within the main: section of the recursive subworkflow to control the recursion (e.g. branch, filter, until, unique, ...)
The recursive workflow can emit while still recursing, allowing downstream processes to run sooner
No need to separately define recursive processes, just include them inside a recursive subworkflow

0 replies

jonalm · 2023-08-29T11:18:04Z

jonalm
Aug 29, 2023

Hi,

I'm failing to use the recursive pattern on a workflow. I'm very new to nextflow so please forgive me for basic errors.

I'm trying to figure out how to use nextflow to "reduce" over an ordered set of items/files by a processes/tasks in an hierarchical manner, e.g. for a channel with 8 items, named a-h, I want to use nextflow to calculate something like

f(f(f(a,b), f(c,d)), f(f(e,f), f(g,h)))

where f is a process with two arguments.

In the following example f is simply concatenating the content of two input files. Here is the working but non-recursive implementation:

include { CREATEDATA } from './module'
include { concat_wrapper as concat_wrapper1} from './module'
include { concat_wrapper as concat_wrapper2} from './module'
include { concat_wrapper as concat_wrapper3} from './module'

workflow {
    (sortkeys, files) = Channel.from( 'A'..'H' ) | CREATEDATA
    keyed_files = sortkeys.merge(files).toSortedList({ a, b -> a[0] <=> b[0] })
    keyed_files.view()

    concat_wrapper1(keyed_files)
    concat_wrapper2(concat_wrapper1.out)
    concat_wrapper3(concat_wrapper2.out)
    concat_wrapper3.out.view()
}

where module.nf is implemented as:

process CREATEDATA {
    input:
    val key_in

    output:
    val key_out
    path "${key_out}.txt"

    script:
    key_out = key_in
    """
    echo $key_in >> ${key_in}.txt
    """
}

process CONCAT {
    input:
    tuple val(first_key), path(first_path), val(second_key), path(second_path)

    output:
    tuple val(key_out), path("${key_out}.txt")

    script:
    key_out = "${first_key}_${second_key}"
    """
    touch ${key_out}.txt
    cat $first_path >> ${key_out}.txt
    cat $second_path >> ${key_out}.txt
    """
}

workflow concat_wrapper {
    take: 
        keyed_files
    main:
        keyed_files_out = CONCAT(keyed_files.flatten().collate(4)).toSortedList({ a, b -> a[0] <=> b[0] })
    emit: 
        keyed_files_out
}

The above works as expected, but the manual looping over the concat_wrapper is cumbersome. As the input and the output of the concat_wrapper has the same type, I thought I could use the recursive functionality like:

nextflow.preview.recursion=true
include { CREATEDATA } from './module'
include { concat_wrapper} from './module'

workflow {
    (sortkeys, files) = Channel.from( 'A'..'H' ) | CREATEDATA
    keyed_files = sortkeys.merge(files).toSortedList({ a, b -> a[0] <=> b[0] })
    keyed_files.view()

    concat_wrapper.recurse(keyed_files).times(2)
    concat_wrapper.out.view()
}

But running the above does not work. And the terminal seems to hang after the first invocation of concat_wrapper is there something I've missed?

4 replies

bentsherman Aug 29, 2023
Maintainer

I think you are encountering the bug #3795 -- reduction operators like toSortedList cause the recursive workflow to hang. As a workaround you must merge some processes so that you don't have to use toSortedList in the recursive subworkflow. And in this case that might be for the better. If you're just merging some files, why not merge them all at once in a single process?

jonalm Aug 29, 2023

Thanks @bentsherman, I guess the reason for not merging them is that I want to parallelize the merging step. My real use case is to do set operations on lists of k-mers, e.g. find intersections of common k-mers across a large number of files. I'm only utilizing the recursive mechanism to loop over the layers of the binary execution tree. Are there other mechanisms in nextflow that could be utilized for this?

bentsherman Aug 31, 2023
Maintainer

I see, so you are doing a binary tree merge to make it faster. Unfortunately I can't think of a way to do it without toSortedList or an equivalent collection operator. This is actually a very nice use case for the recursion, though, so I will keep it in mind as we try to improve the support.

Until then, your original non-recursive implementation is exactly what I would do. And as a consolation prize, I modified it to handle any number of files, as long as you chain enough copies of concat_wrapper. Here you go:

workflow concat_wrapper {
    take: 
        keyed_files
    main:
        keyed_files
        | flatten
        | collate(4)
        | branch {
            one: it.size() == 2 // route any odd pair around CONCAT
            two: it.size() == 4
        }
        | set { ch_pairs }

        ch_pairs.two
        | CONCAT
        | mix( ch_pairs.one )
        | toSortedList { a, b -> a[0] <=> b[0] }
        | set { keyed_files_out }
    emit: 
        keyed_files_out
}

workflow {
    Channel.from( 'A'..'Z' ) // so 26 input files
    | CREATEDATA // output: tuple val(key), path("${key}.txt")
    | toSortedList { a, b -> a[0] <=> b[0] }
    | view
    | concat_wrapper32 // add as many layers as you want!
    | concat_wrapper16
    | concat_wrapper8
    | concat_wrapper4
    | concat_wrapper2
    | view
}

jonalm Sep 1, 2023

Many thanks for the rewrite; that implementation is much clearer. Finger crossed that the bug is fixed and that recursion turns in to an officially supported nextflow feature.

jeffquinn-msk · 2023-10-30T16:33:25Z

jeffquinn-msk
Oct 30, 2023

So just to clarify, in the current state of the world, is there any option to proceed if we need to recurse and our initial recursion input happens to come from running another process or workflow?

Something like:

workflow {
    def input_tuple = tuple([id:params.experiment_name], file(params.experiment_file))

    PREPARE_EXPERIMENT( Channel.fromList( [input_tuple] ) )

    RUN_STEP
        .recurse(PREPARE_EXPERIMENT.out.experiment)
        .times(10)
}

3 replies

bentsherman Oct 30, 2023
Maintainer

In this case you have to collect the process output into a value channel, e.g. PREPARE_EXPERIMENT.out.experiment.collect()

jeffquinn-msk Oct 30, 2023

Aha thanks!

GregorySchwing Feb 22, 2025

In this case you have to collect the process output into a value channel, e.g. PREPARE_EXPERIMENT.out.experiment.collect()

awesome suggestion

mbeavitt-bh · 2024-01-18T16:44:36Z

mbeavitt-bh
Jan 18, 2024

A basic question;
Sorry if this was already addressed but, in the clock example, the filesize is evaluated in order to find out when to stop the recursion:

  clock
    .recurse(file(params.data))
    .until{ it-> it.size() > 100 }

But if I wanted to evaluate the output in some other way, perhaps using a custom function, how would I do that?

Say, for the sake of argument, I actually wanted to know how many times the letter i showed up in the output in order to call the recursion to a finish. I could easily design a process that counts the letter i, but how can that be used in this evaluation?

2 replies

bentsherman Jan 18, 2024
Maintainer

In this example, the it variable is the output file and the until block is just a regular closure, so you can implement whatever condition you want within the closure, as a function of the output file. You could implement a custom function elsewhere in the script and use it in the condition

mbeavitt-bh Jan 18, 2024

I see, so it's my limited understanding of nextflow that's holding me back 😅. Thank you so much for the reply.

hovo1990 · 2024-04-25T13:10:45Z

hovo1990
Apr 25, 2024

So does this mean none of the steps of the iteration will be cached? Since Input file is being modified?

1 reply

bentsherman Apr 25, 2024
Maintainer

Each iteration will have its own task directory which has a link to the original file wherever it is created. So if you modify the input file, you modify that original copy. This is true whether you use the recursion feature or not. You can avoid the problem by creating a new file on each iteration, then it should be cached

jduerholt · 2024-06-04T11:41:18Z

jduerholt
Jun 4, 2024

Hi all,

for me it seems that the recurse method has a problem when used in combination with collect or collectFile.

Have a look at this minimal (non) working example:

nextflow.preview.recursion=true

params.experiments = "$baseDir/experiments.txt"
params.n_candidates = 2


process generateCandidate {
    input:
    path 'experiments.txt'

    output:
    path 'candidate_*.txt'

    """
    for i in 1 2
    do
        echo \$RANDOM > candidate_\$i.txt
    done
    """
}

process postProcess{

    input: 
    path 'candidate.txt'
    
    output:
    path 'result.txt'
    path 'candidate.txt'
    
    """
    echo \$RANDOM > result.txt
    """
}

process makeExperiment{
    input:
    path 'candidate.txt' //candidate.json
    path 'result.txt' //result.json
    
    output:
    path 'experiment.txt'
    
    """
    paste candidate.txt result.txt > experiment.txt
    """
}


process updateExperiments{
    publishDir path: "$baseDir/results"

    input:
    path 'collection.txt'
    path  'experiments.txt'
    
    output:
    path 'experiments_latest.txt'
    
    """
    cat experiments.txt collection.txt > experiments_latest.txt
    """

}
 

workflow step {
    take: infile

    main:
    c = generateCandidate(infile)
    
    o = postProcess(c.flatten())
    o = makeExperiment(o[0],o[1])
    
    o = updateExperiments(o.collectFile(name: 'collection.txt', newLine:true), infile)
    
    emit: o
}

workflow {
    step.recurse(params.experiments).times(2)
    //step(params.experiments)
}

This workflow mimics a Bayesian optimization loop in which in the first step candidates are generated which are then evaluated in parallel and are combined in the updateExperiments step.

If I execute the workflow, it stops without error messate before updateExperiments, if I just run one step (comment in last line), it works without problem.

Any idea what is going on there?

Best,

Johannes

3 replies

bentsherman Oct 2, 2024
Maintainer

It is because of this issue: #3795

collectFile is a "reduction" operator and so will hang if you try to recurse the workflow

Basically, when you have a recursive workflow, you can't do a scatter-gather (as you are doing with flatten and collectFile) because of this limitation with the "gather" step. Instead, you must do the entire scatter-gather in a single process

You can recover the parallelism by running the experiments in parallel within the process script

jduerholt Oct 7, 2024

Thank you very much for the hint to the issue. Do you know if there is anybody working on a fix for the bug?

My solution was write a python script which just calls the workflow (including the scatter-gather) in a for loop, factoring out the recursive part to the python script.

bentsherman Oct 7, 2024
Maintainer

I originally posted that issue as a bug, but it turns out to be more of a fundamental limitation of the recursion feature. We would have to make some pretty deep changes to Nextflow to fix this "bug". So I am considering this option, but in the meantime, most people can get by with moving the scatter-gather into a single process.

muniheart · 2025-06-30T08:05:06Z

muniheart
Jun 30, 2025

I'm having trouble implementing recursion in the pipeline described here Briefly, a process runs nextflow in a singularity container. It consumes a channel of param-files which successively enable more features of a pipeline. Once a task completes, a squashfs image of the task's workdir is created and the directory can be removed to reclaim storage. On subsequent tasks, squashfs images of all previous tasks are mounted to the container's file system, each at its path of origin, to enable retrieval of cached tasks. Here is minimal code to describe my problem.

nextflow.enable.dsl=2
nextflow.preview.recursion=true

process NFCORE_RUN {
  input:
    path data
  output:
    path "$workdir"

  script:
    log.info "NFCORE_RUN: data: ${data}"
    workdir = "work_${task.index}"

  // Make some subdirs and files for testing.
  """
    mkdir -p ${workdir}/{d3,af}
    touch ${workdir}/af/test.txt
  """
}

// Image the work-dir and the real-path of the work-dir.
process SQUASH_WORK {
    input:
    path work_dir

    output:
    path "*sqfs"

    script:
    image = "${work_dir}.sqfs"
    log.info "SQUASH_WORK: image=${image}"

    """
    mksquashfs ${work_dir}/* ${image} -p "${work_dir} s 0 0 0 \$(realpath ${work_dir})"
    """
}

// Extract work-dir links.
process EXTRACT_LINK {
    input:
       path sqfs
    output:
      path "work*"

    script:
        log.info "EXTRACT_LINK: sqfs=${sqfs}"
        """
        unsquashfs -d . $sqfs /work*
        """
}

workflow MAIN {
    take:
    data

    main:
    NFCORE_RUN( data )
    SQUASH_WORK( NFCORE_RUN.out )
    EXTRACT_LINK( SQUASH_WORK.out )
    log.info "MAIN: NFCORE_RUN.out.getClass(): ${NFCORE_RUN.out.getClass()}"

    emit:
    NFCORE_RUN.out.concat( SQUASH_WORK.out ).concat( EXTRACT_LINK.out )
}

workflow {
  data = channel.fromPath("params*.yml").toSortedList(it->it.name).flatten()
  MAIN.scan( data )
}

Run on 3 files: params_1.yml params_2.yml params_3.yml, output is,

Nextflow 25.04.4 is available - Please consider updating your version to it

 N E X T F L O W   ~  version 25.04.2

Launching `./main.nf` [distraught_fermat] DSL2 - revision: 44ee6ee28b

WARN: NEXTFLOW RECURSION IS A PREVIEW FEATURE - SYNTAX AND FUNCTIONALITY CAN CHANGE IN FUTURE RELEASES
executor >  local (9)
[10/2273df] MAIN:NFCORE_RUN (3)   | 3 of 3 ✔
[e3/dba68b] MAIN:SQUASH_WORK (3)  | 3 of 3 ✔
[b8/3ee5e3] MAIN:EXTRACT_LINK (3) | 3 of 3 ✔
MAIN: NFCORE_RUN.out.getClass(): class nextflow.script.ChannelOut
NFCORE_RUN: data: params_1.yml
SQUASH_WORK: image=work_1.sqfs
NFCORE_RUN: data: params_2.yml work_1
EXTRACT_LINK: sqfs=work_1.sqfs
SQUASH_WORK: image=work_2.sqfs
NFCORE_RUN: data: params_3.yml work_1 work_2
EXTRACT_LINK: sqfs=work_2.sqfs
SQUASH_WORK: image=work_3.sqfs
EXTRACT_LINK: sqfs=work_3.sqfs

The input data is only updated with the work-dir, not the squashfs image, nor the links, and therein lies the problem. Process NFCORE_RUN needs both the images and the links to create bind-mount options to pass to singularity through containerOptions.

1 reply

muniheart Jun 30, 2025

Though I am still curious to understand the nature of the above problem and its resolution, I have found a crude work-around.

nextflow.enable.dsl=2
nextflow.preview.recursion=true
    
/*
 *  data: [ params-file, work_1.sqfs, work_2.sqfs, ... ]
 * /
process NFCORE_RUN {
  input:
    path data
  output:
    path "$workdir"
    
  script:
    log.info "NFCORE_RUN: data: ${data}"
    workdir = "work_${task.index}"
    
    images = data.grep { it=~/sqfs/ }
    mounts = images.collect {
                base_name = it.getSimpleName()              // work_i.sqfs -> work_i
                [
                    it,
                    it.resolveSymLink()                     // /absolute/path/without/symlinks/to/work_i.sqfs
                    .resolveSibling( base_name )            // /absolute/path/to/work_i
                    .resolveSymLink()                       // /absolute/path/without/symlinks/to/work_i
                ]
            }

    bind_opts = mounts.collect { "-B ${it[0]}:${it[1]}:image-src=/" }
    
    log.info "NFCORE_RUN: data: ${data}"
    log.info "NFCORE_RUN: images: ${images}"
    log.info "NFCORE_RUN: mounts: ${mounts}"
    log.info "NFCORE_RUN: bind_opts: ${bind_opts}"

  // Make some subdirs and files for testing.
  """
    mkdir -p ${workdir}/{d3,af}
    touch ${workdir}/af/test.txt
  """
}

// Image the work-dir and the real-path of the work-dir.
process SQUASH_WORK {
    input:
    path work_dir

    output:
    path "*sqfs"

    exec:
    log.info "SQUASH_WORK: work_dir: ${work_dir}"
    log.info "SQUASH_WORK: work_dir.getClass(): ${work_dir.getClass()}"
    log.info "SQUASH_WORK: work_dir.toRealPath(): ${work_dir.toRealPath()}"
    log.info "SQUASH_WORK: work_dir.toFile(): ${work_dir.toFile()}"

    script:
    image = "${work_dir}.sqfs"
    log.info "SQUASH_WORK: image=${image}"

    """
    mksquashfs ${work_dir}/* ${image}
    """
}

workflow MAIN {
    take:
    data

    main:
    NFCORE_RUN( data )
    SQUASH_WORK( NFCORE_RUN.out )

    emit:
    SQUASH_WORK.out
}

workflow {
  data = channel.fromPath("params*.yml").toSortedList(it->it.name).flatten()
  MAIN.scan( data )
}

I suspect, though I haven't tested, that the code that generates the bind-mount options will work when moved to the config file.

Processes and workflows recursion [EXPERIMENTAL] #2521

Uh oh!

Uh oh!

pditommaso Dec 23, 2021 Maintainer

Recursion in DSL2

Accumulators

Limitations & cavetas

Conclusion

Resources

Replies: 16 comments · 67 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pditommaso Dec 24, 2021 Maintainer Author

Uh oh!

Uh oh!

pditommaso Dec 24, 2021 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

bentsherman Jan 21, 2022 Maintainer

Uh oh!

Uh oh!

bentsherman Jan 31, 2022 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bentsherman Jan 31, 2022 Maintainer

Uh oh!

bentsherman Oct 5, 2022 Maintainer

Uh oh!

Uh oh!

Uh oh!

bentsherman Apr 27, 2024 Maintainer

Uh oh!

Uh oh!

bentsherman Apr 28, 2024 Maintainer

Uh oh!

Uh oh!

Uh oh!

bentsherman Mar 1, 2022 Maintainer

Uh oh!

Uh oh!

Uh oh!

bentsherman Mar 16, 2022 Maintainer

Uh oh!

Uh oh!

bentsherman Mar 16, 2022 Maintainer

Uh oh!

pditommaso
Dec 23, 2021
Maintainer

Replies: 16 comments 67 replies

pditommaso Dec 24, 2021
Maintainer Author

pditommaso Dec 24, 2021
Maintainer Author

bentsherman
Jan 21, 2022
Maintainer

bentsherman Jan 31, 2022
Maintainer

bentsherman
Jan 31, 2022
Maintainer

bentsherman Oct 5, 2022
Maintainer

bentsherman Apr 27, 2024
Maintainer

bentsherman Apr 28, 2024
Maintainer

bentsherman Mar 1, 2022
Maintainer

bentsherman Mar 16, 2022
Maintainer

bentsherman Mar 16, 2022
Maintainer