Nextflow language improvements #3107

bentsherman · 2022-08-09T22:27:54Z

bentsherman
Aug 9, 2022
Maintainer

As a Nextflow user (and now developer), I have a lot of little qualms with the Nextflow language. The fact that I fell in love with Nextflow despite all of its quirks is truly a testament to its excellence as a workflow manager. But man, there is a lot of weirdness and confusion when you’re learning Nextflow. Like, it’s not hard to learn, but there’s always this sinking feeling that you’re not doing something right, and that feeling never completely goes away.

I’ve been poring over issues and discussions from the past year or so, and I’ve seen a lot of weird bug reports or comments about the language itself, so I decided it’s time to collect everything into a mega-thread and start coming up with solutions.

Basically, I’m going to lay out everything I think is weird about the Nextflow language and try to propose solutions. I’m posting it here because I want to hear from Nextflow users. Like you! Please feel free to comment below, and I will try to incorporate your suggestions into the big picture.

I will continue to update this post as I see new issues or solutions.

NOTE: The checkboxes are reset every time I update this post, so they might not reflect the current status of things.

Basic scripting

See issues with the lang/dsl2 label.

syntax errors within a process or config block are not specific enough (better error messages #2082)
support function overloading (Function call with channel fails (seemingly) non-deterministically. #2447, Explore the possibility to allow the support function overloading #3011)
exit doesn't print message when called inside a process argument (Exit error message not printed to screen when used between process input brackets. #3046)
develop a linter / formal grammar for Nextflow scripts (?)
provide more flexibility for publishing files (How to use `publishDir` on a workflow output? #1933)

Formal grammar and linter

Some folks in the nextflow / nf-core Slacks are currently trying to develop a linter. We'll see how far they get, but overall I think there is a lack of clarity in what is possible in the Nextflow language. Beyond the core features, there is a large space of possible syntax since Nextflow inherits from Groovy and Java.

While I hope we can better formalize the Nextflow language (which we'll probably need to in order to fix issues like #2082), for now I think the most important thing is to make the docs clearer about what is possible in the Nextflow language, and maybe even provide some guidelines or conventions to help users along.

Flexibility for publishing files

A lot of users were asking in the thread linked above for the ability to publish files at the workflow level. As far as I can tell, users are just trying to not repeat the same publishDir settings for every process. As shown in the comments below (#3107 (reply in thread)), you can use process selectors to apply some directives to many processes at once, so I think this covers it.

Another idea is to implement a publish operator that basically does the same thing as the directive but as a standalone operator (#1540).

Processes

See issues with the lang/processes label.

support arity for path inputs/outputs (Output file("*.foo") returns a single file or list #1236, Inconsistent behaviour for glob process outputs #2425)
support optional input channels (Optional inputs for DSL2 #1694)
support each input repeater with tuple (Use of input repeaters ("each") in DSL2 #1966)
support record input/output type (Using custom objects with paths #2085)
support map input/output type (Add map input/output type for processes #2127)
support named input channels for processes and workflows (Support named channel inputs for processes and workflows #2257)
discourage when block in documentation (Suggest adding the process "when:" as a process directive. #2518)
support nullable input/output path (Nullable input/output paths #2893)
add glob() method for globs instead of file() (Add glob() method for globs in favor of file() #3109)

Map inputs/outputs, named inputs, optional inputs

These concepts are related but have slightly different uses. Maps are useful as an alternative to tuples. Named and optional inputs have the same usefulness as named and optional outputs. In particular, I think optional inputs would solve the problem that channel topics (#2842) is trying to solve. Nullable paths are useful when you want an element of a tuple / map to be optional.

Multiple input channels and the `merge` operator

In general, multiple input channels should only be used to enumerate the outer product of the channels, such as by using each or value channels. I have seen cases where users have a tuple channel and they want to split it up into multiple input channels instead of one long tuple input. While this may look "nicer", it is not the paradigm of Nextflow and often leads to suffering. Using multiple input channels in this particular way is equivalent to the merge operator, which should be removed from Nextflow. What I think users are actually reaching for here is to either use maps or named/optional inputs, so I think these language improvements will resolve some of this tension.

Operators

See issues with the lang/operators label.

`merge`

The merge operator was deprecated with DSL2 but revived because of some use cases that users wanted (see above section on multiple input channels). The problem is that merging two channels from different processes leads to random combinations because processes can execute tasks out of order. While it is safe to use with other operators, this behavior is not guaranteed and could change in the future. While it is safe with channels from the same process, in that case you are probably setting up our input/output channels in a bad way (see above section on multiple input channels). As stated above, if we can implement some improvements to process inputs/outputs then I think we'll be able to truly put the merge operator to rest.

Modules

See issues with the lang/modules label.

Process selector of original process name applies to modules included as original name and as aliased name. #2490

Configuration

See issues with the lang/config label.

See #2723 for ongoing discussion about a new configuration syntax. Lots of config-related issues are due to limitations of the current config syntax.

Add option to ignore process selector warnings #2700
rename pod directive to podOptions
rename pod.config to pod.configMap
deprecate pod.pullPolicy in favor of pod.imagePullPolicy
deprecate pod.runAsUser in favor of pod.securityContext

mahesh-panchal · 2022-08-10T09:50:31Z

mahesh-panchal
Aug 10, 2022

Metadata handling, grouping and operations.

Nextflow is really flexible when it comes to passing around data, but effectively everything needs to be passed around as a list including metadata, which can lead to large input tuples depending on the metadata needing to be passed around. Nf-core took a step to pass metadata around as a Map as the first element of most input tuples, but this combines metadata for different things. For a more explicit example, let's take a basic alignment workflow. Examples of metadata that are passed around are things like:

Sample ID
Library ID
Lane
Single-end/Paired-end
Genomic Interval
but these properties pertain to different things, and depending on the task you want to group on different properties.

The specific issue I have is when it comes to grouping data again based on metadata properties. If one uses a separate element of a tuple for each part of the metadata then one needs to use the index which is not informative to the reader ( e.g. r_ch.join(l_ch, by:[0,2]) - the reader needs to go to the process to see what the tuple structure is - which could mean trying to track it down through multiple subworkflows). If one passes this all around as single Map, then one is forced to perform channel data manipulation and take the unneeded metadata out, perform the join, and then reincorporate the separated out metadata.

I think it would help readability and workflow maintenance if input tuples were maps rather than lists, and operators could select which keys grouping happened on.

3 replies

bentsherman Aug 10, 2022
Maintainer Author

Excellent point. There are a few aspects to improving support for maps. I have already listed the need for a map input/output type and the issue of modifying maps above. What you are talking about is using maps with operators that have the by option.

I just checked the source code and indeed these operators only support integer indices. So if we only add support for string indices then I think maps will work with these operators. I will add it to the list and create a separate issue.

mahesh-panchal Aug 11, 2022

Having a map input type would be useful alternative to tuples.

And similarly supporting string indices would also be appreciated.

cjw85 Aug 11, 2022

Channel emissions being maps/named tuples would be a massive win! Numeric indexing is brittle when someone wants to change the contents of a channel and it makes operators on channels less readable.

The fudge at the moment is to emit multiple channels each with the same meta data and then join items back together by key. Code is more readable by virtue of having names for things, but less readable for having (somewhat) pointless operator chains everywhere.

mahesh-panchal · 2022-08-10T10:47:09Z

mahesh-panchal
Aug 10, 2022

Disk space usage while running is often a concern. The process.scratch goes a long way to help disk space usage while running, but frequently users don't want to cache the outputs of intermediate processes as they take up too much space. It would be nice if there were directives to say which processes should be used for check-pointing, and which processes should have their work directory emptied as soon as all processes needing that input have run to completion.

1 reply

bentsherman Aug 10, 2022
Maintainer Author

Yes I am currently investigating this problem, albeit very slowly. See #452 for discussion of this issue. My plan is to add a temporary option to output paths, so that a "temporary" output is deleted as soon as its immediate consumers are done, and Nextflow checks these consumers when trying to resume a task with missing temporary outputs.

I suppose this issue fits into the scope of "language" improvements so I will add it.

mahesh-panchal · 2022-08-10T19:24:57Z

mahesh-panchal
Aug 10, 2022

Inheritance in configuration is a desired feature. One common example is with the publishDir directive.
One would like to for example set the mode of publishDir generally to 'copy', and have it inherited by every more specific use of publishDir.

Example:
nextflow.config:

process {
    // General publishDir setting
    publishDir = [
        path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
        mode: params.publish_dir_mode,                                            // Would like this inherited
        saveAs: { filename -> filename.equals('versions.yml') ? null : filename } // Would like this inherited
    ]

    withName: 'MY_TASK' {
         // I only want to supply a new path here, but I need to write out everything that not default again.
        publishDir = [
            path: { "${params.outdir}/task/$sample" },
            mode: params.publish_dir_mode,                                            // Needs to be included again or it uses default setting
            saveAs: { filename -> filename.equals('versions.yml') ? null : filename } // Needs to be included again or it uses default setting
        ]
    }

}

Another aspect that would benefit from this also is parameter specification. For example when something has a complex configuration, but you want to add on an extra string:

// Example from nf-core/rnaseq
    process {
        withName: '.*:ALIGN_STAR:STAR_ALIGN|.*:ALIGN_STAR:STAR_ALIGN_IGENOMES' {
            ext.args   = [
                '--quantMode TranscriptomeSAM',
                '--twopassMode Basic',
                '--outSAMtype BAM Unsorted',
                '--readFilesCommand zcat',
                '--runRNGseed 0',
                '--outFilterMultimapNmax 20',
                '--alignSJDBoverhangMin 1',
                '--outSAMattributes NH HI AS NM MD',
                '--quantTranscriptomeBan Singleend',
                params.save_unaligned ? '--outReadsUnmapped Fastx' : ''
            ].join(' ').trim()
    }

In it's current form this needs to be written out again in some form if I want to modify it, whereas it would be nice to be able to say I want to append something to all of this instead.

16 replies

mahesh-panchal Aug 12, 2022

pattern and saveAs from publishDir select which files to save. Use the process selector expressions to group processes as you need.
Take a look at the nf-core conf/modules.config files. They basically contain every possible combination of need I think.

cjw85 Aug 12, 2022

Nice to know it's possible in a roundabout way. Thanks!

I really like the idea of a publish operator, that would be a nice addition to the language making it follow more a Unix pipeline type affair, with output being the final stage of the control flow.

mahesh-panchal Aug 12, 2022

The problem I have with a publish operator is that it's not override-able.

From a user perspective, from the configuration they can switch off publishing certain files if desired ( e.g. if they have space issues, only want a subset of the workflow output) without changing the workflow logic, and in it's current state does allow publishing anything from anywhere. This is one thing that I think should remain as configuration and not workflow logic.

cjw85 Aug 16, 2022

I guess I have a different understanding of what "publishing" an output should mean, logically and conceptually. As a workflow developer I publish files that I want users to have as output. The idea of allowing users to flexibly and arbitrarily configure the outputs they want is somewhat of an anathema to me.

In some very specific use cases I know my group adds options to enable publishing of some large files, based on user feedback. These are typically intermediate files that aren't a product of the workflow per-se (an answer to a question), but something a user might need in order to perform their own additional analysis. The classic example is a large BAM/CRAM file.

I imagine an argument against this approach is that it puts a burden on the developer to enable such parameter flags, when a user could use the configuration system to make such choices. But how many users (not workflow developers) are going to be able to construct configuration code like the examples on this page? So my answer to the original counter-argument is "so what?". Workflow developers should take more care, responsibility, and pride in constructing workflows to provide useful and meaningful outputs.

bentsherman Aug 18, 2022
Maintainer Author

In my former lab we also liked being able to toggle the publishing of intermediate files with params. But it's only a matter of changing the publish pattern, which you could do just as easily with a publish operator. And actually the operator might be easier to work with because you can explicitly select which channels to feed into the publish operator, rather than constructing a pattern that matches all of your desired outputs.

mahesh-panchal · 2022-08-10T19:43:52Z

mahesh-panchal
Aug 10, 2022

The behaviour of file is a little frustrating. Depending on the input, the output could either be of type Path, or List. One problem I have is when passing a glob. The ordering of the List is then dependent on the OS, instead of the glob, so file("data/reads_{1,2}.fa") may come out as [ data/read_1.fa, data/read_2.fa ] or [ data/read_2.fa, data/read_1.fa ]. However I need the data in the order defined by the glob. Since this path could be defined from a csv file, etc, I could also have it that it also supports a single file path too, but that means I cannot use sort directly on the output. If the the input is not a glob, sort will split the resulting Path string on the / and sort the folders and file by name, which means the code then also needs to check if the output of file is a List or Path before doing the sort.

Example snippet:

workflow {
    Channel.fromPath( 'test.tsv', checkIfExists: true )
        .splitCsv( header: ['sample_id','datatype','sequences'], skip: 1, sep:'\t' )
        .branch { record -> def seqs = file( record.sequences, checkIfExists: true)
            // If seqs is not a list, path is the absolute path
            // If seqs is a list, path does not preserve glob order and are relative paths
            if ( seqs instanceof List ) {
                seqs = seqs.sort() { it.name }
            }
            tsv_pacbio_ch : record.datatype == 'PacBio'
                return [ [id: record.sample_id ], seqs ]
            tsv_illumina_ch : record.datatype == 'Illumina'
                return [ [id: record.sample_id ], seqs ] 
        } | mix | view
}

There's a similar issue inside processes where for example if the input could be a single file or list, one needs to test for the object type before calling size (I think) for example to get the number of files supplied.

5 replies

bentsherman Aug 10, 2022
Maintainer Author

This is kind of like the glob output path returning either a single file or a list. The fundamental problem is the same -- a function should always return the same type, otherwise it just creates confusion. So a solution would be to create a new glob() function that always returns a list, even if it's empty or contains only one item. Then we deprecate the glob support in file() in favor of glob().

I don't think we can help the OS-dependent ordering of files, but if you know that you'll always get a list from glob() then you can sort it without any type checking.

As for the ambiguity of path inputs, I think we can apply the solution that Paolo proposed for path outputs: #2425 (comment). Basically, add an option (e.g. cardinality) to specify the number of expected paths, including a wildcard if you don't know the exact number but want it to be a list.

mahesh-panchal Aug 11, 2022

Agreed that a function should always return the same type.

How would glob function be used in practice though? Take the example above where it's a user supplied string. The programmer then needs figure out if the string is a glob or not before calling the correct function. I think it would be nicer if file just returned a List of one Path when the string is not a glob.

cjw85 Aug 11, 2022

This one is just one of those things that needs to change, and break backwards compatibility. Its just bad design.

bentsherman Aug 11, 2022
Maintainer Author

@mahesh-panchal It turns out there is already a files() method that does this, just wasn't documented 😆

bentsherman Aug 11, 2022
Maintainer Author

@mahesh-panchal I also missed your question about glob. It's a good point, user could provide a path string that may or may not contain a glob. I was thinking to deprecate glob support in file() but now I think it's better just to leave as is and update the docs.

mahesh-panchal · 2022-08-10T20:06:55Z

mahesh-panchal
Aug 10, 2022

I just saw the updates to the text. How do you use flatMap to replace transpose? I use it a lot in this workflow:
https://github.com/nf-core/genomeassembler/blob/dev/subworkflows/local/prepare_input.nf

6 replies

mahesh-panchal Aug 10, 2022

The yaml input file allows a list of input files for various types of data and the number of files per sample can vary. They all need to be processed individually before being processed together, so I use transpose to associate meta data with each file once the list of files has been read in.

bentsherman Aug 10, 2022
Maintainer Author

Okay I understand. Paolo has also given some context here: #3105 (comment)

I think we just need to improve the docs on it then.

mahesh-panchal Aug 11, 2022

A little sleep help it seems. So my transpose operations are all flapMap { meta, files -> files.collect{ [ meta, it ] } }, but yes, it's nice to have an operator to do the inverse of groupTuple. I thought the docs explained the concept clearly enough, but it would help if it was explained as the inverse operation to groupTuple.

cjw85 Aug 11, 2022

As I commented in the issue Ben linked, your use of transpose is the toy 1D case that kinda works alright and obeys normal expectations of a transpose operator. My issue with transpose is what its doing with higher dimensional structures.

I think an operator which is a strict inverse of groupTuple is useful.

bentsherman Aug 11, 2022
Maintainer Author

Good point, I would be interested to see if anyone is using transpose in the multi-dimensional case...

mahesh-panchal · 2022-08-11T07:23:36Z

mahesh-panchal
Aug 11, 2022

Agreed that distinct is confusing. Since channels are supposed asynchronous, what does consecutive mean when input order into the channel queue is not guaranteed? The output would not be consistent.
How is this operator used in practice?

4 replies

bentsherman Aug 11, 2022
Maintainer Author

I guess it would only be useful when receiving items from a channel factory or operator. Channel factories preserve their order by definition, but operators only happen to preserve it because they don't use parallelism. And of course on a process output it wouldn't make sense. Here again I would be interested to see how it's being used out the wild.

cjw85 Aug 11, 2022

I made a comment on this below, lemme clean up...

distinct and unique
The difference between these two is that distinct only removes consecutive duplicates whereas unique removes any duplicates. Still kinda confusing, like it's not obvious which one is which. Might be better to deprecate distinct and add e.g. consecutive option to unique.

Haha, this seems backwards. c.f. the the Unix tool uniq which only detects neighbouring duplicates. Admittedly that's a gotcha newbies fall foul of; I would class it as acceptable since people should understand we a dealing with streams and a simple unique is not in fact so simple.

I would have a unique that works like uniq for people to use on the factories, and give it an option to ignore order. I considered just saying people should do channel.sort().uniq() like you would on the command line, but actually a unique operation on a channel can take shortcuts and doesn't need to create the fully sorted stream to work.

bobamess Aug 12, 2022

For those who code in R, note that unlike the Unix command uniq, R's unique() function omits duplicated and not just repeated elements/rows. That is, an element is omitted if it is equal to any previous element and not just if it is equal the immediately previous one. See https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/unique
To match the Unix uniq R has the rle() function

bentsherman Aug 12, 2022
Maintainer Author

According to the source code Nextflow operators are inspired by RxJava, which is basically the standard reactive programming library. Looking at their wiki, I think their distinct and distinctUntilChanged are equivalent to unique and distinct in Nextflow (respectively).

https://github.com/ReactiveX/RxJava/wiki/Filtering-Observables#distinct

mahesh-panchal · 2022-08-11T07:52:42Z

mahesh-panchal
Aug 11, 2022

countBy is deprecated in the code, but not in the docs.
Someone once asked how to implement the symmetric difference of two channels and this would have been a useful operator, although it wouldn't extend to complex data with countBy
Symmetric difference:

workflow {
    Channel.fromList( [ 'A', 'B', 'C', 'D' ] )
        .mix( Channel.fromList( [ 'C','D','E' ] ) )
        // .countBy() // deprecated
        .map{ it -> [ it, 1 ] }
        .groupTuple()
        .filter { it -> it[1].size() == 1 }
        .map{ val, count -> val }
        .view()
}

It can be difficult to implement operations that need to filter channels based on the contents of others.

1 reply

bentsherman Aug 11, 2022
Maintainer Author

I just reviewed the code where operators are defined and there are a few inconsistencies, operators that aren't documented or whose deprecation status doesn't match the docs. I might just make a PR to resolve all of them at once.

But I'm not sure why countBy is deprecated in the code. It's a useful operator to have. It is a special case of reduce, but so are most of the math operators: min, max, sum, count, etc. I would be fine with keeping it.

mahesh-panchal · 2022-08-11T08:18:38Z

mahesh-panchal
Aug 11, 2022

There needs to be more documentation and functionality for workflow programmers to report warnings and errors to workflow users.

What's the best way of telling users of a workflow that something is not correct?
For example the error function doesn't look like it's really for workflow programmers to report errors to users. It reports there's a problem on line X ( e.g. -- Check script 'main.nf' at line: 7 or see '.nextflow.log' file for more details ), but the workflow programmer would like the workflow to end with a useful error message to the user.
exit <val>, <message> currently only works correctly if you don't use within process arguments ( see #3046 ).

Reporting warnings is also tricky, for example if you want to tell a user that a channel is empty.
ch.ifEmpty { log.warn("My warning") } puts something in the channel so I need to exit otherwise I get an input error, whereas I would rather a more graceful exit where nothing downstream is executed, but I get a nice warning message just to be clear and the other unrelated processes finish.

Parameter validation, and schema validation for CSV/TSV inputs would be appreciated too ( along with perhaps a Channel factory for YAML/ other common input file formats and schema validation )

3 replies

cjw85 Aug 11, 2022

Agreed, this is an area where I think nf-core plugs holes in Nextflow, holes that really should be part of the language.

bentsherman Aug 11, 2022
Maintainer Author

I tested the error function and it worked fine for me. It prints the message and points you to the line where it happened.

$ cat error.nf 

error "workflow failed. sorry."

$ nextflow run error.nf 
N E X T F L O W  ~  version 22.08.0-edge
Launching `error.nf` [golden_morse] DSL2 - revision: 4be99e7509
workflow failed. sorry.

 -- Check script 'error.nf' at line: 2 or see '.nextflow.log' file for more details

AFAIK both error and exit are fine to use for reporting errors, not sure if one is preferred over the other.

Not sure what to do about warning the user about an empty channel. I guess you could append a filter operator but that feels like a hack.

Regarding input validation, that is indeed something that nf-core has developed and would like to incorporate into Nextflow itself. See the thread on the new config syntax (#2723) for related discussion.

cjw85 Aug 11, 2022

Don't get me started on input validation 🤣

cjw85 · 2022-08-11T09:06:36Z

cjw85
Aug 11, 2022

@bentsherman

On the merge operator:

While it is safe with channels from the same process

Is this definitively true? I've asked this question to Seqera staff several times and no one has ever been able to confidently tell me that the implementation gurantees that channels emitted from the same process have an identical ordering (e.g. that there is a lock on the set of channel from a process such that items can only be injected from a single instance of the process at a time).

4 replies

bentsherman Aug 11, 2022
Maintainer Author

Let's test it and see:

$ cat merge.nf 

process foo {
    input:
        val i
    output:
        val x
        val y
        val z
    exec:
    x = i * 1
    y = i * 2
    z = i * 3
}

workflow {
    // should print nothing if all outputs are consistently ordered
    Channel.of(1 .. 1000) | foo | merge | filter { x, y, z -> (y != 2 * x) || z != (3 * x) } | view
}

$ nextflow run merge.nf -pool-size 100
N E X T F L O W  ~  version 22.08.0-edge
Launching `merge.nf` [happy_dubinsky] DSL2 - revision: da3eb1d99c
executor >  local (1000)
[ae/5db328] process > foo (990) [100%] 1000 of 1000 ✔

While the process processes inputs in an arbitrary order, it still emits outputs in a consistent order. The underlying mechanism is that a channel emits items in the same order it receives them. Might seem like a trivial statement, but that's how you know that process outputs will always be consistently ordered.

cjw85 Aug 11, 2022

I'm not sure running it once is proof of the gurantee I'm after.

You have three channels (x, y, z) and a multiplicity of process instances (1, 2, 3, ...) wanting to put values into those channels. When values are taken from the processes and put into the channels, is it all done in a serial fashion such that x, y, z are taken from process instance 1 and put into channels x,y,z before any data is collected from process instance 2 and put into the channels. I can easily envisage if collection is done in e.g. multiple threads (one per process) that the channels could become differently ordered.

bentsherman Aug 11, 2022
Maintainer Author

I updated my example to be bigger, feel free to keep scaling it further if you need more convincing. Every task is executed in a separate thread. If I have some extra time one day I might see if I can find more definitive evidence in the source code. I share your suspicion but perhaps it performs the output emissions with a queue so that the outputs are consistent.

IMO it's kind of a moot point anyway because you shouldn't be using merge in the first place, ever.

cjw85 Aug 11, 2022

Every task is executed in a separate thread.

This statement alone will mean I remain unconvinced until proven otherwise.

On the idea that merge shouldn't be used, this is a definite area of improvement for the docs. The operators docs talk a lot about operations based on keys, but I don't think I ever read a statement along the lines of "you do put a useful key in all your channels don't you?". I think someone has already commented it as being a best practice thing to do. It takes a while to realise the power of doing so.

cjw85 · 2022-08-11T09:20:15Z

cjw85
Aug 11, 2022

syntax errors within a process or config block are not specific enough

I think this should be the number one priority for Seqera to fix, to the extent I would say no new features until this is fixed. My developers have spent countless hours looking through each others code trying to find minor typos.

5 replies

bentsherman Aug 11, 2022
Maintainer Author

I doubt we're going to do that, but it is probably one of the largest holes in the Nextflow language. See the linked issue for related discussion. I don't think I'm Groovy enough to know how to solve it yet. Best I could do for now is to expand the vague error message to at least direct users to look for a syntax error in that area.

cjw85 Aug 12, 2022

I've run out of fingers counting the number of people who have abandoned trying to use Nextflow because of this problem. I think a lot of people could see past the issues of the other themes I've raised if the debugging experience were not complete c... . I'll certainly forgive everything else if just this were fixed. 😄

benbfly Jul 25, 2023

I agree whole-heartedly with @cjw85 , one year later. I don't have the perspective on how much it's improved in the last year (since I'm new), but it's still a MASSIVE impediment to people developing for Nextflow. All the new cloud and Tower features are great, but these glaring problems at the core of the language remain a huge problem.
As long as this remains a problem, I would also mention it early and often in the training materials, so that people don't feel crazy and go out of their mind when learning to develop for nextflow. It's the hardest programming language they will ever learn by a mile.

pditommaso Jul 25, 2023
Maintainer

Received. We are well aware of these problems and there are plans to bring the language to the next level. Stay tuned.

benbfly Jul 25, 2023

Thanks @pditommaso . I appreciate that it's not an easy problem to tackle and will take time and lots of effort. We all support you in this!
I think the most important things that could be done immediately is to let people know about these problems in flashing red lights at the beginning of training materials and documentation. It's demoralizing to go through training materials and documentation that feels like "look how easy operators and channels are", and then when you start building real workflows you start getting into very confusing error messages and start feeling crazy. That is why the comments above from @cjw85 comments resonated with me - and we are not inexperienced programmers.

cjw85 · 2022-08-11T09:45:54Z

cjw85
Aug 11, 2022

I think the documentation should be more detailed

I'd suggest some style-review in the documentation by a technical writer, or at least discussion with (new) users. I've always felt the documentation assumes the reader immediately understands the power of various concepts. Take the groupTuple docs, they state the basic premise of grouping on a key, and showing an example. But they don't immediately state what the default key is, the reader naturally states trying to figure this out from the example. Its possible to not notice that the next line tells you the answer. At lot of this could be solved by giving the method signature and default parameters first like say numpy/pandas/scipy/almost anything else does. This gives the user a summary of the method and they can start to see the gist of things before seeing example.

Ironically, I think it labours the point on Channels and perhaps makes them more intimidating than they are. It's interesting that Seqera have publically commented that they recognise many of their users are coming from Python and don't want to write Groovy-like things. This should be leveraged. There's an obvious common, fundamental concept in Python that can be used as an analogy to Channels 😉

5 replies

bentsherman Aug 11, 2022
Maintainer Author

We're planning to revamp the Nextflow website and docs to be, well, better. Both in terms of organization and writing quality. The biggest improvement IMO will be to have clearly delineated reference docs vs user guide vs developer docs. The numpy/pandas docs are a good suggestion too, I like how they just list the arguments and then provide examples.

I don't know what public comments you're talking about, but in the Nextflow docs I'm reluctant to make too many Python comparisons. For better or worse, it's important that Nextflow users learn how to do things "the Groovy way", not try to pretend they're still in Python. I assume you're talking about generators -- it's a fair analogy, but I'm not sure how helpful it would be to just say e.g. "channels in Nextflow are kind of like generators in Python". Need to think on that one. I do agree that the channel docs could probably be more concise.

cjw85 Aug 11, 2022

Teaching by analogy is immensely powerful.

I think the statement "for those coming from Python, Channels are tantamount to generators in Python, and in Nextflow we mostly pass around references to filepaths in them" flattens the Everest learning curve down to a molehill. It's a valid point that users should except that Nextflow is not Python, but people need an on-ramp to learn.

I've found it to be the elephant in the room whilst trying to get others to use Nextflow: no one wants to learn Groovy. Those that try, find it a miserable experience. There's very few examples online, the learning resources are pretty patchy, and you quickly descend into reading Java docs. The nf-core community and templating are, to me, a stunning example of the lack of desire: there are long-time users of Nextflow who have avoided developing more than a cursory understanding of Groovy. The language is the new COBOL: it hangs around in a few areas, but doesn't have a critical mass of users to remain appealing to learn and use. Unfortunately I think Nextflow suffers because of this.

mahesh-panchal Aug 12, 2022

I chose to use Nextflow specifically because I don't know python, but read that Groovy was a superset of Java and I've used Java before. However I don't think I've really needed to learn Groovy to use Nextflow. The majority of what I've done that Nextflow doesn't have operators for, is basically Map or List manipulation, and for 95% of that Working with collections has been my reference. Then I'll go play about here https://groovyconsole.appspot.com/ to get syntax correct, but I think the problems themselves are mostly mathematics or structure manipulation. Between the Nextflow docs and that, I've learned a lot of what I needed, and StackOverflow and Slack took care of the rest. What's been the best learning tool though for me has been to abstract away the details, and write toy examples to explore functionality. This is a skill that I think all courses should teach. Analogies help, but in the end, getting a user to go away from their specific use case and abstract it to figure out what they're trying to manipulate and play with is a key skill.

Actually, I forgot about closures. Anonymous functions is not a common thing, but there's still lot's of stuff on the groovy docs page above that I've never needed to care about or use.

cjw85 Aug 12, 2022

I definitely agree with the sentiment that you shouldn't have to learn Groovy to use Nextflow, at least not in a purist sense in isolation. The heart of the issue is that currently many users do because of a confluence of other issues in the Nextflow language and documentation.

Users do need a practical understanding of Groovy, which is what you're getting at when you talk about exploring with toy examples.

The Nextflow scripting section of the Nextflow docs is quite useful.

Several comments:

I think you're in the minority having used Java before. And that's a massive leg up in trying to comprehend Groovy documentation and help resources. Many assume, explicitly in some cases but implicitly in many, that the reader is coming from Java.
Nextflow is not a superset of Groovy, we've learnt this the hardway when trying to write functions with optional arguments. Similar to Ben's comment, you never quite know if something is supposed to work, whether you've done something wrong in Nextflow, or something wrong in Groovy.
"Problems are mathematics or structure manipulation" --- this is precisely what Ben and I are getting at when discussing that channel operators have unusual or unexpected semantics. With a wholesale rewrite of the operators to be named and behave like operators in other contexts (SQL, R tidyverse, Python Pandas, linear algebra) this problem would largely go away, through dare I say it analogy!

I don't mean to turn this into a rant about Groovy. As I say, I think not needing to "learn Groovy" in and of itself should be one of the aims of improving Nextflow as a language.

benbfly Jul 25, 2023

I so completely agree with @cjw85 here, and it's actually very analogous to the comments above about syntax error reporting.
The one that got me was the same one above, the documentation for groupTuple. Scatter/gather is central tenet of this kind of distributed computing platform, yet the operator documentation remains completely inadequate.
This is in contrast to some of the documentation on cloud platforms, Tower, etc. which are highly detailed technical documentation.
Python/Groovy is a complete red herring. I have programmed way more in Java than in Python and feel very comfortable with Groovy style. The fact is that Nextflow is big enough now that it needs rigorous technical documentation (which can be in addition to more informal training materials and "patterns" documents.)

da-i · 2022-08-11T14:44:59Z

da-i
Aug 11, 2022

I fully agree with you @bentsherman that there is that constant feeling doing things incorrectly/inefficiently. It's also interesting that some best practices are implemented in a different project that is not part of the language, such as a linter, that is developed by the same organisation but that is not compatible with the generic language. In my mind things should be applicable to all applications of the language and not a specific subset.

I would also like to mention the other recent discussion here that goes hand in hand with re-usability of modules with respect to the publishing of files: reusable code should not pubish, but their results should be publishable.

2 replies

cjw85 Aug 11, 2022

Similar to my comments above, there's definitely aspects of nf-core and its tools that should be implemented as part of core nextflow. That's not to say that I believe nf-core or its practices should wholesale be part of nextflow.

bentsherman Aug 11, 2022
Maintainer Author

Thank you for your comments @da-i. I would also like to see a linter for Nextflow as I think it would force us to make sure that Nextflow operates as a "real" language with a well-defined grammar, etc. There are some nf-core folks working on such a linter, I'm sure that if they get it to work then we will try to make it part of Nextflow and not merely nf-core.

I think I've seen some discussion about module re-usability. Is there an existing issue about it? Would love to add it to the list.

mahesh-panchal · 2022-08-13T02:48:15Z

mahesh-panchal
Aug 13, 2022

Inspired by the publishDir discussion on whether this should be workflow logic or configuration, is the when: block that's currently part of the process script scope. Since this controls data flow, this is something that I think should be in the workflow scope instead, like filter or branch.

One thing nf-core has done is to move the specification of the when: condition to the configuration by including the snippet

    when:
    task.ext.when == null || task.ext.when

in the process block. This was done because there are workflows that want to execute processes based on the presence of certain tool options. However, earlier nf-core switched to using process selector expressions and ext.args to pass tool parameters with rather than a complex params map, and since process configuration is not accessible in the workflow scope, using ext.when from the process configuration scope meant we could control process execution again based on tool parameters. This is not ideal as workflow logic is now also in the configuration.

Since configuration files are currently executable code for the time being, one could implement allowing process selector expressions within the params configuration scope.

process.putAll(params.subMap(params.keySet().findAll{ it.startsWith('with') }))

Then there's no need to use the when: as tool options would be specified in the params configuration scope and therefore visible in the workflow scope too allowing branch, filter and if to control workflow logic.

I guess what this boils down to is:

Allow process selector expressions in params configuration scope to pass parameters to processes and subworkflows.
Deprecate when: in the process script scope, and include it as an option to processes for code readability purposes e.g. MARK_DUPLICATES( sorted_bam_ch, when: !params.skip_markduplicates ) ( instead of having branching/filtering everywhere ).

Allowing process selector expressions in the params configuration scope would also deprecate the need for addParams and mean that parameter specification would only be in the configuration files rather than at various levels of nesting in workflows and subworkflow module files.

The downside of not having it in the configuration though is that it prevents users from disabling entire subworkflows they don't want to execute. So perhaps there should also be a process.enabled directive which can be used to disable processes that don't play well with the host execution system.

3 replies

cjw85 Aug 13, 2022

TL;DR: I don’t see the point of when:, just use control logic in a workflow scope.

using ext.when from the process configuration scope meant we could control process execution again based on tool parameters. This is not ideal as workflow logic is now also in the configuration

Things like this are why I say nf-core has come up with solutions to problems that ought not to exist. I've had a few conversations with nf-core people along the lines of "why can't I just do it X <in this simple way>" with the answer being "Nextflow doesn't allow Y so you have to do it <in this horribly convoluted way>". * Some of it also is perhaps born out of nf-core's desire to modularise and abstract processes into oblivion. This is the mistake that other workflow language communities (notably CWL) made by attempting to create generic wrappers for all programs so that the same wrappers could be used in all workflows. Modularity is great, but like most things in moderation.

edited above paragraph to escape <...> comments.

I dislike

MARK_DUPLICATES( sorted_bam_ch, when: !params.skip_markduplicates )

it seems like its adding something that is already perfectly well served by:

if params.run_markduplicates:
    MARK_DUPLICATES( sorted_bam_ch)

without adding anything to the language. There's nothing wrong to me with mixing stream/dataflow syntax with a more imperative style. Why is the above not also the obvious solution to:

The downside of not having it in the configuration though is that it prevents users from disabling entire subworkflows

We should remember the old C++ adage: “there is a much smaller and cleaner language struggling to get out.” It seems to me nf-core has resorted to configuration-from-a-distance and state injection, partly because simple things like a publish operator is not present.

To be clear, I’m not saying that Nextflow should disallow some of what nf-core is trying to do. Rather care needs to be taken in pushing things into the language when taking a step back could lead to something more simple.

*Aside, I think it ties into my earlier point about Groovy: if more people had the imeptus and skills to send PRs to Nextflow with better solutions to these problems, the language would be in a better state with regard to these concerns.

mahesh-panchal Aug 13, 2022

if params.run_markduplicates:
    MARK_DUPLICATES( sorted_bam_ch)

I prefer this too, but when: also gives you access to variables in the process task context too, so if a solution for that can be proposed with the current syntax I'm all for it.
http://nextflow-io.github.io/patterns/index.html#_execute_when_empty

Scratch that; I hadn't thought about it enough. The above could also be achieved with a branch, so I guess, just deprecate when: then.

bentsherman Aug 18, 2022
Maintainer Author

See #2518 for related discussion. In most cases you can use if statements or branch operator to control process execution. Even if you want to condition based on process-specific directives like ext, I think you could define those args as params, which would then be referenced by both the workflow logic and the config file where ext is set for each process.

For now I think we'll just discourage the when: block in the docs, by adding a note that explains these alternatives.

mahesh-panchal · 2022-08-13T03:00:46Z

mahesh-panchal
Aug 13, 2022

Something cosmetic is the use of the branch operator vs the if control statement. Ideally in the data flow paradigm, one should use the branch operator, however in practice we start using if to prevent the flood of non-executed processes to screen. This then results in the configuration file needing to use if too prevent all the process selector does not match warnings.

It would be nice if there was an option to toggle the printing of non-executed processes to screen/log ( on by default, so one can turn off for debugging )

5 replies

bentsherman Aug 18, 2022
Maintainer Author

See #2700 for related discussion. Seems like we just need a way to disable process selector warnings. I have a feeling that the next config syntax won't support if statements anyway.

mahesh-panchal Aug 18, 2022

This one isn't about the selector warnings, but more that there's different behaviour on screen with which process's are being run. If you use if the process doesn't appear in the log, but if you use branch the process appears but shows nothing is executed. Then scale this up with large workflows, where you have may have a handful of processes that run, but your screen is also flooded with all the processes that are not run ifbranch was used for all conditional execution. Some end users of workflows have mistaken this as an error and are wondering why certain processes are not run.

bentsherman Aug 18, 2022
Maintainer Author

My understanding is that Nextflow basically has a "build" step and a "run" step, and if statements are evaluated during the build step. In other words, Nextflow first constructs the entire DAG and then "ignites" it. If you condition a process only on branch or filter, then that process is part of the DAG, and it is impossible for Nextflow to know if the process will be invoked until, well, running the workflow all the way through.

mahesh-panchal Aug 18, 2022

Sure, but the execution of the DAG is separate from the logging right? A process only gets logged if a process is invoked.

bentsherman Aug 18, 2022
Maintainer Author

Okay I see what you're saying. That should be easy to implement since the entire process list is printed every time, so just filter processes based on whether or not they've been invoked yet.

cjw85 · 2022-08-14T19:24:41Z

cjw85
Aug 14, 2022

support each input repeater with other types e.g. tuple

This was one of those wrinkles in the grammar that took me ages to realise was a bug and I wasn't doing something wrong.

I have a more radical proposal: deprecate each, it doesn't achieve anything that can't be achieved with operators. I gave up using each and stuck with product operators after hitting this bug. But, even when I was using it it just felt like a dirty shortcut to using an operator --- looking at a workflow scope a product is more explicit, its easy to miss an each.

2 replies

mahesh-panchal Aug 15, 2022

I agree.

It would be beneficial to force programmers be more explicit in the workflow scope.
There are lots of things that can be hard to detect for example, in most circumstances, output channels are queue type channels, but in the case where all inputs are value type channels, then the output channel is also value type (which can then lead to reader confusion, e.g. why is collect needed here but not there? ).

benbfly Jul 25, 2023

Agree. I just wrote my first each process, and am experiencing a completely weird bug that occurs in a sporadic, non-deterministic manner. I don't know if it's because of the "each/tuple" issue, because the Nextflow documentation is confusing:

Note: Input repeaters currently do not support tuples. However, you can emulate an input repeater on a channel of tuples by using the combine or cross operator with other input channels to produce all of the desired input combinations.

I can't even understand what this means. Does this mean the each channel can't be a tuple, or the channel you're crossing it with can't be a tuple (the latter is what I'm trying to do).
This is a good case that illustrates two problems with core Nextflow - (1) inadequate documentation, (2) no error message or misleading error message when you use this "unsupported" usage.

mahesh-panchal · 2022-08-15T08:23:47Z

mahesh-panchal
Aug 15, 2022

Make -resume the default behaviour. The poll ( way back then ) even preferred it: https://nextflow.io/blog/2019/troubleshooting-nextflow-resume.html

0 replies

mahesh-panchal · 2022-08-15T08:51:30Z

mahesh-panchal
Aug 15, 2022

I think it would help if Channel creation were also limited to the workflow scope.

I think it would also be beneficial to limit params to the workflow scope ( as it's visible from everywhere currently ), and as I've mentioned before implement some kind of selector that would replace addParams to change where it's visible so parameter configuration is all in one place.

7 replies

cjw85 Aug 15, 2022

no one passes params in a channel

No reason you can't. addParams feels to me like something added in to cope with other deficiencies, either in the language or writing style (a style caused by deficiencies in the language --- all comes back to that). I think I'm getting back to my point of a small simpler language trying to get out.

bentsherman Aug 18, 2022
Maintainer Author

I agree that channel logic should only exist in workflow blocks. But params are global, so I don't think it makes sense to restrict their scope. Like Chris said, you can always pass params as process inputs to keep your processes pure.

I haven't used addParams at all, can you give an example of how it's typically used?

mahesh-panchal Aug 18, 2022

As I understand it, addParams is for when you have an existing workflow, that you include in your own but need to set the workflow params.

E.g. BUSCO workflow written by someone else available on-line for reuse:

params.mode    = 'genome'
params.contigs = ''

workflow {
    BUSCO_WORKFLOW()
}

workflow BUSCO_WORKFLOW {
    BUSCO (
        Channel.fromPath( params.contigs, checkIfExists: true),
        params.mode
    )
    ...
}

Then I want to include this workflow in my own, so I go and do:

include { BUSCO_WORKFLOW as BUSCO_GENOME        } from 'subworkflows/busco' addParams( mode:'genome' )
include { BUSCO_WORKFLOW as BUSCO_TRANSCRIPTOME } from 'subworkflows/busco' addParams( mode: 'transcript' )

I don't use it myself since I'm normally the one writing the workflows, but it used to be used extensively in nf-core to pass around the tool options for processes, before switching to the script injection method using ext.args. See: https://github.com/nf-core/rnaseq/blob/bc5fc76f40b2da6082a854927184c9d6e5060393/modules/nf-core/subworkflow/fastqc_umitools_trimgalore.nf#L9-L11

bentsherman Aug 18, 2022
Maintainer Author

I see. Seems like those params should be exposed as inputs to the workflow. But maybe this params approach is more convenient / less verbose.

mahesh-panchal Aug 18, 2022

Scale of number of inputs is definitely an influencing factor with how people choose to parameterize.

mahesh-panchal · 2022-08-15T09:28:00Z

mahesh-panchal
Aug 15, 2022

Making the directives block in the process script scope more explicit may help too. At the moment they can be in different places in the process block e.g. between input and script: or above input: etc.

One user has also said they found it confusing that directives specified above input: use variables that are declared below, e.g. in the input: block. Normally variables are declared first, and then used.

6 replies

mahesh-panchal Aug 19, 2022

I could be wrong, but I did think it was clearly understood that process directives were configuration. To me, because directives are often written inside the process in the docs, this was what users thought should be used. Also the docs were often very vague on how to use the complete functionality of a directive in a config. publishDir is a clear example of this. The docs demonstrate how to use publishDir within processes, and that it can be used multiple times in the same process, but one needs to look at pod (https://www.nextflow.io/docs/latest/process.html#pod) to infer how to use the full publishDir syntax in the config, which the majority of users would likely skip over.

As I understood it, nf-core only uses container packaging and labels in the processes. The rest goes into config files. publishDir was also one of those until the config syntax was clarified there.

bentsherman Aug 19, 2022
Maintainer Author

Going back to your original comment, maybe we could encourage a structure similar to the workflow syntax:

process foo {
  input:
  val x

  // directives:
  cpus 2
  memory 6.GB

  script:
  y = -x
  """
  echo 'foo! cpus ${task.cpus}! memory ${task.memory}!'
  """

  output:
  val y
}

workflow {
  foo(1) | view
}

mahesh-panchal Aug 19, 2022

I hadn't considered output but that also makes sense too. I quite like this.

mahesh-panchal Aug 19, 2022

Ensure the docs are updated too: https://www.nextflow.io/docs/latest/process.html#script

A process contains one and only one script block, and it must be the last statement when the process contains input and output declarations.

Did this change with the implementation of stub:?

bentsherman Aug 19, 2022
Maintainer Author

It's a bit misleading, but what is meant is that if you only provide a script string without specifying script:, the string must be the last statement in the process. The process is basically a groovy function where the last statement is the return value, so if the process "returns" a string then nextflow interprets it as the script. As far as I can tell, you can put the script block anywhere as long as you specify the script: guard.

mahesh-panchal · 2022-08-15T09:32:16Z

mahesh-panchal
Aug 15, 2022

Allow environment variables to be empty if undefined. Current behaviour reports the variable is undefined.
e.g.

params.my_var = "${HOM}" ?: 'var'

results in

No such variable: HOM

3 replies

cjw85 Aug 15, 2022

If you invert the above you have my_var == 'var' unless the HOM environment variable is set, in which case set params.my_var to that. Typically such logic is handled in an argument parser. I think this could be included as part of the rewritten parameter parsing/specification/validation.

bentsherman Aug 18, 2022
Maintainer Author

I'm surprised you can access environment variables like that in the first place. Nextflow probably shouldn't support that in the first place. You can do it the Groovy way instead:

params.my_var = System.getenv('OLDPWD') ?: 'none'

mahesh-panchal Aug 19, 2022

This may have been an afterthought of having something similar to the command line syntax.

--my_var ${OLDPWD:-'none'}

Are there thoughts yet on how environment variables will be handled in the new syntax? One wrinkle with them is they cannot be used in the file passed to -params-file. Implicit variables like $projectDir are also not supported in the -params-file.

mahesh-panchal · 2022-08-22T08:09:51Z

mahesh-panchal
Aug 22, 2022

The docs could use a section clarifying optional bracketing.

Lot's of people don't know where the optional brackets go and it causes the obscure syntax errors.
e.g. Users might try

input:
path ("*log", type: 'dir'), emit: logs

when the brackets should be:

input:
path ("*log", type: 'dir', emit: logs)

2 replies

bentsherman Aug 22, 2022
Maintainer Author

This is a whole thing with the Groovy DSL syntax, which makes various parentheses and dots optional so that you can have "natural language":

// equivalent to: turn(left).then(right)
turn left then right

// equivalent to: take(2.pills).of(chloroquinine).after(6.hours)
take 2.pills of chloroquinine after 6.hours

// equivalent to: paint(wall).with(red, green).and(yellow)
paint wall with red, green and yellow

// with named parameters too
// equivalent to: check(that: margarita).tastes(good)
check that: margarita tastes good

// with closures as parameters
// equivalent to: given({}).when({}).then({})
given { } when { } then { }

It's good for whipping up a quick DSL but for Nextflow I think it causes confusion. This is one point where I'd like the docs to maybe explain this syntax but also give clear guidelines so that it doesn't feel like anything goes.

cjw85 Aug 22, 2022

It might be preferable if the examples in the docs used explicit brackets: as soon as tuples become involved the binding rules catch a lot of people out.

mahesh-panchal · 2022-08-25T12:02:25Z

mahesh-panchal
Aug 25, 2022

Although workflow configuration will be changed in the future, it would be helpful to have the docs updated in the mean time to describe the three formats of configuration input in one place ( command line, -params-file, and config file) and describe the uses of them and what works in which format ( e.g. where quoting is necessary, implicit variable access like $projectDir, environment variables, closures, lists and maps, etc).

Also when params are used to set other params, make it explicit to use -params-file or command-line to supply params because of: #2662

0 replies

SamStudio8 · 2022-08-31T16:29:44Z

SamStudio8
Aug 31, 2022

I thought it relevant to cross-post my recent attempt to re-open a request for a workflow.onStart here: #1138.

0 replies

bentsherman · 2023-07-25T21:13:28Z

bentsherman
Jul 25, 2023
Maintainer Author

Thanks again to everyone who participated in this discussion, it helped me immensely to map out the landscape of issues that people have with the language. Many of these issues have either been addressed, are under development or review, have been incorporated somehow into our internal roadmap, or have been spun off into separate issues.

As a result, I figured it was about time to summarize the ongoing efforts and provide some closure to this particular discussion. At this point, if anyone wants to keep discussing a particular issue, I encourage you to do it in, well, that particular issue on GitHub, or create a new one if you don't see it here.

Better error messages

Check out the error-improvements label to see the status of issues around better error messages. I found a few ways to either return an error earlier or intercept it and provide more context. These fixes should cover a large portion of weird errors that users experience.

Long-term, we continue to explore how to evolve the DSL and the configuration to make weird errors like these less likely in the first place. These are both massive efforts, so nothing new really on the what or when.

Linting and code completion remain important efforts, but I don't have much time to work on them at the moment. For now I recommend npm-groovy-lint, which goes a long way.

Flexibility for publishing files

Initially there was some discussion around having a publish operator, but we concluded that we really need a better way to define top-level inputs and outputs, then the questions around publishDir might be easier to address.

Processes

Lots of suggestions for how to make processes more expressive. I have implemented solutions (or at least drafts) for many of these items, and you can visit the individual issues/PRs to see what's going on.

Documentation

Much of the issues discussed here boiled down to improving the documentation in some way or another, which we have done in a massive way. I've been working to make the docs more "reference-complete", meaning that it comprehensively describes what all is possible rather than only teaching by example. We've added tons of little clarifications that were missing, such as the notes around merge, groupTuple, each, etc. And there are still a few major efforts in review, which you can track using the kind/docs label. All of these together should cover just about everything that was discussed.

0 replies

Nextflow language improvements #3107

bentsherman Aug 9, 2022 Maintainer

Basic scripting

Formal grammar and linter

Flexibility for publishing files

Processes

Map inputs/outputs, named inputs, optional inputs

Multiple input channels and the merge operator

Operators

merge

Modules

Configuration

Replies: 23 comments · 83 replies

bentsherman Aug 10, 2022 Maintainer Author

bentsherman Aug 10, 2022 Maintainer Author

bentsherman Aug 18, 2022 Maintainer Author

bentsherman Aug 10, 2022 Maintainer Author

bentsherman Aug 11, 2022 Maintainer Author

bentsherman Aug 11, 2022 Maintainer Author

bentsherman Aug 10, 2022 Maintainer Author

bentsherman Aug 11, 2022 Maintainer Author

bentsherman Aug 11, 2022 Maintainer Author

bentsherman Aug 12, 2022 Maintainer Author

bentsherman Aug 11, 2022 Maintainer Author

bentsherman Aug 11, 2022 Maintainer Author

bentsherman Aug 11, 2022 Maintainer Author

bentsherman Aug 11, 2022 Maintainer Author

bentsherman Aug 11, 2022 Maintainer Author

pditommaso Jul 25, 2023 Maintainer

bentsherman Aug 11, 2022 Maintainer Author

bentsherman
Aug 9, 2022
Maintainer

Multiple input channels and the `merge` operator

`merge`

Replies: 23 comments 83 replies

bentsherman Aug 10, 2022
Maintainer Author

bentsherman Aug 10, 2022
Maintainer Author

bentsherman Aug 18, 2022
Maintainer Author

bentsherman Aug 10, 2022
Maintainer Author

bentsherman Aug 11, 2022
Maintainer Author

bentsherman Aug 11, 2022
Maintainer Author

bentsherman Aug 10, 2022
Maintainer Author

bentsherman Aug 11, 2022
Maintainer Author

bentsherman Aug 11, 2022
Maintainer Author

bentsherman Aug 12, 2022
Maintainer Author

bentsherman Aug 11, 2022
Maintainer Author

bentsherman Aug 11, 2022
Maintainer Author

bentsherman Aug 11, 2022
Maintainer Author

bentsherman Aug 11, 2022
Maintainer Author

bentsherman Aug 11, 2022
Maintainer Author

pditommaso Jul 25, 2023
Maintainer

bentsherman Aug 11, 2022
Maintainer Author