Nextflow language improvements #3107
Replies: 23 comments 83 replies
-
Metadata handling, grouping and operations. Nextflow is really flexible when it comes to passing around data, but effectively everything needs to be passed around as a list including metadata, which can lead to large input tuples depending on the metadata needing to be passed around. Nf-core took a step to pass metadata around as a Map as the first element of most input tuples, but this combines metadata for different things. For a more explicit example, let's take a basic alignment workflow. Examples of metadata that are passed around are things like:
The specific issue I have is when it comes to grouping data again based on metadata properties. If one uses a separate element of a tuple for each part of the metadata then one needs to use the index which is not informative to the reader ( e.g. I think it would help readability and workflow maintenance if input tuples were maps rather than lists, and operators could select which keys grouping happened on. |
Beta Was this translation helpful? Give feedback.
-
Disk space usage while running is often a concern. The |
Beta Was this translation helpful? Give feedback.
-
Inheritance in configuration is a desired feature. One common example is with the Example: process {
// General publishDir setting
publishDir = [
path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
mode: params.publish_dir_mode, // Would like this inherited
saveAs: { filename -> filename.equals('versions.yml') ? null : filename } // Would like this inherited
]
withName: 'MY_TASK' {
// I only want to supply a new path here, but I need to write out everything that not default again.
publishDir = [
path: { "${params.outdir}/task/$sample" },
mode: params.publish_dir_mode, // Needs to be included again or it uses default setting
saveAs: { filename -> filename.equals('versions.yml') ? null : filename } // Needs to be included again or it uses default setting
]
}
} Another aspect that would benefit from this also is parameter specification. For example when something has a complex configuration, but you want to add on an extra string: // Example from nf-core/rnaseq
process {
withName: '.*:ALIGN_STAR:STAR_ALIGN|.*:ALIGN_STAR:STAR_ALIGN_IGENOMES' {
ext.args = [
'--quantMode TranscriptomeSAM',
'--twopassMode Basic',
'--outSAMtype BAM Unsorted',
'--readFilesCommand zcat',
'--runRNGseed 0',
'--outFilterMultimapNmax 20',
'--alignSJDBoverhangMin 1',
'--outSAMattributes NH HI AS NM MD',
'--quantTranscriptomeBan Singleend',
params.save_unaligned ? '--outReadsUnmapped Fastx' : ''
].join(' ').trim()
} In it's current form this needs to be written out again in some form if I want to modify it, whereas it would be nice to be able to say I want to append something to all of this instead. |
Beta Was this translation helpful? Give feedback.
-
The behaviour of Example snippet: workflow {
Channel.fromPath( 'test.tsv', checkIfExists: true )
.splitCsv( header: ['sample_id','datatype','sequences'], skip: 1, sep:'\t' )
.branch { record -> def seqs = file( record.sequences, checkIfExists: true)
// If seqs is not a list, path is the absolute path
// If seqs is a list, path does not preserve glob order and are relative paths
if ( seqs instanceof List ) {
seqs = seqs.sort() { it.name }
}
tsv_pacbio_ch : record.datatype == 'PacBio'
return [ [id: record.sample_id ], seqs ]
tsv_illumina_ch : record.datatype == 'Illumina'
return [ [id: record.sample_id ], seqs ]
} | mix | view
} There's a similar issue inside processes where for example if the input could be a single file or list, one needs to test for the object type before calling |
Beta Was this translation helpful? Give feedback.
-
I just saw the updates to the text. How do you use |
Beta Was this translation helpful? Give feedback.
-
Agreed that |
Beta Was this translation helpful? Give feedback.
-
workflow {
Channel.fromList( [ 'A', 'B', 'C', 'D' ] )
.mix( Channel.fromList( [ 'C','D','E' ] ) )
// .countBy() // deprecated
.map{ it -> [ it, 1 ] }
.groupTuple()
.filter { it -> it[1].size() == 1 }
.map{ val, count -> val }
.view()
} It can be difficult to implement operations that need to filter channels based on the contents of others. |
Beta Was this translation helpful? Give feedback.
-
There needs to be more documentation and functionality for workflow programmers to report warnings and errors to workflow users. What's the best way of telling users of a workflow that something is not correct? Reporting warnings is also tricky, for example if you want to tell a user that a channel is empty. Parameter validation, and schema validation for CSV/TSV inputs would be appreciated too ( along with perhaps a Channel factory for YAML/ other common input file formats and schema validation ) |
Beta Was this translation helpful? Give feedback.
-
On the merge operator:
Is this definitively true? I've asked this question to Seqera staff several times and no one has ever been able to confidently tell me that the implementation gurantees that channels emitted from the same process have an identical ordering (e.g. that there is a lock on the set of channel from a process such that items can only be injected from a single instance of the process at a time). |
Beta Was this translation helpful? Give feedback.
-
I think this should be the number one priority for Seqera to fix, to the extent I would say no new features until this is fixed. My developers have spent countless hours looking through each others code trying to find minor typos. |
Beta Was this translation helpful? Give feedback.
-
I'd suggest some style-review in the documentation by a technical writer, or at least discussion with (new) users. I've always felt the documentation assumes the reader immediately understands the power of various concepts. Take the Ironically, I think it labours the point on Channels and perhaps makes them more intimidating than they are. It's interesting that Seqera have publically commented that they recognise many of their users are coming from Python and don't want to write Groovy-like things. This should be leveraged. There's an obvious common, fundamental concept in Python that can be used as an analogy to Channels 😉 |
Beta Was this translation helpful? Give feedback.
-
I fully agree with you @bentsherman that there is that constant feeling doing things incorrectly/inefficiently. It's also interesting that some best practices are implemented in a different project that is not part of the language, such as a linter, that is developed by the same organisation but that is not compatible with the generic language. In my mind things should be applicable to all applications of the language and not a specific subset. I would also like to mention the other recent discussion here that goes hand in hand with re-usability of modules with respect to the publishing of files: reusable code should not pubish, but their results should be publishable. |
Beta Was this translation helpful? Give feedback.
-
Inspired by the One thing nf-core has done is to move the specification of the when:
task.ext.when == null || task.ext.when in the Since configuration files are currently executable code for the time being, one could implement allowing process selector expressions within the process.putAll(params.subMap(params.keySet().findAll{ it.startsWith('with') })) Then there's no need to use the I guess what this boils down to is:
Allowing process selector expressions in the params configuration scope would also deprecate the need for The downside of not having it in the configuration though is that it prevents users from disabling entire subworkflows they don't want to execute. So perhaps there should also be a |
Beta Was this translation helpful? Give feedback.
-
Something cosmetic is the use of the It would be nice if there was an option to toggle the printing of non-executed processes to screen/log ( on by default, so one can turn off for debugging ) |
Beta Was this translation helpful? Give feedback.
-
This was one of those wrinkles in the grammar that took me ages to realise was a bug and I wasn't doing something wrong. I have a more radical proposal: deprecate |
Beta Was this translation helpful? Give feedback.
-
Make |
Beta Was this translation helpful? Give feedback.
-
I think it would help if Channel creation were also limited to the I think it would also be beneficial to limit |
Beta Was this translation helpful? Give feedback.
-
Making the directives block in the process script scope more explicit may help too. At the moment they can be in different places in the process block e.g. between One user has also said they found it confusing that directives specified above |
Beta Was this translation helpful? Give feedback.
-
Allow environment variables to be empty if undefined. Current behaviour reports the variable is undefined. params.my_var = "${HOM}" ?: 'var' results in
|
Beta Was this translation helpful? Give feedback.
-
The docs could use a section clarifying optional bracketing. Lot's of people don't know where the optional brackets go and it causes the obscure syntax errors. input:
path ("*log", type: 'dir'), emit: logs when the brackets should be: input:
path ("*log", type: 'dir', emit: logs) |
Beta Was this translation helpful? Give feedback.
-
Although workflow configuration will be changed in the future, it would be helpful to have the docs updated in the mean time to describe the three formats of configuration input in one place ( command line, Also when params are used to set other params, make it explicit to use |
Beta Was this translation helpful? Give feedback.
-
I thought it relevant to cross-post my recent attempt to re-open a request for a |
Beta Was this translation helpful? Give feedback.
-
Thanks again to everyone who participated in this discussion, it helped me immensely to map out the landscape of issues that people have with the language. Many of these issues have either been addressed, are under development or review, have been incorporated somehow into our internal roadmap, or have been spun off into separate issues. As a result, I figured it was about time to summarize the ongoing efforts and provide some closure to this particular discussion. At this point, if anyone wants to keep discussing a particular issue, I encourage you to do it in, well, that particular issue on GitHub, or create a new one if you don't see it here. Better error messagesCheck out the error-improvements label to see the status of issues around better error messages. I found a few ways to either return an error earlier or intercept it and provide more context. These fixes should cover a large portion of weird errors that users experience. Long-term, we continue to explore how to evolve the DSL and the configuration to make weird errors like these less likely in the first place. These are both massive efforts, so nothing new really on the what or when. Linting and code completion remain important efforts, but I don't have much time to work on them at the moment. For now I recommend npm-groovy-lint, which goes a long way. Flexibility for publishing filesInitially there was some discussion around having a ProcessesLots of suggestions for how to make processes more expressive. I have implemented solutions (or at least drafts) for many of these items, and you can visit the individual issues/PRs to see what's going on. DocumentationMuch of the issues discussed here boiled down to improving the documentation in some way or another, which we have done in a massive way. I've been working to make the docs more "reference-complete", meaning that it comprehensively describes what all is possible rather than only teaching by example. We've added tons of little clarifications that were missing, such as the notes around |
Beta Was this translation helpful? Give feedback.
-
As a Nextflow user (and now developer), I have a lot of little qualms with the Nextflow language. The fact that I fell in love with Nextflow despite all of its quirks is truly a testament to its excellence as a workflow manager. But man, there is a lot of weirdness and confusion when you’re learning Nextflow. Like, it’s not hard to learn, but there’s always this sinking feeling that you’re not doing something right, and that feeling never completely goes away.
I’ve been poring over issues and discussions from the past year or so, and I’ve seen a lot of weird bug reports or comments about the language itself, so I decided it’s time to collect everything into a mega-thread and start coming up with solutions.
Basically, I’m going to lay out everything I think is weird about the Nextflow language and try to propose solutions. I’m posting it here because I want to hear from Nextflow users. Like you! Please feel free to comment below, and I will try to incorporate your suggestions into the big picture.
I will continue to update this post as I see new issues or solutions.
NOTE: The checkboxes are reset every time I update this post, so they might not reflect the current status of things.
Basic scripting
See issues with the
lang/dsl2
label.exit
doesn't print message when called inside a process argument (Exit error message not printed to screen when used between process input brackets. #3046)Formal grammar and linter
Some folks in the nextflow / nf-core Slacks are currently trying to develop a linter. We'll see how far they get, but overall I think there is a lack of clarity in what is possible in the Nextflow language. Beyond the core features, there is a large space of possible syntax since Nextflow inherits from Groovy and Java.
While I hope we can better formalize the Nextflow language (which we'll probably need to in order to fix issues like #2082), for now I think the most important thing is to make the docs clearer about what is possible in the Nextflow language, and maybe even provide some guidelines or conventions to help users along.
Flexibility for publishing files
A lot of users were asking in the thread linked above for the ability to publish files at the workflow level. As far as I can tell, users are just trying to not repeat the same publishDir settings for every process. As shown in the comments below (#3107 (reply in thread)), you can use process selectors to apply some directives to many processes at once, so I think this covers it.
Another idea is to implement a
publish
operator that basically does the same thing as the directive but as a standalone operator (#1540).Processes
See issues with the
lang/processes
label.each
input repeater with tuple (Use of input repeaters ("each") in DSL2 #1966)when
block in documentation (Suggest adding the process "when:" as a process directive. #2518)glob()
method for globs instead offile()
(Addglob()
method for globs in favor offile()
#3109)Map inputs/outputs, named inputs, optional inputs
These concepts are related but have slightly different uses. Maps are useful as an alternative to tuples. Named and optional inputs have the same usefulness as named and optional outputs. In particular, I think optional inputs would solve the problem that channel topics (#2842) is trying to solve. Nullable paths are useful when you want an element of a tuple / map to be optional.
Multiple input channels and the
merge
operatorIn general, multiple input channels should only be used to enumerate the outer product of the channels, such as by using
each
or value channels. I have seen cases where users have a tuple channel and they want to split it up into multiple input channels instead of one long tuple input. While this may look "nicer", it is not the paradigm of Nextflow and often leads to suffering. Using multiple input channels in this particular way is equivalent to themerge
operator, which should be removed from Nextflow. What I think users are actually reaching for here is to either use maps or named/optional inputs, so I think these language improvements will resolve some of this tension.Operators
See issues with the
lang/operators
label.merge
operator (Replacement for `merge`? #2113, Is channel output from multiMap guaranteed to be in the same order it comes in? #2682, Add notes about merge, multiMap, and multiple input channels #3037)combine
,cross
, andjoin
(Inconsistent Channel Operator Behavior #2437, Add doc tests, move some snippets to separate files #3959)tap
operator in favor ofset
(Deprecate tap operator, extend set operator to return input channel #2981)collect
,toList
, andtoSortedList
(Behaviour of.collect()
and.toList()
#3105)transpose
(Behaviour of.collect()
and.toList()
#3105)by
option (Support string indices for operators withby
option #3108)distinct
andunique
(Add doc tests, move some snippets to separate files #3959)merge
The
merge
operator was deprecated with DSL2 but revived because of some use cases that users wanted (see above section on multiple input channels). The problem is that merging two channels from different processes leads to random combinations because processes can execute tasks out of order. While it is safe to use with other operators, this behavior is not guaranteed and could change in the future. While it is safe with channels from the same process, in that case you are probably setting up our input/output channels in a bad way (see above section on multiple input channels). As stated above, if we can implement some improvements to process inputs/outputs then I think we'll be able to truly put themerge
operator to rest.Modules
See issues with the
lang/modules
label.Configuration
See issues with the
lang/config
label.See #2723 for ongoing discussion about a new configuration syntax. Lots of config-related issues are due to limitations of the current config syntax.
pod
directive topodOptions
pod.config
topod.configMap
pod.pullPolicy
in favor ofpod.imagePullPolicy
pod.runAsUser
in favor ofpod.securityContext
Beta Was this translation helpful? Give feedback.
All reactions