Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update bin, lib, template docs #5564

Draft
wants to merge 24 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
6627bbc
Update examples and language for custom scripts
christopher-hakkaart Dec 2, 2024
aac6dfe
Update text after modifying bin documentation
christopher-hakkaart Dec 2, 2024
6822188
Mirror notes
christopher-hakkaart Dec 2, 2024
f568c0e
Proof read edits
christopher-hakkaart Dec 2, 2024
9a2424e
New section
christopher-hakkaart Dec 3, 2024
e1795ea
Update syntax docs (#5542)
bentsherman Nov 26, 2024
6d703db
Prevent NPE with null AWS Batch response
pditommaso Nov 27, 2024
9963084
Update wave deps
pditommaso Nov 27, 2024
de31b6d
Fix missing wave response (#5547) [ci fast]
pditommaso Nov 27, 2024
62c565a
Incorrect CPU value in Azure example (#5549) [ci skip]
adamrtalbot Nov 29, 2024
c0d98d9
Update changelog [ci skip]
pditommaso Nov 27, 2024
2794d3e
Detecting errors in data unstaging (#5345)
jorgee Dec 2, 2024
a91fd9d
Bump [email protected]
pditommaso Dec 3, 2024
3b810d0
Bump [email protected]
pditommaso Dec 3, 2024
eff621e
Bump [email protected]
pditommaso Dec 3, 2024
6960eab
Bump [email protected] [ci fast]
pditommaso Dec 3, 2024
20a4b6d
[release 24.11.0-edge] Update timestamp and build number
pditommaso Dec 3, 2024
a636831
Update changelog [ci skip]
pditommaso Dec 3, 2024
8aa2f4a
Trying new layout
christopher-hakkaart Dec 3, 2024
53dfd4d
Merge branch 'nextflow-io:master' into docs-pythonscripts
christopher-hakkaart Dec 3, 2024
0433bf4
Quick improvements when reading
christopher-hakkaart Dec 3, 2024
2a4deaf
Merge branch 'docs-pythonscripts' of https://github.com/christopher-h…
christopher-hakkaart Dec 3, 2024
4d8f2f9
Add in another example and note
christopher-hakkaart Dec 4, 2024
0b9c83f
Apply suggestions from review
christopher-hakkaart Dec 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/cache-and-resume.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ The task hash is computed from the following metadata:
- Task {ref}`inputs <process-input>`
- Task {ref}`script <process-script>`
- Any global variables referenced in the task script
- Any {ref}`bundled scripts <bundling-executables>` used in the task script
- Any {ref}`bundled scripts <structure-bin>` used in the task script
- Whether the task is a {ref}`stub run <process-stub>`
- Task attempt

Expand Down
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ module
notifications
secrets
sharing
structure
christopher-hakkaart marked this conversation as resolved.
Show resolved Hide resolved
vscode
dsl1
```
Expand Down
70 changes: 28 additions & 42 deletions docs/module.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,41 +184,25 @@ Ciao world!

## Module templates

Process script {ref}`templates <process-template>` can be included alongside a module in the `templates` directory.

For example, suppose we have a project L with a module that defines two processes, P1 and P2, both of which use templates. The template files can be made available in the local `templates` directory:
Template files can be stored in the `templates` directory alongside a module.

```
Project L
|── myModules.nf
└── templates
|── P1-template.sh
└── P2-template.sh
Project A
├── main.nf
└── modules
└── sayhello
├── sayhello.nf
└── templates
└── sayhello.sh
```

Then, we have a second project A with a workflow that includes P1 and P2:

```
Pipeline A
└── main.nf
```
Template files can be invoked like regular scripts from a process in your pipeline using the `template` function. Variables prefixed with the dollar character (`$`) are interpreted as Nextflow variables when the template file is executed by Nextflow.

Finally, we have a third project B with a workflow that also includes P1 and P2:
See {ref}`process-template` for more information utilizing template files.

```
Pipeline B
└── main.nf
```
Storing template files with the module that utilizes it encourages sharing of modules across pipelines. For example, future projects would be able to include the module from above by cloning the modules directory and including the module without needing to modify the process or template.

With the possibility to keep the template files inside the project L, A and B can use the modules defined in L without any changes. A future project C would do the same, just cloning L (if not available on the system) and including its module.

Beside promoting the sharing of modules across pipelines, there are several advantages to keeping the module template under the script path:

1. Modules are self-contained
2. Modules can be tested independently from the pipeline(s) that import them
3. Modules can be made into libraries

Having multiple template locations enables a structured project organization. If a project has several modules, and they all use templates, the project could group module scripts and their templates as needed. For example:
Beyond facilitating module sharing across pipelines, organizing templates locations allows for a well-structured project. For example, complex projects with multiple modules that rely on templates can be organized into logical groups:

```
baseDir
Expand All @@ -240,43 +224,45 @@ baseDir
|── mymodules6.nf
└── templates
|── P5-template.sh
|── P6-template.sh
└── P7-template.sh
└── P6-template.sh
```

Template files can also be stored in the project `templates` directory. See {ref}`structure-template` for more information about the project directory structure.

(module-binaries)=

## Module binaries

:::{versionadded} 22.10.0
:::

Modules can define binary scripts that are locally scoped to the processes defined by the tasks.
Modules can define binary scripts that are locally scoped to the processes.

To enable this feature, set the following flag in your pipeline script or configuration file:

```nextflow
nextflow.enable.moduleBinaries = true
```

The binary scripts must be placed in the module directory names `<module-dir>/resources/usr/bin`:
Binary scripts must be placed in the module directory named `<module-dir>/resources/usr/bin`. For example:

```
<module-dir>
|── main.nf
└── resources
└── usr
└── bin
|── your-module-script1.sh
└── another-module-script2.py
└── script.py
```

Those scripts will be made accessible like any other command in the task environment, provided they have been granted the Linux execute permissions.
Binary scripts can be invoked like regular commands from the locally scoped module without modifying the `PATH` environment variable or using an absolute path. Each script should include a shebang to specify the interpreter and inputs should be supplied as arguments. See {ref}`structure-bin` for more information about custom scripts in `bin` directories.

To use this feature, the module binaries must be enabled in your pipeline script or configuration file:

```nextflow
nextflow.enable.moduleBinaries = true
```

:::{note}
This feature requires the use of a local or shared file system for the pipeline work directory, or {ref}`wave-page` when using cloud-based executors.
Module binary scripts require a local or shared file system for the pipeline work directory or {ref}`wave-page` when using cloud-based executors.
:::

Scripts can also be stored in project level `bin` directory. See {ref}`structure-bin` for more information.

## Sharing modules

Modules are designed to be easy to share and re-use across different pipelines, which helps eliminate duplicate work and spread improvements throughout the community. While Nextflow does not provide an explicit mechanism for sharing modules, there are several ways to do it:
Expand Down
99 changes: 47 additions & 52 deletions docs/process.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,11 @@ See {ref}`syntax-process` for a full description of the process syntax.

## Script

The `script` block defines, as a string expression, the script that is executed by the process.
The `script` block defines the string expression that is executed by the process.

A process may contain only one script, and if the `script` guard is not explicitly declared, the script must be the final statement in the process block.
The process can contain only one script block. If the `script` guard is not explicitly declared it must be the final statement in the process block.

The script string is executed as a [Bash](<http://en.wikipedia.org/wiki/Bash_(Unix_shell)>) script in the host environment. It can be any command or script that you would normally execute on the command line or in a Bash script. Naturally, the script may only use commands that are available in the host environment.
The script string is executed as a [Bash](<http://en.wikipedia.org/wiki/Bash_(Unix_shell)>) script in the host environment. It can be any command or script that you would execute on the command line or in a Bash script and can only use commands that are available in the host environment.

The script block can be a simple string or a multi-line string. The latter approach makes it easier to write scripts with multiple commands spanning multiple lines. For example:

Expand All @@ -42,19 +42,17 @@ process doMoreThings {
}
```

As explained in the script tutorial section, strings can be defined using single-quotes or double-quotes, and multi-line strings are defined by three single-quote or three double-quote characters.
Strings can be defined using single-quotes or double-quotes. Multi-line strings are defined by three single-quote or three double-quote characters.

There is a subtle but important difference between them. Like in Bash, strings delimited by a `"` character support variable substitutions, while strings delimited by `'` do not.
There is a subtle but important difference between single-quote (`'`) or three double-quote (`"`) characters. Like in Bash, strings delimited by the `"` character support variable substitutions, while strings delimited by `'` do not.

In the above code fragment, the `$db` variable is replaced by the actual value defined elsewhere in the pipeline script.
For example, in the above code fragment, the `$db` variable is replaced by the actual value defined elsewhere in the pipeline script.

:::{warning}
Since Nextflow uses the same Bash syntax for variable substitutions in strings, you must manage them carefully depending on whether you want to evaluate a *Nextflow* variable or a *Bash* variable.
Nextflow uses the same Bash syntax for variable substitutions in strings. You must manage them carefully depending on whether you want to evaluate a *Nextflow* variable or a *Bash* variable.
:::

When you need to access a system environment variable in your script, you have two options.

If you don't need to access any Nextflow variables, you can define your script block with single-quotes:
System environment variables and Nextflow variables can be accessed by your script. If you don't need to access any Nextflow variables, you can define your script block with single-quotes and use the dollar character (`$`) to access system environment variables. For example:

```nextflow
process printPath {
Expand All @@ -64,7 +62,7 @@ process printPath {
}
```

Otherwise, you can define your script with double-quotes and escape the system environment variables by prefixing them with a back-slash `\` character, as shown in the following example:
Otherwise, you can define your script with double-quotes and escape the system environment variables by prefixing them with a back-slash `\` character. For example:

```nextflow
process doOtherThings {
Expand All @@ -76,34 +74,30 @@ process doOtherThings {
}
```

In this example, `$MAX` is a Nextflow variable that must be defined elsewhere in the pipeline script. Nextflow replaces it with the actual value before executing the script. Meanwhile, `$DB` is a Bash variable that must exist in the execution environment, and Bash will replace it with the actual value during execution.

:::{tip}
Alternatively, you can use the {ref}`process-shell` block definition, which allows a script to contain both Bash and Nextflow variables without having to escape the first.
:::
In this example, `$MAX` is a Nextflow variable that is defined elsewhere in the pipeline script. Nextflow replaces it with the actual value before executing the script. In contrast, `$DB` is a Bash variable that must exist in the execution environment. Bash will replace it with the actual value during execution.

### Scripts *à la carte*

The process script is interpreted by Nextflow as a Bash script by default, but you are not limited to Bash.
The process script is interpreted as Bash by default.

You can use your favourite scripting language (Perl, Python, R, etc), or even mix them in the same pipeline.
However, you can use your favorite scripting language (Perl, Python, R, etc) for each process. You can also mix languages in the same pipeline.

A pipeline may be composed of processes that execute very different tasks. With Nextflow, you can choose the scripting language that best fits the task performed by a given process. For example, for some processes R might be more useful than Perl, whereas for others you may need to use Python because it provides better access to a library or an API, etc.
A pipeline may be composed of processes that execute very different tasks. You can choose the scripting language that best fits the task performed by a given process. For example, R might be more useful than Perl for some processes, whereas for others you may need to use Python because it provides better access to a library or an API.

To use a language other than Bash, simply start your process script with the corresponding [shebang](<http://en.wikipedia.org/wiki/Shebang_(Unix)>). For example:
To use a language other than Bash, start your process script with the corresponding [shebang](<http://en.wikipedia.org/wiki/Shebang_(Unix)>). For example:

```nextflow
process perlTask {
"""
#!/usr/bin/perl
#!/usr/bin/env perl

print 'Hi there!' . '\n';
"""
}

process pythonTask {
"""
#!/usr/bin/python
#!/usr/bin/env python

x = 'Hello'
y = 'world!'
Expand All @@ -118,12 +112,12 @@ workflow {
```

:::{tip}
Since the actual location of the interpreter binary file can differ across platforms, it is wise to use the `env` command followed by the interpreter name, e.g. `#!/usr/bin/env perl`, instead of the absolute path, in order to make your script more portable.
Use `env` to resolve the interpreter's location instead of hard-coding the interpreter path.
:::

### Conditional scripts

The `script` block is like a function that returns a string. This means that you can write arbitrary code to determine the script, as long as the final statement is a string.
The `script` block is like a function that returns a string. You can write arbitrary code to determine the script as long as the final statement is a string.

If-else statements based on task inputs can be used to produce a different script. For example:

Expand Down Expand Up @@ -155,57 +149,58 @@ process align {
}
```

In the above example, the process will execute one of several scripts depending on the value of the `mode` parameter. By default it will execute the `tcoffee` command.
In the above example, the process will execute one of several scripts depending on the value of the `mode` parameter. By default, the process will execute the `tcoffee` command.

(process-template)=

### Template
### Template files

Process scripts can be externalized to **template** files, which allows them to be reused across different processes and tested independently from the pipeline execution.
Process scripts can be externalized to **template** files and reused across multiple processes. Template files can be stored in the project or modules template directory. See {ref}`structure-templates` and {ref}`module-templates` for more information about directory structures.

A template can be used in place of an embedded script using the `template` function in the script section:
In template files, variables prefixed with the dollar character (`$`) are interpreted as Nextflow variables when the template script is executed by Nextflow.

```nextflow
process templateExample {
```
#!/usr/bin/env bash

echo "Hello ${x}"
```

Template files can be invoked like regular scripts from any process in your pipeline using the `template` function.

```
process sayHello {

input:
val STR
val x

output:
stdout

script:
template 'my_script.sh'
template 'sayhello.sh'
}

workflow {
Channel.of('this', 'that') | templateExample
Channel.of("Foo") | sayHello | view
}
```

By default, Nextflow looks for the template script in the `templates` directory located alongside the Nextflow script in which the process is defined. An absolute path can be used to specify a different location. However, this practice is discouraged because it hinders pipeline portability.
All template variable must be defined. The pipeline will fail if a template variable is missing, regardless of where it occurs in the template.

An example template script is provided below:
Templates can be tested independently of pipeline execution by providing each input as an environment variable. For example:

```bash
#!/bin/bash
echo "process started at `date`"
echo $STR
echo "process completed"
STR='foo' bash templates/sayhello.sh
```

Variables prefixed with the dollar character (`$`) are interpreted as Nextflow variables when the template script is executed by Nextflow and Bash variables when executed directly. For example, the above script can be executed from the command line by providing each input as an environment variable:

```bash
STR='foo' bash templates/my_script.sh
```
Template scripts are only recommended for Bash scripts. Languages that do not prefix variables with `$` (e.g., Python and R) can't be executed directly as a template script from the command line as variables prefixed with `$` are interpreted as Bash variables. Similarly, template variables escaped with `\$` will be interpreted as Bash variables when executed by Nextflow but not the command line.

The following caveats should be considered:

- Template scripts are recommended only for Bash scripts. Languages that do not prefix variables with `$` (e.g. Python and R) can't be executed directly as a template script.

- Variables escaped with `\$` will be interpreted as Bash variables when executed by Nextflow, but will not be interpreted as variables when executed from the command line. This practice should be avoided to ensure that the template script behaves consistently.

- Template variables are evaluated even if they are commented out in the template script. If a template variable is missing, it will cause the pipeline to fail regardless of where it occurs in the template.
:::{warning}
Template variables are evaluated even if they are commented out in the template script.
:::

:::{tip}
Template scripts are generally discouraged due to the caveats described above. The best practice for using a custom script is to embed it in the process definition at first and move it to a separate file with its own command line interface once the code matures.
The best practice for using a custom script is to first embed it in the process definition and transfer it to a separate file with its own command line interface once the code matures.
:::

(process-shell)=
Expand Down
26 changes: 0 additions & 26 deletions docs/sharing.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,32 +93,6 @@ Read the {ref}`container-page` page to learn more about how to use containers wi
For maximal reproducibility, make sure to define a specific version for each tool. Otherwise, your pipeline might use different versions across subsequent runs, which can introduce subtle differences to your results.
:::

(bundling-executables)=

#### The `bin` directory

As for custom scripts, you can include executable scripts in the `bin` directory of your pipeline repository. When configured correctly, these scripts can be executed like a regular command from any process script (i.e. without modifying the `PATH` environment variable or using an absolute path), and changing the script will cause the task to be re-executed on a resumed run (i.e. just like changing the process script itself).

To configure a custom script:

1. Save the script in the `bin` directory (relative to the pipeline repository root).
2. Specify a portable shebang (see note below for details).
3. Make the script executable. For example: `chmod a+x bin/my_script.py`

:::{tip}
To maximize the portability of your bundled script, use `env` to dynamically resolve the location of the interpreter instead of hard-coding it in the shebang line.

For example, shebang definitions `#!/usr/bin/python` and `#!/usr/local/bin/python` both hard-code specific paths to the Python interpreter. Instead, the following approach is more portable:

```bash
#!/usr/bin/env python
```
:::

#### The `lib` directory

Any Groovy scripts or JAR files in the `lib` directory will be automatically loaded and made available to your pipeline scripts. The `lib` directory is a useful way to provide utility code or external libraries without cluttering the pipeline scripts.

### Data

In general, input data should be provided by external sources using parameters which can be controlled by the user. This way, a pipeline can be easily reused to process different datasets which are appropriate for the pipeline.
Expand Down
Loading
Loading