From e644f5bbd9b41048f5c23512ed3531f9c6dbd90b Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Tue, 8 Apr 2025 13:25:39 +0100 Subject: [PATCH 01/36] First draft of splitting and grouping side quest First draft of a splitting and grouping side quest. This includes separating samples using filter, then grouping and spreading by intervals. Early version but introduces key operator concepts to participants. --- docs/side_quests/splitting-and-grouping.md | 1521 +++++++++++++++++ .../splitting_and_grouping/data/intervals.txt | 3 + .../data/samplesheet.csv | 9 + side-quests/splitting_and_grouping/main.nf | 31 + 4 files changed, 1564 insertions(+) create mode 100644 docs/side_quests/splitting-and-grouping.md create mode 100644 side-quests/splitting_and_grouping/data/intervals.txt create mode 100644 side-quests/splitting_and_grouping/data/samplesheet.csv create mode 100644 side-quests/splitting_and_grouping/main.nf diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md new file mode 100644 index 0000000000..325f43eb64 --- /dev/null +++ b/docs/side_quests/splitting-and-grouping.md @@ -0,0 +1,1521 @@ +# Splitting and Grouping + +Nextflow helps you work with your data in flexible ways. One of the most useful things you can do is split your data into different streams and then group related items back together. + +Think of it like sorting mail: you might first separate letters by their destination, process each pile differently, and then recombine items going to the same person. In Nextflow, we use special operators to do this with our scientific data. + +Nextflow's channel system is at the heart of this flexibility. Channels act as pipelines that connect different parts of your workflow, allowing data to flow through your analysis. You can create multiple channels from a single data source, process each channel differently, and then merge channels back together when needed. This approach lets you design workflows that naturally mirror the branching and converging paths of complex bioinformatics analyses. + +In this side quest, we'll explore how to split and group data using Nextflow's powerful channel operators. We'll start with a samplesheet containing information about different samples and their associated data. By the end of this side quest, you'll be able to manipulate and combine data streams effectively, making your workflows more efficient and easier to understand. + +- Read data from files using `splitCsv` +- Filter and transform data with `filter` and `map` +- Combine related data using `join` and `groupTuple` + +These skills will help you build workflows that can handle multiple samples and different types of data efficiently. + +--- + +## 0. Warmup + +### 0.1 Prerequisites + +Before taking on this side quest you should: + +- Complete the [Hello Nextflow](../hello_nextflow/README.md) tutorial +- Understand basic Nextflow concepts (processes, channels, operators) + +### 0.2 Starting Point + +Let's move into the project directory. + +```bash +cd side-quests/splitting-and-grouping +``` + +You'll find a `data` directory containing a samplesheet and a main workflow file. + +```console title="Directory contents" +> tree +. +├── data +│ └── samplesheet.csv +└── main.nf +``` + +The samplesheet contains information about different samples and their associated data. In particular, it contains information about the sample's ID, repeat number, type (normal or tumor), and the paths to the fastq files. + +```console title="samplesheet.csv" +id,repeat,type,fastq1,fastq2 +sampleA,1,normal,sampleA_rep1_normal_R1.fastq.gz,sampleA_rep1_normal_R2.fastq.gz +sampleA,1,tumor,sampleA_rep1_tumor_R1.fastq.gz,sampleA_rep1_tumor_R2.fastq.gz +sampleA,2,normal,sampleA_rep2_normal_R1.fastq.gz,sampleA_rep2_normal_R2.fastq.gz +sampleA,2,tumor,sampleA_rep2_tumor_R1.fastq.gz,sampleA_rep2_tumor_R2.fastq.gz +sampleB,1,normal,sampleB_rep1_normal_R1.fastq.gz,sampleB_rep1_normal_R2.fastq.gz +sampleB,1,tumor,sampleB_rep1_tumor_R1.fastq.gz,sampleB_rep1_tumor_R2.fastq.gz +sampleC,1,normal,sampleC_rep1_normal_R1.fastq.gz,sampleC_rep1_normal_R2.fastq.gz +sampleC,1,tumor,sampleC_rep1_tumor_R1.fastq.gz,sampleC_rep1_tumor_R2.fastq.gz +``` + +Note there are 8 samples in total, 4 normal and 4 tumor. sampleA has 2 repeats, while sampleB and sampleC only have 1. + +We're going to read in this samplesheet, then group and split the samples based on their data. + +--- + +## 1. Read in samplesheet + +### 1.1. Read in samplesheet with splitCsv + +Let's start by reading in the samplesheet with `splitCsv`. In the main workflow file, you'll see that we've already started the workflow. + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") +} +``` + +We can use the [`splitCsv` operator](https://www.nextflow.io/docs/latest/operator.html#splitcsv) to split the samplesheet into a channel of maps, where each map represents a row from the CSV file. + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +} +``` + +The `header: true` option tells Nextflow to use the first row of the CSV file as the header row, which will be used as keys for the values. Let's see what Nextflow can see after reading with splitCsv. To do this, we can use the `view` operator. + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .view() +} +``` + +```bash title="Read the samplesheet" +nextflow run main.nf +``` + +```console title="Read samplesheet with splitCsv" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [berserk_cray] DSL2 - revision: 8f31622c03 + +[id:sampleA, repeat:1, type:normal, fastq1:sampleA_rep1_normal_R1.fastq.gz, fastq2:sampleA_rep1_normal_R2.fastq.gz] +[id:sampleA, repeat:1, type:tumor, fastq1:sampleA_rep1_tumor_R1.fastq.gz, fastq2:sampleA_rep1_tumor_R2.fastq.gz] +[id:sampleA, repeat:2, type:normal, fastq1:sampleA_rep2_normal_R1.fastq.gz, fastq2:sampleA_rep2_normal_R2.fastq.gz] +[id:sampleA, repeat:2, type:tumor, fastq1:sampleA_rep2_tumor_R1.fastq.gz, fastq2:sampleA_rep2_tumor_R2.fastq.gz] +[id:sampleB, repeat:1, type:normal, fastq1:sampleB_rep1_normal_R1.fastq.gz, fastq2:sampleB_rep1_normal_R2.fastq.gz] +[id:sampleB, repeat:1, type:tumor, fastq1:sampleB_rep1_tumor_R1.fastq.gz, fastq2:sampleB_rep1_tumor_R2.fastq.gz] +[id:sampleC, repeat:1, type:normal, fastq1:sampleC_rep1_normal_R1.fastq.gz, fastq2:sampleC_rep1_normal_R2.fastq.gz] +[id:sampleC, repeat:1, type:tumor, fastq1:sampleC_rep1_tumor_R1.fastq.gz, fastq2:sampleC_rep1_tumor_R2.fastq.gz] +``` + +We can see that each row from the CSV file has been converted into a map with keys matching the header row. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. + +Each map contains: + +- `id`: The sample identifier (sampleA, sampleB, sampleC) +- `repeat`: The replicate number (1 or 2) +- `type`: The sample type (normal or tumor) +- `fastq1`: Path to the first FASTQ file +- `fastq2`: Path to the second FASTQ file + +This format makes it easy to access specific fields from each sample. For example, we could access the sample ID with `row.id` or the FASTQ paths with `row.fastq1` and `row.fastq2`. + +This means we have successfully read in the samplesheet and have access to the data in each row. We can start to implement this in our pipeline. + +### 1.2. Use dump to pretty print the data + +For a prettier output format, we can use the [`dump` operator](https://www.nextflow.io/docs/latest/operator.html#dump) instead of `view`: + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .dump(tag: 'samples', pretty: true) +} +``` + +```bash title="Read the samplesheet" +nextflow run main.nf +``` + +```console title="Read samplesheet with dump" + N E X T F L O W ~ version 24.10.5 + +Launching `./main.nf` [grave_stone] DSL2 - revision: b2bafa8755 +``` + +Wait?! Where is our output? `dump` is a special operator that prints the data to the console only when specifically enabled. That is what the `tag` parameter is for. Let's enable it: + +```bash title="Enable dump" +nextflow run main.nf -dump-channels samples +``` + +```console title="Read samplesheet with dump" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [wise_kirch] DSL2 - revision: 7f194f2473 + +[DUMP: samples] { + "id": "sampleA", + "repeat": "1", + "type": "normal", + "fastq1": "sampleA_rep1_normal_R1.fastq.gz", + "fastq2": "sampleA_rep1_normal_R2.fastq.gz" +} +[DUMP: samples] { + "id": "sampleA", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" +} +[DUMP: samples] { + "id": "sampleA", + "repeat": "2", + "type": "normal", + "fastq1": "sampleA_rep2_normal_R1.fastq.gz", + "fastq2": "sampleA_rep2_normal_R2.fastq.gz" +} +[DUMP: samples] { + "id": "sampleA", + "repeat": "2", + "type": "tumor", + "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" +} +[DUMP: samples] { + "id": "sampleB", + "repeat": "1", + "type": "normal", + "fastq1": "sampleB_rep1_normal_R1.fastq.gz", + "fastq2": "sampleB_rep1_normal_R2.fastq.gz" +} +[DUMP: samples] { + "id": "sampleB", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleB_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleB_rep1_tumor_R2.fastq.gz" +} +[DUMP: samples] { + "id": "sampleC", + "repeat": "1", + "type": "normal", + "fastq1": "sampleC_rep1_normal_R1.fastq.gz", + "fastq2": "sampleC_rep1_normal_R2.fastq.gz" +} +[DUMP: samples] { + "id": "sampleC", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" +} +``` + +This is a long output, but we can see that each row from the CSV file has been converted into a map with keys matching the header row. It's more clear to read, at the cost of being too much content for a small terminal. If you want it to be more concise, you can remove the `pretty: true` parameter and the console output will be similar to `view`. + +!!! note +If the output is too tall for your terminal but you have a very wide terminal, you can remove `pretty: true` from the `dump` operator to make it more concise. + +Both dump and view are useful for debugging and we will continue to use them throughout this side quest. Feel free to intersperse them if you need additional clarification at any step. + +### Takeaway + +In this section, you've learned: + +- **Reading in a samplesheet**: How to read in a samplesheet with `splitCsv` +- **Viewing data**: How to use `view` to print the data +- **Dumping data**: How to use `dump` to pretty print the data + +We now have a channel of maps, each representing a row from the samplesheet. Next, we'll transform this data into a format suitable for our pipeline by extracting metadata and organizing the file paths. + +--- + +## 2. Filter and transform data + +### 2.1. Filter data with `filter` + +We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator.html#filter) to filter the data based on a condition. Let's say we only want to process normal samples. We can do this by filtering the data based on the `type` field. Let's insert this before the `dump` operator. + +_Before:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .dump(tag: 'samples', pretty: true) +} +``` + +_After:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .filter { sample -> sample.type == 'normal' } + .dump(tag: 'samples') +} +``` + +!!! note +We drop the `pretty: true` parameter from `dump` because it makes it easier to see the difference + +```bash title="View normal samples" +nextflow run main.nf +``` + +```console title="View normal samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [stupefied_pike] DSL2 - revision: 8761d1b103 + +[DUMP: samples] ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'] +[DUMP: samples] ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'] +[DUMP: samples] ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'] +[DUMP: samples] ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'] +``` + +We have successfully filtered the data to only include normal samples. Let's recap how this works. The `filter` operator takes a closure that is applied to each element in the channel. If the closure returns `true`, the element is included in the output channel. If the closure returns `false`, the element is excluded from the output channel. + +In this case, we want to keep only the samples where `sample.type == 'normal'`. In the closure, we use the variable name `sample` to refer to each element in the channel, which then checks if `sample.type` is equal to `'normal'`. If it is, the sample is included in the output channel. If it is not, the sample is excluded from the output channel. + +```groovy title="main.nf" linenums="4" +.filter { sample -> sample.type == 'normal' } +``` + +### 2.2. Save results of filter to a new channel + +While useful, we are discarding the tumor samples. Instead, let's rewrite our pipeline to save all the samples to one channel called `samplesheet`, then filter that channel to just the normal samples and save the results to a new channel called `normal_samples`. + +_Before:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .filter { sample -> sample.type == 'normal' } + .dump(tag: 'samples') +} +``` + +_After:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .view() +} +``` + +Once again, run the pipeline to see the results: + +```bash title="View normal samples" +nextflow run main.nf +``` + +```console title="View normal samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [lonely_miescher] DSL2 - revision: 7e26f19fd3 + +[id:sampleA, repeat:1, type:normal, fastq1:sampleA_rep1_normal_R1.fastq.gz, fastq2:sampleA_rep1_normal_R2.fastq.gz] +[id:sampleA, repeat:2, type:normal, fastq1:sampleA_rep2_normal_R1.fastq.gz, fastq2:sampleA_rep2_normal_R2.fastq.gz] +[id:sampleB, repeat:1, type:normal, fastq1:sampleB_rep1_normal_R1.fastq.gz, fastq2:sampleB_rep1_normal_R2.fastq.gz] +[id:sampleC, repeat:1, type:normal, fastq1:sampleC_rep1_normal_R1.fastq.gz, fastq2:sampleC_rep1_normal_R2.fastq.gz] +``` + +Success! We have filtered the data to only include normal samples. If we wanted, we still have access to the tumor samples within the `samplesheet` channel. Since we managed it for the normal samples, let's do it for the tumor samples as well: + +_Before:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .view() +} +``` + +_After:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .view() + tumor_samples = samplesheet + .filter { sample -> sample.type == 'tumor' } + .view() +} +``` + +```bash title="View tumor samples" +nextflow run main.nf +``` + +```console title="View tumor samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [focused_kirch] DSL2 - revision: 87d6672658 + +[id:sampleA, repeat:1, type:normal, fastq1:sampleA_rep1_normal_R1.fastq.gz, fastq2:sampleA_rep1_normal_R2.fastq.gz] +[id:sampleA, repeat:1, type:tumor, fastq1:sampleA_rep1_tumor_R1.fastq.gz, fastq2:sampleA_rep1_tumor_R2.fastq.gz] +[id:sampleA, repeat:2, type:normal, fastq1:sampleA_rep2_normal_R1.fastq.gz, fastq2:sampleA_rep2_normal_R2.fastq.gz] +[id:sampleA, repeat:2, type:tumor, fastq1:sampleA_rep2_tumor_R1.fastq.gz, fastq2:sampleA_rep2_tumor_R2.fastq.gz] +[id:sampleB, repeat:1, type:normal, fastq1:sampleB_rep1_normal_R1.fastq.gz, fastq2:sampleB_rep1_normal_R2.fastq.gz] +[id:sampleB, repeat:1, type:tumor, fastq1:sampleB_rep1_tumor_R1.fastq.gz, fastq2:sampleB_rep1_tumor_R2.fastq.gz] +[id:sampleC, repeat:1, type:normal, fastq1:sampleC_rep1_normal_R1.fastq.gz, fastq2:sampleC_rep1_normal_R2.fastq.gz] +[id:sampleC, repeat:1, type:tumor, fastq1:sampleC_rep1_tumor_R1.fastq.gz, fastq2:sampleC_rep1_tumor_R2.fastq.gz] +``` + +We've managed to separate out the normal and tumor samples into two different channels but they're mixed up when we `view` them in the console! Here's where dump could be useful, because it can label the different channels with a tag. + +_Before:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .view() + tumor_samples = samplesheet + .filter { sample -> sample.type == 'tumor' } + .view() +} +``` + +_After:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .dump(tag: 'tumor') +} +``` + +```bash title="View normal and tumor samples" +nextflow run main.nf -dump-channels normal,tumor +``` + +```console title="View normal and tumor samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [spontaneous_jones] DSL2 - revision: 0e794240ef + +[DUMP: tumor] ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'] +[DUMP: normal] ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'] +[DUMP: tumor] ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz'] +[DUMP: normal] ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'] +[DUMP: tumor] ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz'] +[DUMP: normal] ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'] +[DUMP: tumor] ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz'] +[DUMP: normal] ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'] +``` + +Note how the `normal` and `tumor` tags are used to label the different channels. This is useful for debugging and for understanding the data flow in our pipeline. + +### Takeaway + +In this section, you've learned: + +- **Filtering data**: How to filter data with `filter` +- **Splitting data**: How to split data into different channels based on a condition +- **Dumping data**: How to use `dump` to label and print the data + +We've now separated out the normal and tumor samples into two different channels. Next, we'll join the normal and tumor samples on the `id` field. + +--- + +## 3. Join on ID + +In the previous section, we separated out the normal and tumor samples into two different channels. These could be processed independently using specific processes or workflows based on their type. But what happens when we want to compare the normal and tumor samples from the same patient? At this point, we need to join them back together making sure to match the samples based on their `id` field. + +Nextflow includes many methods for combing channels, but in this case the most appropriate operator is [`join`](https://www.nextflow.io/docs/latest/operator.html#join). This acts like a SQL `JOIN` operation, where we specify the key to join on and the type of join to perform. + +### 3.1. Use `map` and `join` to combine based on sample ID + +If we check the [`join`](https://www.nextflow.io/docs/latest/operator.html#join) documentation, we can see that it joins two channels based on the first item in each tuple. Let's run the pipeline to check our data structure and see how we need to modify it to join on the `id` field. + +```bash title="View normal and tumor samples" +nextflow run main.nf -dump-channels normal,tumor +``` + +```console title="View normal and tumor samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [spontaneous_jones] DSL2 - revision: 0e794240ef + +[DUMP: tumor] ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'] +[DUMP: normal] ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'] +[DUMP: tumor] ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz'] +[DUMP: normal] ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'] +[DUMP: tumor] ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz'] +[DUMP: normal] ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'] +[DUMP: tumor] ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz'] +[DUMP: normal] ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'] +``` + +We can see that the `id` field is the first element in each map. For `join` to work, we should isolate the `id` field in each tuple. After that, we can simply use the `join` operator to combine the two channels. + +To isolate the `id` field, we can use the [`map` operator](https://www.nextflow.io/docs/latest/operator.html#map) to create a new tuple with the `id` field as the first element. + +_Before:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .dump(tag: 'tumor') +} +``` + +_After:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .map { sample -> [sample.id, sample] } + .dump(tag: 'tumor') +} +``` + +```bash title="View normal and tumor samples with ID as element 0" +nextflow run main.nf -dump-channels normal,tumor +``` + +```console title="View normal and tumor samples with ID as element 0" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [sick_jones] DSL2 - revision: 9b183fbc7c + +[DUMP: tumor] ['sampleA', ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] +[DUMP: normal] ['sampleA', ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz']] +[DUMP: tumor] ['sampleA', ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] +[DUMP: normal] ['sampleA', ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']] +[DUMP: tumor] ['sampleB', ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] +[DUMP: normal] ['sampleB', ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']] +[DUMP: tumor] ['sampleC', ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] +[DUMP: normal] ['sampleC', ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']] +``` + +It might be subtle, but you should be able to see the first element in each tuple is the `id` field. Now we can use the `join` operator to combine the two channels based on the `id` field. + +Once again, we will use `dump` to selectively print the joined outputs. + +_Before:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .map { sample -> [sample.id, sample] } + .dump(tag: 'tumor') +} +``` + +_After:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .map { sample -> [sample.id, sample] } + .dump(tag: 'tumor') + joined_samples = normal_samples + .join(tumor_samples) + .dump(tag: 'joined') +} +``` + +```bash title="View normal and tumor samples" +nextflow run main.nf -dump-channels joined +``` + +```console title="View joined normal and tumor samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [thirsty_poitras] DSL2 - revision: 95a2b8902b + +[DUMP: joined] ['sampleA', ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] +[DUMP: joined] ['sampleA', ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] +[DUMP: joined] ['sampleB', ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] +[DUMP: joined] ['sampleC', ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] +``` + +It's a little hard to tell because it's so wide, but you should be able to see the samples have been joined by the `id` field. Each tuple now has the format: + +- `id`: The sample ID +- `normal_sample`: The normal sample including type, replicate and path to fastq files +- `tumor_sample`: The tumor sample including type, replicate and path to fastq files + +If you want you can use the `pretty` parameter of `dump` to make it easier to read: + +_After:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .map { sample -> [sample.id, sample] } + .dump(tag: 'tumor') + joined_samples = normal_samples + .join(tumor_samples) + .dump(tag: 'joined', pretty: true) +} +``` + +```bash title="View normal and tumor samples" +nextflow run main.nf -dump-channels joined +``` + +```console title="View normal and tumor samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [tender_feynman] DSL2 - revision: 3505c6a732 + +[DUMP: joined] [ + "sampleA", + { + "id": "sampleA", + "repeat": "1", + "type": "normal", + "fastq1": "sampleA_rep1_normal_R1.fastq.gz", + "fastq2": "sampleA_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleA", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" + } +] +[DUMP: joined] [ + "sampleA", + { + "id": "sampleA", + "repeat": "2", + "type": "normal", + "fastq1": "sampleA_rep2_normal_R1.fastq.gz", + "fastq2": "sampleA_rep2_normal_R2.fastq.gz" + }, + { + "id": "sampleA", + "repeat": "2", + "type": "tumor", + "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" + } +] +[DUMP: joined] [ + "sampleB", + { + "id": "sampleB", + "repeat": "1", + "type": "normal", + "fastq1": "sampleB_rep1_normal_R1.fastq.gz", + "fastq2": "sampleB_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleB", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleB_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleB_rep1_tumor_R2.fastq.gz" + } +] +[DUMP: joined] [ + "sampleC", + { + "id": "sampleC", + "repeat": "1", + "type": "normal", + "fastq1": "sampleC_rep1_normal_R1.fastq.gz", + "fastq2": "sampleC_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleC", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" + } +] +``` + +!!! warning +The `join` operator will discard any un-matched tuples. In this example, we made sure all samples were matched for tumor and normal but if this is not true you must use the parameter `remainder: true` to keep the unmatched tuples. Check the [documentation](https://www.nextflow.io/docs/latest/operator.html#join) for more details. + +### Takeaway + +In this section, you've learned: + +- How to use `map` to isolate a field in a tuple +- How to use `join` to combine tuples based on the first field + +With this knowledge, we can successfully combine channels based on a shared field. Next, we'll consider the situation where you want to join on multiple fields. + +### 3.2. Join on multiple fields + +We have 2 replicates for sampleA, but only 1 for sampleB and sampleC. In this case we were able to join them effectively by using the `id` field, but what would happen if they were out of sync? We could mix up the normal and tumor samples from different replicates! This could be disastrous! + +To avoid this, we can join on multiple fields. There are actually multiple ways to achieve this but we are going to focus on creating a new joining key which includes both the sample `id` and `replicate` number. + +Let's start by creating a new joining key. We can do this in the same way as before by using the [`map` operator](https://www.nextflow.io/docs/latest/operator.html#map) to create a new tuple with the `id` and `repeat` fields as the first element. + +_Before:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .map { sample -> [sample.id, sample] } + .dump(tag: 'tumor') + joined_samples = normal_samples + .join(tumor_samples) + .dump(tag: 'joined', pretty: true) +} +``` + +_After:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [ + [sample.id, sample.repeat], + sample + ] + } + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .map { sample -> [ + [sample.id, sample.repeat], + sample + ] + } + .dump(tag: 'tumor') + joined_samples = normal_samples + .join(tumor_samples) + .dump(tag: 'joined', pretty: true) +} +``` + +Now we should see the join is occurring but using both the `id` and `repeat` fields. + +```bash title="View normal and tumor samples" +nextflow run main.nf -dump-channels joined +``` + +```console title="View normal and tumor samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [cranky_lorenz] DSL2 - revision: 2be25de1df + +[DUMP: joined] [ + [ + "sampleA", + "1" + ], + { + "id": "sampleA", + "repeat": "1", + "type": "normal", + "fastq1": "sampleA_rep1_normal_R1.fastq.gz", + "fastq2": "sampleA_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleA", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" + } +] +[DUMP: joined] [ + [ + "sampleA", + "2" + ], + { + "id": "sampleA", + "repeat": "2", + "type": "normal", + "fastq1": "sampleA_rep2_normal_R1.fastq.gz", + "fastq2": "sampleA_rep2_normal_R2.fastq.gz" + }, + { + "id": "sampleA", + "repeat": "2", + "type": "tumor", + "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" + } +] +[DUMP: joined] [ + [ + "sampleB", + "1" + ], + { + "id": "sampleB", + "repeat": "1", + "type": "normal", + "fastq1": "sampleB_rep1_normal_R1.fastq.gz", + "fastq2": "sampleB_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleB", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleB_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleB_rep1_tumor_R2.fastq.gz" + } +] +[DUMP: joined] [ + [ + "sampleC", + "1" + ], + { + "id": "sampleC", + "repeat": "1", + "type": "normal", + "fastq1": "sampleC_rep1_normal_R1.fastq.gz", + "fastq2": "sampleC_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleC", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" + } +] +``` + +Note how we have a tuple of two elements (`id` and `repeat` fields) as the first element of each joined result. This demonstrates how complex items can be used as a joining key, enabling fairly intricate matching between samples from the same conditions. + +### 3.3. Use subMap to create a new joining key + +We have an issue from the above example. We have lost the field names from the original joining key, i.e. the `id` and `repeat` fields are just a list of two values. If we want to retain the field names so we can access them later by name we can use the [`subMap` method](). + +The `subMap` method takes a map and returns a new map with only the key-value pairs specified in the argument. In this case we want to specify the `id` and `repeat` fields. + +_Before:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [ + [sample.id, sample.repeat], + sample + ] + } + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .map { sample -> [ + [sample.id, sample.repeat], + sample + ] + } + .dump(tag: 'tumor') + joined_samples = normal_samples + .join(tumor_samples) + .dump(tag: 'joined', pretty: true) +} +``` + +_After:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [ + sample.subMap(['id', 'repeat']), + sample + ] + } + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .map { sample -> [ + sample.subMap(['id', 'repeat']), + sample + ] + } + .dump(tag: 'tumor') + joined_samples = normal_samples + .join(tumor_samples) + .dump(tag: 'joined', pretty: true) +} +``` + +```bash title="View normal and tumor samples" +nextflow run main.nf -dump-channels joined +``` + +```console title="View normal and tumor samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [insane_gautier] DSL2 - revision: bf5b9a6d37 + +[DUMP: joined] [ + { + "id": "sampleA", + "repeat": "1" + }, + { + "id": "sampleA", + "repeat": "1", + "type": "normal", + "fastq1": "sampleA_rep1_normal_R1.fastq.gz", + "fastq2": "sampleA_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleA", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" + } +] +[DUMP: joined] [ + { + "id": "sampleA", + "repeat": "2" + }, + { + "id": "sampleA", + "repeat": "2", + "type": "normal", + "fastq1": "sampleA_rep2_normal_R1.fastq.gz", + "fastq2": "sampleA_rep2_normal_R2.fastq.gz" + }, + { + "id": "sampleA", + "repeat": "2", + "type": "tumor", + "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" + } +] +[DUMP: joined] [ + { + "id": "sampleB", + "repeat": "1" + }, + { + "id": "sampleB", + "repeat": "1", + "type": "normal", + "fastq1": "sampleB_rep1_normal_R1.fastq.gz", + "fastq2": "sampleB_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleB", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleB_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleB_rep1_tumor_R2.fastq.gz" + } +] +[DUMP: joined] [ + { + "id": "sampleC", + "repeat": "1" + }, + { + "id": "sampleC", + "repeat": "1", + "type": "normal", + "fastq1": "sampleC_rep1_normal_R1.fastq.gz", + "fastq2": "sampleC_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleC", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" + } +] +``` + +Now we have a new joining key that not only includes the `id` and `repeat` fields but also retains the field names so we can access them later by name, e.g. `sample.id` and `sample.repeat`. + +### 3.4. Use a named closure in map + +Since we are re-using the same map in multiple places, we run the risk of introducing errors if we accidentally change the map in one place but not the other. To avoid this, we can use a named closure in the map. A named closure allows us to make a reusable function we can call later within a map. + +To do so, first we define the closure as a new variable: + +_Before:_ + +```groovy title="main.nf" linenums="1" +workflow { + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +``` + +_After:_ + +```groovy title="main.nf" linenums="1" +workflow { + getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +``` + +We have taken the map we used previously and defined it as a named variable we can call later. Let's implement it in our workflow: + +_Before:_ + +```groovy title="main.nf" linenums="5" + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [ + sample.subMap(['id', 'repeat']), + sample + ] + } + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .map { sample -> [ + sample.subMap(['id', 'repeat']), + sample + ] + } + .dump(tag: 'tumor') +``` + +_After:_ + +```groovy title="main.nf" linenums="5" + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .map ( getSampleIdAndReplicate ) + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .map ( getSampleIdAndReplicate ) + .dump(tag: 'tumor') +``` + +!!! note +The `map` operator has switched from using `{ }` to using `( )` to pass the closure as an argument. This is because the `map` operator expects a closure as an argument and `{ }` is used to define an anonymous closure. When calling a named closure, use the `( )` syntax. + +```bash title="View normal and tumor samples" +nextflow run main.nf -dump-channels joined +``` + +```console title="View normal and tumor samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [trusting_boltzmann] DSL2 - revision: 0b1cd77e3b + +[DUMP: joined] [ + { + "id": "sampleA", + "repeat": "1" + }, + { + "id": "sampleA", + "repeat": "1", + "type": "normal", + "fastq1": "sampleA_rep1_normal_R1.fastq.gz", + "fastq2": "sampleA_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleA", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" + } +] +[DUMP: joined] [ + { + "id": "sampleA", + "repeat": "2" + }, + { + "id": "sampleA", + "repeat": "2", + "type": "normal", + "fastq1": "sampleA_rep2_normal_R1.fastq.gz", + "fastq2": "sampleA_rep2_normal_R2.fastq.gz" + }, + { + "id": "sampleA", + "repeat": "2", + "type": "tumor", + "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" + } +] +[DUMP: joined] [ + { + "id": "sampleB", + "repeat": "1" + }, + { + "id": "sampleB", + "repeat": "1", + "type": "normal", + "fastq1": "sampleB_rep1_normal_R1.fastq.gz", + "fastq2": "sampleB_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleB", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleB_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleB_rep1_tumor_R2.fastq.gz" + } +] +[DUMP: joined] [ + { + "id": "sampleC", + "repeat": "1" + }, + { + "id": "sampleC", + "repeat": "1", + "type": "normal", + "fastq1": "sampleC_rep1_normal_R1.fastq.gz", + "fastq2": "sampleC_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleC", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" + } +] +``` + +Using a named closure in the map allows us to reuse the same map in multiple places which reduces our risk of introducing errors. It also makes the code more readable and easier to maintain. + +### Takeaway + +In this section, you've learned: + +- **Manipulating Tuples**: How to use `map` to isolate a field in a tuple +- **Joining Tuples**: How to use `join` to combine tuples based on the first field +- **Creating Joining Keys**: How to use `subMap` to create a new joining key +- **Named Closures**: How to use a named closure in map + +You now have a workflow that can split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then dump the results. + +This is a common pattern in bioinformatics workflows where you need to match up samples after processing independently, so it is a useful skill. Next, we will look at aggregating samples by fields. + +### 4. Aggregating samples + +In the previous section, we learned how to split a samplesheet and filter the normal and tumor samples. But this only covers a single type of joining. What if we want to group samples by a specific attribute? For example, instead of joining matched normal-tumor pairs, we might want to process all samples from "sampleA" together regardless of their type. This pattern is common in bioinformatics workflows where you may want to process related samples separately for efficiency reasons before comparing or combining the results at the end. + +Nextflow includes built in methods to do this, the main one we will look at is `groupTuple`. + +### 4.1. Grouping samples using `groupTuple` + +Let's start by grouping the samples by our `id` field. We can do this by using the `groupTuple` operator. + +As a reminder, what we are trying to achieve it to take all of the samples with the same `id` and group them together. We had 3 samples in the starting samplesheet (A, B and C) so we should end up with 3 grouped samples at the end of this step. + +The first step is similar to what we did in the previous section. We must isolate our grouping variable as the first element of the tuple. Remember, our first element is currently a map of `id` and `repeat` fields: + +```groovy title="main.nf" linenums="1" +{ + "id": "sampleA", + "repeat": "1" +} +``` + +We can reuse the `subMap` method from before to isolate our `id` field after joining. Like before, we will use `map` to apply the `subMap` method to the first element of the tuple for each sample. + +_Before:_ + +```groovy title="main.nf" linenums="13" + joined_samples = normal_samples + .join(tumor_samples) + .dump(tag: 'joined', pretty: true) +} +``` + +_After:_ + +```groovy title="main.nf" linenums="13" + joined_samples = normal_samples + .join(tumor_samples) + .dump(tag: 'joined', pretty: true) + + joined_samples.map { samples, normal, tumor -> + [ + samples.subMap('id'), + normal, + tumor + ] + } + .dump(tag: 'grouped') +} +``` + +Let's run it again and check the channel contents: + +```bash title="View grouped samples" +nextflow run main.nf -dump-channels grouped +``` + +```console title="View grouped samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [amazing_euler] DSL2 - revision: 765de536ee + +[DUMP: grouped] [['id':'sampleA'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleA'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleB'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleC'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] +``` + +We can see that we have successfully isolated the `id` field, but not grouped the samples yet. + +Let's now group the samples by the `id` field, using the [`groupTuple` operator](https://www.nextflow.io/docs/latest/operator.html#grouptuple). + +_Before:_ + +```groovy title="main.nf" linenums="21" + joined_samples.map { samples, normal, tumor -> + [ + samples.subMap('id'), + normal, + tumor + ] + } + .dump(tag: 'grouped') +} +``` + +_After:_ + +```groovy title="main.nf" linenums="21" + grouped_samples = joined_samples.map { samples, normal, tumor -> + [ + samples.subMap('id'), + normal, + tumor + ] + } + .groupTuple() + .dump(tag: 'grouped') +} +``` + +Simple, huh? We just added a single line of code. Let's see what happens when we run it: + +```bash title="View grouped samples" +nextflow run main.nf -dump-channels grouped +``` + +```console title="View grouped samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [condescending_baekeland] DSL2 - revision: 73b96e0f01 + +[DUMP: grouped] [['id':'sampleA'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']]] +[DUMP: grouped] [['id':'sampleB'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']]] +[DUMP: grouped] [['id':'sampleC'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']]] +``` + +It's a little awkward to read, but you should see there are 3 groups of samples, one for each `id` field. `sampleA` has 2 normal and 2 tumor samples, `sampleB` has 1 normal and 1 tumor sample, and `sampleC` has 1 normal and 1 tumor sample. + +If you're having trouble visualizing it, you can use the `pretty` flag of `dump` to make it easier to read: + +_Before:_ + +```groovy title="main.nf" linenums="24" + .dump(tag: 'grouped') +} +``` + +_After:_ + +```groovy title="main.nf" linenums="24" + .dump(tag: 'grouped', pretty: true) +} +``` + +```console title="View grouped samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [nice_poisson] DSL2 - revision: a102e91428 + +[DUMP: grouped] [ + { + "id": "sampleA" + }, + [ + { + "id": "sampleA", + "repeat": "1", + "type": "normal", + "fastq1": "sampleA_rep1_normal_R1.fastq.gz", + "fastq2": "sampleA_rep1_normal_R2.fastq.gz" + }, + { + "id": "sampleA", + "repeat": "2", + "type": "normal", + "fastq1": "sampleA_rep2_normal_R1.fastq.gz", + "fastq2": "sampleA_rep2_normal_R2.fastq.gz" + } + ], + [ + { + "id": "sampleA", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" + }, + { + "id": "sampleA", + "repeat": "2", + "type": "tumor", + "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" + } + ] +] +[DUMP: grouped] [ + { + "id": "sampleB" + }, + [ + { + "id": "sampleB", + "repeat": "1", + "type": "normal", + "fastq1": "sampleB_rep1_normal_R1.fastq.gz", + "fastq2": "sampleB_rep1_normal_R2.fastq.gz" + } + ], + [ + { + "id": "sampleB", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleB_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleB_rep1_tumor_R2.fastq.gz" + } + ] +] +[DUMP: grouped] [ + { + "id": "sampleC" + }, + [ + { + "id": "sampleC", + "repeat": "1", + "type": "normal", + "fastq1": "sampleC_rep1_normal_R1.fastq.gz", + "fastq2": "sampleC_rep1_normal_R2.fastq.gz" + } + ], + [ + { + "id": "sampleC", + "repeat": "1", + "type": "tumor", + "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" + } + ] +] +``` + +Note our data has changed structure. What was previously a list of tuples is now a list of lists of tuples. This is because when we use `groupTuple`, Nextflow creates a new list for each group. This is important when trying to handle the data downstream. + +It's possible to use a simpler data structure than this, by separating our the sample information from the sequencing data. We generally refer to this as a `metamap`, but this will be covered in a later side quest. For now, you should just understand that we can group up samples using the `groupTuple` operator and that the data structure will change as a result. + +!!! note +[`transpose`](https://www.nextflow.io/docs/latest/reference/operator.html#transpose) is the opposite of groupTuple. It unpacks the items in a channel and flattens them. Try and add `transpose` and undo the grouping we performed above! + +### Takeaway + +In this section, you've learned: + +- **Grouping samples**: How to use `groupTuple` to group samples by a specific attribute + +You now have a workflow that can split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then group them by `id`. + +## 5. Spread samples over intervals + +!!! AUTHORS NOTE !!! SWAP SPREADING AND GROUPTUPLE + +Spreading samples over different conditions is a common pattern in bioinformatics workflows. For example, it is used to spread variant calling over a range of intervals. This can help distribute work across multiple cores or nodes and make the pipelines more efficient and be turned around faster. + +In the next section, we will demonstrate how to take our existing samples and repeat each one for every interval. In this way, we will have a single sample for each input interval. We will also multiply our number of samples by the number of intervals, so get ready for a busy terminal! + +### 5.1. Spread samples over intervals using `combine` + +Let's start by creating a channel of intervals. To keep life simple, we will just use 3 intervals we will manually define. In a real workflow, you could read these in from a file input or even create a channel with lots of interval files. + +_Before:_ + +```groovy title="main.nf" linenums="24" + .dump(tag: 'grouped', pretty: true) +} +``` + +_After:_ + +```groovy title="main.nf" linenums="24" + .dump(tag: 'grouped', pretty: true) + + intervals = Channel.of('chr1', 'chr2', 'chr3') + .dump(tag: "intervals") +} +``` + +Now remember, we want to repeat each sample for each interval. This is sometimes referred to as the Cartesian product of the samples and intervals. We can achieve this by using the [`combine` operator](https://www.nextflow.io/docs/latest/operator.html#combine). This will take every item from channel 1 and repeat it for each item in channel 2. Let's add a combine operator to our workflow: + +_Before:_ + +```groovy title="main.nf" linenums="26" + intervals = Channel.of('chr1', 'chr2', 'chr3') + .dump(tag: "intervals") +} +``` + +_After:_ + +```groovy title="main.nf" linenums="26" + intervals = Channel.of('chr1', 'chr2', 'chr3') + .dump(tag: "intervals") + + grouped_samples.combine(intervals) + .dump(tag: 'combined', pretty: true) +} +``` + +Now let's run it and see what happens: + +```bash title="View combined samples" +nextflow run main.nf -dump-channels combined +``` + +```console title="View combined samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [dreamy_carlsson] DSL2 - revision: 0abb4c9e41 + +[DUMP: combined] [['id':'sampleA'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']], 'chr1'] +[DUMP: combined] [['id':'sampleA'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']], 'chr2'] +[DUMP: combined] [['id':'sampleA'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']], 'chr3'] +[DUMP: combined] [['id':'sampleB'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']], 'chr1'] +[DUMP: combined] [['id':'sampleB'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']], 'chr2'] +[DUMP: combined] [['id':'sampleB'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']], 'chr3'] +[DUMP: combined] [['id':'sampleC'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']], 'chr1'] +[DUMP: combined] [['id':'sampleC'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']], 'chr2'] +[DUMP: combined] [['id':'sampleC'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']], 'chr3'] +``` + +Success! We have repeated every sample for every single interval in our 3 interval list. We've effectively tripled the number of items in our channel. + +### Takeaway + +In this section, you've learned: + +- **Spreading samples over intervals**: How to use `combine` to repeat samples over intervals + +## Summary + +You've now seen how to split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then group them by `id`. You've also seen how to spread samples over intervals using the `combine` operator. + +## Contents + +1. Read in samplesheet with splitCsv + +- Samplesheet details here +- Show with view, then show with dump (is prettier!) + +2. Use filter (and/or map) to manipulate into 2 separate channels + +- Use named closure in map here? +- Show that elements can be in two channels by filtering twice + +3. Join on ID +4. Use groupTuple to group up samples by ID +5. Combine by intervals +6. Group after intervals diff --git a/side-quests/splitting_and_grouping/data/intervals.txt b/side-quests/splitting_and_grouping/data/intervals.txt new file mode 100644 index 0000000000..c0a1f9e3f7 --- /dev/null +++ b/side-quests/splitting_and_grouping/data/intervals.txt @@ -0,0 +1,3 @@ +chr1 +chr2 +chr3 diff --git a/side-quests/splitting_and_grouping/data/samplesheet.csv b/side-quests/splitting_and_grouping/data/samplesheet.csv new file mode 100644 index 0000000000..a4cac668e1 --- /dev/null +++ b/side-quests/splitting_and_grouping/data/samplesheet.csv @@ -0,0 +1,9 @@ +id,repeat,type,fastq1,fastq2 +sampleA,1,normal,sampleA_rep1_normal_R1.fastq.gz,sampleA_rep1_normal_R2.fastq.gz +sampleA,1,tumor,sampleA_rep1_tumor_R1.fastq.gz,sampleA_rep1_tumor_R2.fastq.gz +sampleA,2,normal,sampleA_rep2_normal_R1.fastq.gz,sampleA_rep2_normal_R2.fastq.gz +sampleA,2,tumor,sampleA_rep2_tumor_R1.fastq.gz,sampleA_rep2_tumor_R2.fastq.gz +sampleB,1,normal,sampleB_rep1_normal_R1.fastq.gz,sampleB_rep1_normal_R2.fastq.gz +sampleB,1,tumor,sampleB_rep1_tumor_R1.fastq.gz,sampleB_rep1_tumor_R2.fastq.gz +sampleC,1,normal,sampleC_rep1_normal_R1.fastq.gz,sampleC_rep1_normal_R2.fastq.gz +sampleC,1,tumor,sampleC_rep1_tumor_R1.fastq.gz,sampleC_rep1_tumor_R2.fastq.gz diff --git a/side-quests/splitting_and_grouping/main.nf b/side-quests/splitting_and_grouping/main.nf new file mode 100644 index 0000000000..def5228d58 --- /dev/null +++ b/side-quests/splitting_and_grouping/main.nf @@ -0,0 +1,31 @@ +workflow { + getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } + samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + normal_samples = samplesheet + .filter { sample -> sample.type == 'normal' } + .map ( getSampleIdAndReplicate ) + .dump(tag: 'normal') + tumor_samples = samplesheet + .filter { sample -> sample.type == "tumor" } + .map ( getSampleIdAndReplicate ) + .dump(tag: 'tumor') + joined_samples = normal_samples + .join(tumor_samples) + .dump(tag: 'joined', pretty: true) + grouped_samples = joined_samples.map { samples, normal, tumor -> + [ + samples.subMap('id'), + normal, + tumor + ] + } + .groupTuple() + .dump(tag: 'grouped', pretty: true) + + intervals = Channel.of('chr1', 'chr2', 'chr3') + .dump(tag: "intervals") + + grouped_samples.combine(intervals) + .dump(tag: 'combined') +} From 5ef395374855c591a7528451cc136c2ec1fba6ce Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Tue, 8 Apr 2025 15:52:06 +0100 Subject: [PATCH 02/36] Spread over intervals prior to grouping This commit reverses the order to spread over intervals prior to grouping. This achieves two things: 1. It explains everything once and only once to make the tutorial simpler 2. It provides a real world reason for using groupTuple This makes the flow of the tutorial easier to understand, at the cost of very verbose outputs. --- docs/side_quests/splitting-and-grouping.md | 558 +++++++++++++++------ side-quests/splitting_and_grouping/main.nf | 28 -- 2 files changed, 402 insertions(+), 184 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 325f43eb64..7a38a63668 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -1178,55 +1178,230 @@ In this section, you've learned: You now have a workflow that can split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then dump the results. -This is a common pattern in bioinformatics workflows where you need to match up samples after processing independently, so it is a useful skill. Next, we will look at aggregating samples by fields. +This is a common pattern in bioinformatics workflows where you need to match up samples after processing independently, so it is a useful skill. Next, we will look at repeating a sample multiple times. -### 4. Aggregating samples +## 4. Spread samples over intervals + +Spreading samples over different conditions is a common pattern in bioinformatics workflows. For example, it is used to spread variant calling over a range of intervals. This can help distribute work across multiple cores or nodes and make the pipelines more efficient and be turned around faster. + +In the next section, we will demonstrate how to take our existing samples and repeat each one for every interval. In this way, we will have a single sample for each input interval. We will also multiply our number of samples by the number of intervals, so get ready for a busy terminal! + +### 4.1. Spread samples over intervals using `combine` + +Let's start by creating a channel of intervals. To keep life simple, we will just use 3 intervals we will manually define. In a real workflow, you could read these in from a file input or even create a channel with lots of interval files. + +_Before:_ + +```groovy title="main.nf" linenums="15" + .dump(tag: 'joined', pretty: true) +} +``` + +_After:_ + +```groovy title="main.nf" linenums="24" + .dump(tag: 'joined', pretty: true) + intervals = Channel.of('chr1', 'chr2', 'chr3') + .dump(tag: "intervals") +} +``` + +Now remember, we want to repeat each sample for each interval. This is sometimes referred to as the Cartesian product of the samples and intervals. We can achieve this by using the [`combine` operator](https://www.nextflow.io/docs/latest/operator.html#combine). This will take every item from channel 1 and repeat it for each item in channel 2. Let's add a combine operator to our workflow: + +_Before:_ + +```groovy title="main.nf" linenums="26" + intervals = Channel.of('chr1', 'chr2', 'chr3') + .dump(tag: "intervals") +} +``` + +_After:_ + +```groovy title="main.nf" linenums="26" + intervals = Channel.of('chr1', 'chr2', 'chr3') + .dump(tag: "intervals") + + combined_samples = joined_samples.combine(intervals) + .dump(tag: 'combined') +} +``` + +Now let's run it and see what happens: + +```bash title="View combined samples" +nextflow run main.nf -dump-channels combined +``` + +```console title="View combined samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [extravagant_maxwell] DSL2 - revision: 459bde3584 + +[DUMP: combined] [['id':'sampleA', 'repeat':'1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], 'chr1'] +[DUMP: combined] [['id':'sampleA', 'repeat':'1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], 'chr2'] +[DUMP: combined] [['id':'sampleA', 'repeat':'1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], 'chr3'] +[DUMP: combined] [['id':'sampleA', 'repeat':'2'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz'], 'chr1'] +[DUMP: combined] [['id':'sampleA', 'repeat':'2'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz'], 'chr2'] +[DUMP: combined] [['id':'sampleA', 'repeat':'2'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz'], 'chr3'] +[DUMP: combined] [['id':'sampleB', 'repeat':'1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz'], 'chr1'] +[DUMP: combined] [['id':'sampleB', 'repeat':'1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz'], 'chr2'] +[DUMP: combined] [['id':'sampleB', 'repeat':'1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz'], 'chr3'] +[DUMP: combined] [['id':'sampleC', 'repeat':'1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz'], 'chr1'] +[DUMP: combined] [['id':'sampleC', 'repeat':'1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz'], 'chr2'] +[DUMP: combined] [['id':'sampleC', 'repeat':'1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz'], 'chr3'] +``` + +Success! We have repeated every sample for every single interval in our 3 interval list. We've effectively tripled the number of items in our channel. It's a little hard to read though, so in the next section we will tidy it up. + +### 4.2. Organise the channel + +We can use the `map` operator to tidy and refactor our sample data so it's easier to understand. Let's move the intervals string to the joining map at the first element. + +_Before:_ + +```groovy title="main.nf" linenums="19" + combined_samples = joined_samples.combine(intervals) + .dump(tag: 'combined') +} +``` + +_After:_ + +```groovy title="main.nf" linenums="19" + combined_samples = joined_samples.combine(intervals) + .map { grouping_key, normal, tumor, interval -> + [ + grouping_key + [interval: interval], + normal, + tumor + ] + + } + .dump(tag: 'combined') +} +``` + +Wait? What did we do here? Let's go over it piece by piece. + +First, we use a map operator to iterate over every item in the channel. By using the names `grouping_key`, `normal`, `tumor` and `interval`, we can refer to the elements in the tuple by name instead of by index. This makes the code more readable and easier to understand. + +```groovy +.map { grouping_key, normal, tumor, interval -> +``` + +Next, create a new map by combining the `grouping_key` with the `interval` field. Remember, the `grouping_key` is the first element of the tuple, which is a map of `id` and `repeat` fields. The `interval` is just a string, but we make it into a new map with the key `interval` and value the string. By 'adding' them (`+`), Groovy will merge them together to produce the union of the two maps. + +```groovy +grouping_key + [interval: interval], +``` + +Finally, we return all of this as one tuple of the 3 elements, the new map, the normal sample data and the tumor sample data. + +```groovy +[ + grouping_key + [interval: interval], + normal, + tumor +] +``` + +Let's run it again and check the channel contents: + +```bash title="View combined samples" +nextflow run main.nf -dump-channels combined +``` + +```console title="View combined samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [focused_curie] DSL2 - revision: 9953685fec + +[DUMP: combined] [['id':'sampleA', 'repeat':'1', 'interval':'chr1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] +[DUMP: combined] [['id':'sampleA', 'repeat':'1', 'interval':'chr2'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] +[DUMP: combined] [['id':'sampleA', 'repeat':'1', 'interval':'chr3'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] +[DUMP: combined] [['id':'sampleA', 'repeat':'2', 'interval':'chr1'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] +[DUMP: combined] [['id':'sampleA', 'repeat':'2', 'interval':'chr2'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] +[DUMP: combined] [['id':'sampleA', 'repeat':'2', 'interval':'chr3'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] +[DUMP: combined] [['id':'sampleB', 'repeat':'1', 'interval':'chr1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] +[DUMP: combined] [['id':'sampleB', 'repeat':'1', 'interval':'chr2'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] +[DUMP: combined] [['id':'sampleB', 'repeat':'1', 'interval':'chr3'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] +[DUMP: combined] [['id':'sampleC', 'repeat':'1', 'interval':'chr1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] +[DUMP: combined] [['id':'sampleC', 'repeat':'1', 'interval':'chr2'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] +[DUMP: combined] [['id':'sampleC', 'repeat':'1', 'interval':'chr3'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] +``` + +Using `map` to coerce your data into the correct structure can be tricky, but it's crucial to correctly splitting and grouping effectively. + +### Takeaway + +In this section, you've learned: + +- **Spreading samples over intervals**: How to use `combine` to repeat samples over intervals + +### 5. Aggregating samples In the previous section, we learned how to split a samplesheet and filter the normal and tumor samples. But this only covers a single type of joining. What if we want to group samples by a specific attribute? For example, instead of joining matched normal-tumor pairs, we might want to process all samples from "sampleA" together regardless of their type. This pattern is common in bioinformatics workflows where you may want to process related samples separately for efficiency reasons before comparing or combining the results at the end. Nextflow includes built in methods to do this, the main one we will look at is `groupTuple`. -### 4.1. Grouping samples using `groupTuple` +### 5.1. Grouping samples using `groupTuple` -Let's start by grouping the samples by our `id` field. We can do this by using the `groupTuple` operator. +Let's start by grouping all of our samples that have the same `id` and `interval` fields, this would be typical of an analysis where we wanted to group technical replicates but keep meaningfully different samples separated. -As a reminder, what we are trying to achieve it to take all of the samples with the same `id` and group them together. We had 3 samples in the starting samplesheet (A, B and C) so we should end up with 3 grouped samples at the end of this step. +To do this, we should separate out our grouping variables so we can use them in isolation. -The first step is similar to what we did in the previous section. We must isolate our grouping variable as the first element of the tuple. Remember, our first element is currently a map of `id` and `repeat` fields: +The first step is similar to what we did in the previous section. We must isolate our grouping variable as the first element of the tuple. Remember, our first element is currently a map of `id`, `repeat` and `interval` fields: ```groovy title="main.nf" linenums="1" { "id": "sampleA", - "repeat": "1" + "repeat": "1", + "interval": "chr1" } ``` -We can reuse the `subMap` method from before to isolate our `id` field after joining. Like before, we will use `map` to apply the `subMap` method to the first element of the tuple for each sample. +We can reuse the `subMap` method from before to isolate our `id` and `interval` fields from the map. Like before, we will use `map` operator to apply the `subMap` method to the first element of the tuple for each sample. _Before:_ -```groovy title="main.nf" linenums="13" - joined_samples = normal_samples - .join(tumor_samples) - .dump(tag: 'joined', pretty: true) +```groovy title="main.nf" linenums="19" + combined_samples = joined_samples.combine(intervals) + .map { grouping_key, normal, tumor, interval -> + [ + grouping_key + [interval: interval], + normal, + tumor + ] + + } + .dump(tag: 'combined') } ``` _After:_ -```groovy title="main.nf" linenums="13" - joined_samples = normal_samples - .join(tumor_samples) - .dump(tag: 'joined', pretty: true) +```groovy title="main.nf" linenums="19" + combined_samples = joined_samples.combine(intervals) + .map { grouping_key, normal, tumor, interval -> + [ + grouping_key + [interval: interval], + normal, + tumor + ] - joined_samples.map { samples, normal, tumor -> - [ - samples.subMap('id'), - normal, - tumor - ] - } - .dump(tag: 'grouped') + } + .dump(tag: 'combined') + + grouped_samples = combined_samples.map { grouping_key, normal, tumor -> + [ + grouping_key.subMap('id', 'interval'), + normal, + tumor + ] + + } + .dump(tag: 'grouped') } ``` @@ -1239,44 +1414,55 @@ nextflow run main.nf -dump-channels grouped ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [amazing_euler] DSL2 - revision: 765de536ee +Launching `main.nf` [fabulous_baekeland] DSL2 - revision: 5d2d687351 -[DUMP: grouped] [['id':'sampleA'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleA'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleB'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleC'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr2'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr3'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr1'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr2'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr3'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleB', 'interval':'chr1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleB', 'interval':'chr2'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleB', 'interval':'chr3'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleC', 'interval':'chr1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleC', 'interval':'chr2'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleC', 'interval':'chr3'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] ``` -We can see that we have successfully isolated the `id` field, but not grouped the samples yet. +We can see that we have successfully isolated the `id` and `interval` fields, but not grouped the samples yet. -Let's now group the samples by the `id` field, using the [`groupTuple` operator](https://www.nextflow.io/docs/latest/operator.html#grouptuple). +Let's now group the samples by this new grouping element, using the [`groupTuple` operator](https://www.nextflow.io/docs/latest/operator.html#grouptuple). _Before:_ -```groovy title="main.nf" linenums="21" - joined_samples.map { samples, normal, tumor -> - [ - samples.subMap('id'), - normal, - tumor - ] - } - .dump(tag: 'grouped') +```groovy title="main.nf" linenums="30" + grouped_samples = combined_samples.map { grouping_key, normal, tumor -> + [ + grouping_key.subMap('id', 'interval'), + grouping_key, + normal, + tumor + ] + + } + .dump(tag: 'grouped') } ``` _After:_ -```groovy title="main.nf" linenums="21" - grouped_samples = joined_samples.map { samples, normal, tumor -> - [ - samples.subMap('id'), - normal, - tumor - ] - } - .groupTuple() - .dump(tag: 'grouped') +```groovy title="main.nf" linenums="29" + grouped_samples = combined_samples.map { grouping_key, normal, tumor -> + [ + grouping_key.subMap('id', 'interval'), + normal, + tumor + ] + + } + .groupTuple() + .dump(tag: 'grouped') } ``` @@ -1289,39 +1475,46 @@ nextflow run main.nf -dump-channels grouped ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [condescending_baekeland] DSL2 - revision: 73b96e0f01 +Launching `main.nf` [reverent_nightingale] DSL2 - revision: 72c6664d6f -[DUMP: grouped] [['id':'sampleA'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']]] -[DUMP: grouped] [['id':'sampleB'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']]] -[DUMP: grouped] [['id':'sampleC'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']]] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr1'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']]] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr2'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']]] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr3'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']]] +[DUMP: grouped] [['id':'sampleB', 'interval':'chr1'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']]] +[DUMP: grouped] [['id':'sampleB', 'interval':'chr2'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']]] +[DUMP: grouped] [['id':'sampleB', 'interval':'chr3'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']]] +[DUMP: grouped] [['id':'sampleC', 'interval':'chr1'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']]] +[DUMP: grouped] [['id':'sampleC', 'interval':'chr2'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']]] +[DUMP: grouped] [['id':'sampleC', 'interval':'chr3'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']]] ``` -It's a little awkward to read, but you should see there are 3 groups of samples, one for each `id` field. `sampleA` has 2 normal and 2 tumor samples, `sampleB` has 1 normal and 1 tumor sample, and `sampleC` has 1 normal and 1 tumor sample. - -If you're having trouble visualizing it, you can use the `pretty` flag of `dump` to make it easier to read: +It's a little awkward to read! If you're having trouble visualizing it, you can use the `pretty` flag of `dump` to make it easier to read: _Before:_ -```groovy title="main.nf" linenums="24" +```groovy title="main.nf" linenums="40" .dump(tag: 'grouped') } ``` _After:_ -```groovy title="main.nf" linenums="24" +```groovy title="main.nf" linenums="40" .dump(tag: 'grouped', pretty: true) } ``` +Note, we only include the first sample to keep this concise! + ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [nice_poisson] DSL2 - revision: a102e91428 +Launching `main.nf` [dreamy_lichterman] DSL2 - revision: 953a5dd264 [DUMP: grouped] [ { - "id": "sampleA" + "id": "sampleA", + "interval": "chr1" }, [ { @@ -1356,60 +1549,122 @@ Launching `main.nf` [nice_poisson] DSL2 - revision: a102e91428 } ] ] -[DUMP: grouped] [ +... +``` + +Note our data has changed structure. What was previously a list of tuples is now a list of lists of tuples. This is because when we use `groupTuple`, Nextflow creates a new list for each group. This is important to remember when trying to handle the data downstream. + +It's possible to use a simpler data structure than this, by separating our the sample information from the sequencing data. We generally refer to this as a `metamap`, but this will be covered in a later side quest. For now, you should just understand that we can group up samples using the `groupTuple` operator and that the data structure will change as a result. + +!!! note +[`transpose`](https://www.nextflow.io/docs/latest/reference/operator.html#transpose) is the opposite of groupTuple. It unpacks the items in a channel and flattens them. Try and add `transpose` and undo the grouping we performed above! + +# 5.2. Reduce duplication of data + +We have a lot of duplicated data in our workflow. Each item in the grouped sample repeats the `id` and `interval` fields. Since this information is available in the metamap, let's just save it once. As a reminder, our data is structured like this: + +```groovy +[ { - "id": "sampleB" + "id": "sampleC", + "interval": "chr3" }, [ { - "id": "sampleB", + "id": "sampleC", "repeat": "1", "type": "normal", - "fastq1": "sampleB_rep1_normal_R1.fastq.gz", - "fastq2": "sampleB_rep1_normal_R2.fastq.gz" + "fastq1": "sampleC_rep1_normal_R1.fastq.gz", + "fastq2": "sampleC_rep1_normal_R2.fastq.gz" } ], [ { - "id": "sampleB", + "id": "sampleC", "repeat": "1", "type": "tumor", - "fastq1": "sampleB_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleB_rep1_tumor_R2.fastq.gz" + "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" } ] ] +``` + +We could parse the data after grouping to remove the duplication, but this requires us to handle all of the outputs. Instead, we can parse the data before grouping, which will mean the results are never included in the first place. + +In the same `map` operator where we isolate the `id` and `interval` fields, we can also grab the `fastq1` and `fastq2` fields for our sample data and _not_ include the `id` and `interval` fields. + +_Before:_ + +```groovy title="main.nf" linenums="30" + grouped_samples = combined_samples.map { grouping_key, normal, tumor -> + [ + grouping_key.subMap('id', 'interval'), + normal, + tumor + ] + + } + .groupTuple() + .dump(tag: 'grouped', pretty: true) +} +``` + +_After:_ + +```groovy title="main.nf" linenums="30" + grouped_samples = combined_samples.map { grouping_key, normal, tumor -> + [ + grouping_key.subMap('id', 'interval'), + normal.subMap("fastq1", "fastq2"), + tumor.subMap("fastq1", "fastq2") + ] + + } + .groupTuple() + .dump(tag: 'grouped', pretty: true) +} +``` + +```bash title="View grouped samples" +nextflow run main.nf -dump-channels grouped +``` + +```console title="View grouped samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [modest_stallman] DSL2 - revision: 5be827a6e8 + [DUMP: grouped] [ { - "id": "sampleC" + "id": "sampleA", + "interval": "chr1" }, [ { - "id": "sampleC", - "repeat": "1", - "type": "normal", - "fastq1": "sampleC_rep1_normal_R1.fastq.gz", - "fastq2": "sampleC_rep1_normal_R2.fastq.gz" + "fastq1": "sampleA_rep1_normal_R1.fastq.gz", + "fastq2": "sampleA_rep1_normal_R2.fastq.gz" + }, + { + "fastq1": "sampleA_rep2_normal_R1.fastq.gz", + "fastq2": "sampleA_rep2_normal_R2.fastq.gz" } ], [ { - "id": "sampleC", - "repeat": "1", - "type": "tumor", - "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" + "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" + }, + { + "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", + "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" } ] ] +... ``` -Note our data has changed structure. What was previously a list of tuples is now a list of lists of tuples. This is because when we use `groupTuple`, Nextflow creates a new list for each group. This is important when trying to handle the data downstream. - -It's possible to use a simpler data structure than this, by separating our the sample information from the sequencing data. We generally refer to this as a `metamap`, but this will be covered in a later side quest. For now, you should just understand that we can group up samples using the `groupTuple` operator and that the data structure will change as a result. - -!!! note -[`transpose`](https://www.nextflow.io/docs/latest/reference/operator.html#transpose) is the opposite of groupTuple. It unpacks the items in a channel and flattens them. Try and add `transpose` and undo the grouping we performed above! +Now we have a much cleaner output. We can see that the `id` and `interval` fields are only included once, and the `fastq1` and `fastq2` fields are included in the sample data ### Takeaway @@ -1419,103 +1674,94 @@ In this section, you've learned: You now have a workflow that can split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then group them by `id`. -## 5. Spread samples over intervals - -!!! AUTHORS NOTE !!! SWAP SPREADING AND GROUPTUPLE - -Spreading samples over different conditions is a common pattern in bioinformatics workflows. For example, it is used to spread variant calling over a range of intervals. This can help distribute work across multiple cores or nodes and make the pipelines more efficient and be turned around faster. - -In the next section, we will demonstrate how to take our existing samples and repeat each one for every interval. In this way, we will have a single sample for each input interval. We will also multiply our number of samples by the number of intervals, so get ready for a busy terminal! +## Summary -### 5.1. Spread samples over intervals using `combine` +In this side quest, you've learned how to split and group data using channels. By modifying the data as it flows through the pipeline, you can construct a pipeline that handles as many samples as possible with no loops or while statements. It gracefully scales to large numbers of samples. Here's what we achieved: -Let's start by creating a channel of intervals. To keep life simple, we will just use 3 intervals we will manually define. In a real workflow, you could read these in from a file input or even create a channel with lots of interval files. +1. **Read in samplesheet with splitCsv** -_Before:_ +- Samplesheet details here +- Show with view, then show with dump (is prettier!) -```groovy title="main.nf" linenums="24" - .dump(tag: 'grouped', pretty: true) -} -``` +2. **Use filter (and/or map) to manipulate into 2 separate channels** -_After:_ +- Use named closure in map here? +- Show that elements can be in two channels by filtering twice -```groovy title="main.nf" linenums="24" - .dump(tag: 'grouped', pretty: true) +3. **Join on ID** - intervals = Channel.of('chr1', 'chr2', 'chr3') - .dump(tag: "intervals") -} -``` +- Show that elements can be in two channels by filtering twice -Now remember, we want to repeat each sample for each interval. This is sometimes referred to as the Cartesian product of the samples and intervals. We can achieve this by using the [`combine` operator](https://www.nextflow.io/docs/latest/operator.html#combine). This will take every item from channel 1 and repeat it for each item in channel 2. Let's add a combine operator to our workflow: +4. **Use groupTuple to group up samples by ID** -_Before:_ +- Show that elements can be in two channels by filtering twice -```groovy title="main.nf" linenums="26" - intervals = Channel.of('chr1', 'chr2', 'chr3') - .dump(tag: "intervals") -} -``` +5. **Combine by intervals** -_After:_ +- Show that elements can be in two channels by filtering twice -```groovy title="main.nf" linenums="26" - intervals = Channel.of('chr1', 'chr2', 'chr3') - .dump(tag: "intervals") +6. **Group after intervals** - grouped_samples.combine(intervals) - .dump(tag: 'combined', pretty: true) -} -``` +- Show that elements can be in two channels by filtering twice -Now let's run it and see what happens: +This approach offers several advantages over writing a pipeline as more standard code, such as using for and while loops: -```bash title="View combined samples" -nextflow run main.nf -dump-channels combined -``` +- We can scale to as many or as few samples as we want with no additional code +- We focus on handling the flow of data through the pipeline, instead of iterating over samples +- We can be as complex or simple as required +- The pipeline becomes more declarative, focusing on what should happen rather than how it should happen +- Nextflow will optimize execution for us by running independent operations in parallel -```console title="View combined samples" - N E X T F L O W ~ version 24.10.5 +By mastering these channel operations, you can build flexible, scalable pipelines that handle complex data relationships without resorting to loops or iterative programming. This declarative approach allows Nextflow to optimize execution and parallelize independent operations automatically. -Launching `main.nf` [dreamy_carlsson] DSL2 - revision: 0abb4c9e41 +### Key Concepts -[DUMP: combined] [['id':'sampleA'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']], 'chr1'] -[DUMP: combined] [['id':'sampleA'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']], 'chr2'] -[DUMP: combined] [['id':'sampleA'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']], 'chr3'] -[DUMP: combined] [['id':'sampleB'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']], 'chr1'] -[DUMP: combined] [['id':'sampleB'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']], 'chr2'] -[DUMP: combined] [['id':'sampleB'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']], 'chr3'] -[DUMP: combined] [['id':'sampleC'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']], 'chr1'] -[DUMP: combined] [['id':'sampleC'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']], 'chr2'] -[DUMP: combined] [['id':'sampleC'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']], 'chr3'] -``` +1. **Reading Samplesheets** -Success! We have repeated every sample for every single interval in our 3 interval list. We've effectively tripled the number of items in our channel. + ```nextflow + // Read CSV with header + Channel.fromPath('samplesheet.csv') + .splitCsv(header: true) + ``` -### Takeaway +2. **Filtering** -In this section, you've learned: + ```nextflow + // Filter channel based on condition + channel.filter { it.type == 'tumor' } + ``` -- **Spreading samples over intervals**: How to use `combine` to repeat samples over intervals +3. **Joining Channels** -## Summary + ```nextflow + // Join two channels by key + tumor_ch.join(normal_ch) -You've now seen how to split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then group them by `id`. You've also seen how to spread samples over intervals using the `combine` operator. + // Extract a key and join by this value + tumor_ch.map { [it.patient_id, it] } + .join( + normal_ch.map { [it.patient_id, it] } + ) + ``` -## Contents +4. **Grouping Data** -1. Read in samplesheet with splitCsv + ```nextflow + // Group by the first element in each tuple + channel.groupTuple() + ``` -- Samplesheet details here -- Show with view, then show with dump (is prettier!) +5. **Combining Channels** -2. Use filter (and/or map) to manipulate into 2 separate channels + ```nextflow + // Combine with Cartesian product + samples_ch.combine(intervals_ch) + ``` -- Use named closure in map here? -- Show that elements can be in two channels by filtering twice +## Resources -3. Join on ID -4. Use groupTuple to group up samples by ID -5. Combine by intervals -6. Group after intervals +- [filter](https://www.nextflow.io/docs/latest/operator.html#filter) +- [map](https://www.nextflow.io/docs/latest/operator.html#map) +- [join](https://www.nextflow.io/docs/latest/operator.html#join) +- [groupTuple](https://www.nextflow.io/docs/latest/operator.html#grouptuple) +- [combine](https://www.nextflow.io/docs/latest/operator.html#combine) diff --git a/side-quests/splitting_and_grouping/main.nf b/side-quests/splitting_and_grouping/main.nf index def5228d58..d8ebe7139b 100644 --- a/side-quests/splitting_and_grouping/main.nf +++ b/side-quests/splitting_and_grouping/main.nf @@ -1,31 +1,3 @@ workflow { - getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - normal_samples = samplesheet - .filter { sample -> sample.type == 'normal' } - .map ( getSampleIdAndReplicate ) - .dump(tag: 'normal') - tumor_samples = samplesheet - .filter { sample -> sample.type == "tumor" } - .map ( getSampleIdAndReplicate ) - .dump(tag: 'tumor') - joined_samples = normal_samples - .join(tumor_samples) - .dump(tag: 'joined', pretty: true) - grouped_samples = joined_samples.map { samples, normal, tumor -> - [ - samples.subMap('id'), - normal, - tumor - ] - } - .groupTuple() - .dump(tag: 'grouped', pretty: true) - - intervals = Channel.of('chr1', 'chr2', 'chr3') - .dump(tag: "intervals") - - grouped_samples.combine(intervals) - .dump(tag: 'combined') } From b7717ac2b99a4fb7f16ec25e7aa17b10977c458f Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Tue, 8 Apr 2025 16:24:37 +0100 Subject: [PATCH 03/36] add splitting and grouping to nav bar --- mkdocs.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/mkdocs.yml b/mkdocs.yml index ea9e4bf608..4fec7d0fae 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -41,6 +41,7 @@ nav: - side_quests/nf-core.md - side_quests/nf-test.md - side_quests/workflows_of_workflows.md + - side_quests/splitting-and-grouping.md - Fundamentals Training: - basic_training/index.md - basic_training/orientation.md From 7f76a634bef3dc4f8bef80ed1b58c7a2bb8d2e1f Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Wed, 9 Apr 2025 15:03:47 +0100 Subject: [PATCH 04/36] Add before/after to first code change --- docs/side_quests/splitting-and-grouping.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 7a38a63668..124b1aa84c 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -77,6 +77,8 @@ workflow { We can use the [`splitCsv` operator](https://www.nextflow.io/docs/latest/operator.html#splitcsv) to split the samplesheet into a channel of maps, where each map represents a row from the CSV file. +_Before:_ + ```groovy title="main.nf" linenums="1" workflow { samplesheet = Channel.fromPath("./data/samplesheet.csv") @@ -86,6 +88,8 @@ workflow { The `header: true` option tells Nextflow to use the first row of the CSV file as the header row, which will be used as keys for the values. Let's see what Nextflow can see after reading with splitCsv. To do this, we can use the `view` operator. +_After:_ + ```groovy title="main.nf" linenums="1" workflow { samplesheet = Channel.fromPath("./data/samplesheet.csv") From 48cdd25a7ec83afe4bcaba6909bd74586554bcb3 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Wed, 9 Apr 2025 15:11:22 +0100 Subject: [PATCH 05/36] Add ch_* prefix to all channels for clarity over objects --- docs/side_quests/splitting-and-grouping.md | 151 +++++++++++---------- side-quests/splitting_and_grouping/main.nf | 2 +- 2 files changed, 78 insertions(+), 75 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 124b1aa84c..4500405ba0 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -71,17 +71,20 @@ Let's start by reading in the samplesheet with `splitCsv`. In the main workflow ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") } ``` +!!! note +Throughout this tutorial, we'll use the `ch_` prefix for all channel variables to clearly indicate they are Nextflow channels. + We can use the [`splitCsv` operator](https://www.nextflow.io/docs/latest/operator.html#splitcsv) to split the samplesheet into a channel of maps, where each map represents a row from the CSV file. _Before:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) } ``` @@ -92,7 +95,7 @@ _After:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .view() } @@ -137,7 +140,7 @@ For a prettier output format, we can use the [`dump` operator](https://www.nextf ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .dump(tag: 'samples', pretty: true) } @@ -251,7 +254,7 @@ _Before:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .dump(tag: 'samples', pretty: true) } @@ -261,7 +264,7 @@ _After:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .filter { sample -> sample.type == 'normal' } .dump(tag: 'samples') @@ -302,7 +305,7 @@ _Before:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .filter { sample -> sample.type == 'normal' } .dump(tag: 'samples') @@ -313,9 +316,9 @@ _After:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .view() } @@ -344,9 +347,9 @@ _Before:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .view() } @@ -356,12 +359,12 @@ _After:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .view() - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == 'tumor' } .view() } @@ -392,12 +395,12 @@ _Before:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .view() - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == 'tumor' } .view() } @@ -407,12 +410,12 @@ _After:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .dump(tag: 'normal') - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .dump(tag: 'tumor') } @@ -488,12 +491,12 @@ _Before:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .dump(tag: 'normal') - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .dump(tag: 'tumor') } @@ -503,13 +506,13 @@ _After:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [sample.id, sample] } .dump(tag: 'normal') - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map { sample -> [sample.id, sample] } .dump(tag: 'tumor') @@ -545,11 +548,11 @@ _Before:_ workflow { samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [sample.id, sample] } .dump(tag: 'normal') - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map { sample -> [sample.id, sample] } .dump(tag: 'tumor') @@ -562,16 +565,16 @@ _After:_ workflow { samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [sample.id, sample] } .dump(tag: 'normal') - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map { sample -> [sample.id, sample] } .dump(tag: 'tumor') - joined_samples = normal_samples - .join(tumor_samples) + joined_samples = ch_normal_samples + .join(ch_tumor_samples) .dump(tag: 'joined') } ``` @@ -603,18 +606,18 @@ _After:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [sample.id, sample] } .dump(tag: 'normal') - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map { sample -> [sample.id, sample] } .dump(tag: 'tumor') - joined_samples = normal_samples - .join(tumor_samples) + joined_samples = ch_normal_samples + .join(ch_tumor_samples) .dump(tag: 'joined', pretty: true) } ``` @@ -722,18 +725,18 @@ _Before:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [sample.id, sample] } .dump(tag: 'normal') - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map { sample -> [sample.id, sample] } .dump(tag: 'tumor') - joined_samples = normal_samples - .join(tumor_samples) + ch_joined_samples = ch_normal_samples + .join(ch_tumor_samples) .dump(tag: 'joined', pretty: true) } ``` @@ -742,9 +745,9 @@ _After:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [ [sample.id, sample.repeat], @@ -752,7 +755,7 @@ workflow { ] } .dump(tag: 'normal') - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map { sample -> [ [sample.id, sample.repeat], @@ -760,8 +763,8 @@ workflow { ] } .dump(tag: 'tumor') - joined_samples = normal_samples - .join(tumor_samples) + ch_joined_samples = ch_normal_samples + .join(ch_tumor_samples) .dump(tag: 'joined', pretty: true) } ``` @@ -871,9 +874,9 @@ _Before:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [ [sample.id, sample.repeat], @@ -881,7 +884,7 @@ workflow { ] } .dump(tag: 'normal') - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map { sample -> [ [sample.id, sample.repeat], @@ -889,8 +892,8 @@ workflow { ] } .dump(tag: 'tumor') - joined_samples = normal_samples - .join(tumor_samples) + ch_joined_samples = ch_normal_samples + .join(ch_tumor_samples) .dump(tag: 'joined', pretty: true) } ``` @@ -899,9 +902,9 @@ _After:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [ sample.subMap(['id', 'repeat']), @@ -909,7 +912,7 @@ workflow { ] } .dump(tag: 'normal') - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map { sample -> [ sample.subMap(['id', 'repeat']), @@ -917,8 +920,8 @@ workflow { ] } .dump(tag: 'tumor') - joined_samples = normal_samples - .join(tumor_samples) + ch_joined_samples = ch_normal_samples + .join(ch_tumor_samples) .dump(tag: 'joined', pretty: true) } ``` @@ -1044,7 +1047,7 @@ We have taken the map we used previously and defined it as a named variable we c _Before:_ ```groovy title="main.nf" linenums="5" - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [ sample.subMap(['id', 'repeat']), @@ -1052,7 +1055,7 @@ _Before:_ ] } .dump(tag: 'normal') - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map { sample -> [ sample.subMap(['id', 'repeat']), @@ -1065,11 +1068,11 @@ _Before:_ _After:_ ```groovy title="main.nf" linenums="5" - normal_samples = samplesheet + ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map ( getSampleIdAndReplicate ) .dump(tag: 'normal') - tumor_samples = samplesheet + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map ( getSampleIdAndReplicate ) .dump(tag: 'tumor') @@ -1205,7 +1208,7 @@ _After:_ ```groovy title="main.nf" linenums="24" .dump(tag: 'joined', pretty: true) - intervals = Channel.of('chr1', 'chr2', 'chr3') + ch_intervals = Channel.of('chr1', 'chr2', 'chr3') .dump(tag: "intervals") } ``` @@ -1215,7 +1218,7 @@ Now remember, we want to repeat each sample for each interval. This is sometimes _Before:_ ```groovy title="main.nf" linenums="26" - intervals = Channel.of('chr1', 'chr2', 'chr3') + ch_intervals = Channel.of('chr1', 'chr2', 'chr3') .dump(tag: "intervals") } ``` @@ -1223,10 +1226,10 @@ _Before:_ _After:_ ```groovy title="main.nf" linenums="26" - intervals = Channel.of('chr1', 'chr2', 'chr3') + ch_intervals = Channel.of('chr1', 'chr2', 'chr3') .dump(tag: "intervals") - combined_samples = joined_samples.combine(intervals) + ch_combined_samples = ch_joined_samples.combine(ch_intervals) .dump(tag: 'combined') } ``` @@ -1265,7 +1268,7 @@ We can use the `map` operator to tidy and refactor our sample data so it's easie _Before:_ ```groovy title="main.nf" linenums="19" - combined_samples = joined_samples.combine(intervals) + ch_combined_samples = joined_samples.combine(ch_intervals) .dump(tag: 'combined') } ``` @@ -1273,7 +1276,7 @@ _Before:_ _After:_ ```groovy title="main.nf" linenums="19" - combined_samples = joined_samples.combine(intervals) + ch_combined_samples = ch_joined_samples.combine(ch_intervals) .map { grouping_key, normal, tumor, interval -> [ grouping_key + [interval: interval], @@ -1370,7 +1373,7 @@ We can reuse the `subMap` method from before to isolate our `id` and `interval` _Before:_ ```groovy title="main.nf" linenums="19" - combined_samples = joined_samples.combine(intervals) + ch_combined_samples = ch_joined_samples.combine(ch_intervals) .map { grouping_key, normal, tumor, interval -> [ grouping_key + [interval: interval], @@ -1386,7 +1389,7 @@ _Before:_ _After:_ ```groovy title="main.nf" linenums="19" - combined_samples = joined_samples.combine(intervals) + ch_combined_samples = ch_joined_samples.combine(ch_intervals) .map { grouping_key, normal, tumor, interval -> [ grouping_key + [interval: interval], @@ -1397,7 +1400,7 @@ _After:_ } .dump(tag: 'combined') - grouped_samples = combined_samples.map { grouping_key, normal, tumor -> + ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ grouping_key.subMap('id', 'interval'), normal, @@ -1441,7 +1444,7 @@ Let's now group the samples by this new grouping element, using the [`groupTuple _Before:_ ```groovy title="main.nf" linenums="30" - grouped_samples = combined_samples.map { grouping_key, normal, tumor -> + ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ grouping_key.subMap('id', 'interval'), grouping_key, @@ -1457,7 +1460,7 @@ _Before:_ _After:_ ```groovy title="main.nf" linenums="29" - grouped_samples = combined_samples.map { grouping_key, normal, tumor -> + ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ grouping_key.subMap('id', 'interval'), normal, @@ -1601,7 +1604,7 @@ In the same `map` operator where we isolate the `id` and `interval` fields, we c _Before:_ ```groovy title="main.nf" linenums="30" - grouped_samples = combined_samples.map { grouping_key, normal, tumor -> + ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ grouping_key.subMap('id', 'interval'), normal, @@ -1617,7 +1620,7 @@ _Before:_ _After:_ ```groovy title="main.nf" linenums="30" - grouped_samples = combined_samples.map { grouping_key, normal, tumor -> + ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ grouping_key.subMap('id', 'interval'), normal.subMap("fastq1", "fastq2"), diff --git a/side-quests/splitting_and_grouping/main.nf b/side-quests/splitting_and_grouping/main.nf index d8ebe7139b..77cc26224d 100644 --- a/side-quests/splitting_and_grouping/main.nf +++ b/side-quests/splitting_and_grouping/main.nf @@ -1,3 +1,3 @@ workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") } From a9324c22ca309d86d19d270e92dd9cf4a2044b5a Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Wed, 9 Apr 2025 15:12:10 +0100 Subject: [PATCH 06/36] Remove "pipeline" from opening statement to focus on channels --- docs/side_quests/splitting-and-grouping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 4500405ba0..8976bfe1a4 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -4,7 +4,7 @@ Nextflow helps you work with your data in flexible ways. One of the most useful Think of it like sorting mail: you might first separate letters by their destination, process each pile differently, and then recombine items going to the same person. In Nextflow, we use special operators to do this with our scientific data. -Nextflow's channel system is at the heart of this flexibility. Channels act as pipelines that connect different parts of your workflow, allowing data to flow through your analysis. You can create multiple channels from a single data source, process each channel differently, and then merge channels back together when needed. This approach lets you design workflows that naturally mirror the branching and converging paths of complex bioinformatics analyses. +Nextflow's channel system is at the heart of this flexibility. Channels connect different parts of your workflow, allowing data to flow through your analysis. You can create multiple channels from a single data source, process each channel differently, and then merge channels back together when needed. This approach lets you design workflows that naturally mirror the branching and converging paths of complex bioinformatics analyses. In this side quest, we'll explore how to split and group data using Nextflow's powerful channel operators. We'll start with a samplesheet containing information about different samples and their associated data. By the end of this side quest, you'll be able to manipulate and combine data streams effectively, making your workflows more efficient and easier to understand. From 8e83189ed41d4e8cf152db840df886626f861b05 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Wed, 9 Apr 2025 15:12:43 +0100 Subject: [PATCH 07/36] Add "you will" statement to the beginning of the tutorial --- docs/side_quests/splitting-and-grouping.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 8976bfe1a4..6ba0ac7da2 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -8,6 +8,8 @@ Nextflow's channel system is at the heart of this flexibility. Channels connect In this side quest, we'll explore how to split and group data using Nextflow's powerful channel operators. We'll start with a samplesheet containing information about different samples and their associated data. By the end of this side quest, you'll be able to manipulate and combine data streams effectively, making your workflows more efficient and easier to understand. +You will: + - Read data from files using `splitCsv` - Filter and transform data with `filter` and `map` - Combine related data using `join` and `groupTuple` From 1bc2dcbba415af525da4b10af19cdc53c2468131 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Wed, 9 Apr 2025 15:13:36 +0100 Subject: [PATCH 08/36] Fix typo and phrasing about combining --- docs/side_quests/splitting-and-grouping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 6ba0ac7da2..41fa396a4f 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -460,7 +460,7 @@ We've now separated out the normal and tumor samples into two different channels In the previous section, we separated out the normal and tumor samples into two different channels. These could be processed independently using specific processes or workflows based on their type. But what happens when we want to compare the normal and tumor samples from the same patient? At this point, we need to join them back together making sure to match the samples based on their `id` field. -Nextflow includes many methods for combing channels, but in this case the most appropriate operator is [`join`](https://www.nextflow.io/docs/latest/operator.html#join). This acts like a SQL `JOIN` operation, where we specify the key to join on and the type of join to perform. +Nextflow includes many methods for combining channels, but in this case the most appropriate operator is [`join`](https://www.nextflow.io/docs/latest/operator.html#join). This acts like a SQL `JOIN` operation, where we specify the key to join on and the type of join to perform. ### 3.1. Use `map` and `join` to combine based on sample ID From 2b906d93ba2a5bf7571b440d1ab51342470f6585 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Wed, 9 Apr 2025 15:18:21 +0100 Subject: [PATCH 09/36] Refine explanation of intervals and the combining method --- docs/side_quests/splitting-and-grouping.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 41fa396a4f..c5cc34a75c 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -1191,9 +1191,9 @@ This is a common pattern in bioinformatics workflows where you need to match up ## 4. Spread samples over intervals -Spreading samples over different conditions is a common pattern in bioinformatics workflows. For example, it is used to spread variant calling over a range of intervals. This can help distribute work across multiple cores or nodes and make the pipelines more efficient and be turned around faster. +A key pattern in bioinformatics workflows is distributing analysis across genomic regions. For instance, variant calling can be parallelized by dividing the genome into intervals (like chromosomes or smaller regions). This parallelization strategy significantly improves pipeline efficiency by distributing computational load across multiple cores or nodes, reducing overall execution time. -In the next section, we will demonstrate how to take our existing samples and repeat each one for every interval. In this way, we will have a single sample for each input interval. We will also multiply our number of samples by the number of intervals, so get ready for a busy terminal! +In the following section, we'll demonstrate how to distribute our sample data across multiple genomic intervals. We'll pair each sample with every interval, allowing parallel processing of different genomic regions. This will multiply our dataset size by the number of intervals, creating multiple independent analysis units that can be brought back together later. ### 4.1. Spread samples over intervals using `combine` From 25c33ab8b880f53dafe9eb80b33353797b2f6164 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Wed, 9 Apr 2025 15:33:22 +0100 Subject: [PATCH 10/36] add another before/after snippet --- docs/side_quests/splitting-and-grouping.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index c5cc34a75c..c9a76a8c8a 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -140,6 +140,18 @@ This means we have successfully read in the samplesheet and have access to the d For a prettier output format, we can use the [`dump` operator](https://www.nextflow.io/docs/latest/operator.html#dump) instead of `view`: +_Before:_ + +```groovy title="main.nf" linenums="1" +workflow { + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .view() +} +``` + +_After:_ + ```groovy title="main.nf" linenums="1" workflow { ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") From 60f3e90e613a7005a12b45d6495d0096088d0ba8 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Wed, 9 Apr 2025 18:49:58 +0100 Subject: [PATCH 11/36] Use single BAM as input file --- docs/side_quests/splitting-and-grouping.md | 598 ++++++++---------- .../data/samplesheet.csv | 18 +- 2 files changed, 280 insertions(+), 336 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index c9a76a8c8a..e4f8c0504b 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -45,18 +45,18 @@ You'll find a `data` directory containing a samplesheet and a main workflow file └── main.nf ``` -The samplesheet contains information about different samples and their associated data. In particular, it contains information about the sample's ID, repeat number, type (normal or tumor), and the paths to the fastq files. +The samplesheet contains information about different samples and their associated data. In particular, it contains information about the sample's ID, repeat number, type (normal or tumor), and the paths to the BAM files (which don't actually exist, but we will pretend they do). ```console title="samplesheet.csv" -id,repeat,type,fastq1,fastq2 -sampleA,1,normal,sampleA_rep1_normal_R1.fastq.gz,sampleA_rep1_normal_R2.fastq.gz -sampleA,1,tumor,sampleA_rep1_tumor_R1.fastq.gz,sampleA_rep1_tumor_R2.fastq.gz -sampleA,2,normal,sampleA_rep2_normal_R1.fastq.gz,sampleA_rep2_normal_R2.fastq.gz -sampleA,2,tumor,sampleA_rep2_tumor_R1.fastq.gz,sampleA_rep2_tumor_R2.fastq.gz -sampleB,1,normal,sampleB_rep1_normal_R1.fastq.gz,sampleB_rep1_normal_R2.fastq.gz -sampleB,1,tumor,sampleB_rep1_tumor_R1.fastq.gz,sampleB_rep1_tumor_R2.fastq.gz -sampleC,1,normal,sampleC_rep1_normal_R1.fastq.gz,sampleC_rep1_normal_R2.fastq.gz -sampleC,1,tumor,sampleC_rep1_tumor_R1.fastq.gz,sampleC_rep1_tumor_R2.fastq.gz +id,repeat,type,bam +sampleA,1,normal,sampleA_r1_normal.bam +sampleA,1,tumor,sampleA_rep1_tumor.bam +sampleB,1,normal,sampleB_rep1_normal.bam +sampleB,1,tumor,sampleB_rep1_tumor.bam +sampleC,1,normal,sampleC_rep1_normal.bam +sampleC,1,tumor,sampleC_rep1_tumor.bam +sampleD,1,normal,sampleD_rep1_normal.bam +sampleD,1,tumor,sampleD_rep1_tumor.bam ``` Note there are 8 samples in total, 4 normal and 4 tumor. sampleA has 2 repeats, while sampleB and sampleC only have 1. @@ -110,16 +110,16 @@ nextflow run main.nf ```console title="Read samplesheet with splitCsv" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [berserk_cray] DSL2 - revision: 8f31622c03 +Launching `main.nf` [elated_fermat] DSL2 - revision: bd6b0224e9 -[id:sampleA, repeat:1, type:normal, fastq1:sampleA_rep1_normal_R1.fastq.gz, fastq2:sampleA_rep1_normal_R2.fastq.gz] -[id:sampleA, repeat:1, type:tumor, fastq1:sampleA_rep1_tumor_R1.fastq.gz, fastq2:sampleA_rep1_tumor_R2.fastq.gz] -[id:sampleA, repeat:2, type:normal, fastq1:sampleA_rep2_normal_R1.fastq.gz, fastq2:sampleA_rep2_normal_R2.fastq.gz] -[id:sampleA, repeat:2, type:tumor, fastq1:sampleA_rep2_tumor_R1.fastq.gz, fastq2:sampleA_rep2_tumor_R2.fastq.gz] -[id:sampleB, repeat:1, type:normal, fastq1:sampleB_rep1_normal_R1.fastq.gz, fastq2:sampleB_rep1_normal_R2.fastq.gz] -[id:sampleB, repeat:1, type:tumor, fastq1:sampleB_rep1_tumor_R1.fastq.gz, fastq2:sampleB_rep1_tumor_R2.fastq.gz] -[id:sampleC, repeat:1, type:normal, fastq1:sampleC_rep1_normal_R1.fastq.gz, fastq2:sampleC_rep1_normal_R2.fastq.gz] -[id:sampleC, repeat:1, type:tumor, fastq1:sampleC_rep1_tumor_R1.fastq.gz, fastq2:sampleC_rep1_tumor_R2.fastq.gz] +[id:sampleA, repeat:1, type:normal, bam:sampleA_r1_normal.bam] +[id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam] +[id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam] +[id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam] +[id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam] +[id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam] +[id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] +[id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam] ``` We can see that each row from the CSV file has been converted into a map with keys matching the header row. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. @@ -129,10 +129,9 @@ Each map contains: - `id`: The sample identifier (sampleA, sampleB, sampleC) - `repeat`: The replicate number (1 or 2) - `type`: The sample type (normal or tumor) -- `fastq1`: Path to the first FASTQ file -- `fastq2`: Path to the second FASTQ file +- `bam`: Path to the BAM file -This format makes it easy to access specific fields from each sample. For example, we could access the sample ID with `row.id` or the FASTQ paths with `row.fastq1` and `row.fastq2`. +This format makes it easy to access specific fields from each sample. For example, we could access the sample ID with `row.id` or the BAM path with `row.bam`. This means we have successfully read in the samplesheet and have access to the data in each row. We can start to implement this in our pipeline. @@ -179,63 +178,55 @@ nextflow run main.nf -dump-channels samples ```console title="Read samplesheet with dump" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [wise_kirch] DSL2 - revision: 7f194f2473 +Launching `main.nf` [cheesy_celsius] DSL2 - revision: 0e9d501bcc [DUMP: samples] { "id": "sampleA", "repeat": "1", "type": "normal", - "fastq1": "sampleA_rep1_normal_R1.fastq.gz", - "fastq2": "sampleA_rep1_normal_R2.fastq.gz" + "bam": "sampleA_r1_normal.bam" } [DUMP: samples] { "id": "sampleA", "repeat": "1", "type": "tumor", - "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" + "bam": "sampleA_rep1_tumor.bam" } [DUMP: samples] { - "id": "sampleA", - "repeat": "2", + "id": "sampleB", + "repeat": "1", "type": "normal", - "fastq1": "sampleA_rep2_normal_R1.fastq.gz", - "fastq2": "sampleA_rep2_normal_R2.fastq.gz" + "bam": "sampleB_rep1_normal.bam" } [DUMP: samples] { - "id": "sampleA", - "repeat": "2", + "id": "sampleB", + "repeat": "1", "type": "tumor", - "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" + "bam": "sampleB_rep1_tumor.bam" } [DUMP: samples] { - "id": "sampleB", + "id": "sampleC", "repeat": "1", "type": "normal", - "fastq1": "sampleB_rep1_normal_R1.fastq.gz", - "fastq2": "sampleB_rep1_normal_R2.fastq.gz" + "bam": "sampleC_rep1_normal.bam" } [DUMP: samples] { - "id": "sampleB", + "id": "sampleC", "repeat": "1", "type": "tumor", - "fastq1": "sampleB_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleB_rep1_tumor_R2.fastq.gz" + "bam": "sampleC_rep1_tumor.bam" } [DUMP: samples] { - "id": "sampleC", + "id": "sampleD", "repeat": "1", "type": "normal", - "fastq1": "sampleC_rep1_normal_R1.fastq.gz", - "fastq2": "sampleC_rep1_normal_R2.fastq.gz" + "bam": "sampleD_rep1_normal.bam" } [DUMP: samples] { - "id": "sampleC", + "id": "sampleD", "repeat": "1", "type": "tumor", - "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" + "bam": "sampleD_rep1_tumor.bam" } ``` @@ -262,7 +253,7 @@ We now have a channel of maps, each representing a row from the samplesheet. Nex ### 2.1. Filter data with `filter` -We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator.html#filter) to filter the data based on a condition. Let's say we only want to process normal samples. We can do this by filtering the data based on the `type` field. Let's insert this before the `dump` operator. +We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator.html#filter) to filter the data based on a condition. Let's say we only want to process normal samples. We can do this by filtering the data based on the `type` field. Let's insert this before the `view` operator. _Before:_ @@ -281,7 +272,7 @@ workflow { ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .filter { sample -> sample.type == 'normal' } - .dump(tag: 'samples') + .view() } ``` @@ -295,12 +286,12 @@ nextflow run main.nf ```console title="View normal samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [stupefied_pike] DSL2 - revision: 8761d1b103 +Launching `main.nf` [adoring_cori] DSL2 - revision: 194d61704d -[DUMP: samples] ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'] -[DUMP: samples] ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'] -[DUMP: samples] ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'] -[DUMP: samples] ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'] +[id:sampleA, repeat:1, type:normal, bam:sampleA_r1_normal.bam] +[id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam] +[id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam] +[id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] ``` We have successfully filtered the data to only include normal samples. Let's recap how this works. The `filter` operator takes a closure that is applied to each element in the channel. If the closure returns `true`, the element is included in the output channel. If the closure returns `false`, the element is excluded from the output channel. @@ -313,6 +304,8 @@ In this case, we want to keep only the samples where `sample.type == 'normal'`. ### 2.2. Save results of filter to a new channel +#TODO: Move this later after making the tumor only channel, put it at the end in one section! + While useful, we are discarding the tumor samples. Instead, let's rewrite our pipeline to save all the samples to one channel called `samplesheet`, then filter that channel to just the normal samples and save the results to a new channel called `normal_samples`. _Before:_ @@ -322,7 +315,7 @@ workflow { ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .filter { sample -> sample.type == 'normal' } - .dump(tag: 'samples') + .view() } ``` @@ -347,15 +340,15 @@ nextflow run main.nf ```console title="View normal samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [lonely_miescher] DSL2 - revision: 7e26f19fd3 +Launching `main.nf` [astonishing_noether] DSL2 - revision: 8e49cf6956 -[id:sampleA, repeat:1, type:normal, fastq1:sampleA_rep1_normal_R1.fastq.gz, fastq2:sampleA_rep1_normal_R2.fastq.gz] -[id:sampleA, repeat:2, type:normal, fastq1:sampleA_rep2_normal_R1.fastq.gz, fastq2:sampleA_rep2_normal_R2.fastq.gz] -[id:sampleB, repeat:1, type:normal, fastq1:sampleB_rep1_normal_R1.fastq.gz, fastq2:sampleB_rep1_normal_R2.fastq.gz] -[id:sampleC, repeat:1, type:normal, fastq1:sampleC_rep1_normal_R1.fastq.gz, fastq2:sampleC_rep1_normal_R2.fastq.gz] +[id:sampleA, repeat:1, type:normal, bam:sampleA_r1_normal.bam] +[id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam] +[id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam] +[id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] ``` -Success! We have filtered the data to only include normal samples. If we wanted, we still have access to the tumor samples within the `samplesheet` channel. Since we managed it for the normal samples, let's do it for the tumor samples as well: +Success! We have filtered the data to only include normal samples. Note that we can use view and save the new channel. If we wanted, we still have access to the tumor samples within the `samplesheet` channel. Since we managed it for the normal samples, let's do it for the tumor samples as well: _Before:_ @@ -391,20 +384,22 @@ nextflow run main.nf ```console title="View tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [focused_kirch] DSL2 - revision: 87d6672658 +Launching `main.nf` [gloomy_roentgen] DSL2 - revision: e6b3917a8e -[id:sampleA, repeat:1, type:normal, fastq1:sampleA_rep1_normal_R1.fastq.gz, fastq2:sampleA_rep1_normal_R2.fastq.gz] -[id:sampleA, repeat:1, type:tumor, fastq1:sampleA_rep1_tumor_R1.fastq.gz, fastq2:sampleA_rep1_tumor_R2.fastq.gz] -[id:sampleA, repeat:2, type:normal, fastq1:sampleA_rep2_normal_R1.fastq.gz, fastq2:sampleA_rep2_normal_R2.fastq.gz] -[id:sampleA, repeat:2, type:tumor, fastq1:sampleA_rep2_tumor_R1.fastq.gz, fastq2:sampleA_rep2_tumor_R2.fastq.gz] -[id:sampleB, repeat:1, type:normal, fastq1:sampleB_rep1_normal_R1.fastq.gz, fastq2:sampleB_rep1_normal_R2.fastq.gz] -[id:sampleB, repeat:1, type:tumor, fastq1:sampleB_rep1_tumor_R1.fastq.gz, fastq2:sampleB_rep1_tumor_R2.fastq.gz] -[id:sampleC, repeat:1, type:normal, fastq1:sampleC_rep1_normal_R1.fastq.gz, fastq2:sampleC_rep1_normal_R2.fastq.gz] -[id:sampleC, repeat:1, type:tumor, fastq1:sampleC_rep1_tumor_R1.fastq.gz, fastq2:sampleC_rep1_tumor_R2.fastq.gz] +[id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam] +[id:sampleA, repeat:1, type:normal, bam:sampleA_r1_normal.bam] +[id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam] +[id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam] +[id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam] +[id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam] +[id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam] +[id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] ``` We've managed to separate out the normal and tumor samples into two different channels but they're mixed up when we `view` them in the console! Here's where dump could be useful, because it can label the different channels with a tag. +#TODO: remove this bit + _Before:_ ```groovy title="main.nf" linenums="1" @@ -442,16 +437,16 @@ nextflow run main.nf -dump-channels normal,tumor ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [spontaneous_jones] DSL2 - revision: 0e794240ef +Launching `main.nf` [sharp_carlsson] DSL2 - revision: 61e1be6afd -[DUMP: tumor] ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'] -[DUMP: normal] ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'] -[DUMP: tumor] ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz'] -[DUMP: normal] ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'] -[DUMP: tumor] ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz'] -[DUMP: normal] ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'] -[DUMP: tumor] ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz'] -[DUMP: normal] ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'] +[DUMP: tumor] ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam'] +[DUMP: normal] ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'] +[DUMP: tumor] ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam'] +[DUMP: normal] ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'] +[DUMP: tumor] ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam'] +[DUMP: normal] ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'] +[DUMP: tumor] ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam'] +[DUMP: normal] ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'] ``` Note how the `normal` and `tumor` tags are used to label the different channels. This is useful for debugging and for understanding the data flow in our pipeline. @@ -485,16 +480,16 @@ nextflow run main.nf -dump-channels normal,tumor ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [spontaneous_jones] DSL2 - revision: 0e794240ef +Launching `main.nf` [sharp_carlsson] DSL2 - revision: 61e1be6afd -[DUMP: tumor] ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'] -[DUMP: normal] ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'] -[DUMP: tumor] ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz'] -[DUMP: normal] ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'] -[DUMP: tumor] ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz'] -[DUMP: normal] ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'] -[DUMP: tumor] ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz'] -[DUMP: normal] ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'] +[DUMP: tumor] ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam'] +[DUMP: normal] ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'] +[DUMP: tumor] ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam'] +[DUMP: normal] ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'] +[DUMP: tumor] ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam'] +[DUMP: normal] ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'] +[DUMP: tumor] ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam'] +[DUMP: normal] ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'] ``` We can see that the `id` field is the first element in each map. For `join` to work, we should isolate the `id` field in each tuple. After that, we can simply use the `join` operator to combine the two channels. @@ -540,16 +535,16 @@ nextflow run main.nf -dump-channels normal,tumor ```console title="View normal and tumor samples with ID as element 0" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [sick_jones] DSL2 - revision: 9b183fbc7c +Launching `main.nf` [peaceful_morse] DSL2 - revision: 34daafdfb3 -[DUMP: tumor] ['sampleA', ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] -[DUMP: normal] ['sampleA', ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz']] -[DUMP: tumor] ['sampleA', ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] -[DUMP: normal] ['sampleA', ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']] -[DUMP: tumor] ['sampleB', ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] -[DUMP: normal] ['sampleB', ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']] -[DUMP: tumor] ['sampleC', ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] -[DUMP: normal] ['sampleC', ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']] +[DUMP: normal] ['sampleA', ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam']] +[DUMP: tumor] ['sampleA', ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] +[DUMP: normal] ['sampleB', ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam']] +[DUMP: tumor] ['sampleB', ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] +[DUMP: normal] ['sampleC', ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam']] +[DUMP: tumor] ['sampleC', ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] +[DUMP: normal] ['sampleD', ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam']] +[DUMP: tumor] ['sampleD', ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] ``` It might be subtle, but you should be able to see the first element in each tuple is the `id` field. Now we can use the `join` operator to combine the two channels based on the `id` field. @@ -577,7 +572,7 @@ _After:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } @@ -600,19 +595,19 @@ nextflow run main.nf -dump-channels joined ```console title="View joined normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [thirsty_poitras] DSL2 - revision: 95a2b8902b +Launching `main.nf` [hopeful_agnesi] DSL2 - revision: 78b21768c2 -[DUMP: joined] ['sampleA', ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] -[DUMP: joined] ['sampleA', ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] -[DUMP: joined] ['sampleB', ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] -[DUMP: joined] ['sampleC', ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] +[DUMP: joined] ['sampleA', ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] +[DUMP: joined] ['sampleB', ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] +[DUMP: joined] ['sampleC', ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] +[DUMP: joined] ['sampleD', ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] ``` It's a little hard to tell because it's so wide, but you should be able to see the samples have been joined by the `id` field. Each tuple now has the format: - `id`: The sample ID -- `normal_sample`: The normal sample including type, replicate and path to fastq files -- `tumor_sample`: The tumor sample including type, replicate and path to fastq files +- `normal_sample`: The normal sample including type, replicate and path to bam file +- `tumor_sample`: The tumor sample including type, replicate and path to bam file If you want you can use the `pretty` parameter of `dump` to make it easier to read: @@ -643,7 +638,7 @@ nextflow run main.nf -dump-channels joined ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [tender_feynman] DSL2 - revision: 3505c6a732 +Launching `main.nf` [desperate_einstein] DSL2 - revision: 2dce0e5352 [DUMP: joined] [ "sampleA", @@ -651,66 +646,58 @@ Launching `main.nf` [tender_feynman] DSL2 - revision: 3505c6a732 "id": "sampleA", "repeat": "1", "type": "normal", - "fastq1": "sampleA_rep1_normal_R1.fastq.gz", - "fastq2": "sampleA_rep1_normal_R2.fastq.gz" + "bam": "sampleA_r1_normal.bam" }, { "id": "sampleA", "repeat": "1", "type": "tumor", - "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" + "bam": "sampleA_rep1_tumor.bam" } ] [DUMP: joined] [ - "sampleA", + "sampleB", { - "id": "sampleA", - "repeat": "2", + "id": "sampleB", + "repeat": "1", "type": "normal", - "fastq1": "sampleA_rep2_normal_R1.fastq.gz", - "fastq2": "sampleA_rep2_normal_R2.fastq.gz" + "bam": "sampleB_rep1_normal.bam" }, { - "id": "sampleA", - "repeat": "2", + "id": "sampleB", + "repeat": "1", "type": "tumor", - "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" + "bam": "sampleB_rep1_tumor.bam" } ] [DUMP: joined] [ - "sampleB", + "sampleC", { - "id": "sampleB", + "id": "sampleC", "repeat": "1", "type": "normal", - "fastq1": "sampleB_rep1_normal_R1.fastq.gz", - "fastq2": "sampleB_rep1_normal_R2.fastq.gz" + "bam": "sampleC_rep1_normal.bam" }, { - "id": "sampleB", + "id": "sampleC", "repeat": "1", "type": "tumor", - "fastq1": "sampleB_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleB_rep1_tumor_R2.fastq.gz" + "bam": "sampleC_rep1_tumor.bam" } ] [DUMP: joined] [ - "sampleC", + "sampleD", { - "id": "sampleC", + "id": "sampleD", "repeat": "1", "type": "normal", - "fastq1": "sampleC_rep1_normal_R1.fastq.gz", - "fastq2": "sampleC_rep1_normal_R2.fastq.gz" + "bam": "sampleD_rep1_normal.bam" }, { - "id": "sampleC", + "id": "sampleD", "repeat": "1", "type": "tumor", - "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" + "bam": "sampleD_rep1_tumor.bam" } ] ``` @@ -749,7 +736,7 @@ workflow { .filter { sample -> sample.type == "tumor" } .map { sample -> [sample.id, sample] } .dump(tag: 'tumor') - ch_joined_samples = ch_normal_samples + joined_samples = ch_normal_samples .join(ch_tumor_samples) .dump(tag: 'joined', pretty: true) } @@ -792,7 +779,7 @@ nextflow run main.nf -dump-channels joined ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [cranky_lorenz] DSL2 - revision: 2be25de1df +Launching `main.nf` [infallible_torricelli] DSL2 - revision: c02356ebe1 [DUMP: joined] [ [ @@ -803,75 +790,67 @@ Launching `main.nf` [cranky_lorenz] DSL2 - revision: 2be25de1df "id": "sampleA", "repeat": "1", "type": "normal", - "fastq1": "sampleA_rep1_normal_R1.fastq.gz", - "fastq2": "sampleA_rep1_normal_R2.fastq.gz" + "bam": "sampleA_r1_normal.bam" }, { "id": "sampleA", "repeat": "1", "type": "tumor", - "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" + "bam": "sampleA_rep1_tumor.bam" } ] [DUMP: joined] [ [ - "sampleA", - "2" + "sampleB", + "1" ], { - "id": "sampleA", - "repeat": "2", + "id": "sampleB", + "repeat": "1", "type": "normal", - "fastq1": "sampleA_rep2_normal_R1.fastq.gz", - "fastq2": "sampleA_rep2_normal_R2.fastq.gz" + "bam": "sampleB_rep1_normal.bam" }, { - "id": "sampleA", - "repeat": "2", + "id": "sampleB", + "repeat": "1", "type": "tumor", - "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" + "bam": "sampleB_rep1_tumor.bam" } ] [DUMP: joined] [ [ - "sampleB", + "sampleC", "1" ], { - "id": "sampleB", + "id": "sampleC", "repeat": "1", "type": "normal", - "fastq1": "sampleB_rep1_normal_R1.fastq.gz", - "fastq2": "sampleB_rep1_normal_R2.fastq.gz" + "bam": "sampleC_rep1_normal.bam" }, { - "id": "sampleB", + "id": "sampleC", "repeat": "1", "type": "tumor", - "fastq1": "sampleB_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleB_rep1_tumor_R2.fastq.gz" + "bam": "sampleC_rep1_tumor.bam" } ] [DUMP: joined] [ [ - "sampleC", + "sampleD", "1" ], { - "id": "sampleC", + "id": "sampleD", "repeat": "1", "type": "normal", - "fastq1": "sampleC_rep1_normal_R1.fastq.gz", - "fastq2": "sampleC_rep1_normal_R2.fastq.gz" + "bam": "sampleD_rep1_normal.bam" }, { - "id": "sampleC", + "id": "sampleD", "repeat": "1", "type": "tumor", - "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" + "bam": "sampleD_rep1_tumor.bam" } ] ``` @@ -947,86 +926,78 @@ nextflow run main.nf -dump-channels joined ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [insane_gautier] DSL2 - revision: bf5b9a6d37 +Launching `main.nf` [sharp_waddington] DSL2 - revision: c02356ebe1 [DUMP: joined] [ - { - "id": "sampleA", - "repeat": "1" - }, + [ + "sampleA", + "1" + ], { "id": "sampleA", "repeat": "1", "type": "normal", - "fastq1": "sampleA_rep1_normal_R1.fastq.gz", - "fastq2": "sampleA_rep1_normal_R2.fastq.gz" + "bam": "sampleA_r1_normal.bam" }, { "id": "sampleA", "repeat": "1", "type": "tumor", - "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" + "bam": "sampleA_rep1_tumor.bam" } ] [DUMP: joined] [ + [ + "sampleB", + "1" + ], { - "id": "sampleA", - "repeat": "2" - }, - { - "id": "sampleA", - "repeat": "2", + "id": "sampleB", + "repeat": "1", "type": "normal", - "fastq1": "sampleA_rep2_normal_R1.fastq.gz", - "fastq2": "sampleA_rep2_normal_R2.fastq.gz" + "bam": "sampleB_rep1_normal.bam" }, { - "id": "sampleA", - "repeat": "2", + "id": "sampleB", + "repeat": "1", "type": "tumor", - "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" + "bam": "sampleB_rep1_tumor.bam" } ] [DUMP: joined] [ + [ + "sampleC", + "1" + ], { - "id": "sampleB", - "repeat": "1" - }, - { - "id": "sampleB", + "id": "sampleC", "repeat": "1", "type": "normal", - "fastq1": "sampleB_rep1_normal_R1.fastq.gz", - "fastq2": "sampleB_rep1_normal_R2.fastq.gz" + "bam": "sampleC_rep1_normal.bam" }, { - "id": "sampleB", + "id": "sampleC", "repeat": "1", "type": "tumor", - "fastq1": "sampleB_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleB_rep1_tumor_R2.fastq.gz" + "bam": "sampleC_rep1_tumor.bam" } ] [DUMP: joined] [ + [ + "sampleD", + "1" + ], { - "id": "sampleC", - "repeat": "1" - }, - { - "id": "sampleC", + "id": "sampleD", "repeat": "1", "type": "normal", - "fastq1": "sampleC_rep1_normal_R1.fastq.gz", - "fastq2": "sampleC_rep1_normal_R2.fastq.gz" + "bam": "sampleD_rep1_normal.bam" }, { - "id": "sampleC", + "id": "sampleD", "repeat": "1", "type": "tumor", - "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" + "bam": "sampleD_rep1_tumor.bam" } ] ``` @@ -1043,7 +1014,7 @@ _Before:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) ``` @@ -1052,7 +1023,7 @@ _After:_ ```groovy title="main.nf" linenums="1" workflow { getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) ``` @@ -1102,7 +1073,7 @@ nextflow run main.nf -dump-channels joined ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [trusting_boltzmann] DSL2 - revision: 0b1cd77e3b +Launching `main.nf` [silly_lalande] DSL2 - revision: 76bcb0b16b [DUMP: joined] [ { @@ -1113,75 +1084,67 @@ Launching `main.nf` [trusting_boltzmann] DSL2 - revision: 0b1cd77e3b "id": "sampleA", "repeat": "1", "type": "normal", - "fastq1": "sampleA_rep1_normal_R1.fastq.gz", - "fastq2": "sampleA_rep1_normal_R2.fastq.gz" + "bam": "sampleA_r1_normal.bam" }, { "id": "sampleA", "repeat": "1", "type": "tumor", - "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" + "bam": "sampleA_rep1_tumor.bam" } ] [DUMP: joined] [ { - "id": "sampleA", - "repeat": "2" + "id": "sampleB", + "repeat": "1" }, { - "id": "sampleA", - "repeat": "2", + "id": "sampleB", + "repeat": "1", "type": "normal", - "fastq1": "sampleA_rep2_normal_R1.fastq.gz", - "fastq2": "sampleA_rep2_normal_R2.fastq.gz" + "bam": "sampleB_rep1_normal.bam" }, { - "id": "sampleA", - "repeat": "2", + "id": "sampleB", + "repeat": "1", "type": "tumor", - "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" + "bam": "sampleB_rep1_tumor.bam" } ] [DUMP: joined] [ { - "id": "sampleB", + "id": "sampleC", "repeat": "1" }, { - "id": "sampleB", + "id": "sampleC", "repeat": "1", "type": "normal", - "fastq1": "sampleB_rep1_normal_R1.fastq.gz", - "fastq2": "sampleB_rep1_normal_R2.fastq.gz" + "bam": "sampleC_rep1_normal.bam" }, { - "id": "sampleB", + "id": "sampleC", "repeat": "1", "type": "tumor", - "fastq1": "sampleB_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleB_rep1_tumor_R2.fastq.gz" + "bam": "sampleC_rep1_tumor.bam" } ] [DUMP: joined] [ { - "id": "sampleC", + "id": "sampleD", "repeat": "1" }, { - "id": "sampleC", + "id": "sampleD", "repeat": "1", "type": "normal", - "fastq1": "sampleC_rep1_normal_R1.fastq.gz", - "fastq2": "sampleC_rep1_normal_R2.fastq.gz" + "bam": "sampleD_rep1_normal.bam" }, { - "id": "sampleC", + "id": "sampleD", "repeat": "1", "type": "tumor", - "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" + "bam": "sampleD_rep1_tumor.bam" } ] ``` @@ -1257,20 +1220,20 @@ nextflow run main.nf -dump-channels combined ```console title="View combined samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [extravagant_maxwell] DSL2 - revision: 459bde3584 +Launching `main.nf` [festering_davinci] DSL2 - revision: 0ec9c8b25a -[DUMP: combined] [['id':'sampleA', 'repeat':'1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], 'chr1'] -[DUMP: combined] [['id':'sampleA', 'repeat':'1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], 'chr2'] -[DUMP: combined] [['id':'sampleA', 'repeat':'1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], 'chr3'] -[DUMP: combined] [['id':'sampleA', 'repeat':'2'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz'], 'chr1'] -[DUMP: combined] [['id':'sampleA', 'repeat':'2'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz'], 'chr2'] -[DUMP: combined] [['id':'sampleA', 'repeat':'2'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz'], 'chr3'] -[DUMP: combined] [['id':'sampleB', 'repeat':'1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz'], 'chr1'] -[DUMP: combined] [['id':'sampleB', 'repeat':'1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz'], 'chr2'] -[DUMP: combined] [['id':'sampleB', 'repeat':'1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz'], 'chr3'] -[DUMP: combined] [['id':'sampleC', 'repeat':'1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz'], 'chr1'] -[DUMP: combined] [['id':'sampleC', 'repeat':'1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz'], 'chr2'] -[DUMP: combined] [['id':'sampleC', 'repeat':'1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz'], 'chr3'] +[DUMP: combined] [['id':'sampleA', 'repeat':'1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam'], 'chr1'] +[DUMP: combined] [['id':'sampleA', 'repeat':'1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam'], 'chr2'] +[DUMP: combined] [['id':'sampleA', 'repeat':'1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam'], 'chr3'] +[DUMP: combined] [['id':'sampleB', 'repeat':'1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam'], 'chr1'] +[DUMP: combined] [['id':'sampleB', 'repeat':'1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam'], 'chr2'] +[DUMP: combined] [['id':'sampleB', 'repeat':'1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam'], 'chr3'] +[DUMP: combined] [['id':'sampleC', 'repeat':'1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam'], 'chr1'] +[DUMP: combined] [['id':'sampleC', 'repeat':'1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam'], 'chr2'] +[DUMP: combined] [['id':'sampleC', 'repeat':'1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam'], 'chr3'] +[DUMP: combined] [['id':'sampleD', 'repeat':'1'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam'], 'chr1'] +[DUMP: combined] [['id':'sampleD', 'repeat':'1'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam'], 'chr2'] +[DUMP: combined] [['id':'sampleD', 'repeat':'1'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam'], 'chr3'] ``` Success! We have repeated every sample for every single interval in our 3 interval list. We've effectively tripled the number of items in our channel. It's a little hard to read though, so in the next section we will tidy it up. @@ -1336,20 +1299,20 @@ nextflow run main.nf -dump-channels combined ```console title="View combined samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [focused_curie] DSL2 - revision: 9953685fec +Launching `main.nf` [thirsty_turing] DSL2 - revision: 7e1d6928c6 -[DUMP: combined] [['id':'sampleA', 'repeat':'1', 'interval':'chr1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] -[DUMP: combined] [['id':'sampleA', 'repeat':'1', 'interval':'chr2'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] -[DUMP: combined] [['id':'sampleA', 'repeat':'1', 'interval':'chr3'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] -[DUMP: combined] [['id':'sampleA', 'repeat':'2', 'interval':'chr1'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] -[DUMP: combined] [['id':'sampleA', 'repeat':'2', 'interval':'chr2'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] -[DUMP: combined] [['id':'sampleA', 'repeat':'2', 'interval':'chr3'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] -[DUMP: combined] [['id':'sampleB', 'repeat':'1', 'interval':'chr1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] -[DUMP: combined] [['id':'sampleB', 'repeat':'1', 'interval':'chr2'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] -[DUMP: combined] [['id':'sampleB', 'repeat':'1', 'interval':'chr3'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] -[DUMP: combined] [['id':'sampleC', 'repeat':'1', 'interval':'chr1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] -[DUMP: combined] [['id':'sampleC', 'repeat':'1', 'interval':'chr2'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] -[DUMP: combined] [['id':'sampleC', 'repeat':'1', 'interval':'chr3'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] +[DUMP: combined] [['id':'sampleA', 'repeat':'1', 'interval':'chr1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] +[DUMP: combined] [['id':'sampleA', 'repeat':'1', 'interval':'chr2'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] +[DUMP: combined] [['id':'sampleA', 'repeat':'1', 'interval':'chr3'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] +[DUMP: combined] [['id':'sampleB', 'repeat':'1', 'interval':'chr1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] +[DUMP: combined] [['id':'sampleB', 'repeat':'1', 'interval':'chr2'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] +[DUMP: combined] [['id':'sampleB', 'repeat':'1', 'interval':'chr3'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] +[DUMP: combined] [['id':'sampleC', 'repeat':'1', 'interval':'chr1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] +[DUMP: combined] [['id':'sampleC', 'repeat':'1', 'interval':'chr2'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] +[DUMP: combined] [['id':'sampleC', 'repeat':'1', 'interval':'chr3'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] +[DUMP: combined] [['id':'sampleD', 'repeat':'1', 'interval':'chr1'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] +[DUMP: combined] [['id':'sampleD', 'repeat':'1', 'interval':'chr2'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] +[DUMP: combined] [['id':'sampleD', 'repeat':'1', 'interval':'chr3'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] ``` Using `map` to coerce your data into the correct structure can be tricky, but it's crucial to correctly splitting and grouping effectively. @@ -1360,6 +1323,8 @@ In this section, you've learned: - **Spreading samples over intervals**: How to use `combine` to repeat samples over intervals +#TODO: Suggestion, tidy up data here instead of at the end? (i.e. section 5.2) + ### 5. Aggregating samples In the previous section, we learned how to split a samplesheet and filter the normal and tumor samples. But this only covers a single type of joining. What if we want to group samples by a specific attribute? For example, instead of joining matched normal-tumor pairs, we might want to process all samples from "sampleA" together regardless of their type. This pattern is common in bioinformatics workflows where you may want to process related samples separately for efficiency reasons before comparing or combining the results at the end. @@ -1435,20 +1400,20 @@ nextflow run main.nf -dump-channels grouped ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [fabulous_baekeland] DSL2 - revision: 5d2d687351 +Launching `main.nf` [grave_lagrange] DSL2 - revision: ed7032f1d7 -[DUMP: grouped] [['id':'sampleA', 'interval':'chr1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleA', 'interval':'chr2'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleA', 'interval':'chr3'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleA', 'interval':'chr1'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleA', 'interval':'chr2'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleA', 'interval':'chr3'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleB', 'interval':'chr1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleB', 'interval':'chr2'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleB', 'interval':'chr3'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleC', 'interval':'chr1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleC', 'interval':'chr2'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] -[DUMP: grouped] [['id':'sampleC', 'interval':'chr3'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr2'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr3'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] +[DUMP: grouped] [['id':'sampleB', 'interval':'chr1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] +[DUMP: grouped] [['id':'sampleB', 'interval':'chr2'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] +[DUMP: grouped] [['id':'sampleB', 'interval':'chr3'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] +[DUMP: grouped] [['id':'sampleC', 'interval':'chr1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] +[DUMP: grouped] [['id':'sampleC', 'interval':'chr2'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] +[DUMP: grouped] [['id':'sampleC', 'interval':'chr3'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] +[DUMP: grouped] [['id':'sampleD', 'interval':'chr1'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] +[DUMP: grouped] [['id':'sampleD', 'interval':'chr2'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] +[DUMP: grouped] [['id':'sampleD', 'interval':'chr3'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] ``` We can see that we have successfully isolated the `id` and `interval` fields, but not grouped the samples yet. @@ -1461,7 +1426,6 @@ _Before:_ ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ grouping_key.subMap('id', 'interval'), - grouping_key, normal, tumor ] @@ -1496,17 +1460,20 @@ nextflow run main.nf -dump-channels grouped ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [reverent_nightingale] DSL2 - revision: 72c6664d6f +Launching `main.nf` [agitated_gates] DSL2 - revision: 024454556c -[DUMP: grouped] [['id':'sampleA', 'interval':'chr1'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']]] -[DUMP: grouped] [['id':'sampleA', 'interval':'chr2'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']]] -[DUMP: grouped] [['id':'sampleA', 'interval':'chr3'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'fastq1':'sampleA_rep1_normal_R1.fastq.gz', 'fastq2':'sampleA_rep1_normal_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'normal', 'fastq1':'sampleA_rep2_normal_R1.fastq.gz', 'fastq2':'sampleA_rep2_normal_R2.fastq.gz']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleA_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep1_tumor_R2.fastq.gz'], ['id':'sampleA', 'repeat':'2', 'type':'tumor', 'fastq1':'sampleA_rep2_tumor_R1.fastq.gz', 'fastq2':'sampleA_rep2_tumor_R2.fastq.gz']]] -[DUMP: grouped] [['id':'sampleB', 'interval':'chr1'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']]] -[DUMP: grouped] [['id':'sampleB', 'interval':'chr2'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']]] -[DUMP: grouped] [['id':'sampleB', 'interval':'chr3'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'fastq1':'sampleB_rep1_normal_R1.fastq.gz', 'fastq2':'sampleB_rep1_normal_R2.fastq.gz']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleB_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleB_rep1_tumor_R2.fastq.gz']]] -[DUMP: grouped] [['id':'sampleC', 'interval':'chr1'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']]] -[DUMP: grouped] [['id':'sampleC', 'interval':'chr2'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']]] -[DUMP: grouped] [['id':'sampleC', 'interval':'chr3'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'fastq1':'sampleC_rep1_normal_R1.fastq.gz', 'fastq2':'sampleC_rep1_normal_R2.fastq.gz']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'fastq1':'sampleC_rep1_tumor_R1.fastq.gz', 'fastq2':'sampleC_rep1_tumor_R2.fastq.gz']]] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr1'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']]] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr2'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']]] +[DUMP: grouped] [['id':'sampleA', 'interval':'chr3'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']]] +[DUMP: grouped] [['id':'sampleB', 'interval':'chr1'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']]] +[DUMP: grouped] [['id':'sampleB', 'interval':'chr2'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']]] +[DUMP: grouped] [['id':'sampleB', 'interval':'chr3'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']]] +[DUMP: grouped] [['id':'sampleC', 'interval':'chr1'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']]] +[DUMP: grouped] [['id':'sampleC', 'interval':'chr2'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']]] +[DUMP: grouped] [['id':'sampleC', 'interval':'chr3'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']]] +[DUMP: grouped] [['id':'sampleD', 'interval':'chr1'], [['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam']], [['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']]] +[DUMP: grouped] [['id':'sampleD', 'interval':'chr2'], [['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam']], [['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']]] +[DUMP: grouped] [['id':'sampleD', 'interval':'chr3'], [['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam']], [['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']]] ``` It's a little awkward to read! If you're having trouble visualizing it, you can use the `pretty` flag of `dump` to make it easier to read: @@ -1521,7 +1488,7 @@ _Before:_ _After:_ ```groovy title="main.nf" linenums="40" - .dump(tag: 'grouped', pretty: true) + .dump(tag: 'grouped', pretty: true) } ``` @@ -1530,7 +1497,7 @@ Note, we only include the first sample to keep this concise! ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [dreamy_lichterman] DSL2 - revision: 953a5dd264 +Launching `main.nf` [big_golick] DSL2 - revision: 61ae66acee [DUMP: grouped] [ { @@ -1542,15 +1509,7 @@ Launching `main.nf` [dreamy_lichterman] DSL2 - revision: 953a5dd264 "id": "sampleA", "repeat": "1", "type": "normal", - "fastq1": "sampleA_rep1_normal_R1.fastq.gz", - "fastq2": "sampleA_rep1_normal_R2.fastq.gz" - }, - { - "id": "sampleA", - "repeat": "2", - "type": "normal", - "fastq1": "sampleA_rep2_normal_R1.fastq.gz", - "fastq2": "sampleA_rep2_normal_R2.fastq.gz" + "bam": "sampleA_r1_normal.bam" } ], [ @@ -1558,15 +1517,7 @@ Launching `main.nf` [dreamy_lichterman] DSL2 - revision: 953a5dd264 "id": "sampleA", "repeat": "1", "type": "tumor", - "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" - }, - { - "id": "sampleA", - "repeat": "2", - "type": "tumor", - "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" + "bam": "sampleA_rep1_tumor.bam" } ] ] @@ -1595,8 +1546,7 @@ We have a lot of duplicated data in our workflow. Each item in the grouped sampl "id": "sampleC", "repeat": "1", "type": "normal", - "fastq1": "sampleC_rep1_normal_R1.fastq.gz", - "fastq2": "sampleC_rep1_normal_R2.fastq.gz" + "bam": "sampleC_rep1_normal.bam" } ], [ @@ -1604,8 +1554,7 @@ We have a lot of duplicated data in our workflow. Each item in the grouped sampl "id": "sampleC", "repeat": "1", "type": "tumor", - "fastq1": "sampleC_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleC_rep1_tumor_R2.fastq.gz" + "bam": "sampleC_rep1_tumor.bam" } ] ] @@ -1613,7 +1562,7 @@ We have a lot of duplicated data in our workflow. Each item in the grouped sampl We could parse the data after grouping to remove the duplication, but this requires us to handle all of the outputs. Instead, we can parse the data before grouping, which will mean the results are never included in the first place. -In the same `map` operator where we isolate the `id` and `interval` fields, we can also grab the `fastq1` and `fastq2` fields for our sample data and _not_ include the `id` and `interval` fields. +In the same `map` operator where we isolate the `id` and `interval` fields, we can also grab the `bam` field for our sample data and _not_ include the `id` and `interval` fields. _Before:_ @@ -1637,8 +1586,7 @@ _After:_ ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ grouping_key.subMap('id', 'interval'), - normal.subMap("fastq1", "fastq2"), - tumor.subMap("fastq1", "fastq2") + normal.subMap("bam") ] } @@ -1654,7 +1602,7 @@ nextflow run main.nf -dump-channels grouped ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [modest_stallman] DSL2 - revision: 5be827a6e8 +Launching `main.nf` [drunk_baekeland] DSL2 - revision: b46fad3c6c [DUMP: grouped] [ { @@ -1663,29 +1611,25 @@ Launching `main.nf` [modest_stallman] DSL2 - revision: 5be827a6e8 }, [ { - "fastq1": "sampleA_rep1_normal_R1.fastq.gz", - "fastq2": "sampleA_rep1_normal_R2.fastq.gz" - }, - { - "fastq1": "sampleA_rep2_normal_R1.fastq.gz", - "fastq2": "sampleA_rep2_normal_R2.fastq.gz" + "bam": "sampleA_r1_normal.bam" } - ], + ] +] +[DUMP: grouped] [ + { + "id": "sampleA", + "interval": "chr2" + }, [ { - "fastq1": "sampleA_rep1_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep1_tumor_R2.fastq.gz" - }, - { - "fastq1": "sampleA_rep2_tumor_R1.fastq.gz", - "fastq2": "sampleA_rep2_tumor_R2.fastq.gz" + "bam": "sampleA_r1_normal.bam" } ] ] ... ``` -Now we have a much cleaner output. We can see that the `id` and `interval` fields are only included once, and the `fastq1` and `fastq2` fields are included in the sample data +Now we have a much cleaner output. We can see that the `id` and `interval` fields are only included once, and the `bam` field is included in the sample data. ### Takeaway diff --git a/side-quests/splitting_and_grouping/data/samplesheet.csv b/side-quests/splitting_and_grouping/data/samplesheet.csv index a4cac668e1..d20f27f416 100644 --- a/side-quests/splitting_and_grouping/data/samplesheet.csv +++ b/side-quests/splitting_and_grouping/data/samplesheet.csv @@ -1,9 +1,9 @@ -id,repeat,type,fastq1,fastq2 -sampleA,1,normal,sampleA_rep1_normal_R1.fastq.gz,sampleA_rep1_normal_R2.fastq.gz -sampleA,1,tumor,sampleA_rep1_tumor_R1.fastq.gz,sampleA_rep1_tumor_R2.fastq.gz -sampleA,2,normal,sampleA_rep2_normal_R1.fastq.gz,sampleA_rep2_normal_R2.fastq.gz -sampleA,2,tumor,sampleA_rep2_tumor_R1.fastq.gz,sampleA_rep2_tumor_R2.fastq.gz -sampleB,1,normal,sampleB_rep1_normal_R1.fastq.gz,sampleB_rep1_normal_R2.fastq.gz -sampleB,1,tumor,sampleB_rep1_tumor_R1.fastq.gz,sampleB_rep1_tumor_R2.fastq.gz -sampleC,1,normal,sampleC_rep1_normal_R1.fastq.gz,sampleC_rep1_normal_R2.fastq.gz -sampleC,1,tumor,sampleC_rep1_tumor_R1.fastq.gz,sampleC_rep1_tumor_R2.fastq.gz +id,repeat,type,bam +sampleA,1,normal,sampleA_r1_normal.bam +sampleA,1,tumor,sampleA_rep1_tumor.bam +sampleB,1,normal,sampleB_rep1_normal.bam +sampleB,1,tumor,sampleB_rep1_tumor.bam +sampleC,1,normal,sampleC_rep1_normal.bam +sampleC,1,tumor,sampleC_rep1_tumor.bam +sampleD,1,normal,sampleD_rep1_normal.bam +sampleD,1,tumor,sampleD_rep1_tumor.bam From 5bdbce726c48c0c6efe5b061f1ff4bb7f11b3183 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Wed, 9 Apr 2025 19:17:08 +0100 Subject: [PATCH 12/36] Remove dump from tutorial to simplify and focus splitting-grouping and not debugging --- docs/side_quests/splitting-and-grouping.md | 843 ++++-------------- .../data/samplesheet.csv | 14 +- 2 files changed, 180 insertions(+), 677 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index e4f8c0504b..a0e37b18a2 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -131,111 +131,7 @@ Each map contains: - `type`: The sample type (normal or tumor) - `bam`: Path to the BAM file -This format makes it easy to access specific fields from each sample. For example, we could access the sample ID with `row.id` or the BAM path with `row.bam`. - -This means we have successfully read in the samplesheet and have access to the data in each row. We can start to implement this in our pipeline. - -### 1.2. Use dump to pretty print the data - -For a prettier output format, we can use the [`dump` operator](https://www.nextflow.io/docs/latest/operator.html#dump) instead of `view`: - -_Before:_ - -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .view() -} -``` - -_After:_ - -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .dump(tag: 'samples', pretty: true) -} -``` - -```bash title="Read the samplesheet" -nextflow run main.nf -``` - -```console title="Read samplesheet with dump" - N E X T F L O W ~ version 24.10.5 - -Launching `./main.nf` [grave_stone] DSL2 - revision: b2bafa8755 -``` - -Wait?! Where is our output? `dump` is a special operator that prints the data to the console only when specifically enabled. That is what the `tag` parameter is for. Let's enable it: - -```bash title="Enable dump" -nextflow run main.nf -dump-channels samples -``` - -```console title="Read samplesheet with dump" - N E X T F L O W ~ version 24.10.5 - -Launching `main.nf` [cheesy_celsius] DSL2 - revision: 0e9d501bcc - -[DUMP: samples] { - "id": "sampleA", - "repeat": "1", - "type": "normal", - "bam": "sampleA_r1_normal.bam" -} -[DUMP: samples] { - "id": "sampleA", - "repeat": "1", - "type": "tumor", - "bam": "sampleA_rep1_tumor.bam" -} -[DUMP: samples] { - "id": "sampleB", - "repeat": "1", - "type": "normal", - "bam": "sampleB_rep1_normal.bam" -} -[DUMP: samples] { - "id": "sampleB", - "repeat": "1", - "type": "tumor", - "bam": "sampleB_rep1_tumor.bam" -} -[DUMP: samples] { - "id": "sampleC", - "repeat": "1", - "type": "normal", - "bam": "sampleC_rep1_normal.bam" -} -[DUMP: samples] { - "id": "sampleC", - "repeat": "1", - "type": "tumor", - "bam": "sampleC_rep1_tumor.bam" -} -[DUMP: samples] { - "id": "sampleD", - "repeat": "1", - "type": "normal", - "bam": "sampleD_rep1_normal.bam" -} -[DUMP: samples] { - "id": "sampleD", - "repeat": "1", - "type": "tumor", - "bam": "sampleD_rep1_tumor.bam" -} -``` - -This is a long output, but we can see that each row from the CSV file has been converted into a map with keys matching the header row. It's more clear to read, at the cost of being too much content for a small terminal. If you want it to be more concise, you can remove the `pretty: true` parameter and the console output will be similar to `view`. - -!!! note -If the output is too tall for your terminal but you have a very wide terminal, you can remove `pretty: true` from the `dump` operator to make it more concise. - -Both dump and view are useful for debugging and we will continue to use them throughout this side quest. Feel free to intersperse them if you need additional clarification at any step. +This format makes it easy to access specific fields from each sample. For example, we could access the sample ID with `sample.id` or the BAM file path with `sample.bam`. The output above shows each row from the CSV file converted into a map with keys matching the header row. Now that we've successfully read in the samplesheet and have access to the data in each row, we can begin implementing our pipeline logic. ### Takeaway @@ -243,7 +139,6 @@ In this section, you've learned: - **Reading in a samplesheet**: How to read in a samplesheet with `splitCsv` - **Viewing data**: How to use `view` to print the data -- **Dumping data**: How to use `dump` to pretty print the data We now have a channel of maps, each representing a row from the samplesheet. Next, we'll transform this data into a format suitable for our pipeline by extracting metadata and organizing the file paths. @@ -261,7 +156,7 @@ _Before:_ workflow { ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) - .dump(tag: 'samples', pretty: true) + .view() } ``` @@ -276,9 +171,6 @@ workflow { } ``` -!!! note -We drop the `pretty: true` parameter from `dump` because it makes it easier to see the difference - ```bash title="View normal samples" nextflow run main.nf ``` @@ -302,7 +194,7 @@ In this case, we want to keep only the samples where `sample.type == 'normal'`. .filter { sample -> sample.type == 'normal' } ``` -### 2.2. Save results of filter to a new channel +### 2.2. Filter to just the tumor samples #TODO: Move this later after making the tumor only channel, put it at the end in one section! @@ -327,7 +219,7 @@ workflow { .splitCsv(header: true) ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } - .view() + ch_normal_samples.view() } ``` @@ -348,17 +240,17 @@ Launching `main.nf` [astonishing_noether] DSL2 - revision: 8e49cf6956 [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] ``` -Success! We have filtered the data to only include normal samples. Note that we can use view and save the new channel. If we wanted, we still have access to the tumor samples within the `samplesheet` channel. Since we managed it for the normal samples, let's do it for the tumor samples as well: +Success! We have filtered the data to only include normal samples. Note that we can use view and save the new channel. If we wanted, we still have access to the tumor samples within the `ch_samplesheet` channel. Since we managed it for the normal samples, let's do it for the tumor samples as well: _Before:_ ```groovy title="main.nf" linenums="1" workflow { ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) + .splitCsv(header: true) ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .view() + .filter { sample -> sample.type == 'normal' } + ch_normal_samples.view() } ``` @@ -370,10 +262,10 @@ workflow { .splitCsv(header: true) ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } - .view() ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == 'tumor' } - .view() + ch_normal_samples.view() + ch_tumor_samples.view() } ``` @@ -396,9 +288,7 @@ Launching `main.nf` [gloomy_roentgen] DSL2 - revision: e6b3917a8e [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] ``` -We've managed to separate out the normal and tumor samples into two different channels but they're mixed up when we `view` them in the console! Here's where dump could be useful, because it can label the different channels with a tag. - -#TODO: remove this bit +We've managed to separate out the normal and tumor samples into two different channels but they're mixed up when we `view` them in the console! If we want, we can remove one of the `view` operators to see the data in each channel separately. Let's remove the `view` operator for the normal samples: _Before:_ @@ -408,10 +298,10 @@ workflow { .splitCsv(header: true) ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } - .view() ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == 'tumor' } - .view() + ch_normal_samples.view() + ch_tumor_samples.view() } ``` @@ -423,33 +313,28 @@ workflow { .splitCsv(header: true) ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } - .dump(tag: 'normal') ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } - .dump(tag: 'tumor') + .filter { sample -> sample.type == 'tumor' } + ch_tumor_samples.view() } ``` ```bash title="View normal and tumor samples" -nextflow run main.nf -dump-channels normal,tumor +nextflow run main.nf ``` ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [sharp_carlsson] DSL2 - revision: 61e1be6afd +Launching `main.nf` [pensive_moriondo] DSL2 - revision: 012d38e59f -[DUMP: tumor] ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam'] -[DUMP: normal] ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'] -[DUMP: tumor] ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam'] -[DUMP: normal] ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'] -[DUMP: tumor] ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam'] -[DUMP: normal] ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'] -[DUMP: tumor] ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam'] -[DUMP: normal] ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'] +[id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam] +[id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam] +[id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam] +[id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam] ``` -Note how the `normal` and `tumor` tags are used to label the different channels. This is useful for debugging and for understanding the data flow in our pipeline. +Note how we can only see the tumor samples in the output. This is because we removed the `view` operator for the normal samples. ### Takeaway @@ -457,7 +342,7 @@ In this section, you've learned: - **Filtering data**: How to filter data with `filter` - **Splitting data**: How to split data into different channels based on a condition -- **Dumping data**: How to use `dump` to label and print the data +- **Viewing data**: How to use `view` to print the data We've now separated out the normal and tumor samples into two different channels. Next, we'll join the normal and tumor samples on the `id` field. @@ -471,25 +356,21 @@ Nextflow includes many methods for combining channels, but in this case the most ### 3.1. Use `map` and `join` to combine based on sample ID -If we check the [`join`](https://www.nextflow.io/docs/latest/operator.html#join) documentation, we can see that it joins two channels based on the first item in each tuple. Let's run the pipeline to check our data structure and see how we need to modify it to join on the `id` field. +If we check the [`join`](https://www.nextflow.io/docs/latest/operator.html#join) documentation, we can see that it joins two channels based on the first item in each tuple. If you don't have the console output still available, let's run the pipeline to check our data structure and see how we need to modify it to join on the `id` field. ```bash title="View normal and tumor samples" -nextflow run main.nf -dump-channels normal,tumor +nextflow run main.nf ``` ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [sharp_carlsson] DSL2 - revision: 61e1be6afd +Launching `main.nf` [sleepy_lichterman] DSL2 - revision: 012d38e59f -[DUMP: tumor] ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam'] -[DUMP: normal] ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'] -[DUMP: tumor] ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam'] -[DUMP: normal] ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'] -[DUMP: tumor] ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam'] -[DUMP: normal] ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'] -[DUMP: tumor] ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam'] -[DUMP: normal] ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'] +[id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam] +[id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam] +[id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam] +[id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam] ``` We can see that the `id` field is the first element in each map. For `join` to work, we should isolate the `id` field in each tuple. After that, we can simply use the `join` operator to combine the two channels. @@ -504,10 +385,9 @@ workflow { .splitCsv(header: true) ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } - .dump(tag: 'normal') ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } - .dump(tag: 'tumor') + .filter { sample -> sample.type == 'tumor' } + ch_tumor_samples.view() } ``` @@ -520,51 +400,51 @@ workflow { ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [sample.id, sample] } - .dump(tag: 'normal') ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } + .filter { sample -> sample.type == 'tumor' } .map { sample -> [sample.id, sample] } - .dump(tag: 'tumor') + ch_normal_samples.view() + ch_tumor_samples.view() } ``` ```bash title="View normal and tumor samples with ID as element 0" -nextflow run main.nf -dump-channels normal,tumor +nextflow run main.nf ``` ```console title="View normal and tumor samples with ID as element 0" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [peaceful_morse] DSL2 - revision: 34daafdfb3 +Launching `main.nf` [trusting_ptolemy] DSL2 - revision: 882ae9add4 -[DUMP: normal] ['sampleA', ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam']] -[DUMP: tumor] ['sampleA', ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] -[DUMP: normal] ['sampleB', ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam']] -[DUMP: tumor] ['sampleB', ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] -[DUMP: normal] ['sampleC', ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam']] -[DUMP: tumor] ['sampleC', ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] -[DUMP: normal] ['sampleD', ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam']] -[DUMP: tumor] ['sampleD', ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] +[sampleA, [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam]] +[sampleA, [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] +[sampleB, [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam]] +[sampleB, [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] +[sampleC, [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam]] +[sampleC, [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[sampleD, [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam]] +[sampleD, [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` It might be subtle, but you should be able to see the first element in each tuple is the `id` field. Now we can use the `join` operator to combine the two channels based on the `id` field. -Once again, we will use `dump` to selectively print the joined outputs. +Once again, we will use `view` to print the joined outputs. _Before:_ ```groovy title="main.nf" linenums="1" workflow { - samplesheet = Channel.fromPath("./data/samplesheet.csv") + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [sample.id, sample] } - .dump(tag: 'normal') ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } + .filter { sample -> sample.type == 'tumor' } .map { sample -> [sample.id, sample] } - .dump(tag: 'tumor') + ch_normal_samples.view() + ch_tumor_samples.view() } ``` @@ -577,30 +457,28 @@ workflow { ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [sample.id, sample] } - .dump(tag: 'normal') ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } + .filter { sample -> sample.type == 'tumor' } .map { sample -> [sample.id, sample] } - .dump(tag: 'tumor') - joined_samples = ch_normal_samples + ch_joined_samples = ch_normal_samples .join(ch_tumor_samples) - .dump(tag: 'joined') + ch_joined_samples.view() } ``` ```bash title="View normal and tumor samples" -nextflow run main.nf -dump-channels joined +nextflow run main.nf ``` ```console title="View joined normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [hopeful_agnesi] DSL2 - revision: 78b21768c2 +Launching `main.nf` [astonishing_heyrovsky] DSL2 - revision: 49857f9ecc -[DUMP: joined] ['sampleA', ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] -[DUMP: joined] ['sampleB', ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] -[DUMP: joined] ['sampleC', ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] -[DUMP: joined] ['sampleD', ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] +[sampleA, [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] +[sampleB, [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] +[sampleC, [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[sampleD, [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` It's a little hard to tell because it's so wide, but you should be able to see the samples have been joined by the `id` field. Each tuple now has the format: @@ -609,99 +487,6 @@ It's a little hard to tell because it's so wide, but you should be able to see t - `normal_sample`: The normal sample including type, replicate and path to bam file - `tumor_sample`: The tumor sample including type, replicate and path to bam file -If you want you can use the `pretty` parameter of `dump` to make it easier to read: - -_After:_ - -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [sample.id, sample] } - .dump(tag: 'normal') - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } - .map { sample -> [sample.id, sample] } - .dump(tag: 'tumor') - joined_samples = ch_normal_samples - .join(ch_tumor_samples) - .dump(tag: 'joined', pretty: true) -} -``` - -```bash title="View normal and tumor samples" -nextflow run main.nf -dump-channels joined -``` - -```console title="View normal and tumor samples" - N E X T F L O W ~ version 24.10.5 - -Launching `main.nf` [desperate_einstein] DSL2 - revision: 2dce0e5352 - -[DUMP: joined] [ - "sampleA", - { - "id": "sampleA", - "repeat": "1", - "type": "normal", - "bam": "sampleA_r1_normal.bam" - }, - { - "id": "sampleA", - "repeat": "1", - "type": "tumor", - "bam": "sampleA_rep1_tumor.bam" - } -] -[DUMP: joined] [ - "sampleB", - { - "id": "sampleB", - "repeat": "1", - "type": "normal", - "bam": "sampleB_rep1_normal.bam" - }, - { - "id": "sampleB", - "repeat": "1", - "type": "tumor", - "bam": "sampleB_rep1_tumor.bam" - } -] -[DUMP: joined] [ - "sampleC", - { - "id": "sampleC", - "repeat": "1", - "type": "normal", - "bam": "sampleC_rep1_normal.bam" - }, - { - "id": "sampleC", - "repeat": "1", - "type": "tumor", - "bam": "sampleC_rep1_tumor.bam" - } -] -[DUMP: joined] [ - "sampleD", - { - "id": "sampleD", - "repeat": "1", - "type": "normal", - "bam": "sampleD_rep1_normal.bam" - }, - { - "id": "sampleD", - "repeat": "1", - "type": "tumor", - "bam": "sampleD_rep1_tumor.bam" - } -] -``` - !!! warning The `join` operator will discard any un-matched tuples. In this example, we made sure all samples were matched for tumor and normal but if this is not true you must use the parameter `remainder: true` to keep the unmatched tuples. Check the [documentation](https://www.nextflow.io/docs/latest/operator.html#join) for more details. @@ -731,14 +516,12 @@ workflow { ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [sample.id, sample] } - .dump(tag: 'normal') ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } + .filter { sample -> sample.type == 'tumor' } .map { sample -> [sample.id, sample] } - .dump(tag: 'tumor') - joined_samples = ch_normal_samples + ch_joined_samples = ch_normal_samples .join(ch_tumor_samples) - .dump(tag: 'joined', pretty: true) + ch_joined_samples.view() } ``` @@ -755,104 +538,34 @@ workflow { sample ] } - .dump(tag: 'normal') ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } + .filter { sample -> sample.type == 'tumor' } .map { sample -> [ [sample.id, sample.repeat], sample ] } - .dump(tag: 'tumor') ch_joined_samples = ch_normal_samples .join(ch_tumor_samples) - .dump(tag: 'joined', pretty: true) + ch_joined_samples.view() } ``` Now we should see the join is occurring but using both the `id` and `repeat` fields. ```bash title="View normal and tumor samples" -nextflow run main.nf -dump-channels joined +nextflow run main.nf ``` ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [infallible_torricelli] DSL2 - revision: c02356ebe1 +Launching `main.nf` [extravagant_varahamihira] DSL2 - revision: 8c61f8cc77 -[DUMP: joined] [ - [ - "sampleA", - "1" - ], - { - "id": "sampleA", - "repeat": "1", - "type": "normal", - "bam": "sampleA_r1_normal.bam" - }, - { - "id": "sampleA", - "repeat": "1", - "type": "tumor", - "bam": "sampleA_rep1_tumor.bam" - } -] -[DUMP: joined] [ - [ - "sampleB", - "1" - ], - { - "id": "sampleB", - "repeat": "1", - "type": "normal", - "bam": "sampleB_rep1_normal.bam" - }, - { - "id": "sampleB", - "repeat": "1", - "type": "tumor", - "bam": "sampleB_rep1_tumor.bam" - } -] -[DUMP: joined] [ - [ - "sampleC", - "1" - ], - { - "id": "sampleC", - "repeat": "1", - "type": "normal", - "bam": "sampleC_rep1_normal.bam" - }, - { - "id": "sampleC", - "repeat": "1", - "type": "tumor", - "bam": "sampleC_rep1_tumor.bam" - } -] -[DUMP: joined] [ - [ - "sampleD", - "1" - ], - { - "id": "sampleD", - "repeat": "1", - "type": "normal", - "bam": "sampleD_rep1_normal.bam" - }, - { - "id": "sampleD", - "repeat": "1", - "type": "tumor", - "bam": "sampleD_rep1_tumor.bam" - } -] +[[sampleA, 1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] +[[sampleB, 1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[sampleC, 1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[sampleD, 1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` Note how we have a tuple of two elements (`id` and `repeat` fields) as the first element of each joined result. This demonstrates how complex items can be used as a joining key, enabling fairly intricate matching between samples from the same conditions. @@ -876,18 +589,16 @@ workflow { sample ] } - .dump(tag: 'normal') ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } + .filter { sample -> sample.type == 'tumor' } .map { sample -> [ [sample.id, sample.repeat], sample ] } - .dump(tag: 'tumor') ch_joined_samples = ch_normal_samples .join(ch_tumor_samples) - .dump(tag: 'joined', pretty: true) + ch_joined_samples.view() } ``` @@ -904,102 +615,32 @@ workflow { sample ] } - .dump(tag: 'normal') ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } + .filter { sample -> sample.type == 'tumor' } .map { sample -> [ sample.subMap(['id', 'repeat']), sample ] } - .dump(tag: 'tumor') ch_joined_samples = ch_normal_samples .join(ch_tumor_samples) - .dump(tag: 'joined', pretty: true) + ch_joined_samples.view() } ``` ```bash title="View normal and tumor samples" -nextflow run main.nf -dump-channels joined +nextflow run main.nf ``` ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [sharp_waddington] DSL2 - revision: c02356ebe1 +Launching `main.nf` [irreverent_shaw] DSL2 - revision: 83d2d53944 -[DUMP: joined] [ - [ - "sampleA", - "1" - ], - { - "id": "sampleA", - "repeat": "1", - "type": "normal", - "bam": "sampleA_r1_normal.bam" - }, - { - "id": "sampleA", - "repeat": "1", - "type": "tumor", - "bam": "sampleA_rep1_tumor.bam" - } -] -[DUMP: joined] [ - [ - "sampleB", - "1" - ], - { - "id": "sampleB", - "repeat": "1", - "type": "normal", - "bam": "sampleB_rep1_normal.bam" - }, - { - "id": "sampleB", - "repeat": "1", - "type": "tumor", - "bam": "sampleB_rep1_tumor.bam" - } -] -[DUMP: joined] [ - [ - "sampleC", - "1" - ], - { - "id": "sampleC", - "repeat": "1", - "type": "normal", - "bam": "sampleC_rep1_normal.bam" - }, - { - "id": "sampleC", - "repeat": "1", - "type": "tumor", - "bam": "sampleC_rep1_tumor.bam" - } -] -[DUMP: joined] [ - [ - "sampleD", - "1" - ], - { - "id": "sampleD", - "repeat": "1", - "type": "normal", - "bam": "sampleD_rep1_normal.bam" - }, - { - "id": "sampleD", - "repeat": "1", - "type": "tumor", - "bam": "sampleD_rep1_tumor.bam" - } -] +[[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleD, repeat:1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` Now we have a new joining key that not only includes the `id` and `repeat` fields but also retains the field names so we can access them later by name, e.g. `sample.id` and `sample.repeat`. @@ -1039,7 +680,6 @@ _Before:_ sample ] } - .dump(tag: 'normal') ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map { sample -> [ @@ -1047,7 +687,6 @@ _Before:_ sample ] } - .dump(tag: 'tumor') ``` _After:_ @@ -1056,97 +695,27 @@ _After:_ ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map ( getSampleIdAndReplicate ) - .dump(tag: 'normal') ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map ( getSampleIdAndReplicate ) - .dump(tag: 'tumor') ``` !!! note The `map` operator has switched from using `{ }` to using `( )` to pass the closure as an argument. This is because the `map` operator expects a closure as an argument and `{ }` is used to define an anonymous closure. When calling a named closure, use the `( )` syntax. ```bash title="View normal and tumor samples" -nextflow run main.nf -dump-channels joined +nextflow run main.nf ``` ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [silly_lalande] DSL2 - revision: 76bcb0b16b +Launching `main.nf` [modest_roentgen] DSL2 - revision: ec9412b708 -[DUMP: joined] [ - { - "id": "sampleA", - "repeat": "1" - }, - { - "id": "sampleA", - "repeat": "1", - "type": "normal", - "bam": "sampleA_r1_normal.bam" - }, - { - "id": "sampleA", - "repeat": "1", - "type": "tumor", - "bam": "sampleA_rep1_tumor.bam" - } -] -[DUMP: joined] [ - { - "id": "sampleB", - "repeat": "1" - }, - { - "id": "sampleB", - "repeat": "1", - "type": "normal", - "bam": "sampleB_rep1_normal.bam" - }, - { - "id": "sampleB", - "repeat": "1", - "type": "tumor", - "bam": "sampleB_rep1_tumor.bam" - } -] -[DUMP: joined] [ - { - "id": "sampleC", - "repeat": "1" - }, - { - "id": "sampleC", - "repeat": "1", - "type": "normal", - "bam": "sampleC_rep1_normal.bam" - }, - { - "id": "sampleC", - "repeat": "1", - "type": "tumor", - "bam": "sampleC_rep1_tumor.bam" - } -] -[DUMP: joined] [ - { - "id": "sampleD", - "repeat": "1" - }, - { - "id": "sampleD", - "repeat": "1", - "type": "normal", - "bam": "sampleD_rep1_normal.bam" - }, - { - "id": "sampleD", - "repeat": "1", - "type": "tumor", - "bam": "sampleD_rep1_tumor.bam" - } -] +[[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleD, repeat:1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` Using a named closure in the map allows us to reuse the same map in multiple places which reduces our risk of introducing errors. It also makes the code more readable and easier to maintain. @@ -1160,7 +729,7 @@ In this section, you've learned: - **Creating Joining Keys**: How to use `subMap` to create a new joining key - **Named Closures**: How to use a named closure in map -You now have a workflow that can split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then dump the results. +You now have a workflow that can split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then print the results. This is a common pattern in bioinformatics workflows where you need to match up samples after processing independently, so it is a useful skill. Next, we will look at repeating a sample multiple times. @@ -1177,16 +746,14 @@ Let's start by creating a channel of intervals. To keep life simple, we will jus _Before:_ ```groovy title="main.nf" linenums="15" - .dump(tag: 'joined', pretty: true) + ch_joined_samples.view() } ``` _After:_ -```groovy title="main.nf" linenums="24" - .dump(tag: 'joined', pretty: true) +```groovy title="main.nf" linenums="15" ch_intervals = Channel.of('chr1', 'chr2', 'chr3') - .dump(tag: "intervals") } ``` @@ -1194,46 +761,44 @@ Now remember, we want to repeat each sample for each interval. This is sometimes _Before:_ -```groovy title="main.nf" linenums="26" +```groovy title="main.nf" linenums="15" ch_intervals = Channel.of('chr1', 'chr2', 'chr3') - .dump(tag: "intervals") } ``` _After:_ -```groovy title="main.nf" linenums="26" +```groovy title="main.nf" linenums="15" ch_intervals = Channel.of('chr1', 'chr2', 'chr3') - .dump(tag: "intervals") ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .dump(tag: 'combined') + .view() } ``` Now let's run it and see what happens: ```bash title="View combined samples" -nextflow run main.nf -dump-channels combined +nextflow run main.nf ``` ```console title="View combined samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [festering_davinci] DSL2 - revision: 0ec9c8b25a +Launching `main.nf` [stupefied_brattain] DSL2 - revision: 6a4891d696 -[DUMP: combined] [['id':'sampleA', 'repeat':'1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam'], 'chr1'] -[DUMP: combined] [['id':'sampleA', 'repeat':'1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam'], 'chr2'] -[DUMP: combined] [['id':'sampleA', 'repeat':'1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam'], 'chr3'] -[DUMP: combined] [['id':'sampleB', 'repeat':'1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam'], 'chr1'] -[DUMP: combined] [['id':'sampleB', 'repeat':'1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam'], 'chr2'] -[DUMP: combined] [['id':'sampleB', 'repeat':'1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam'], 'chr3'] -[DUMP: combined] [['id':'sampleC', 'repeat':'1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam'], 'chr1'] -[DUMP: combined] [['id':'sampleC', 'repeat':'1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam'], 'chr2'] -[DUMP: combined] [['id':'sampleC', 'repeat':'1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam'], 'chr3'] -[DUMP: combined] [['id':'sampleD', 'repeat':'1'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam'], 'chr1'] -[DUMP: combined] [['id':'sampleD', 'repeat':'1'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam'], 'chr2'] -[DUMP: combined] [['id':'sampleD', 'repeat':'1'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam'], 'chr3'] +[[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], chr1] +[[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], chr2] +[[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], chr3] +[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam], chr1] +[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam], chr2] +[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam], chr3] +[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam], chr1] +[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam], chr2] +[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam], chr3] +[[id:sampleD, repeat:1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam], chr1] +[[id:sampleD, repeat:1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam], chr2] +[[id:sampleD, repeat:1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam], chr3 ``` Success! We have repeated every sample for every single interval in our 3 interval list. We've effectively tripled the number of items in our channel. It's a little hard to read though, so in the next section we will tidy it up. @@ -1245,8 +810,8 @@ We can use the `map` operator to tidy and refactor our sample data so it's easie _Before:_ ```groovy title="main.nf" linenums="19" - ch_combined_samples = joined_samples.combine(ch_intervals) - .dump(tag: 'combined') + ch_combined_samples = ch_joined_samples.combine(ch_intervals) + .view() } ``` @@ -1262,7 +827,7 @@ _After:_ ] } - .dump(tag: 'combined') + .view() } ``` @@ -1293,26 +858,26 @@ Finally, we return all of this as one tuple of the 3 elements, the new map, the Let's run it again and check the channel contents: ```bash title="View combined samples" -nextflow run main.nf -dump-channels combined +nextflow run main.nf ``` ```console title="View combined samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [thirsty_turing] DSL2 - revision: 7e1d6928c6 +Launching `main.nf` [stupefied_kare] DSL2 - revision: 7d98ee6805 -[DUMP: combined] [['id':'sampleA', 'repeat':'1', 'interval':'chr1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] -[DUMP: combined] [['id':'sampleA', 'repeat':'1', 'interval':'chr2'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] -[DUMP: combined] [['id':'sampleA', 'repeat':'1', 'interval':'chr3'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] -[DUMP: combined] [['id':'sampleB', 'repeat':'1', 'interval':'chr1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] -[DUMP: combined] [['id':'sampleB', 'repeat':'1', 'interval':'chr2'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] -[DUMP: combined] [['id':'sampleB', 'repeat':'1', 'interval':'chr3'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] -[DUMP: combined] [['id':'sampleC', 'repeat':'1', 'interval':'chr1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] -[DUMP: combined] [['id':'sampleC', 'repeat':'1', 'interval':'chr2'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] -[DUMP: combined] [['id':'sampleC', 'repeat':'1', 'interval':'chr3'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] -[DUMP: combined] [['id':'sampleD', 'repeat':'1', 'interval':'chr1'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] -[DUMP: combined] [['id':'sampleD', 'repeat':'1', 'interval':'chr2'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] -[DUMP: combined] [['id':'sampleD', 'repeat':'1', 'interval':'chr3'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] +[[id:sampleA, repeat:1, interval:chr1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleA, repeat:1, interval:chr2], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleA, repeat:1, interval:chr3], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleB, repeat:1, interval:chr1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleB, repeat:1, interval:chr2], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleB, repeat:1, interval:chr3], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleC, repeat:1, interval:chr1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleC, repeat:1, interval:chr2], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleC, repeat:1, interval:chr3], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleD, repeat:1, interval:chr1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleD, repeat:1, interval:chr2], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleD, repeat:1, interval:chr3], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` Using `map` to coerce your data into the correct structure can be tricky, but it's crucial to correctly splitting and grouping effectively. @@ -1361,7 +926,7 @@ _Before:_ ] } - .dump(tag: 'combined') + .view() } ``` @@ -1377,7 +942,6 @@ _After:_ ] } - .dump(tag: 'combined') ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ @@ -1387,33 +951,33 @@ _After:_ ] } - .dump(tag: 'grouped') + .view() } ``` Let's run it again and check the channel contents: ```bash title="View grouped samples" -nextflow run main.nf -dump-channels grouped +nextflow run main.nf ``` ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [grave_lagrange] DSL2 - revision: ed7032f1d7 +Launching `main.nf` [silly_leibniz] DSL2 - revision: 1db0b1e3de -[DUMP: grouped] [['id':'sampleA', 'interval':'chr1'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] -[DUMP: grouped] [['id':'sampleA', 'interval':'chr2'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] -[DUMP: grouped] [['id':'sampleA', 'interval':'chr3'], ['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam'], ['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']] -[DUMP: grouped] [['id':'sampleB', 'interval':'chr1'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] -[DUMP: grouped] [['id':'sampleB', 'interval':'chr2'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] -[DUMP: grouped] [['id':'sampleB', 'interval':'chr3'], ['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam'], ['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']] -[DUMP: grouped] [['id':'sampleC', 'interval':'chr1'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] -[DUMP: grouped] [['id':'sampleC', 'interval':'chr2'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] -[DUMP: grouped] [['id':'sampleC', 'interval':'chr3'], ['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam'], ['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']] -[DUMP: grouped] [['id':'sampleD', 'interval':'chr1'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] -[DUMP: grouped] [['id':'sampleD', 'interval':'chr2'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] -[DUMP: grouped] [['id':'sampleD', 'interval':'chr3'], ['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam'], ['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']] +[[id:sampleA, interval:chr1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleA, interval:chr2], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleA, interval:chr3], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleB, interval:chr1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleB, interval:chr2], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleB, interval:chr3], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleC, interval:chr1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleC, interval:chr2], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleC, interval:chr3], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleD, interval:chr1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleD, interval:chr2], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleD, interval:chr3], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` We can see that we have successfully isolated the `id` and `interval` fields, but not grouped the samples yet. @@ -1431,7 +995,7 @@ _Before:_ ] } - .dump(tag: 'grouped') + .view() } ``` @@ -1447,81 +1011,32 @@ _After:_ } .groupTuple() - .dump(tag: 'grouped') + .view() } ``` Simple, huh? We just added a single line of code. Let's see what happens when we run it: -```bash title="View grouped samples" -nextflow run main.nf -dump-channels grouped -``` - -```console title="View grouped samples" - N E X T F L O W ~ version 24.10.5 - -Launching `main.nf` [agitated_gates] DSL2 - revision: 024454556c - -[DUMP: grouped] [['id':'sampleA', 'interval':'chr1'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']]] -[DUMP: grouped] [['id':'sampleA', 'interval':'chr2'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']]] -[DUMP: grouped] [['id':'sampleA', 'interval':'chr3'], [['id':'sampleA', 'repeat':'1', 'type':'normal', 'bam':'sampleA_r1_normal.bam']], [['id':'sampleA', 'repeat':'1', 'type':'tumor', 'bam':'sampleA_rep1_tumor.bam']]] -[DUMP: grouped] [['id':'sampleB', 'interval':'chr1'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']]] -[DUMP: grouped] [['id':'sampleB', 'interval':'chr2'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']]] -[DUMP: grouped] [['id':'sampleB', 'interval':'chr3'], [['id':'sampleB', 'repeat':'1', 'type':'normal', 'bam':'sampleB_rep1_normal.bam']], [['id':'sampleB', 'repeat':'1', 'type':'tumor', 'bam':'sampleB_rep1_tumor.bam']]] -[DUMP: grouped] [['id':'sampleC', 'interval':'chr1'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']]] -[DUMP: grouped] [['id':'sampleC', 'interval':'chr2'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']]] -[DUMP: grouped] [['id':'sampleC', 'interval':'chr3'], [['id':'sampleC', 'repeat':'1', 'type':'normal', 'bam':'sampleC_rep1_normal.bam']], [['id':'sampleC', 'repeat':'1', 'type':'tumor', 'bam':'sampleC_rep1_tumor.bam']]] -[DUMP: grouped] [['id':'sampleD', 'interval':'chr1'], [['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam']], [['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']]] -[DUMP: grouped] [['id':'sampleD', 'interval':'chr2'], [['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam']], [['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']]] -[DUMP: grouped] [['id':'sampleD', 'interval':'chr3'], [['id':'sampleD', 'repeat':'1', 'type':'normal', 'bam':'sampleD_rep1_normal.bam']], [['id':'sampleD', 'repeat':'1', 'type':'tumor', 'bam':'sampleD_rep1_tumor.bam']]] -``` - -It's a little awkward to read! If you're having trouble visualizing it, you can use the `pretty` flag of `dump` to make it easier to read: - -_Before:_ - -```groovy title="main.nf" linenums="40" - .dump(tag: 'grouped') -} -``` - -_After:_ +#TODO: AUTHORS NOTE WE CHANGED THE SAMPLE SHEET DURING DEV HERE. GO BACK AND FIX IT -```groovy title="main.nf" linenums="40" - .dump(tag: 'grouped', pretty: true) -} +```bash title="View grouped samples" +nextflow run main.nf ``` -Note, we only include the first sample to keep this concise! - ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [big_golick] DSL2 - revision: 61ae66acee +Launching `main.nf` [irreverent_mestorf] DSL2 - revision: 5cb6b8c8da -[DUMP: grouped] [ - { - "id": "sampleA", - "interval": "chr1" - }, - [ - { - "id": "sampleA", - "repeat": "1", - "type": "normal", - "bam": "sampleA_r1_normal.bam" - } - ], - [ - { - "id": "sampleA", - "repeat": "1", - "type": "tumor", - "bam": "sampleA_rep1_tumor.bam" - } - ] -] -... +[[id:sampleA, interval:chr1], [[id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam]], [[id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]]] +[[id:sampleA, interval:chr2], [[id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam]], [[id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]]] +[[id:sampleA, interval:chr3], [[id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam]], [[id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]]] +[[id:sampleB, interval:chr1], [[id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam]], [[id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]]] +[[id:sampleB, interval:chr2], [[id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam]], [[id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]]] +[[id:sampleB, interval:chr3], [[id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam]], [[id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]]] +[[id:sampleC, interval:chr1], [[id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam]], [[id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]]] +[[id:sampleC, interval:chr2], [[id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam]], [[id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]]] +[[id:sampleC, interval:chr3], [[id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam]], [[id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]]] ``` Note our data has changed structure. What was previously a list of tuples is now a list of lists of tuples. This is because when we use `groupTuple`, Nextflow creates a new list for each group. This is important to remember when trying to handle the data downstream. @@ -1566,7 +1081,7 @@ In the same `map` operator where we isolate the `id` and `interval` fields, we c _Before:_ -```groovy title="main.nf" linenums="30" +```groovy title="main.nf" linenums="27" ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ grouping_key.subMap('id', 'interval'), @@ -1576,56 +1091,44 @@ _Before:_ } .groupTuple() - .dump(tag: 'grouped', pretty: true) + .view() } ``` _After:_ -```groovy title="main.nf" linenums="30" +```groovy title="main.nf" linenums="27" ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ grouping_key.subMap('id', 'interval'), - normal.subMap("bam") + normal.subMap("bam"), + tumor.subMap("bam") ] } .groupTuple() - .dump(tag: 'grouped', pretty: true) + .view() } ``` ```bash title="View grouped samples" -nextflow run main.nf -dump-channels grouped +nextflow run main.nf ``` ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [drunk_baekeland] DSL2 - revision: b46fad3c6c - -[DUMP: grouped] [ - { - "id": "sampleA", - "interval": "chr1" - }, - [ - { - "bam": "sampleA_r1_normal.bam" - } - ] -] -[DUMP: grouped] [ - { - "id": "sampleA", - "interval": "chr2" - }, - [ - { - "bam": "sampleA_r1_normal.bam" - } - ] -] +Launching `main.nf` [boring_hopper] DSL2 - revision: 23cdd7ec26 + +[[id:sampleA, interval:chr1], [[bam:sampleA_rep1_normal.bam], [bam:sampleB_rep1_normal.bam]], [[bam:sampleA_rep1_tumor.bam], [bam:sampleB_rep1_tumor.bam]]] +[[id:sampleA, interval:chr2], [[bam:sampleA_rep1_normal.bam], [bam:sampleB_rep1_normal.bam]], [[bam:sampleA_rep1_tumor.bam], [bam:sampleB_rep1_tumor.bam]]] +[[id:sampleA, interval:chr3], [[bam:sampleA_rep1_normal.bam], [bam:sampleB_rep1_normal.bam]], [[bam:sampleA_rep1_tumor.bam], [bam:sampleB_rep1_tumor.bam]]] +[[id:sampleB, interval:chr1], [[bam:sampleC_rep1_normal.bam]], [[bam:sampleC_rep1_tumor.bam]]] +[[id:sampleB, interval:chr2], [[bam:sampleC_rep1_normal.bam]], [[bam:sampleC_rep1_tumor.bam]]] +[[id:sampleB, interval:chr3], [[bam:sampleC_rep1_normal.bam]], [[bam:sampleC_rep1_tumor.bam]]] +[[id:sampleC, interval:chr1], [[bam:sampleD_rep1_normal.bam]], [[bam:sampleD_rep1_tumor.bam]]] +[[id:sampleC, interval:chr2], [[bam:sampleD_rep1_normal.bam]], [[bam:sampleD_rep1_tumor.bam]]] +[[id:sampleC, interval:chr3], [[bam:sampleD_rep1_normal.bam]], [[bam:sampleD_rep1_tumor.bam]]] ... ``` @@ -1646,7 +1149,7 @@ In this side quest, you've learned how to split and group data using channels. B 1. **Read in samplesheet with splitCsv** - Samplesheet details here -- Show with view, then show with dump (is prettier!) +- Show with view, then show with view 2. **Use filter (and/or map) to manipulate into 2 separate channels** diff --git a/side-quests/splitting_and_grouping/data/samplesheet.csv b/side-quests/splitting_and_grouping/data/samplesheet.csv index d20f27f416..d9ec938f31 100644 --- a/side-quests/splitting_and_grouping/data/samplesheet.csv +++ b/side-quests/splitting_and_grouping/data/samplesheet.csv @@ -1,9 +1,9 @@ id,repeat,type,bam -sampleA,1,normal,sampleA_r1_normal.bam +sampleA,1,normal,sampleA_rep1_normal.bam sampleA,1,tumor,sampleA_rep1_tumor.bam -sampleB,1,normal,sampleB_rep1_normal.bam -sampleB,1,tumor,sampleB_rep1_tumor.bam -sampleC,1,normal,sampleC_rep1_normal.bam -sampleC,1,tumor,sampleC_rep1_tumor.bam -sampleD,1,normal,sampleD_rep1_normal.bam -sampleD,1,tumor,sampleD_rep1_tumor.bam +sampleA,2,normal,sampleB_rep1_normal.bam +sampleA,2,tumor,sampleB_rep1_tumor.bam +sampleB,1,normal,sampleC_rep1_normal.bam +sampleB,1,tumor,sampleC_rep1_tumor.bam +sampleC,1,normal,sampleD_rep1_normal.bam +sampleC,1,tumor,sampleD_rep1_tumor.bam From b19f44b50a6f1acd3571cdd60f7fa3e2b421b816 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Thu, 10 Apr 2025 09:44:58 +0100 Subject: [PATCH 13/36] Update samplesheet --- docs/side_quests/splitting-and-grouping.md | 185 ++++++++++----------- side-quests/splitting_and_grouping/main.nf | 32 ++++ 2 files changed, 123 insertions(+), 94 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index a0e37b18a2..1c4b68fc93 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -110,16 +110,16 @@ nextflow run main.nf ```console title="Read samplesheet with splitCsv" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [elated_fermat] DSL2 - revision: bd6b0224e9 +Launching `main.nf` [deadly_mercator] DSL2 - revision: bd6b0224e9 -[id:sampleA, repeat:1, type:normal, bam:sampleA_r1_normal.bam] +[id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam] [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam] -[id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam] -[id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam] -[id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam] -[id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam] -[id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] -[id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam] +[id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam] +[id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam] +[id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam] +[id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam] +[id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] +[id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam] ``` We can see that each row from the CSV file has been converted into a map with keys matching the header row. A map is a key-value data structure similar to dictionaries in Python, objects in JavaScript, or hashes in Ruby. @@ -178,12 +178,12 @@ nextflow run main.nf ```console title="View normal samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [adoring_cori] DSL2 - revision: 194d61704d +Launching `main.nf` [admiring_brown] DSL2 - revision: 194d61704d -[id:sampleA, repeat:1, type:normal, bam:sampleA_r1_normal.bam] -[id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam] -[id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam] -[id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] +[id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam] +[id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam] +[id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam] +[id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] ``` We have successfully filtered the data to only include normal samples. Let's recap how this works. The `filter` operator takes a closure that is applied to each element in the channel. If the closure returns `true`, the element is included in the output channel. If the closure returns `false`, the element is excluded from the output channel. @@ -198,7 +198,7 @@ In this case, we want to keep only the samples where `sample.type == 'normal'`. #TODO: Move this later after making the tumor only channel, put it at the end in one section! -While useful, we are discarding the tumor samples. Instead, let's rewrite our pipeline to save all the samples to one channel called `samplesheet`, then filter that channel to just the normal samples and save the results to a new channel called `normal_samples`. +While useful, we are discarding the tumor samples. Instead, let's rewrite our pipeline to save all the samples to one channel called `ch_samplesheet`, then filter that channel to just the normal samples and save the results to a new channel called `ch_normal_samples`. _Before:_ @@ -232,12 +232,12 @@ nextflow run main.nf ```console title="View normal samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [astonishing_noether] DSL2 - revision: 8e49cf6956 +Launching `main.nf` [trusting_poisson] DSL2 - revision: 639186ee74 -[id:sampleA, repeat:1, type:normal, bam:sampleA_r1_normal.bam] -[id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam] -[id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam] -[id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] +[id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam] +[id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam] +[id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam] +[id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] ``` Success! We have filtered the data to only include normal samples. Note that we can use view and save the new channel. If we wanted, we still have access to the tumor samples within the `ch_samplesheet` channel. Since we managed it for the normal samples, let's do it for the tumor samples as well: @@ -276,16 +276,16 @@ nextflow run main.nf ```console title="View tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [gloomy_roentgen] DSL2 - revision: e6b3917a8e +Launching `main.nf` [big_bernard] DSL2 - revision: 897c9e44cc +[id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam] [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam] -[id:sampleA, repeat:1, type:normal, bam:sampleA_r1_normal.bam] -[id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam] -[id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam] -[id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam] -[id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam] -[id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam] -[id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] +[id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam] +[id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam] +[id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam] +[id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam] +[id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam] +[id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam] ``` We've managed to separate out the normal and tumor samples into two different channels but they're mixed up when we `view` them in the console! If we want, we can remove one of the `view` operators to see the data in each channel separately. Let's remove the `view` operator for the normal samples: @@ -326,12 +326,12 @@ nextflow run main.nf ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [pensive_moriondo] DSL2 - revision: 012d38e59f +Launching `main.nf` [loving_bardeen] DSL2 - revision: 012d38e59f [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam] -[id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam] -[id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam] -[id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam] +[id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam] +[id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam] +[id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam] ``` Note how we can only see the tumor samples in the output. This is because we removed the `view` operator for the normal samples. @@ -348,7 +348,7 @@ We've now separated out the normal and tumor samples into two different channels --- -## 3. Join on ID +## 3. Join on sample ID In the previous section, we separated out the normal and tumor samples into two different channels. These could be processed independently using specific processes or workflows based on their type. But what happens when we want to compare the normal and tumor samples from the same patient? At this point, we need to join them back together making sure to match the samples based on their `id` field. @@ -365,12 +365,12 @@ nextflow run main.nf ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [sleepy_lichterman] DSL2 - revision: 012d38e59f +Launching `main.nf` [loving_bardeen] DSL2 - revision: 012d38e59f [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam] -[id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam] -[id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam] -[id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam] +[id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam] +[id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam] +[id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam] ``` We can see that the `id` field is the first element in each map. For `join` to work, we should isolate the `id` field in each tuple. After that, we can simply use the `join` operator to combine the two channels. @@ -415,16 +415,16 @@ nextflow run main.nf ```console title="View normal and tumor samples with ID as element 0" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [trusting_ptolemy] DSL2 - revision: 882ae9add4 +Launching `main.nf` [dreamy_sax] DSL2 - revision: 882ae9add4 [sampleA, [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam]] [sampleA, [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[sampleB, [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam]] -[sampleB, [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] -[sampleC, [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam]] -[sampleC, [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[sampleD, [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam]] -[sampleD, [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[sampleA, [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam]] +[sampleA, [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] +[sampleB, [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam]] +[sampleB, [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[sampleC, [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam]] +[sampleC, [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` It might be subtle, but you should be able to see the first element in each tuple is the `id` field. Now we can use the `join` operator to combine the two channels based on the `id` field. @@ -473,12 +473,12 @@ nextflow run main.nf ```console title="View joined normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [astonishing_heyrovsky] DSL2 - revision: 49857f9ecc +Launching `main.nf` [elegant_waddington] DSL2 - revision: c552f22069 [sampleA, [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[sampleB, [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] -[sampleC, [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[sampleD, [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[sampleA, [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] +[sampleB, [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[sampleC, [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` It's a little hard to tell because it's so wide, but you should be able to see the samples have been joined by the `id` field. Each tuple now has the format: @@ -560,12 +560,12 @@ nextflow run main.nf ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [extravagant_varahamihira] DSL2 - revision: 8c61f8cc77 +Launching `main.nf` [prickly_wing] DSL2 - revision: 3bebf22dee [[sampleA, 1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[[sampleB, 1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[sampleC, 1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[sampleD, 1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[sampleA, 2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[sampleB, 1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[sampleC, 1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` Note how we have a tuple of two elements (`id` and `repeat` fields) as the first element of each joined result. This demonstrates how complex items can be used as a joining key, enabling fairly intricate matching between samples from the same conditions. @@ -635,12 +635,12 @@ nextflow run main.nf ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [irreverent_shaw] DSL2 - revision: 83d2d53944 +Launching `main.nf` [curious_hopper] DSL2 - revision: 90283e523d [[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleD, repeat:1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleA, repeat:2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` Now we have a new joining key that not only includes the `id` and `repeat` fields but also retains the field names so we can access them later by name, e.g. `sample.id` and `sample.repeat`. @@ -710,12 +710,12 @@ nextflow run main.nf ```console title="View normal and tumor samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [modest_roentgen] DSL2 - revision: ec9412b708 +Launching `main.nf` [angry_meninsky] DSL2 - revision: 2edc226b1d [[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleD, repeat:1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleA, repeat:2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` Using a named closure in the map allows us to reuse the same map in multiple places which reduces our risk of introducing errors. It also makes the code more readable and easier to maintain. @@ -785,20 +785,20 @@ nextflow run main.nf ```console title="View combined samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [stupefied_brattain] DSL2 - revision: 6a4891d696 +Launching `main.nf` [nasty_albattani] DSL2 - revision: 040e367b95 [[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], chr1] [[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], chr2] [[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], chr3] -[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam], chr1] -[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam], chr2] -[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam], chr3] -[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam], chr1] -[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam], chr2] -[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam], chr3] -[[id:sampleD, repeat:1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam], chr1] -[[id:sampleD, repeat:1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam], chr2] -[[id:sampleD, repeat:1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam], chr3 +[[id:sampleA, repeat:2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam], chr1] +[[id:sampleA, repeat:2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam], chr2] +[[id:sampleA, repeat:2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam], chr3] +[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam], chr1] +[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam], chr2] +[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam], chr3] +[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam], chr1] +[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam], chr2] +[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam], chr3] ``` Success! We have repeated every sample for every single interval in our 3 interval list. We've effectively tripled the number of items in our channel. It's a little hard to read though, so in the next section we will tidy it up. @@ -864,20 +864,20 @@ nextflow run main.nf ```console title="View combined samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [stupefied_kare] DSL2 - revision: 7d98ee6805 +Launching `main.nf` [sick_moriondo] DSL2 - revision: 583df36829 [[id:sampleA, repeat:1, interval:chr1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] [[id:sampleA, repeat:1, interval:chr2], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] [[id:sampleA, repeat:1, interval:chr3], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[[id:sampleB, repeat:1, interval:chr1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleB, repeat:1, interval:chr2], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleB, repeat:1, interval:chr3], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleC, repeat:1, interval:chr1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleC, repeat:1, interval:chr2], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleC, repeat:1, interval:chr3], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleD, repeat:1, interval:chr1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] -[[id:sampleD, repeat:1, interval:chr2], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] -[[id:sampleD, repeat:1, interval:chr3], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleA, repeat:2, interval:chr1], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleA, repeat:2, interval:chr2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleA, repeat:2, interval:chr3], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleB, repeat:1, interval:chr1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleB, repeat:1, interval:chr2], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleB, repeat:1, interval:chr3], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleC, repeat:1, interval:chr1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleC, repeat:1, interval:chr2], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleC, repeat:1, interval:chr3], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` Using `map` to coerce your data into the correct structure can be tricky, but it's crucial to correctly splitting and grouping effectively. @@ -964,20 +964,20 @@ nextflow run main.nf ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [silly_leibniz] DSL2 - revision: 1db0b1e3de +Launching `main.nf` [suspicious_cantor] DSL2 - revision: bb6b28c9d4 [[id:sampleA, interval:chr1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] [[id:sampleA, interval:chr2], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] [[id:sampleA, interval:chr3], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[[id:sampleB, interval:chr1], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleB, interval:chr2], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleB, interval:chr3], [id:sampleB, repeat:1, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleC, interval:chr1], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleC, interval:chr2], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleC, interval:chr3], [id:sampleC, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleD, interval:chr1], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] -[[id:sampleD, interval:chr2], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] -[[id:sampleD, interval:chr3], [id:sampleD, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleD, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleA, interval:chr1], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleA, interval:chr2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleA, interval:chr3], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleB, interval:chr1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleB, interval:chr2], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleB, interval:chr3], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleC, interval:chr1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleC, interval:chr2], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleC, interval:chr3], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] ``` We can see that we have successfully isolated the `id` and `interval` fields, but not grouped the samples yet. @@ -1017,8 +1017,6 @@ _After:_ Simple, huh? We just added a single line of code. Let's see what happens when we run it: -#TODO: AUTHORS NOTE WE CHANGED THE SAMPLE SHEET DURING DEV HERE. GO BACK AND FIX IT - ```bash title="View grouped samples" nextflow run main.nf ``` @@ -1026,7 +1024,7 @@ nextflow run main.nf ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [irreverent_mestorf] DSL2 - revision: 5cb6b8c8da +Launching `main.nf` [desperate_fourier] DSL2 - revision: 3b18d673f3 [[id:sampleA, interval:chr1], [[id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam]], [[id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]]] [[id:sampleA, interval:chr2], [[id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam]], [[id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]]] @@ -1118,7 +1116,7 @@ nextflow run main.nf ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [boring_hopper] DSL2 - revision: 23cdd7ec26 +Launching `main.nf` [deadly_dubinsky] DSL2 - revision: 9ca3088b35 [[id:sampleA, interval:chr1], [[bam:sampleA_rep1_normal.bam], [bam:sampleB_rep1_normal.bam]], [[bam:sampleA_rep1_tumor.bam], [bam:sampleB_rep1_tumor.bam]]] [[id:sampleA, interval:chr2], [[bam:sampleA_rep1_normal.bam], [bam:sampleB_rep1_normal.bam]], [[bam:sampleA_rep1_tumor.bam], [bam:sampleB_rep1_tumor.bam]]] @@ -1129,7 +1127,6 @@ Launching `main.nf` [boring_hopper] DSL2 - revision: 23cdd7ec26 [[id:sampleC, interval:chr1], [[bam:sampleD_rep1_normal.bam]], [[bam:sampleD_rep1_tumor.bam]]] [[id:sampleC, interval:chr2], [[bam:sampleD_rep1_normal.bam]], [[bam:sampleD_rep1_tumor.bam]]] [[id:sampleC, interval:chr3], [[bam:sampleD_rep1_normal.bam]], [[bam:sampleD_rep1_tumor.bam]]] -... ``` Now we have a much cleaner output. We can see that the `id` and `interval` fields are only included once, and the `bam` field is included in the sample data. diff --git a/side-quests/splitting_and_grouping/main.nf b/side-quests/splitting_and_grouping/main.nf index 77cc26224d..d34ae77032 100644 --- a/side-quests/splitting_and_grouping/main.nf +++ b/side-quests/splitting_and_grouping/main.nf @@ -1,3 +1,35 @@ workflow { + getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map ( getSampleIdAndReplicate ) + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == "tumor" } + .map ( getSampleIdAndReplicate ) + ch_joined_samples = ch_normal_samples + .join(ch_tumor_samples) + ch_intervals = Channel.of('chr1', 'chr2', 'chr3') + + ch_combined_samples = ch_joined_samples.combine(ch_intervals) + .map { grouping_key, normal, tumor, interval -> + [ + grouping_key + [interval: interval], + normal, + tumor + ] + + } + + ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> + [ + grouping_key.subMap('id', 'interval'), + normal.subMap("bam"), + tumor.subMap("bam") + ] + + } + .groupTuple() + .view() } From 1b3a809f382588fec7838ce9e78aee0b9a03da97 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Thu, 10 Apr 2025 09:45:50 +0100 Subject: [PATCH 14/36] fixup: Remove TODO --- docs/side_quests/splitting-and-grouping.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 1c4b68fc93..44e57c2a1d 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -196,8 +196,6 @@ In this case, we want to keep only the samples where `sample.type == 'normal'`. ### 2.2. Filter to just the tumor samples -#TODO: Move this later after making the tumor only channel, put it at the end in one section! - While useful, we are discarding the tumor samples. Instead, let's rewrite our pipeline to save all the samples to one channel called `ch_samplesheet`, then filter that channel to just the normal samples and save the results to a new channel called `ch_normal_samples`. _Before:_ From 5182860049afd752f1a4bd0ce4ae7a535e65e506 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Thu, 10 Apr 2025 11:19:03 +0100 Subject: [PATCH 15/36] Finish with a section on flattening the data to close out grouping --- docs/side_quests/splitting-and-grouping.md | 365 +++++++++++++++------ 1 file changed, 257 insertions(+), 108 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 44e57c2a1d..149020c072 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -718,6 +718,95 @@ Launching `main.nf` [angry_meninsky] DSL2 - revision: 2edc226b1d Using a named closure in the map allows us to reuse the same map in multiple places which reduces our risk of introducing errors. It also makes the code more readable and easier to maintain. +### 3.5. Reduce duplication of data + +We have a lot of duplicated data in our workflow. Each item in the joined samples repeats the `id` and `repeat` fields. Since this information is already available in the grouping key, we can avoid this redundancy. As a reminder, our current data structure looks like this: + +```groovy +[ + [ + "id": "sampleC", + "repeat": "1", + ], + [ + "id": "sampleC", + "repeat": "1", + "type": "normal", + "bam": "sampleC_rep1_normal.bam" + ], + [ + "id": "sampleC", + "repeat": "1", + "type": "tumor", + "bam": "sampleC_rep1_tumor.bam" + ] +] +``` + +Since the `id` and `repeat` fields are available in the grouping key, let's remove them from the sample data to avoid duplication. We can do this by using the `subMap` method to create a new map with only the `type` and `bam` fields. This approach allows us to maintain all necessary information while eliminating redundancy in our data structure. + +_Before:_ + +```groovy title="main.nf" linenums="15" +workflow { + getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map ( getSampleIdAndReplicate ) + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == "tumor" } + .map ( getSampleIdAndReplicate ) + ch_joined_samples = ch_normal_samples + .join(ch_tumor_samples) + ch_joined_samples.view() +} +``` + +_After:_ + +```groovy title="main.nf" linenums="15" +workflow { + getSampleIdAndReplicate = { sample -> + [ + sample.subMap(['id', 'repeat']), + sample.subMap(['type', 'bam']) + ] + } + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map ( getSampleIdAndReplicate ) + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == "tumor" } + .map ( getSampleIdAndReplicate ) + ch_joined_samples = ch_normal_samples + .join(ch_tumor_samples) + ch_joined_samples.view() +} +``` + +Now, when the closure returns the tuple, the first element is the `id` and `repeat` fields and the second element is the `type` and `bam` fields. We have effectively removed the `id` and `repeat` fields from the sample data and uniquely store them in the grouping key. This approach eliminates redundancy while maintaining all necessary information. + +```bash title="View deduplicated data" +nextflow run main.nf +``` + +```console title="View deduplicated data" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [trusting_pike] DSL2 - revision: 09d3c7a81b + +[[id:sampleA, repeat:1], [type:normal, bam:sampleA_rep1_normal.bam], [type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleA, repeat:2], [type:normal, bam:sampleB_rep1_normal.bam], [type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleB, repeat:1], [type:normal, bam:sampleC_rep1_normal.bam], [type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleC, repeat:1], [type:normal, bam:sampleD_rep1_normal.bam], [type:tumor, bam:sampleD_rep1_tumor.bam]] +``` + +We can see we only state the `id` and `repeat` fields once in the grouping key and we have the `type` and `bam` fields in the sample data. We haven't lost any information but we managed to make our channel contents more succinct. + ### Takeaway In this section, you've learned: @@ -726,8 +815,8 @@ In this section, you've learned: - **Joining Tuples**: How to use `join` to combine tuples based on the first field - **Creating Joining Keys**: How to use `subMap` to create a new joining key - **Named Closures**: How to use a named closure in map - -You now have a workflow that can split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then print the results. +- **Deduplicating Data**: How to remove duplicate data from the channel + You now have a workflow that can split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then print the results. This is a common pattern in bioinformatics workflows where you need to match up samples after processing independently, so it is a useful skill. Next, we will look at repeating a sample multiple times. @@ -783,20 +872,20 @@ nextflow run main.nf ```console title="View combined samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [nasty_albattani] DSL2 - revision: 040e367b95 +Launching `main.nf` [soggy_fourier] DSL2 - revision: fa8f5edb22 -[[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], chr1] -[[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], chr2] -[[id:sampleA, repeat:1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], chr3] -[[id:sampleA, repeat:2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam], chr1] -[[id:sampleA, repeat:2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam], chr2] -[[id:sampleA, repeat:2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam], chr3] -[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam], chr1] -[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam], chr2] -[[id:sampleB, repeat:1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam], chr3] -[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam], chr1] -[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam], chr2] -[[id:sampleC, repeat:1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam], chr3] +[[id:sampleA, repeat:1], [type:normal, bam:sampleA_rep1_normal.bam], [type:tumor, bam:sampleA_rep1_tumor.bam], chr1] +[[id:sampleA, repeat:1], [type:normal, bam:sampleA_rep1_normal.bam], [type:tumor, bam:sampleA_rep1_tumor.bam], chr2] +[[id:sampleA, repeat:1], [type:normal, bam:sampleA_rep1_normal.bam], [type:tumor, bam:sampleA_rep1_tumor.bam], chr3] +[[id:sampleA, repeat:2], [type:normal, bam:sampleB_rep1_normal.bam], [type:tumor, bam:sampleB_rep1_tumor.bam], chr1] +[[id:sampleA, repeat:2], [type:normal, bam:sampleB_rep1_normal.bam], [type:tumor, bam:sampleB_rep1_tumor.bam], chr2] +[[id:sampleA, repeat:2], [type:normal, bam:sampleB_rep1_normal.bam], [type:tumor, bam:sampleB_rep1_tumor.bam], chr3] +[[id:sampleB, repeat:1], [type:normal, bam:sampleC_rep1_normal.bam], [type:tumor, bam:sampleC_rep1_tumor.bam], chr1] +[[id:sampleB, repeat:1], [type:normal, bam:sampleC_rep1_normal.bam], [type:tumor, bam:sampleC_rep1_tumor.bam], chr2] +[[id:sampleB, repeat:1], [type:normal, bam:sampleC_rep1_normal.bam], [type:tumor, bam:sampleC_rep1_tumor.bam], chr3] +[[id:sampleC, repeat:1], [type:normal, bam:sampleD_rep1_normal.bam], [type:tumor, bam:sampleD_rep1_tumor.bam], chr1] +[[id:sampleC, repeat:1], [type:normal, bam:sampleD_rep1_normal.bam], [type:tumor, bam:sampleD_rep1_tumor.bam], chr2] +[[id:sampleC, repeat:1], [type:normal, bam:sampleD_rep1_normal.bam], [type:tumor, bam:sampleD_rep1_tumor.bam], chr3] ``` Success! We have repeated every sample for every single interval in our 3 interval list. We've effectively tripled the number of items in our channel. It's a little hard to read though, so in the next section we will tidy it up. @@ -862,20 +951,20 @@ nextflow run main.nf ```console title="View combined samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [sick_moriondo] DSL2 - revision: 583df36829 +Launching `main.nf` [sad_hawking] DSL2 - revision: 1f6f6250cd -[[id:sampleA, repeat:1, interval:chr1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[[id:sampleA, repeat:1, interval:chr2], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[[id:sampleA, repeat:1, interval:chr3], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[[id:sampleA, repeat:2, interval:chr1], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleA, repeat:2, interval:chr2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleA, repeat:2, interval:chr3], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleB, repeat:1, interval:chr1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleB, repeat:1, interval:chr2], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleB, repeat:1, interval:chr3], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleC, repeat:1, interval:chr1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] -[[id:sampleC, repeat:1, interval:chr2], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] -[[id:sampleC, repeat:1, interval:chr3], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleA, repeat:1, interval:chr1], [type:normal, bam:sampleA_rep1_normal.bam], [type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleA, repeat:1, interval:chr2], [type:normal, bam:sampleA_rep1_normal.bam], [type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleA, repeat:1, interval:chr3], [type:normal, bam:sampleA_rep1_normal.bam], [type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleA, repeat:2, interval:chr1], [type:normal, bam:sampleB_rep1_normal.bam], [type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleA, repeat:2, interval:chr2], [type:normal, bam:sampleB_rep1_normal.bam], [type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleA, repeat:2, interval:chr3], [type:normal, bam:sampleB_rep1_normal.bam], [type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleB, repeat:1, interval:chr1], [type:normal, bam:sampleC_rep1_normal.bam], [type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleB, repeat:1, interval:chr2], [type:normal, bam:sampleC_rep1_normal.bam], [type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleB, repeat:1, interval:chr3], [type:normal, bam:sampleC_rep1_normal.bam], [type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleC, repeat:1, interval:chr1], [type:normal, bam:sampleD_rep1_normal.bam], [type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleC, repeat:1, interval:chr2], [type:normal, bam:sampleD_rep1_normal.bam], [type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleC, repeat:1, interval:chr3], [type:normal, bam:sampleD_rep1_normal.bam], [type:tumor, bam:sampleD_rep1_tumor.bam]] ``` Using `map` to coerce your data into the correct structure can be tricky, but it's crucial to correctly splitting and grouping effectively. @@ -886,8 +975,6 @@ In this section, you've learned: - **Spreading samples over intervals**: How to use `combine` to repeat samples over intervals -#TODO: Suggestion, tidy up data here instead of at the end? (i.e. section 5.2) - ### 5. Aggregating samples In the previous section, we learned how to split a samplesheet and filter the normal and tumor samples. But this only covers a single type of joining. What if we want to group samples by a specific attribute? For example, instead of joining matched normal-tumor pairs, we might want to process all samples from "sampleA" together regardless of their type. This pattern is common in bioinformatics workflows where you may want to process related samples separately for efficiency reasons before comparing or combining the results at the end. @@ -962,24 +1049,27 @@ nextflow run main.nf ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [suspicious_cantor] DSL2 - revision: bb6b28c9d4 +Launching `main.nf` [loving_escher] DSL2 - revision: 3adccba898 -[[id:sampleA, interval:chr1], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[[id:sampleA, interval:chr2], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[[id:sampleA, interval:chr3], [id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam]] -[[id:sampleA, interval:chr1], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleA, interval:chr2], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleA, interval:chr3], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]] -[[id:sampleB, interval:chr1], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleB, interval:chr2], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleB, interval:chr3], [id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam], [id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]] -[[id:sampleC, interval:chr1], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] -[[id:sampleC, interval:chr2], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] -[[id:sampleC, interval:chr3], [id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam], [id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleA, interval:chr1], [type:normal, bam:sampleA_rep1_normal.bam], [type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleA, interval:chr2], [type:normal, bam:sampleA_rep1_normal.bam], [type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleA, interval:chr3], [type:normal, bam:sampleA_rep1_normal.bam], [type:tumor, bam:sampleA_rep1_tumor.bam]] +[[id:sampleA, interval:chr1], [type:normal, bam:sampleB_rep1_normal.bam], [type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleA, interval:chr2], [type:normal, bam:sampleB_rep1_normal.bam], [type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleA, interval:chr3], [type:normal, bam:sampleB_rep1_normal.bam], [type:tumor, bam:sampleB_rep1_tumor.bam]] +[[id:sampleB, interval:chr1], [type:normal, bam:sampleC_rep1_normal.bam], [type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleB, interval:chr2], [type:normal, bam:sampleC_rep1_normal.bam], [type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleB, interval:chr3], [type:normal, bam:sampleC_rep1_normal.bam], [type:tumor, bam:sampleC_rep1_tumor.bam]] +[[id:sampleC, interval:chr1], [type:normal, bam:sampleD_rep1_normal.bam], [type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleC, interval:chr2], [type:normal, bam:sampleD_rep1_normal.bam], [type:tumor, bam:sampleD_rep1_tumor.bam]] +[[id:sampleC, interval:chr3], [type:normal, bam:sampleD_rep1_normal.bam], [type:tumor, bam:sampleD_rep1_tumor.bam]] ``` We can see that we have successfully isolated the `id` and `interval` fields, but not grouped the samples yet. +!!! note +We are discarding the `replicate` field here. This is because we don't need it for further downstream processing. After completing this tutorial, see if you can include it without affecting the later grouping! + Let's now group the samples by this new grouping element, using the [`groupTuple` operator](https://www.nextflow.io/docs/latest/operator.html#grouptuple). _Before:_ @@ -1022,17 +1112,17 @@ nextflow run main.nf ```console title="View grouped samples" N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [desperate_fourier] DSL2 - revision: 3b18d673f3 +Launching `main.nf` [festering_almeida] DSL2 - revision: 78988949e3 -[[id:sampleA, interval:chr1], [[id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam]], [[id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]]] -[[id:sampleA, interval:chr2], [[id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam]], [[id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]]] -[[id:sampleA, interval:chr3], [[id:sampleA, repeat:1, type:normal, bam:sampleA_rep1_normal.bam], [id:sampleA, repeat:2, type:normal, bam:sampleB_rep1_normal.bam]], [[id:sampleA, repeat:1, type:tumor, bam:sampleA_rep1_tumor.bam], [id:sampleA, repeat:2, type:tumor, bam:sampleB_rep1_tumor.bam]]] -[[id:sampleB, interval:chr1], [[id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam]], [[id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]]] -[[id:sampleB, interval:chr2], [[id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam]], [[id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]]] -[[id:sampleB, interval:chr3], [[id:sampleB, repeat:1, type:normal, bam:sampleC_rep1_normal.bam]], [[id:sampleB, repeat:1, type:tumor, bam:sampleC_rep1_tumor.bam]]] -[[id:sampleC, interval:chr1], [[id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam]], [[id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]]] -[[id:sampleC, interval:chr2], [[id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam]], [[id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]]] -[[id:sampleC, interval:chr3], [[id:sampleC, repeat:1, type:normal, bam:sampleD_rep1_normal.bam]], [[id:sampleC, repeat:1, type:tumor, bam:sampleD_rep1_tumor.bam]]] +[[id:sampleA, interval:chr1], [[type:normal, bam:sampleA_rep1_normal.bam], [type:normal, bam:sampleB_rep1_normal.bam]], [[type:tumor, bam:sampleA_rep1_tumor.bam], [type:tumor, bam:sampleB_rep1_tumor.bam]]] +[[id:sampleA, interval:chr2], [[type:normal, bam:sampleA_rep1_normal.bam], [type:normal, bam:sampleB_rep1_normal.bam]], [[type:tumor, bam:sampleA_rep1_tumor.bam], [type:tumor, bam:sampleB_rep1_tumor.bam]]] +[[id:sampleA, interval:chr3], [[type:normal, bam:sampleA_rep1_normal.bam], [type:normal, bam:sampleB_rep1_normal.bam]], [[type:tumor, bam:sampleA_rep1_tumor.bam], [type:tumor, bam:sampleB_rep1_tumor.bam]]] +[[id:sampleB, interval:chr1], [[type:normal, bam:sampleC_rep1_normal.bam]], [[type:tumor, bam:sampleC_rep1_tumor.bam]]] +[[id:sampleB, interval:chr2], [[type:normal, bam:sampleC_rep1_normal.bam]], [[type:tumor, bam:sampleC_rep1_tumor.bam]]] +[[id:sampleB, interval:chr3], [[type:normal, bam:sampleC_rep1_normal.bam]], [[type:tumor, bam:sampleC_rep1_tumor.bam]]] +[[id:sampleC, interval:chr1], [[type:normal, bam:sampleD_rep1_normal.bam]], [[type:tumor, bam:sampleD_rep1_tumor.bam]]] +[[id:sampleC, interval:chr2], [[type:normal, bam:sampleD_rep1_normal.bam]], [[type:tumor, bam:sampleD_rep1_tumor.bam]]] +[[id:sampleC, interval:chr3], [[type:normal, bam:sampleD_rep1_normal.bam]], [[type:tumor, bam:sampleD_rep1_tumor.bam]]] ``` Note our data has changed structure. What was previously a list of tuples is now a list of lists of tuples. This is because when we use `groupTuple`, Nextflow creates a new list for each group. This is important to remember when trying to handle the data downstream. @@ -1042,50 +1132,47 @@ It's possible to use a simpler data structure than this, by separating our the s !!! note [`transpose`](https://www.nextflow.io/docs/latest/reference/operator.html#transpose) is the opposite of groupTuple. It unpacks the items in a channel and flattens them. Try and add `transpose` and undo the grouping we performed above! -# 5.2. Reduce duplication of data +# 5.2. Reorganise data into more efficient structure -We have a lot of duplicated data in our workflow. Each item in the grouped sample repeats the `id` and `interval` fields. Since this information is available in the metamap, let's just save it once. As a reminder, our data is structured like this: +Let's consider the inputs to a typical Nextflow process. Generally, inputs can be in the form of values or files. In this example, we have a set of values for sample information (`id` and `interval`) and a set of files for sequencing data (`normal` and `tumor`). The `input` block of a process might look like this: + +```groovy title="main.nf" + input: + tuple val(sampleInfo), path(normalBam), path(tumorBam) +``` + +Currently, our data structure isn't optimal for this. We have a tuple where the first element is a map and the second and third elements are lists of maps. Here is the current data structure: ```groovy [ - { - "id": "sampleC", - "interval": "chr3" - }, - [ - { - "id": "sampleC", - "repeat": "1", - "type": "normal", - "bam": "sampleC_rep1_normal.bam" - } - ], - [ - { - "id": "sampleC", - "repeat": "1", - "type": "tumor", - "bam": "sampleC_rep1_tumor.bam" - } - ] + [id:sampleA, interval:chr1], + [[type:normal, bam:sampleA_rep1_normal.bam], [type:normal, bam:sampleB_rep1_normal.bam]], + [[type:tumor, bam:sampleA_rep1_tumor.bam], [type:tumor, bam:sampleB_rep1_tumor.bam]] +], +[ + [id:sampleA, interval:chr1], + [[type:normal, bam:sampleA_rep1_normal.bam]] + [[type:tumor, bam:sampleA_rep1_tumor.bam]] ] ``` -We could parse the data after grouping to remove the duplication, but this requires us to handle all of the outputs. Instead, we can parse the data before grouping, which will mean the results are never included in the first place. +What we need is to flatten the data structure so that the second and third elements are lists of BAM files - essentially removing the `type` field and simplifying the structure. -In the same `map` operator where we isolate the `id` and `interval` fields, we can also grab the `bam` field for our sample data and _not_ include the `id` and `interval` fields. +Once again, we'll employ the `map` operator to manipulate our data structure as it flows through the channel. This time, we'll take the lists of data associated with each BAM file and specifically extract the `bam` field, dropping the `type` field in the process. -_Before:_ +To operate on a list of maps, we can use the [`collect` method](), a built-in method in Groovy which iterates over each element in the list and applies the closure to it. You can see examples in the Nextflow documentation [here](https://www.nextflow.io/docs/latest/script.html#closures). In our case, we'll extract the `bam` field from each map and return a list of just BAM file paths. Our closure will look like this: -```groovy title="main.nf" linenums="27" - ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> - [ - grouping_key.subMap('id', 'interval'), - normal, - tumor - ] +```groovy +{ bam_data -> bam_data.bam } +``` - } +Note this is conceptually similar but distinct to the Nextflow [`collect` operator](https://www.nextflow.io/docs/latest/reference/operator.html#collect). This method takes a closure that will be applied to each element in the list. In this case, we want to extract the `bam` field from each map. + +Let's append our map to the end of our pipeline and show the resulting data structure: + +_Before:_ + +```groovy title="main.nf" linenums="38" .groupTuple() .view() } @@ -1093,49 +1180,111 @@ _Before:_ _After:_ -```groovy title="main.nf" linenums="27" - ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> +```groovy title="main.nf" linenums="38" + .groupTuple() + .map { sample_info, normal, tumor -> [ - grouping_key.subMap('id', 'interval'), - normal.subMap("bam"), - tumor.subMap("bam") + sample_info, + normal.collect { bam_data -> bam_data.bam }, + tumor.collect { bam_data -> bam_data.bam } ] - } - .groupTuple() .view() } ``` -```bash title="View grouped samples" +```bash title="View flattened samples" nextflow run main.nf ``` -```console title="View grouped samples" +```console title="" + N E X T F L O W ~ version 24.10.5 -Launching `main.nf` [deadly_dubinsky] DSL2 - revision: 9ca3088b35 +Launching `main.nf` [compassionate_moriondo] DSL2 - revision: 71a8c35ed9 -[[id:sampleA, interval:chr1], [[bam:sampleA_rep1_normal.bam], [bam:sampleB_rep1_normal.bam]], [[bam:sampleA_rep1_tumor.bam], [bam:sampleB_rep1_tumor.bam]]] -[[id:sampleA, interval:chr2], [[bam:sampleA_rep1_normal.bam], [bam:sampleB_rep1_normal.bam]], [[bam:sampleA_rep1_tumor.bam], [bam:sampleB_rep1_tumor.bam]]] -[[id:sampleA, interval:chr3], [[bam:sampleA_rep1_normal.bam], [bam:sampleB_rep1_normal.bam]], [[bam:sampleA_rep1_tumor.bam], [bam:sampleB_rep1_tumor.bam]]] -[[id:sampleB, interval:chr1], [[bam:sampleC_rep1_normal.bam]], [[bam:sampleC_rep1_tumor.bam]]] -[[id:sampleB, interval:chr2], [[bam:sampleC_rep1_normal.bam]], [[bam:sampleC_rep1_tumor.bam]]] -[[id:sampleB, interval:chr3], [[bam:sampleC_rep1_normal.bam]], [[bam:sampleC_rep1_tumor.bam]]] -[[id:sampleC, interval:chr1], [[bam:sampleD_rep1_normal.bam]], [[bam:sampleD_rep1_tumor.bam]]] -[[id:sampleC, interval:chr2], [[bam:sampleD_rep1_normal.bam]], [[bam:sampleD_rep1_tumor.bam]]] -[[id:sampleC, interval:chr3], [[bam:sampleD_rep1_normal.bam]], [[bam:sampleD_rep1_tumor.bam]]] +[[id:sampleA, interval:chr1], [sampleA_rep1_normal.bam, sampleB_rep1_normal.bam], [sampleA_rep1_tumor.bam, sampleB_rep1_tumor.bam]] +[[id:sampleA, interval:chr2], [sampleA_rep1_normal.bam, sampleB_rep1_normal.bam], [sampleA_rep1_tumor.bam, sampleB_rep1_tumor.bam]] +[[id:sampleA, interval:chr3], [sampleA_rep1_normal.bam, sampleB_rep1_normal.bam], [sampleA_rep1_tumor.bam, sampleB_rep1_tumor.bam]] +[[id:sampleB, interval:chr1], [sampleC_rep1_normal.bam], [sampleC_rep1_tumor.bam]] +[[id:sampleB, interval:chr2], [sampleC_rep1_normal.bam], [sampleC_rep1_tumor.bam]] +[[id:sampleB, interval:chr3], [sampleC_rep1_normal.bam], [sampleC_rep1_tumor.bam]] +[[id:sampleC, interval:chr1], [sampleD_rep1_normal.bam], [sampleD_rep1_tumor.bam]] +[[id:sampleC, interval:chr2], [sampleD_rep1_normal.bam], [sampleD_rep1_tumor.bam]] +[[id:sampleC, interval:chr3], [sampleD_rep1_normal.bam], [sampleD_rep1_tumor.bam]] ``` -Now we have a much cleaner output. We can see that the `id` and `interval` fields are only included once, and the `bam` field is included in the sample data. +Note how the channel is now structured as a 3-part tuple: + +- `sample_info` is a map of sample information +- `normal` is a list of BAM file paths +- `tumor` is a list of BAM file paths + +`groupTuple` is a powerful operator but can generate complex data structures. It's important to understand how the data structure changes as it flows through the pipeline so you can manipulate it as needed. Using a `map` at the end of a pipeline helps refine the output into a structure that fits our processes pipeline. + +!!! exercise + + Can you manipulate the data earlier in the pipeline to avoid the need for the final `map`? + + ??? solution + If we parse the data right at the start of our pipeline to _only_ include the `bam` field, we can avoid passing the `type` field through the pipeline which makes our entire pipeline cleaner while retaining the same functionality: + + _Before:_ + + ```groovy title="main.nf" linenums="1" + workflow { + getSampleIdAndReplicate = { sample -> + [ + sample.subMap(['id', 'repeat']), + sample.subMap(['id', 'bam']) + ] + } + ``` + + _After:_ + + ```groovy title="main.nf" linenums="1" + workflow { + getSampleIdAndReplicate = { sample -> + [ + sample.subMap(['id', 'repeat']), + sample.bam + ] + } + ``` + + Once we have done this we can remove the `map` operator from the end of the pipeline: + + _Before:_ + + ```groovy title="main.nf" linenums="38" + .groupTuple() + .map { sample_info, normal, tumor -> + [ + sample_info, + normal.collect { bam_data -> bam_data.bam }, + tumor.collect { bam_data -> bam_data.bam } + ] + } + .view() + ``` + + _After:_ + + ```groovy title="main.nf" linenums="38" + .groupTuple() + .view() + ``` + + Sometimes parsing data earlier in the pipeline is the right choice to avoid complicated code. In this tutorial, we wanted to take you step-by-step which meant we had to do things the long way. ### Takeaway In this section, you've learned: - **Grouping samples**: How to use `groupTuple` to group samples by a specific attribute - -You now have a workflow that can split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then group them by `id`. +- **Flattening data structure**: How to use `map` to flatten the data structure + You now have a workflow that can split a samplesheet, filter the normal and tumor samples, join them together by sample ID and replicate number, then group them by `id`. ## Summary From c35c50fbc64351f6521747d8e0b7a15b7d92f81a Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Thu, 10 Apr 2025 11:19:48 +0100 Subject: [PATCH 16/36] revert main.nf --- side-quests/splitting_and_grouping/main.nf | 32 ---------------------- 1 file changed, 32 deletions(-) diff --git a/side-quests/splitting_and_grouping/main.nf b/side-quests/splitting_and_grouping/main.nf index d34ae77032..77cc26224d 100644 --- a/side-quests/splitting_and_grouping/main.nf +++ b/side-quests/splitting_and_grouping/main.nf @@ -1,35 +1,3 @@ workflow { - getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map ( getSampleIdAndReplicate ) - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } - .map ( getSampleIdAndReplicate ) - ch_joined_samples = ch_normal_samples - .join(ch_tumor_samples) - ch_intervals = Channel.of('chr1', 'chr2', 'chr3') - - ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .map { grouping_key, normal, tumor, interval -> - [ - grouping_key + [interval: interval], - normal, - tumor - ] - - } - - ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> - [ - grouping_key.subMap('id', 'interval'), - normal.subMap("bam"), - tumor.subMap("bam") - ] - - } - .groupTuple() - .view() } From 3e90e13956f197fe61cf729b261804f668ba8604 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Thu, 10 Apr 2025 11:20:34 +0100 Subject: [PATCH 17/36] Sort out notes --- docs/side_quests/splitting-and-grouping.md | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 149020c072..b7acd9bec6 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -78,7 +78,8 @@ workflow { ``` !!! note -Throughout this tutorial, we'll use the `ch_` prefix for all channel variables to clearly indicate they are Nextflow channels. + + Throughout this tutorial, we'll use the `ch_` prefix for all channel variables to clearly indicate they are Nextflow channels. We can use the [`splitCsv` operator](https://www.nextflow.io/docs/latest/operator.html#splitcsv) to split the samplesheet into a channel of maps, where each map represents a row from the CSV file. @@ -486,7 +487,8 @@ It's a little hard to tell because it's so wide, but you should be able to see t - `tumor_sample`: The tumor sample including type, replicate and path to bam file !!! warning -The `join` operator will discard any un-matched tuples. In this example, we made sure all samples were matched for tumor and normal but if this is not true you must use the parameter `remainder: true` to keep the unmatched tuples. Check the [documentation](https://www.nextflow.io/docs/latest/operator.html#join) for more details. + + The `join` operator will discard any un-matched tuples. In this example, we made sure all samples were matched for tumor and normal but if this is not true you must use the parameter `remainder: true` to keep the unmatched tuples. Check the [documentation](https://www.nextflow.io/docs/latest/operator.html#join) for more details. ### Takeaway @@ -699,7 +701,8 @@ _After:_ ``` !!! note -The `map` operator has switched from using `{ }` to using `( )` to pass the closure as an argument. This is because the `map` operator expects a closure as an argument and `{ }` is used to define an anonymous closure. When calling a named closure, use the `( )` syntax. + + The `map` operator has switched from using `{ }` to using `( )` to pass the closure as an argument. This is because the `map` operator expects a closure as an argument and `{ }` is used to define an anonymous closure. When calling a named closure, use the `( )` syntax. ```bash title="View normal and tumor samples" nextflow run main.nf @@ -1068,7 +1071,8 @@ Launching `main.nf` [loving_escher] DSL2 - revision: 3adccba898 We can see that we have successfully isolated the `id` and `interval` fields, but not grouped the samples yet. !!! note -We are discarding the `replicate` field here. This is because we don't need it for further downstream processing. After completing this tutorial, see if you can include it without affecting the later grouping! + + We are discarding the `replicate` field here. This is because we don't need it for further downstream processing. After completing this tutorial, see if you can include it without affecting the later grouping! Let's now group the samples by this new grouping element, using the [`groupTuple` operator](https://www.nextflow.io/docs/latest/operator.html#grouptuple). @@ -1130,7 +1134,8 @@ Note our data has changed structure. What was previously a list of tuples is now It's possible to use a simpler data structure than this, by separating our the sample information from the sequencing data. We generally refer to this as a `metamap`, but this will be covered in a later side quest. For now, you should just understand that we can group up samples using the `groupTuple` operator and that the data structure will change as a result. !!! note -[`transpose`](https://www.nextflow.io/docs/latest/reference/operator.html#transpose) is the opposite of groupTuple. It unpacks the items in a channel and flattens them. Try and add `transpose` and undo the grouping we performed above! + + [`transpose`](https://www.nextflow.io/docs/latest/reference/operator.html#transpose) is the opposite of groupTuple. It unpacks the items in a channel and flattens them. Try and add `transpose` and undo the grouping we performed above! # 5.2. Reorganise data into more efficient structure From 9508b0dde2d2b133778bcc68c483545783fa5dd5 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Thu, 10 Apr 2025 11:23:38 +0100 Subject: [PATCH 18/36] Re-order side quests slightly for more logical order, from least to most complex --- mkdocs.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mkdocs.yml b/mkdocs.yml index 4fec7d0fae..1e1e27bb89 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -37,11 +37,11 @@ nav: - nf4_science/rnaseq/next_steps.md - Side Quests: - side_quests/index.md + - side_quests/workflows_of_workflows.md + - side_quests/splitting-and-grouping.md - side_quests/orientation.md - side_quests/nf-core.md - side_quests/nf-test.md - - side_quests/workflows_of_workflows.md - - side_quests/splitting-and-grouping.md - Fundamentals Training: - basic_training/index.md - basic_training/orientation.md From 019fb4080ba5aaf6ecd7d3237e9cc0fafefdf65e Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Thu, 10 Apr 2025 11:29:15 +0100 Subject: [PATCH 19/36] remove intervals file --- side-quests/splitting_and_grouping/data/intervals.txt | 3 --- 1 file changed, 3 deletions(-) delete mode 100644 side-quests/splitting_and_grouping/data/intervals.txt diff --git a/side-quests/splitting_and_grouping/data/intervals.txt b/side-quests/splitting_and_grouping/data/intervals.txt deleted file mode 100644 index c0a1f9e3f7..0000000000 --- a/side-quests/splitting_and_grouping/data/intervals.txt +++ /dev/null @@ -1,3 +0,0 @@ -chr1 -chr2 -chr3 From ef33c16e6dca0a9247496a456f275e3246ef30dd Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Thu, 10 Apr 2025 12:00:42 +0100 Subject: [PATCH 20/36] Add earlier map to simplify data as core concept instead of challenge --- docs/side_quests/splitting-and-grouping.md | 135 ++++++++++++--------- 1 file changed, 78 insertions(+), 57 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index b7acd9bec6..465be6e245 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -1,6 +1,6 @@ # Splitting and Grouping -Nextflow helps you work with your data in flexible ways. One of the most useful things you can do is split your data into different streams and then group related items back together. +Nextflow helps you work with your data in flexible ways. One of the most useful things you can do is split your data into different streams and then group related items back together. This capability is particularly valuable in bioinformatics workflows where you often need to process different sample types separately before combining results for comparison or joint analysis. Think of it like sorting mail: you might first separate letters by their destination, process each pile differently, and then recombine items going to the same person. In Nextflow, we use special operators to do this with our scientific data. @@ -1137,7 +1137,7 @@ It's possible to use a simpler data structure than this, by separating our the s [`transpose`](https://www.nextflow.io/docs/latest/reference/operator.html#transpose) is the opposite of groupTuple. It unpacks the items in a channel and flattens them. Try and add `transpose` and undo the grouping we performed above! -# 5.2. Reorganise data into more efficient structure +# 5.2. Reorganise the data Let's consider the inputs to a typical Nextflow process. Generally, inputs can be in the form of values or files. In this example, we have a set of values for sample information (`id` and `interval`) and a set of files for sequencing data (`normal` and `tumor`). The `input` block of a process might look like this: @@ -1227,61 +1227,82 @@ Note how the channel is now structured as a 3-part tuple: `groupTuple` is a powerful operator but can generate complex data structures. It's important to understand how the data structure changes as it flows through the pipeline so you can manipulate it as needed. Using a `map` at the end of a pipeline helps refine the output into a structure that fits our processes pipeline. -!!! exercise - - Can you manipulate the data earlier in the pipeline to avoid the need for the final `map`? - - ??? solution - If we parse the data right at the start of our pipeline to _only_ include the `bam` field, we can avoid passing the `type` field through the pipeline which makes our entire pipeline cleaner while retaining the same functionality: - - _Before:_ - - ```groovy title="main.nf" linenums="1" - workflow { - getSampleIdAndReplicate = { sample -> - [ - sample.subMap(['id', 'repeat']), - sample.subMap(['id', 'bam']) - ] - } - ``` - - _After:_ - - ```groovy title="main.nf" linenums="1" - workflow { - getSampleIdAndReplicate = { sample -> - [ - sample.subMap(['id', 'repeat']), - sample.bam - ] - } - ``` - - Once we have done this we can remove the `map` operator from the end of the pipeline: - - _Before:_ - - ```groovy title="main.nf" linenums="38" - .groupTuple() - .map { sample_info, normal, tumor -> - [ - sample_info, - normal.collect { bam_data -> bam_data.bam }, - tumor.collect { bam_data -> bam_data.bam } - ] - } - .view() - ``` - - _After:_ - - ```groovy title="main.nf" linenums="38" - .groupTuple() - .view() - ``` - - Sometimes parsing data earlier in the pipeline is the right choice to avoid complicated code. In this tutorial, we wanted to take you step-by-step which meant we had to do things the long way. +## 5.3. Simplify the data + +One issue we have faced in this pipeline is that we have a moderately complicated data structure which we have had to coerce throughout the pipeline. What if we could simplify it at the start? Then we would only handle the relevant fields in the pipeline and avoid the need for the final `map` operator. + +If we parse the data right at the start of our pipeline to _only_ include the `bam` field, we can avoid passing the `type` field through the pipeline which makes our entire pipeline cleaner while retaining the same functionality: + +_Before:_ + +```groovy title="main.nf" linenums="1" +workflow { + getSampleIdAndReplicate = { sample -> + [ + sample.subMap(['id', 'repeat']), + sample.subMap(['type', 'bam']) + ] + } +``` + +_After:_ + +```groovy title="main.nf" linenums="1" +workflow { + getSampleIdAndReplicate = { sample -> + [ + sample.subMap(['id', 'repeat']), + sample.bam + ] + } +``` + +A reminder, this will select only the BAM files once we have separated the channels into normal and tumor. We are losing the `type` field, but we know which samples are normal and tumor because they have been filtered and the channel should only contain one type per sample. Once we have done this we can remove the `map` operator from the end of the pipeline: + +_Before:_ + +```groovy title="main.nf" linenums="38" + .groupTuple() + .map { sample_info, normal, tumor -> + [ + sample_info, + normal.collect { bam_data -> bam_data.bam }, + tumor.collect { bam_data -> bam_data.bam } + ] + } + .view() +} +``` + +_After:_ + +```groovy title="main.nf" linenums="38" + .groupTuple() + .view() +} +``` + +Sometimes parsing data earlier in the pipeline is the right choice to avoid complicated code. + +```bash title="View flattened samples" +nextflow run main.nf +``` + +```console title="View flattened samples" + N E X T F L O W ~ version 24.10.5 + +Launching `main.nf` [reverent_angela] DSL2 - revision: 656a31b305 + +[[id:sampleA, interval:chr1], [sampleA_rep1_normal.bam, sampleB_rep1_normal.bam], [sampleA_rep1_tumor.bam, sampleB_rep1_tumor.bam]] +[[id:sampleA, interval:chr2], [sampleA_rep1_normal.bam, sampleB_rep1_normal.bam], [sampleA_rep1_tumor.bam, sampleB_rep1_tumor.bam]] +[[id:sampleA, interval:chr3], [sampleA_rep1_normal.bam, sampleB_rep1_normal.bam], [sampleA_rep1_tumor.bam, sampleB_rep1_tumor.bam]] +[[id:sampleB, interval:chr1], [sampleC_rep1_normal.bam], [sampleC_rep1_tumor.bam]] +[[id:sampleB, interval:chr2], [sampleC_rep1_normal.bam], [sampleC_rep1_tumor.bam]] +[[id:sampleB, interval:chr3], [sampleC_rep1_normal.bam], [sampleC_rep1_tumor.bam]] +[[id:sampleC, interval:chr1], [sampleD_rep1_normal.bam], [sampleD_rep1_tumor.bam]] +[[id:sampleC, interval:chr2], [sampleD_rep1_normal.bam], [sampleD_rep1_tumor.bam]] +[[id:sampleC, interval:chr3], [sampleD_rep1_normal.bam], [sampleD_rep1_tumor.bam]] +``` ### Takeaway From 8572a6fca62bc881eccc91c62f2830d0194aea85 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Thu, 10 Apr 2025 12:09:26 +0100 Subject: [PATCH 21/36] Tidy code up a bit --- docs/side_quests/splitting-and-grouping.md | 85 ++++++++-------------- 1 file changed, 32 insertions(+), 53 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 465be6e245..52bed4e253 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -653,26 +653,28 @@ To do so, first we define the closure as a new variable: _Before:_ -```groovy title="main.nf" linenums="1" -workflow { +```groovy title="main.nf" linenums="2" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) + ch_normal_samples = ch_samplesheet ``` _After:_ -```groovy title="main.nf" linenums="1" -workflow { - getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } +```groovy title="main.nf" linenums="2" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) + + getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } + + ch_normal_samples = ch_samplesheet ``` We have taken the map we used previously and defined it as a named variable we can call later. Let's implement it in our workflow: _Before:_ -```groovy title="main.nf" linenums="5" +```groovy title="main.nf" linenums="7" ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [ @@ -691,13 +693,15 @@ _Before:_ _After:_ -```groovy title="main.nf" linenums="5" +```groovy title="main.nf" linenums="7" ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map ( getSampleIdAndReplicate ) + ch_tumor_samples = ch_samplesheet .filter { sample -> sample.type == "tumor" } .map ( getSampleIdAndReplicate ) + ``` !!! note @@ -750,45 +754,19 @@ Since the `id` and `repeat` fields are available in the grouping key, let's remo _Before:_ -```groovy title="main.nf" linenums="15" -workflow { +```groovy title="main.nf" linenums="5" getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map ( getSampleIdAndReplicate ) - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } - .map ( getSampleIdAndReplicate ) - ch_joined_samples = ch_normal_samples - .join(ch_tumor_samples) - ch_joined_samples.view() -} ``` _After:_ -```groovy title="main.nf" linenums="15" -workflow { +```groovy title="main.nf" linenums="5" getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample.subMap(['type', 'bam']) ] } - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map ( getSampleIdAndReplicate ) - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } - .map ( getSampleIdAndReplicate ) - ch_joined_samples = ch_normal_samples - .join(ch_tumor_samples) - ch_joined_samples.view() -} ``` Now, when the closure returns the tuple, the first element is the `id` and `repeat` fields and the second element is the `type` and `bam` fields. We have effectively removed the `id` and `repeat` fields from the sample data and uniquely store them in the grouping key. This approach eliminates redundancy while maintaining all necessary information. @@ -835,14 +813,17 @@ Let's start by creating a channel of intervals. To keep life simple, we will jus _Before:_ -```groovy title="main.nf" linenums="15" +```groovy title="main.nf" linenums="21" + .join(ch_tumor_samples) ch_joined_samples.view() } ``` _After:_ -```groovy title="main.nf" linenums="15" +```groovy title="main.nf" linenums="21" + .join(ch_tumor_samples) + ch_intervals = Channel.of('chr1', 'chr2', 'chr3') } ``` @@ -851,14 +832,14 @@ Now remember, we want to repeat each sample for each interval. This is sometimes _Before:_ -```groovy title="main.nf" linenums="15" +```groovy title="main.nf" linenums="23" ch_intervals = Channel.of('chr1', 'chr2', 'chr3') } ``` _After:_ -```groovy title="main.nf" linenums="15" +```groovy title="main.nf" linenums="23" ch_intervals = Channel.of('chr1', 'chr2', 'chr3') ch_combined_samples = ch_joined_samples.combine(ch_intervals) @@ -899,7 +880,7 @@ We can use the `map` operator to tidy and refactor our sample data so it's easie _Before:_ -```groovy title="main.nf" linenums="19" +```groovy title="main.nf" linenums="25" ch_combined_samples = ch_joined_samples.combine(ch_intervals) .view() } @@ -907,7 +888,7 @@ _Before:_ _After:_ -```groovy title="main.nf" linenums="19" +```groovy title="main.nf" linenums="25" ch_combined_samples = ch_joined_samples.combine(ch_intervals) .map { grouping_key, normal, tumor, interval -> [ @@ -1004,7 +985,7 @@ We can reuse the `subMap` method from before to isolate our `id` and `interval` _Before:_ -```groovy title="main.nf" linenums="19" +```groovy title="main.nf" linenums="25" ch_combined_samples = ch_joined_samples.combine(ch_intervals) .map { grouping_key, normal, tumor, interval -> [ @@ -1020,7 +1001,7 @@ _Before:_ _After:_ -```groovy title="main.nf" linenums="19" +```groovy title="main.nf" linenums="25" ch_combined_samples = ch_joined_samples.combine(ch_intervals) .map { grouping_key, normal, tumor, interval -> [ @@ -1078,7 +1059,7 @@ Let's now group the samples by this new grouping element, using the [`groupTuple _Before:_ -```groovy title="main.nf" linenums="30" +```groovy title="main.nf" linenums="35" ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ grouping_key.subMap('id', 'interval'), @@ -1093,7 +1074,7 @@ _Before:_ _After:_ -```groovy title="main.nf" linenums="29" +```groovy title="main.nf" linenums="35" ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ grouping_key.subMap('id', 'interval'), @@ -1177,7 +1158,7 @@ Let's append our map to the end of our pipeline and show the resulting data stru _Before:_ -```groovy title="main.nf" linenums="38" +```groovy title="main.nf" linenums="42" .groupTuple() .view() } @@ -1185,7 +1166,7 @@ _Before:_ _After:_ -```groovy title="main.nf" linenums="38" +```groovy title="main.nf" linenums="42" .groupTuple() .map { sample_info, normal, tumor -> [ @@ -1235,8 +1216,7 @@ If we parse the data right at the start of our pipeline to _only_ include the `b _Before:_ -```groovy title="main.nf" linenums="1" -workflow { +```groovy title="main.nf" linenums="5" getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), @@ -1247,8 +1227,7 @@ workflow { _After:_ -```groovy title="main.nf" linenums="1" -workflow { +```groovy title="main.nf" linenums="5" getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), @@ -1261,7 +1240,7 @@ A reminder, this will select only the BAM files once we have separated the chann _Before:_ -```groovy title="main.nf" linenums="38" +```groovy title="main.nf" linenums="43" .groupTuple() .map { sample_info, normal, tumor -> [ @@ -1276,7 +1255,7 @@ _Before:_ _After:_ -```groovy title="main.nf" linenums="38" +```groovy title="main.nf" linenums="43" .groupTuple() .view() } From 6d749c4517362e95a8ab505070e363a7ea3dc71f Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Thu, 10 Apr 2025 12:09:49 +0100 Subject: [PATCH 22/36] Fix side quest order --- mkdocs.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mkdocs.yml b/mkdocs.yml index 1e1e27bb89..3d74de54c0 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -37,11 +37,11 @@ nav: - nf4_science/rnaseq/next_steps.md - Side Quests: - side_quests/index.md + - side_quests/orientation.md - side_quests/workflows_of_workflows.md - side_quests/splitting-and-grouping.md - - side_quests/orientation.md - - side_quests/nf-core.md - side_quests/nf-test.md + - side_quests/nf-core.md - Fundamentals Training: - basic_training/index.md - basic_training/orientation.md From ff32156c6b904bd562725e219105bcb6565e0993 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Thu, 10 Apr 2025 12:17:13 +0100 Subject: [PATCH 23/36] Reduce examples and left shift for clarity --- docs/side_quests/splitting-and-grouping.md | 644 +++++++++------------ 1 file changed, 287 insertions(+), 357 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 52bed4e253..24549894cc 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -85,23 +85,19 @@ We can use the [`splitCsv` operator](https://www.nextflow.io/docs/latest/operato _Before:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) ``` The `header: true` option tells Nextflow to use the first row of the CSV file as the header row, which will be used as keys for the values. Let's see what Nextflow can see after reading with splitCsv. To do this, we can use the `view` operator. _After:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .view() ``` ```bash title="Read the samplesheet" @@ -153,23 +149,19 @@ We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator. _Before:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .view() ``` _After:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .filter { sample -> sample.type == 'normal' } - .view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .filter { sample -> sample.type == 'normal' } + .view() ``` ```bash title="View normal samples" @@ -201,25 +193,21 @@ While useful, we are discarding the tumor samples. Instead, let's rewrite our pi _Before:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .filter { sample -> sample.type == 'normal' } - .view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .filter { sample -> sample.type == 'normal' } + .view() ``` _After:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - ch_normal_samples.view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } +ch_normal_samples.view() ``` Once again, run the pipeline to see the results: @@ -243,29 +231,25 @@ Success! We have filtered the data to only include normal samples. Note that we _Before:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - ch_normal_samples.view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } +ch_normal_samples.view() ``` _After:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - ch_normal_samples.view() - ch_tumor_samples.view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } +ch_normal_samples.view() +ch_tumor_samples.view() ``` ```bash title="View tumor samples" @@ -291,31 +275,27 @@ We've managed to separate out the normal and tumor samples into two different ch _Before:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - ch_normal_samples.view() - ch_tumor_samples.view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } +ch_normal_samples.view() +ch_tumor_samples.view() ``` _After:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - ch_tumor_samples.view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } +ch_tumor_samples.view() ``` ```bash title="View normal and tumor samples" @@ -378,33 +358,29 @@ To isolate the `id` field, we can use the [`map` operator](https://www.nextflow. _Before:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - ch_tumor_samples.view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } +ch_tumor_samples.view() ``` _After:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [sample.id, sample] } - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [sample.id, sample] } - ch_normal_samples.view() - ch_tumor_samples.view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [sample.id, sample] } +ch_normal_samples.view() +ch_tumor_samples.view() ``` ```bash title="View normal and tumor samples with ID as element 0" @@ -432,37 +408,33 @@ Once again, we will use `view` to print the joined outputs. _Before:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [sample.id, sample] } - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [sample.id, sample] } - ch_normal_samples.view() - ch_tumor_samples.view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [sample.id, sample] } +ch_normal_samples.view() +ch_tumor_samples.view() ``` _After:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [sample.id, sample] } - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [sample.id, sample] } - ch_joined_samples = ch_normal_samples - .join(ch_tumor_samples) - ch_joined_samples.view() -} +```groovy title="main.nf" linenums="2" +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [sample.id, sample] } +ch_joined_samples = ch_normal_samples + .join(ch_tumor_samples) +ch_joined_samples.view() ``` ```bash title="View normal and tumor samples" @@ -509,46 +481,32 @@ Let's start by creating a new joining key. We can do this in the same way as bef _Before:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [sample.id, sample] } - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [sample.id, sample] } - ch_joined_samples = ch_normal_samples - .join(ch_tumor_samples) - ch_joined_samples.view() -} +```groovy title="main.nf" linenums="4" +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [sample.id, sample] } ``` _After:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [ - [sample.id, sample.repeat], - sample - ] - } - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [ - [sample.id, sample.repeat], - sample - ] - } - ch_joined_samples = ch_normal_samples - .join(ch_tumor_samples) - ch_joined_samples.view() -} +```groovy title="main.nf" linenums="4" +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [ + [sample.id, sample.repeat], + sample + ] + } +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [ + [sample.id, sample.repeat], + sample + ] + } ``` Now we should see the join is occurring but using both the `id` and `repeat` fields. @@ -578,54 +536,40 @@ The `subMap` method takes a map and returns a new map with only the key-value pa _Before:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [ - [sample.id, sample.repeat], - sample - ] - } - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [ - [sample.id, sample.repeat], - sample - ] - } - ch_joined_samples = ch_normal_samples - .join(ch_tumor_samples) - ch_joined_samples.view() -} +```groovy title="main.nf" linenums="4" +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [ + [sample.id, sample.repeat], + sample + ] + } +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [ + [sample.id, sample.repeat], + sample + ] + } ``` _After:_ -```groovy title="main.nf" linenums="1" -workflow { - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [ - sample.subMap(['id', 'repeat']), - sample - ] - } - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [ - sample.subMap(['id', 'repeat']), - sample - ] - } - ch_joined_samples = ch_normal_samples - .join(ch_tumor_samples) - ch_joined_samples.view() -} +```groovy title="main.nf" linenums="4" +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [ + sample.subMap(['id', 'repeat']), + sample + ] + } +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [ + sample.subMap(['id', 'repeat']), + sample + ] + } ``` ```bash title="View normal and tumor samples" @@ -654,20 +598,20 @@ To do so, first we define the closure as a new variable: _Before:_ ```groovy title="main.nf" linenums="2" - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - ch_normal_samples = ch_samplesheet +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) +ch_normal_samples = ch_samplesheet ``` _After:_ ```groovy title="main.nf" linenums="2" - ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) +ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) - getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } +getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } - ch_normal_samples = ch_samplesheet +ch_normal_samples = ch_samplesheet ``` We have taken the map we used previously and defined it as a named variable we can call later. Let's implement it in our workflow: @@ -675,32 +619,32 @@ We have taken the map we used previously and defined it as a named variable we c _Before:_ ```groovy title="main.nf" linenums="7" - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [ - sample.subMap(['id', 'repeat']), - sample - ] - } - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } - .map { sample -> [ - sample.subMap(['id', 'repeat']), - sample - ] - } +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [ + sample.subMap(['id', 'repeat']), + sample + ] + } +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == "tumor" } + .map { sample -> [ + sample.subMap(['id', 'repeat']), + sample + ] + } ``` _After:_ ```groovy title="main.nf" linenums="7" - ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map ( getSampleIdAndReplicate ) +ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map ( getSampleIdAndReplicate ) - ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } - .map ( getSampleIdAndReplicate ) +ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == "tumor" } + .map ( getSampleIdAndReplicate ) ``` @@ -755,18 +699,18 @@ Since the `id` and `repeat` fields are available in the grouping key, let's remo _Before:_ ```groovy title="main.nf" linenums="5" - getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } +getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } ``` _After:_ ```groovy title="main.nf" linenums="5" - getSampleIdAndReplicate = { sample -> - [ - sample.subMap(['id', 'repeat']), - sample.subMap(['type', 'bam']) - ] - } +getSampleIdAndReplicate = { sample -> + [ + sample.subMap(['id', 'repeat']), + sample.subMap(['type', 'bam']) + ] + } ``` Now, when the closure returns the tuple, the first element is the `id` and `repeat` fields and the second element is the `type` and `bam` fields. We have effectively removed the `id` and `repeat` fields from the sample data and uniquely store them in the grouping key. This approach eliminates redundancy while maintaining all necessary information. @@ -814,18 +758,16 @@ Let's start by creating a channel of intervals. To keep life simple, we will jus _Before:_ ```groovy title="main.nf" linenums="21" - .join(ch_tumor_samples) - ch_joined_samples.view() -} + .join(ch_tumor_samples) +ch_joined_samples.view() ``` _After:_ ```groovy title="main.nf" linenums="21" - .join(ch_tumor_samples) + .join(ch_tumor_samples) - ch_intervals = Channel.of('chr1', 'chr2', 'chr3') -} +ch_intervals = Channel.of('chr1', 'chr2', 'chr3') ``` Now remember, we want to repeat each sample for each interval. This is sometimes referred to as the Cartesian product of the samples and intervals. We can achieve this by using the [`combine` operator](https://www.nextflow.io/docs/latest/operator.html#combine). This will take every item from channel 1 and repeat it for each item in channel 2. Let's add a combine operator to our workflow: @@ -833,18 +775,16 @@ Now remember, we want to repeat each sample for each interval. This is sometimes _Before:_ ```groovy title="main.nf" linenums="23" - ch_intervals = Channel.of('chr1', 'chr2', 'chr3') -} +ch_intervals = Channel.of('chr1', 'chr2', 'chr3') ``` _After:_ ```groovy title="main.nf" linenums="23" - ch_intervals = Channel.of('chr1', 'chr2', 'chr3') +ch_intervals = Channel.of('chr1', 'chr2', 'chr3') - ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .view() -} +ch_combined_samples = ch_joined_samples.combine(ch_intervals) + .view() ``` Now let's run it and see what happens: @@ -881,25 +821,23 @@ We can use the `map` operator to tidy and refactor our sample data so it's easie _Before:_ ```groovy title="main.nf" linenums="25" - ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .view() -} +ch_combined_samples = ch_joined_samples.combine(ch_intervals) + .view() ``` _After:_ ```groovy title="main.nf" linenums="25" - ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .map { grouping_key, normal, tumor, interval -> - [ - grouping_key + [interval: interval], - normal, - tumor - ] - - } - .view() -} +ch_combined_samples = ch_joined_samples.combine(ch_intervals) + .map { grouping_key, normal, tumor, interval -> + [ + grouping_key + [interval: interval], + normal, + tumor + ] + + } + .view() ``` Wait? What did we do here? Let's go over it piece by piece. @@ -986,42 +924,40 @@ We can reuse the `subMap` method from before to isolate our `id` and `interval` _Before:_ ```groovy title="main.nf" linenums="25" - ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .map { grouping_key, normal, tumor, interval -> - [ - grouping_key + [interval: interval], - normal, - tumor - ] - - } - .view() -} +ch_combined_samples = ch_joined_samples.combine(ch_intervals) + .map { grouping_key, normal, tumor, interval -> + [ + grouping_key + [interval: interval], + normal, + tumor + ] + + } + .view() ``` _After:_ ```groovy title="main.nf" linenums="25" - ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .map { grouping_key, normal, tumor, interval -> - [ - grouping_key + [interval: interval], - normal, - tumor - ] - - } - - ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> - [ - grouping_key.subMap('id', 'interval'), - normal, - tumor - ] - - } - .view() -} +ch_combined_samples = ch_joined_samples.combine(ch_intervals) + .map { grouping_key, normal, tumor, interval -> + [ + grouping_key + [interval: interval], + normal, + tumor + ] + + } + +ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> + [ + grouping_key.subMap('id', 'interval'), + normal, + tumor + ] + + } + .view() ``` Let's run it again and check the channel contents: @@ -1060,32 +996,30 @@ Let's now group the samples by this new grouping element, using the [`groupTuple _Before:_ ```groovy title="main.nf" linenums="35" - ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> - [ - grouping_key.subMap('id', 'interval'), - normal, - tumor - ] - - } - .view() -} +ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> + [ + grouping_key.subMap('id', 'interval'), + normal, + tumor + ] + + } + .view() ``` _After:_ ```groovy title="main.nf" linenums="35" - ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> - [ - grouping_key.subMap('id', 'interval'), - normal, - tumor - ] - - } - .groupTuple() - .view() -} +ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> + [ + grouping_key.subMap('id', 'interval'), + normal, + tumor + ] + + } + .groupTuple() + .view() ``` Simple, huh? We just added a single line of code. Let's see what happens when we run it: @@ -1159,24 +1093,22 @@ Let's append our map to the end of our pipeline and show the resulting data stru _Before:_ ```groovy title="main.nf" linenums="42" - .groupTuple() - .view() -} +.groupTuple() +.view() ``` _After:_ ```groovy title="main.nf" linenums="42" - .groupTuple() - .map { sample_info, normal, tumor -> - [ - sample_info, - normal.collect { bam_data -> bam_data.bam }, - tumor.collect { bam_data -> bam_data.bam } - ] - } - .view() +.groupTuple() +.map { sample_info, normal, tumor -> + [ + sample_info, + normal.collect { bam_data -> bam_data.bam }, + tumor.collect { bam_data -> bam_data.bam } + ] } +.view() ``` ```bash title="View flattened samples" @@ -1217,23 +1149,23 @@ If we parse the data right at the start of our pipeline to _only_ include the `b _Before:_ ```groovy title="main.nf" linenums="5" - getSampleIdAndReplicate = { sample -> - [ - sample.subMap(['id', 'repeat']), - sample.subMap(['type', 'bam']) - ] - } +getSampleIdAndReplicate = { sample -> + [ + sample.subMap(['id', 'repeat']), + sample.subMap(['type', 'bam']) + ] + } ``` _After:_ ```groovy title="main.nf" linenums="5" - getSampleIdAndReplicate = { sample -> - [ - sample.subMap(['id', 'repeat']), - sample.bam - ] - } +getSampleIdAndReplicate = { sample -> + [ + sample.subMap(['id', 'repeat']), + sample.bam + ] + } ``` A reminder, this will select only the BAM files once we have separated the channels into normal and tumor. We are losing the `type` field, but we know which samples are normal and tumor because they have been filtered and the channel should only contain one type per sample. Once we have done this we can remove the `map` operator from the end of the pipeline: @@ -1241,24 +1173,22 @@ A reminder, this will select only the BAM files once we have separated the chann _Before:_ ```groovy title="main.nf" linenums="43" - .groupTuple() - .map { sample_info, normal, tumor -> - [ - sample_info, - normal.collect { bam_data -> bam_data.bam }, - tumor.collect { bam_data -> bam_data.bam } - ] - } - .view() +.groupTuple() +.map { sample_info, normal, tumor -> + [ + sample_info, + normal.collect { bam_data -> bam_data.bam }, + tumor.collect { bam_data -> bam_data.bam } + ] } +.view() ``` _After:_ ```groovy title="main.nf" linenums="43" - .groupTuple() - .view() -} +.groupTuple() +.view() ``` Sometimes parsing data earlier in the pipeline is the right choice to avoid complicated code. From 8eeba21f0324be8956fd89bd63544e1f8f2e5615 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Tue, 22 Apr 2025 17:45:52 +0100 Subject: [PATCH 24/36] Fix typo in path to splitting-and-grouping --- docs/side_quests/splitting-and-grouping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 24549894cc..0a4ab574ae 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -32,7 +32,7 @@ Before taking on this side quest you should: Let's move into the project directory. ```bash -cd side-quests/splitting-and-grouping +cd side-quests/splitting_and_grouping ``` You'll find a `data` directory containing a samplesheet and a main workflow file. From c15f9fc5eabcf10d572f6afd6d2f16e26275607c Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Tue, 22 Apr 2025 17:47:57 +0100 Subject: [PATCH 25/36] Fix samplesheet items to be more consistent --- docs/side_quests/splitting-and-grouping.md | 6 +++--- .../splitting_and_grouping/data/samplesheet.csv | 12 ++++++------ 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 0a4ab574ae..04a411c6a8 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -49,14 +49,14 @@ The samplesheet contains information about different samples and their associate ```console title="samplesheet.csv" id,repeat,type,bam -sampleA,1,normal,sampleA_r1_normal.bam +sampleA,1,normal,sampleA_rep1_normal.bam sampleA,1,tumor,sampleA_rep1_tumor.bam +sampleA,2,normal,sampleA_rep2_normal.bam +sampleA,2,tumor,sampleA_rep2_tumor.bam sampleB,1,normal,sampleB_rep1_normal.bam sampleB,1,tumor,sampleB_rep1_tumor.bam sampleC,1,normal,sampleC_rep1_normal.bam sampleC,1,tumor,sampleC_rep1_tumor.bam -sampleD,1,normal,sampleD_rep1_normal.bam -sampleD,1,tumor,sampleD_rep1_tumor.bam ``` Note there are 8 samples in total, 4 normal and 4 tumor. sampleA has 2 repeats, while sampleB and sampleC only have 1. diff --git a/side-quests/splitting_and_grouping/data/samplesheet.csv b/side-quests/splitting_and_grouping/data/samplesheet.csv index d9ec938f31..5d887ef17c 100644 --- a/side-quests/splitting_and_grouping/data/samplesheet.csv +++ b/side-quests/splitting_and_grouping/data/samplesheet.csv @@ -1,9 +1,9 @@ id,repeat,type,bam sampleA,1,normal,sampleA_rep1_normal.bam sampleA,1,tumor,sampleA_rep1_tumor.bam -sampleA,2,normal,sampleB_rep1_normal.bam -sampleA,2,tumor,sampleB_rep1_tumor.bam -sampleB,1,normal,sampleC_rep1_normal.bam -sampleB,1,tumor,sampleC_rep1_tumor.bam -sampleC,1,normal,sampleD_rep1_normal.bam -sampleC,1,tumor,sampleD_rep1_tumor.bam +sampleA,2,normal,sampleA_rep2_normal.bam +sampleA,2,tumor,sampleA_rep2_tumor.bam +sampleB,1,normal,sampleB_rep1_normal.bam +sampleB,1,tumor,sampleB_rep1_tumor.bam +sampleC,1,normal,sampleC_rep1_normal.bam +sampleC,1,tumor,sampleC_rep1_tumor.bam From d8c7671182ce965ea31675d698028a4f37fe3fc6 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Tue, 22 Apr 2025 17:50:38 +0100 Subject: [PATCH 26/36] More consistent phrasing of first steps --- docs/side_quests/splitting-and-grouping.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 04a411c6a8..6d3e17e987 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -81,17 +81,12 @@ workflow { Throughout this tutorial, we'll use the `ch_` prefix for all channel variables to clearly indicate they are Nextflow channels. -We can use the [`splitCsv` operator](https://www.nextflow.io/docs/latest/operator.html#splitcsv) to split the samplesheet into a channel of maps, where each map represents a row from the CSV file. - _Before:_ ```groovy title="main.nf" linenums="2" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) ``` -The `header: true` option tells Nextflow to use the first row of the CSV file as the header row, which will be used as keys for the values. Let's see what Nextflow can see after reading with splitCsv. To do this, we can use the `view` operator. - _After:_ ```groovy title="main.nf" linenums="2" @@ -100,6 +95,12 @@ ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .view() ``` +We can use the [`splitCsv` operator](https://www.nextflow.io/docs/latest/operator.html#splitcsv) to split the samplesheet into a channel of maps, where each map represents a row from the CSV file. + +The `header: true` option tells Nextflow to use the first row of the CSV file as the header row, which will be used as keys for the values. Let's see what Nextflow can see after reading with splitCsv. To do this, we can use the `view` operator. + +Run the pipeline: + ```bash title="Read the samplesheet" nextflow run main.nf ``` From 8cf70b55251bc504e39f8eee2853dcdb2f5db733 Mon Sep 17 00:00:00 2001 From: Adam Talbot <12817534+adamrtalbot@users.noreply.github.com> Date: Tue, 22 Apr 2025 17:52:48 +0100 Subject: [PATCH 27/36] Apply suggestions from code review Co-authored-by: Friederike Hanssen --- docs/side_quests/splitting-and-grouping.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 6d3e17e987..6e12e4faa7 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -129,7 +129,7 @@ Each map contains: - `type`: The sample type (normal or tumor) - `bam`: Path to the BAM file -This format makes it easy to access specific fields from each sample. For example, we could access the sample ID with `sample.id` or the BAM file path with `sample.bam`. The output above shows each row from the CSV file converted into a map with keys matching the header row. Now that we've successfully read in the samplesheet and have access to the data in each row, we can begin implementing our pipeline logic. +This format makes it easy to access specific fields from each sample. For example, we could access the sample ID with `id` or the BAM file path with `bam`. The output above shows each row from the CSV file converted into a map with keys matching the header row. Now that we've successfully read in the samplesheet and have access to the data in each row, we can begin implementing our pipeline logic. ### Takeaway @@ -332,7 +332,7 @@ We've now separated out the normal and tumor samples into two different channels In the previous section, we separated out the normal and tumor samples into two different channels. These could be processed independently using specific processes or workflows based on their type. But what happens when we want to compare the normal and tumor samples from the same patient? At this point, we need to join them back together making sure to match the samples based on their `id` field. -Nextflow includes many methods for combining channels, but in this case the most appropriate operator is [`join`](https://www.nextflow.io/docs/latest/operator.html#join). This acts like a SQL `JOIN` operation, where we specify the key to join on and the type of join to perform. +Nextflow includes many methods for combining channels, but in this case the most appropriate operator is [`join`](https://www.nextflow.io/docs/latest/operator.html#join). If you are familiar with SQL, it acts like the `JOIN` operation, where we specify the key to join on and the type of join to perform. ### 3.1. Use `map` and `join` to combine based on sample ID From 67ce29f1be74427e988e6ab210bca56b496d3653 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Tue, 22 Apr 2025 17:57:14 +0100 Subject: [PATCH 28/36] Add reference to join documentation --- docs/side_quests/splitting-and-grouping.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 6e12e4faa7..02877cd2dd 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -332,7 +332,7 @@ We've now separated out the normal and tumor samples into two different channels In the previous section, we separated out the normal and tumor samples into two different channels. These could be processed independently using specific processes or workflows based on their type. But what happens when we want to compare the normal and tumor samples from the same patient? At this point, we need to join them back together making sure to match the samples based on their `id` field. -Nextflow includes many methods for combining channels, but in this case the most appropriate operator is [`join`](https://www.nextflow.io/docs/latest/operator.html#join). If you are familiar with SQL, it acts like the `JOIN` operation, where we specify the key to join on and the type of join to perform. +Nextflow includes many methods for combining channels, but in this case the most appropriate operator is [`join`](https://www.nextflow.io/docs/latest/operator.html#join). If you are familiar with SQL, it acts like the `JOIN` operation, where we specify the key to join on and the type of join to perform. ### 3.1. Use `map` and `join` to combine based on sample ID @@ -529,6 +529,8 @@ Launching `main.nf` [prickly_wing] DSL2 - revision: 3bebf22dee Note how we have a tuple of two elements (`id` and `repeat` fields) as the first element of each joined result. This demonstrates how complex items can be used as a joining key, enabling fairly intricate matching between samples from the same conditions. +If you want to explore more ways to join on different keys, check out the [join operator documentation](https://www.nextflow.io/docs/latest/operator.html#join) for additional options and examples. + ### 3.3. Use subMap to create a new joining key We have an issue from the above example. We have lost the field names from the original joining key, i.e. the `id` and `repeat` fields are just a list of two values. If we want to retain the field names so we can access them later by name we can use the [`subMap` method](). From 63c89dc248af331d037fc318431ac4df493cb879 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Tue, 22 Apr 2025 18:01:31 +0100 Subject: [PATCH 29/36] Fix headers and indentation --- docs/side_quests/splitting-and-grouping.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting-and-grouping.md index 02877cd2dd..74b565341d 100644 --- a/docs/side_quests/splitting-and-grouping.md +++ b/docs/side_quests/splitting-and-grouping.md @@ -900,7 +900,7 @@ In this section, you've learned: - **Spreading samples over intervals**: How to use `combine` to repeat samples over intervals -### 5. Aggregating samples +## 5. Aggregating samples In the previous section, we learned how to split a samplesheet and filter the normal and tumor samples. But this only covers a single type of joining. What if we want to group samples by a specific attribute? For example, instead of joining matched normal-tumor pairs, we might want to process all samples from "sampleA" together regardless of their type. This pattern is common in bioinformatics workflows where you may want to process related samples separately for efficiency reasons before comparing or combining the results at the end. @@ -1055,7 +1055,7 @@ It's possible to use a simpler data structure than this, by separating our the s [`transpose`](https://www.nextflow.io/docs/latest/reference/operator.html#transpose) is the opposite of groupTuple. It unpacks the items in a channel and flattens them. Try and add `transpose` and undo the grouping we performed above! -# 5.2. Reorganise the data +### 5.2. Reorganise the data Let's consider the inputs to a typical Nextflow process. Generally, inputs can be in the form of values or files. In this example, we have a set of values for sample information (`id` and `interval`) and a set of files for sequencing data (`normal` and `tumor`). The `input` block of a process might look like this: @@ -1143,7 +1143,7 @@ Note how the channel is now structured as a 3-part tuple: `groupTuple` is a powerful operator but can generate complex data structures. It's important to understand how the data structure changes as it flows through the pipeline so you can manipulate it as needed. Using a `map` at the end of a pipeline helps refine the output into a structure that fits our processes pipeline. -## 5.3. Simplify the data +### 5.3. Simplify the data One issue we have faced in this pipeline is that we have a moderately complicated data structure which we have had to coerce throughout the pipeline. What if we could simplify it at the start? Then we would only handle the relevant fields in the pipeline and avoid the need for the final `map` operator. From 5a4ab537395e46af902d6fe2acbd1957fd192c94 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Wed, 23 Apr 2025 12:21:18 +0100 Subject: [PATCH 30/36] Rename splitting and grouping file to be consistent with other filenames --- .../{splitting-and-grouping.md => splitting_and_grouping.md} | 0 mkdocs.yml | 2 +- 2 files changed, 1 insertion(+), 1 deletion(-) rename docs/side_quests/{splitting-and-grouping.md => splitting_and_grouping.md} (100%) diff --git a/docs/side_quests/splitting-and-grouping.md b/docs/side_quests/splitting_and_grouping.md similarity index 100% rename from docs/side_quests/splitting-and-grouping.md rename to docs/side_quests/splitting_and_grouping.md diff --git a/mkdocs.yml b/mkdocs.yml index 3d74de54c0..895406d608 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -39,7 +39,7 @@ nav: - side_quests/index.md - side_quests/orientation.md - side_quests/workflows_of_workflows.md - - side_quests/splitting-and-grouping.md + - side_quests/splitting_and_grouping.md - side_quests/nf-test.md - side_quests/nf-core.md - Fundamentals Training: From 29cc3debe3213be05c360f2f04a1cda750721bd2 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Wed, 23 Apr 2025 12:27:25 +0100 Subject: [PATCH 31/36] Refine summary messages --- docs/side_quests/splitting_and_grouping.md | 34 +++++++--------------- 1 file changed, 10 insertions(+), 24 deletions(-) diff --git a/docs/side_quests/splitting_and_grouping.md b/docs/side_quests/splitting_and_grouping.md index 74b565341d..1866f0b7aa 100644 --- a/docs/side_quests/splitting_and_grouping.md +++ b/docs/side_quests/splitting_and_grouping.md @@ -1228,31 +1228,17 @@ In this section, you've learned: In this side quest, you've learned how to split and group data using channels. By modifying the data as it flows through the pipeline, you can construct a pipeline that handles as many samples as possible with no loops or while statements. It gracefully scales to large numbers of samples. Here's what we achieved: -1. **Read in samplesheet with splitCsv** +1. **Read in samplesheet with splitCsv**: We read in a samplesheet and viewed the contents. -- Samplesheet details here -- Show with view, then show with view +2. **Use filter (and/or map) to manipulate into 2 separate channels**: We used `filter` to split the data into two channels based on the `type` field. -2. **Use filter (and/or map) to manipulate into 2 separate channels** +3. **Join on ID**: We used `join` to join the two channels on the `id` field. -- Use named closure in map here? -- Show that elements can be in two channels by filtering twice +4. **Use groupTuple to group up samples by ID**: We used `groupTuple` to group the samples by the `id` field. -3. **Join on ID** +5. **Combine by intervals**: We used `combine` to combine the two channels on the `interval` field. -- Show that elements can be in two channels by filtering twice - -4. **Use groupTuple to group up samples by ID** - -- Show that elements can be in two channels by filtering twice - -5. **Combine by intervals** - -- Show that elements can be in two channels by filtering twice - -6. **Group after intervals** - -- Show that elements can be in two channels by filtering twice +6. **Group after intervals**: We used `groupTuple` to group the samples by the `interval` field. This approach offers several advantages over writing a pipeline as more standard code, such as using for and while loops: @@ -1274,14 +1260,14 @@ By mastering these channel operations, you can build flexible, scalable pipeline .splitCsv(header: true) ``` -2. **Filtering** +1. **Filtering** ```nextflow // Filter channel based on condition channel.filter { it.type == 'tumor' } ``` -3. **Joining Channels** +1. **Joining Channels** ```nextflow // Join two channels by key @@ -1294,14 +1280,14 @@ By mastering these channel operations, you can build flexible, scalable pipeline ) ``` -4. **Grouping Data** +1. **Grouping Data** ```nextflow // Group by the first element in each tuple channel.groupTuple() ``` -5. **Combining Channels** +1. **Combining Channels** ```nextflow // Combine with Cartesian product From 6099c1c97a9e588e7b9a100f7fc7d0b975e70ac2 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Mon, 28 Apr 2025 16:18:53 +0100 Subject: [PATCH 32/36] Use bullet points instead of numbers for key concepts of splitting-and-grouping --- docs/side_quests/splitting_and_grouping.md | 62 +++++++++++----------- 1 file changed, 31 insertions(+), 31 deletions(-) diff --git a/docs/side_quests/splitting_and_grouping.md b/docs/side_quests/splitting_and_grouping.md index 1866f0b7aa..a627bf3375 100644 --- a/docs/side_quests/splitting_and_grouping.md +++ b/docs/side_quests/splitting_and_grouping.md @@ -1252,47 +1252,47 @@ By mastering these channel operations, you can build flexible, scalable pipeline ### Key Concepts -1. **Reading Samplesheets** +- **Reading Samplesheets** - ```nextflow - // Read CSV with header - Channel.fromPath('samplesheet.csv') - .splitCsv(header: true) - ``` + ```nextflow + // Read CSV with header + Channel.fromPath('samplesheet.csv') + .splitCsv(header: true) + ``` -1. **Filtering** +- **Filtering** - ```nextflow - // Filter channel based on condition - channel.filter { it.type == 'tumor' } - ``` + ```nextflow + // Filter channel based on condition + channel.filter { it.type == 'tumor' } + ``` -1. **Joining Channels** +- **Joining Channels** - ```nextflow - // Join two channels by key - tumor_ch.join(normal_ch) + ```nextflow + // Join two channels by key + tumor_ch.join(normal_ch) - // Extract a key and join by this value - tumor_ch.map { [it.patient_id, it] } - .join( - normal_ch.map { [it.patient_id, it] } - ) - ``` + // Extract a key and join by this value + tumor_ch.map { [it.patient_id, it] } + .join( + normal_ch.map { [it.patient_id, it] } + ) + ``` -1. **Grouping Data** +- **Grouping Data** - ```nextflow - // Group by the first element in each tuple - channel.groupTuple() - ``` + ```nextflow + // Group by the first element in each tuple + channel.groupTuple() + ``` -1. **Combining Channels** +- **Combining Channels** - ```nextflow - // Combine with Cartesian product - samples_ch.combine(intervals_ch) - ``` + ```nextflow + // Combine with Cartesian product + samples_ch.combine(intervals_ch) + ``` ## Resources From ddea988df3ab2fb5e7ae00a660f59e5f178c8edc Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Mon, 28 Apr 2025 17:01:16 +0100 Subject: [PATCH 33/36] splitting and grouping use before/after correctly --- docs/side_quests/splitting_and_grouping.md | 814 ++++++++++----------- 1 file changed, 407 insertions(+), 407 deletions(-) diff --git a/docs/side_quests/splitting_and_grouping.md b/docs/side_quests/splitting_and_grouping.md index a627bf3375..827fa8e210 100644 --- a/docs/side_quests/splitting_and_grouping.md +++ b/docs/side_quests/splitting_and_grouping.md @@ -81,19 +81,19 @@ workflow { Throughout this tutorial, we'll use the `ch_` prefix for all channel variables to clearly indicate they are Nextflow channels. -_Before:_ +=== "After" -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") -``` + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .view() + ``` -_After:_ +=== "Before" -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .view() -``` + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + ``` We can use the [`splitCsv` operator](https://www.nextflow.io/docs/latest/operator.html#splitcsv) to split the samplesheet into a channel of maps, where each map represents a row from the CSV file. @@ -148,22 +148,22 @@ We now have a channel of maps, each representing a row from the samplesheet. Nex We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator.html#filter) to filter the data based on a condition. Let's say we only want to process normal samples. We can do this by filtering the data based on the `type` field. Let's insert this before the `view` operator. -_Before:_ +=== "After" -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .view() -``` + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .filter { sample -> sample.type == 'normal' } + .view() + ``` -_After:_ +=== "Before" -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .filter { sample -> sample.type == 'normal' } - .view() -``` + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .view() + ``` ```bash title="View normal samples" nextflow run main.nf @@ -192,24 +192,24 @@ In this case, we want to keep only the samples where `sample.type == 'normal'`. While useful, we are discarding the tumor samples. Instead, let's rewrite our pipeline to save all the samples to one channel called `ch_samplesheet`, then filter that channel to just the normal samples and save the results to a new channel called `ch_normal_samples`. -_Before:_ +=== "After" -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) - .filter { sample -> sample.type == 'normal' } - .view() -``` + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + ch_normal_samples.view() + ``` -_After:_ +=== "Before" -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } -ch_normal_samples.view() -``` + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + .filter { sample -> sample.type == 'normal' } + .view() + ``` Once again, run the pipeline to see the results: @@ -230,28 +230,28 @@ Launching `main.nf` [trusting_poisson] DSL2 - revision: 639186ee74 Success! We have filtered the data to only include normal samples. Note that we can use view and save the new channel. If we wanted, we still have access to the tumor samples within the `ch_samplesheet` channel. Since we managed it for the normal samples, let's do it for the tumor samples as well: -_Before:_ +=== "After" -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } -ch_normal_samples.view() -``` - -_After:_ - -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } -ch_normal_samples.view() -ch_tumor_samples.view() -``` + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + ch_normal_samples.view() + ch_tumor_samples.view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + ch_normal_samples.view() + ``` ```bash title="View tumor samples" nextflow run main.nf @@ -274,30 +274,30 @@ Launching `main.nf` [big_bernard] DSL2 - revision: 897c9e44cc We've managed to separate out the normal and tumor samples into two different channels but they're mixed up when we `view` them in the console! If we want, we can remove one of the `view` operators to see the data in each channel separately. Let's remove the `view` operator for the normal samples: -_Before:_ - -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } -ch_normal_samples.view() -ch_tumor_samples.view() -``` +=== "After" + + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + ch_tumor_samples.view() + ``` -_After:_ +=== "Before" -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } -ch_tumor_samples.view() -``` + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + ch_normal_samples.view() + ch_tumor_samples.view() + ``` ```bash title="View normal and tumor samples" nextflow run main.nf @@ -357,32 +357,32 @@ We can see that the `id` field is the first element in each map. For `join` to w To isolate the `id` field, we can use the [`map` operator](https://www.nextflow.io/docs/latest/operator.html#map) to create a new tuple with the `id` field as the first element. -_Before:_ +=== "After" -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } -ch_tumor_samples.view() -``` - -_After:_ - -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [sample.id, sample] } -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [sample.id, sample] } -ch_normal_samples.view() -ch_tumor_samples.view() -``` + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [sample.id, sample] } + ch_normal_samples.view() + ch_tumor_samples.view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + ch_tumor_samples.view() + ``` ```bash title="View normal and tumor samples with ID as element 0" nextflow run main.nf @@ -407,36 +407,36 @@ It might be subtle, but you should be able to see the first element in each tupl Once again, we will use `view` to print the joined outputs. -_Before:_ - -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [sample.id, sample] } -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [sample.id, sample] } -ch_normal_samples.view() -ch_tumor_samples.view() -``` +=== "After" -_After:_ - -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [sample.id, sample] } -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [sample.id, sample] } -ch_joined_samples = ch_normal_samples - .join(ch_tumor_samples) -ch_joined_samples.view() -``` + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [sample.id, sample] } + ch_joined_samples = ch_normal_samples + .join(ch_tumor_samples) + ch_joined_samples.view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [sample.id, sample] } + ch_normal_samples.view() + ch_tumor_samples.view() + ``` ```bash title="View normal and tumor samples" nextflow run main.nf @@ -480,35 +480,35 @@ To avoid this, we can join on multiple fields. There are actually multiple ways Let's start by creating a new joining key. We can do this in the same way as before by using the [`map` operator](https://www.nextflow.io/docs/latest/operator.html#map) to create a new tuple with the `id` and `repeat` fields as the first element. -_Before:_ - -```groovy title="main.nf" linenums="4" -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [sample.id, sample] } -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [sample.id, sample] } -``` - -_After:_ - -```groovy title="main.nf" linenums="4" -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [ - [sample.id, sample.repeat], - sample - ] - } -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [ - [sample.id, sample.repeat], - sample - ] - } -``` +=== "After" + + ```groovy title="main.nf" linenums="4" + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [ + [sample.id, sample.repeat], + sample + ] + } + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [ + [sample.id, sample.repeat], + sample + ] + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="4" + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [sample.id, sample] } + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [sample.id, sample] } + ``` Now we should see the join is occurring but using both the `id` and `repeat` fields. @@ -537,43 +537,43 @@ We have an issue from the above example. We have lost the field names from the o The `subMap` method takes a map and returns a new map with only the key-value pairs specified in the argument. In this case we want to specify the `id` and `repeat` fields. -_Before:_ - -```groovy title="main.nf" linenums="4" -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [ - [sample.id, sample.repeat], - sample - ] - } -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [ - [sample.id, sample.repeat], - sample - ] - } -``` - -_After:_ +=== "After" -```groovy title="main.nf" linenums="4" -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [ - sample.subMap(['id', 'repeat']), - sample - ] - } -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == 'tumor' } - .map { sample -> [ - sample.subMap(['id', 'repeat']), - sample - ] - } -``` + ```groovy title="main.nf" linenums="4" + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [ + sample.subMap(['id', 'repeat']), + sample + ] + } + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [ + sample.subMap(['id', 'repeat']), + sample + ] + } + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="4" + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [ + [sample.id, sample.repeat], + sample + ] + } + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == 'tumor' } + .map { sample -> [ + [sample.id, sample.repeat], + sample + ] + } + ``` ```bash title="View normal and tumor samples" nextflow run main.nf @@ -598,58 +598,58 @@ Since we are re-using the same map in multiple places, we run the risk of introd To do so, first we define the closure as a new variable: -_Before:_ +=== "After" -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) -ch_normal_samples = ch_samplesheet -``` + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) -_After:_ + getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } -```groovy title="main.nf" linenums="2" -ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") - .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + ``` -getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } +=== "Before" -ch_normal_samples = ch_samplesheet -``` + ```groovy title="main.nf" linenums="2" + ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") + .splitCsv(header: true) + ch_normal_samples = ch_samplesheet + ``` We have taken the map we used previously and defined it as a named variable we can call later. Let's implement it in our workflow: -_Before:_ - -```groovy title="main.nf" linenums="7" -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map { sample -> [ - sample.subMap(['id', 'repeat']), - sample - ] - } -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } - .map { sample -> [ - sample.subMap(['id', 'repeat']), - sample - ] - } -``` +=== "After" -_After:_ + ```groovy title="main.nf" linenums="7" + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map ( getSampleIdAndReplicate ) -```groovy title="main.nf" linenums="7" -ch_normal_samples = ch_samplesheet - .filter { sample -> sample.type == 'normal' } - .map ( getSampleIdAndReplicate ) + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == "tumor" } + .map ( getSampleIdAndReplicate ) -ch_tumor_samples = ch_samplesheet - .filter { sample -> sample.type == "tumor" } - .map ( getSampleIdAndReplicate ) + ``` -``` +=== "Before" + + ```groovy title="main.nf" linenums="7" + ch_normal_samples = ch_samplesheet + .filter { sample -> sample.type == 'normal' } + .map { sample -> [ + sample.subMap(['id', 'repeat']), + sample + ] + } + ch_tumor_samples = ch_samplesheet + .filter { sample -> sample.type == "tumor" } + .map { sample -> [ + sample.subMap(['id', 'repeat']), + sample + ] + } + ``` !!! note @@ -699,22 +699,22 @@ We have a lot of duplicated data in our workflow. Each item in the joined sample Since the `id` and `repeat` fields are available in the grouping key, let's remove them from the sample data to avoid duplication. We can do this by using the `subMap` method to create a new map with only the `type` and `bam` fields. This approach allows us to maintain all necessary information while eliminating redundancy in our data structure. -_Before:_ +=== "After" -```groovy title="main.nf" linenums="5" -getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } -``` + ```groovy title="main.nf" linenums="5" + getSampleIdAndReplicate = { sample -> + [ + sample.subMap(['id', 'repeat']), + sample.subMap(['type', 'bam']) + ] + } + ``` -_After:_ +=== "Before" -```groovy title="main.nf" linenums="5" -getSampleIdAndReplicate = { sample -> - [ - sample.subMap(['id', 'repeat']), - sample.subMap(['type', 'bam']) - ] - } -``` + ```groovy title="main.nf" linenums="5" + getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), sample ] } + ``` Now, when the closure returns the tuple, the first element is the `id` and `repeat` fields and the second element is the `type` and `bam` fields. We have effectively removed the `id` and `repeat` fields from the sample data and uniquely store them in the grouping key. This approach eliminates redundancy while maintaining all necessary information. @@ -758,37 +758,37 @@ In the following section, we'll demonstrate how to distribute our sample data ac Let's start by creating a channel of intervals. To keep life simple, we will just use 3 intervals we will manually define. In a real workflow, you could read these in from a file input or even create a channel with lots of interval files. -_Before:_ +=== "After" -```groovy title="main.nf" linenums="21" - .join(ch_tumor_samples) -ch_joined_samples.view() -``` + ```groovy title="main.nf" linenums="21" + .join(ch_tumor_samples) -_After:_ + ch_intervals = Channel.of('chr1', 'chr2', 'chr3') + ``` -```groovy title="main.nf" linenums="21" - .join(ch_tumor_samples) +=== "Before" -ch_intervals = Channel.of('chr1', 'chr2', 'chr3') -``` + ```groovy title="main.nf" linenums="21" + .join(ch_tumor_samples) + ch_joined_samples.view() + ``` Now remember, we want to repeat each sample for each interval. This is sometimes referred to as the Cartesian product of the samples and intervals. We can achieve this by using the [`combine` operator](https://www.nextflow.io/docs/latest/operator.html#combine). This will take every item from channel 1 and repeat it for each item in channel 2. Let's add a combine operator to our workflow: -_Before:_ +=== "After" -```groovy title="main.nf" linenums="23" -ch_intervals = Channel.of('chr1', 'chr2', 'chr3') -``` + ```groovy title="main.nf" linenums="23" + ch_intervals = Channel.of('chr1', 'chr2', 'chr3') -_After:_ + ch_combined_samples = ch_joined_samples.combine(ch_intervals) + .view() + ``` -```groovy title="main.nf" linenums="23" -ch_intervals = Channel.of('chr1', 'chr2', 'chr3') +=== "Before" -ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .view() -``` + ```groovy title="main.nf" linenums="23" + ch_intervals = Channel.of('chr1', 'chr2', 'chr3') + ``` Now let's run it and see what happens: @@ -821,27 +821,27 @@ Success! We have repeated every sample for every single interval in our 3 interv We can use the `map` operator to tidy and refactor our sample data so it's easier to understand. Let's move the intervals string to the joining map at the first element. -_Before:_ +=== "After" -```groovy title="main.nf" linenums="25" -ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .view() -``` + ```groovy title="main.nf" linenums="25" + ch_combined_samples = ch_joined_samples.combine(ch_intervals) + .map { grouping_key, normal, tumor, interval -> + [ + grouping_key + [interval: interval], + normal, + tumor + ] -_After:_ + } + .view() + ``` -```groovy title="main.nf" linenums="25" -ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .map { grouping_key, normal, tumor, interval -> - [ - grouping_key + [interval: interval], - normal, - tumor - ] +=== "Before" - } - .view() -``` + ```groovy title="main.nf" linenums="25" + ch_combined_samples = ch_joined_samples.combine(ch_intervals) + .view() + ``` Wait? What did we do here? Let's go over it piece by piece. @@ -924,44 +924,44 @@ The first step is similar to what we did in the previous section. We must isolat We can reuse the `subMap` method from before to isolate our `id` and `interval` fields from the map. Like before, we will use `map` operator to apply the `subMap` method to the first element of the tuple for each sample. -_Before:_ - -```groovy title="main.nf" linenums="25" -ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .map { grouping_key, normal, tumor, interval -> - [ - grouping_key + [interval: interval], - normal, - tumor - ] - - } - .view() -``` - -_After:_ - -```groovy title="main.nf" linenums="25" -ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .map { grouping_key, normal, tumor, interval -> - [ - grouping_key + [interval: interval], - normal, - tumor - ] - - } - -ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> - [ - grouping_key.subMap('id', 'interval'), - normal, - tumor - ] - - } - .view() -``` +=== "After" + + ```groovy title="main.nf" linenums="25" + ch_combined_samples = ch_joined_samples.combine(ch_intervals) + .map { grouping_key, normal, tumor, interval -> + [ + grouping_key + [interval: interval], + normal, + tumor + ] + + } + + ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> + [ + grouping_key.subMap('id', 'interval'), + normal, + tumor + ] + + } + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="25" + ch_combined_samples = ch_joined_samples.combine(ch_intervals) + .map { grouping_key, normal, tumor, interval -> + [ + grouping_key + [interval: interval], + normal, + tumor + ] + + } + .view() + ``` Let's run it again and check the channel contents: @@ -996,34 +996,34 @@ We can see that we have successfully isolated the `id` and `interval` fields, bu Let's now group the samples by this new grouping element, using the [`groupTuple` operator](https://www.nextflow.io/docs/latest/operator.html#grouptuple). -_Before:_ +=== "After" -```groovy title="main.nf" linenums="35" -ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> - [ - grouping_key.subMap('id', 'interval'), - normal, - tumor - ] + ```groovy title="main.nf" linenums="35" + ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> + [ + grouping_key.subMap('id', 'interval'), + normal, + tumor + ] - } - .view() -``` + } + .groupTuple() + .view() + ``` -_After:_ +=== "Before" -```groovy title="main.nf" linenums="35" -ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> - [ - grouping_key.subMap('id', 'interval'), - normal, - tumor - ] + ```groovy title="main.nf" linenums="35" + ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> + [ + grouping_key.subMap('id', 'interval'), + normal, + tumor + ] - } - .groupTuple() - .view() -``` + } + .view() + ``` Simple, huh? We just added a single line of code. Let's see what happens when we run it: @@ -1059,7 +1059,7 @@ It's possible to use a simpler data structure than this, by separating our the s Let's consider the inputs to a typical Nextflow process. Generally, inputs can be in the form of values or files. In this example, we have a set of values for sample information (`id` and `interval`) and a set of files for sequencing data (`normal` and `tumor`). The `input` block of a process might look like this: -```groovy title="main.nf" +```groovy input: tuple val(sampleInfo), path(normalBam), path(tumorBam) ``` @@ -1093,26 +1093,26 @@ Note this is conceptually similar but distinct to the Nextflow [`collect` operat Let's append our map to the end of our pipeline and show the resulting data structure: -_Before:_ +=== "After" -```groovy title="main.nf" linenums="42" -.groupTuple() -.view() -``` + ```groovy title="main.nf" linenums="42" + .groupTuple() + .map { sample_info, normal, tumor -> + [ + sample_info, + normal.collect { bam_data -> bam_data.bam }, + tumor.collect { bam_data -> bam_data.bam } + ] + } + .view() + ``` -_After:_ +=== "Before" -```groovy title="main.nf" linenums="42" -.groupTuple() -.map { sample_info, normal, tumor -> - [ - sample_info, - normal.collect { bam_data -> bam_data.bam }, - tumor.collect { bam_data -> bam_data.bam } - ] -} -.view() -``` + ```groovy title="main.nf" linenums="42" + .groupTuple() + .view() + ``` ```bash title="View flattened samples" nextflow run main.nf @@ -1149,50 +1149,50 @@ One issue we have faced in this pipeline is that we have a moderately complicate If we parse the data right at the start of our pipeline to _only_ include the `bam` field, we can avoid passing the `type` field through the pipeline which makes our entire pipeline cleaner while retaining the same functionality: -_Before:_ +=== "After" -```groovy title="main.nf" linenums="5" -getSampleIdAndReplicate = { sample -> - [ - sample.subMap(['id', 'repeat']), - sample.subMap(['type', 'bam']) - ] - } -``` + ```groovy title="main.nf" linenums="5" + getSampleIdAndReplicate = { sample -> + [ + sample.subMap(['id', 'repeat']), + sample.bam + ] + } + ``` -_After:_ +=== "Before" -```groovy title="main.nf" linenums="5" -getSampleIdAndReplicate = { sample -> - [ - sample.subMap(['id', 'repeat']), - sample.bam - ] - } -``` + ```groovy title="main.nf" linenums="5" + getSampleIdAndReplicate = { sample -> + [ + sample.subMap(['id', 'repeat']), + sample.subMap(['type', 'bam']) + ] + } + ``` A reminder, this will select only the BAM files once we have separated the channels into normal and tumor. We are losing the `type` field, but we know which samples are normal and tumor because they have been filtered and the channel should only contain one type per sample. Once we have done this we can remove the `map` operator from the end of the pipeline: -_Before:_ - -```groovy title="main.nf" linenums="43" -.groupTuple() -.map { sample_info, normal, tumor -> - [ - sample_info, - normal.collect { bam_data -> bam_data.bam }, - tumor.collect { bam_data -> bam_data.bam } - ] -} -.view() -``` - -_After:_ - -```groovy title="main.nf" linenums="43" -.groupTuple() -.view() -``` +=== "After" + + ```groovy title="main.nf" linenums="43" + .groupTuple() + .view() + ``` + +=== "Before" + + ```groovy title="main.nf" linenums="43" + .groupTuple() + .map { sample_info, normal, tumor -> + [ + sample_info, + normal.collect { bam_data -> bam_data.bam }, + tumor.collect { bam_data -> bam_data.bam } + ] + } + .view() + ``` Sometimes parsing data earlier in the pipeline is the right choice to avoid complicated code. From 4333a26d8d1ce9c4f04c49865e76fe5fede222a2 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Mon, 28 Apr 2025 17:22:36 +0100 Subject: [PATCH 34/36] Splitting and grouping highlight lines --- docs/side_quests/splitting_and_grouping.md | 117 ++++++++++----------- 1 file changed, 56 insertions(+), 61 deletions(-) diff --git a/docs/side_quests/splitting_and_grouping.md b/docs/side_quests/splitting_and_grouping.md index 827fa8e210..8b15cb5e2d 100644 --- a/docs/side_quests/splitting_and_grouping.md +++ b/docs/side_quests/splitting_and_grouping.md @@ -81,7 +81,7 @@ workflow { Throughout this tutorial, we'll use the `ch_` prefix for all channel variables to clearly indicate they are Nextflow channels. -=== "After" +=== "After" hl_lines="2-3" ```groovy title="main.nf" linenums="2" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") @@ -148,7 +148,7 @@ We now have a channel of maps, each representing a row from the samplesheet. Nex We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator.html#filter) to filter the data based on a condition. Let's say we only want to process normal samples. We can do this by filtering the data based on the `type` field. Let's insert this before the `view` operator. -=== "After" +=== "After" hl_lines="3" ```groovy title="main.nf" linenums="2" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") @@ -194,7 +194,7 @@ While useful, we are discarding the tumor samples. Instead, let's rewrite our pi === "After" - ```groovy title="main.nf" linenums="2" + ```groovy title="main.nf" linenums="2" hl_lines="3 5" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) ch_normal_samples = ch_samplesheet @@ -232,7 +232,7 @@ Success! We have filtered the data to only include normal samples. Note that we === "After" - ```groovy title="main.nf" linenums="2" + ```groovy title="main.nf" linenums="2" hl_lines="5-8" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) ch_normal_samples = ch_samplesheet @@ -288,7 +288,7 @@ We've managed to separate out the normal and tumor samples into two different ch === "Before" - ```groovy title="main.nf" linenums="2" + ```groovy title="main.nf" linenums="2" hl_lines="7" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) ch_normal_samples = ch_samplesheet @@ -359,7 +359,7 @@ To isolate the `id` field, we can use the [`map` operator](https://www.nextflow. === "After" - ```groovy title="main.nf" linenums="2" + ```groovy title="main.nf" linenums="2" hl_lines="5 8 9" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) ch_normal_samples = ch_samplesheet @@ -409,7 +409,7 @@ Once again, we will use `view` to print the joined outputs. === "After" - ```groovy title="main.nf" linenums="2" + ```groovy title="main.nf" linenums="2" hl_lines="9-11" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) ch_normal_samples = ch_samplesheet @@ -425,7 +425,7 @@ Once again, we will use `view` to print the joined outputs. === "Before" - ```groovy title="main.nf" linenums="2" + ```groovy title="main.nf" linenums="2" hl_lines="9-10" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) ch_normal_samples = ch_samplesheet @@ -482,7 +482,7 @@ Let's start by creating a new joining key. We can do this in the same way as bef === "After" - ```groovy title="main.nf" linenums="4" + ```groovy title="main.nf" linenums="4" hl_lines="3-7 10-14" ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [ @@ -539,7 +539,7 @@ The `subMap` method takes a map and returns a new map with only the key-value pa === "After" - ```groovy title="main.nf" linenums="4" + ```groovy title="main.nf" linenums="4" hl_lines="4 11" ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [ @@ -558,7 +558,7 @@ The `subMap` method takes a map and returns a new map with only the key-value pa === "Before" - ```groovy title="main.nf" linenums="4" + ```groovy title="main.nf" linenums="4" hl_lines="4 11" ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map { sample -> [ @@ -600,7 +600,7 @@ To do so, first we define the closure as a new variable: === "After" - ```groovy title="main.nf" linenums="2" + ```groovy title="main.nf" linenums="2" hl_lines="4" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) @@ -621,7 +621,7 @@ We have taken the map we used previously and defined it as a named variable we c === "After" - ```groovy title="main.nf" linenums="7" + ```groovy title="main.nf" linenums="7" hl_lines="3 7" ch_normal_samples = ch_samplesheet .filter { sample -> sample.type == 'normal' } .map ( getSampleIdAndReplicate ) @@ -701,7 +701,7 @@ Since the `id` and `repeat` fields are available in the grouping key, let's remo === "After" - ```groovy title="main.nf" linenums="5" + ```groovy title="main.nf" linenums="5" hl_lines="2-5" getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), @@ -760,7 +760,7 @@ Let's start by creating a channel of intervals. To keep life simple, we will jus === "After" - ```groovy title="main.nf" linenums="21" + ```groovy title="main.nf" linenums="21" hl_lines="3" .join(ch_tumor_samples) ch_intervals = Channel.of('chr1', 'chr2', 'chr3') @@ -777,7 +777,7 @@ Now remember, we want to repeat each sample for each interval. This is sometimes === "After" - ```groovy title="main.nf" linenums="23" + ```groovy title="main.nf" linenums="23" hl_lines="3-4" ch_intervals = Channel.of('chr1', 'chr2', 'chr3') ch_combined_samples = ch_joined_samples.combine(ch_intervals) @@ -823,24 +823,23 @@ We can use the `map` operator to tidy and refactor our sample data so it's easie === "After" - ```groovy title="main.nf" linenums="25" + ```groovy title="main.nf" linenums="25" hl_lines="2-9" ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .map { grouping_key, normal, tumor, interval -> - [ - grouping_key + [interval: interval], - normal, - tumor - ] - - } - .view() + .map { grouping_key, normal, tumor, interval -> + [ + grouping_key + [interval: interval], + normal, + tumor + ] + } + .view() ``` === "Before" ```groovy title="main.nf" linenums="25" ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .view() + .view() ``` Wait? What did we do here? Let's go over it piece by piece. @@ -926,16 +925,15 @@ We can reuse the `subMap` method from before to isolate our `id` and `interval` === "After" - ```groovy title="main.nf" linenums="25" + ```groovy title="main.nf" linenums="25" hl_lines="10-17" ch_combined_samples = ch_joined_samples.combine(ch_intervals) - .map { grouping_key, normal, tumor, interval -> - [ - grouping_key + [interval: interval], - normal, - tumor - ] - - } + .map { grouping_key, normal, tumor, interval -> + [ + grouping_key + [interval: interval], + normal, + tumor + ] + } ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> [ @@ -943,9 +941,8 @@ We can reuse the `subMap` method from before to isolate our `id` and `interval` normal, tumor ] - - } - .view() + } + .view() ``` === "Before" @@ -998,31 +995,29 @@ Let's now group the samples by this new grouping element, using the [`groupTuple === "After" - ```groovy title="main.nf" linenums="35" + ```groovy title="main.nf" linenums="35" hl_lines="8" ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> - [ - grouping_key.subMap('id', 'interval'), - normal, - tumor - ] - - } - .groupTuple() - .view() - ``` + [ + grouping_key.subMap('id', 'interval'), + normal, + tumor + ] + } + .groupTuple() + .view() + ``` === "Before" ```groovy title="main.nf" linenums="35" ch_grouped_samples = ch_combined_samples.map { grouping_key, normal, tumor -> - [ - grouping_key.subMap('id', 'interval'), - normal, - tumor - ] - - } - .view() + [ + grouping_key.subMap('id', 'interval'), + normal, + tumor + ] + } + .view() ``` Simple, huh? We just added a single line of code. Let's see what happens when we run it: @@ -1095,7 +1090,7 @@ Let's append our map to the end of our pipeline and show the resulting data stru === "After" - ```groovy title="main.nf" linenums="42" + ```groovy title="main.nf" linenums="42" hl_lines="2-8" .groupTuple() .map { sample_info, normal, tumor -> [ @@ -1151,7 +1146,7 @@ If we parse the data right at the start of our pipeline to _only_ include the `b === "After" - ```groovy title="main.nf" linenums="5" + ```groovy title="main.nf" linenums="5" hl_lines="4" getSampleIdAndReplicate = { sample -> [ sample.subMap(['id', 'repeat']), @@ -1182,7 +1177,7 @@ A reminder, this will select only the BAM files once we have separated the chann === "Before" - ```groovy title="main.nf" linenums="43" + ```groovy title="main.nf" linenums="43" hl_lines="2-8" .groupTuple() .map { sample_info, normal, tumor -> [ From edaa09ac3d704a5988f9d0d65590a435cf32e958 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Mon, 28 Apr 2025 17:26:52 +0100 Subject: [PATCH 35/36] fixup --- docs/side_quests/splitting_and_grouping.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/side_quests/splitting_and_grouping.md b/docs/side_quests/splitting_and_grouping.md index 8b15cb5e2d..caef4eff31 100644 --- a/docs/side_quests/splitting_and_grouping.md +++ b/docs/side_quests/splitting_and_grouping.md @@ -81,9 +81,9 @@ workflow { Throughout this tutorial, we'll use the `ch_` prefix for all channel variables to clearly indicate they are Nextflow channels. -=== "After" hl_lines="2-3" +=== "After" - ```groovy title="main.nf" linenums="2" + ```groovy title="main.nf" linenums="2" hl_lines="2-3" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .view() @@ -148,9 +148,9 @@ We now have a channel of maps, each representing a row from the samplesheet. Nex We can use the [`filter` operator](https://www.nextflow.io/docs/latest/operator.html#filter) to filter the data based on a condition. Let's say we only want to process normal samples. We can do this by filtering the data based on the `type` field. Let's insert this before the `view` operator. -=== "After" hl_lines="3" +=== "After" - ```groovy title="main.nf" linenums="2" + ```groovy title="main.nf" linenums="2" hl_lines="3" ch_samplesheet = Channel.fromPath("./data/samplesheet.csv") .splitCsv(header: true) .filter { sample -> sample.type == 'normal' } From 4173bf6c74445cb58099150683cc7d568d5cf6c6 Mon Sep 17 00:00:00 2001 From: adamrtalbot <12817534+adamrtalbot@users.noreply.github.com> Date: Fri, 2 May 2025 14:11:58 +0100 Subject: [PATCH 36/36] Fix formatting issue --- docs/side_quests/splitting_and_grouping.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/side_quests/splitting_and_grouping.md b/docs/side_quests/splitting_and_grouping.md index caef4eff31..60386db325 100644 --- a/docs/side_quests/splitting_and_grouping.md +++ b/docs/side_quests/splitting_and_grouping.md @@ -1005,7 +1005,7 @@ Let's now group the samples by this new grouping element, using the [`groupTuple } .groupTuple() .view() - ``` + ``` === "Before"