Skip to content

Commit 5d5ddeb

Browse files
authored
Merge pull request #68 from seqeralabs/Parsing_move
Parsing move
2 parents dca815b + 9418321 commit 5d5ddeb

File tree

31 files changed

+304
-679
lines changed

31 files changed

+304
-679
lines changed

.gitpod.Dockerfile

Lines changed: 0 additions & 32 deletions
This file was deleted.

.gitpod.yml

Lines changed: 5 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -13,27 +13,18 @@ github:
1313
# add a "Review in Gitpod" button to pull requests (defaults to false)
1414
addBadge: false
1515

16-
image:
17-
file: .gitpod.Dockerfile
16+
# Old container: nfcore/gitpod:latest
17+
image: nfcore/gitpod:latest
1818

1919
# List the start up tasks. Learn more https://www.gitpod.io/docs/config-start-tasks/
2020
tasks:
2121
- name: Start web server
2222
command: gp await-port 23000 && gp preview https://training.seqera.io
2323

2424
- name: Download Nextflow Tutorial
25-
init: |
26-
echo 'init script' # runs during prebuild
27-
echo 'start script'
28-
2925

3026
command: |
31-
curl -s https://get.nextflow.io | bash
32-
chmod +x nextflow
33-
sudo mv nextflow /usr/local/bin/
34-
docker pull nextflow/rnaseq-nf
35-
sudo apt install -y tree
36-
sudo apt install -y graphviz
37-
unset JAVA_TOOL_OPTIONS
38-
alias conda_activate=". /opt/conda/etc/profile.d/conda.sh; conda activate base"
3927
cd nf-training
28+
conda init bash
29+
unset JAVA_TOOL_OPTIONS
30+
docker pull nextflow/rnaseq-nf

asciidocs/channels.adoc

Lines changed: 298 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -360,3 +360,301 @@ process fastqc {
360360
}
361361
----
362362

363+
=== Text files
364+
365+
The `splitText` operator allows you to split multi-line strings or text file items, emitted by a source channel into chunks containing n lines, which will be emitted by the resulting channel. See:
366+
367+
----
368+
Channel
369+
.fromPath('data/meta/random.txt') // <1>
370+
.splitText() // <2>
371+
.view() // <3>
372+
----
373+
374+
<1> Instructs Nextflow to make a channel from the path "data/meta/random.txt".
375+
<2> The `splitText` operator splits each item into chunks of one line by default.
376+
<3> View contents of the channel.
377+
378+
379+
You can define the number of lines in each chunk by using the parameter `by`, as shown in the following example:
380+
381+
----
382+
Channel
383+
.fromPath('data/meta/random.txt')
384+
.splitText( by: 2 )
385+
.subscribe {
386+
print it;
387+
print "--- end of the chunk ---\n"
388+
}
389+
----
390+
391+
TIP: The `subscribe` operator permits to execute a user defined function each time a new value is emitted by the source channel.
392+
393+
An optional closure can be specified in order to transform the text chunks produced by the operator. The following example shows how to split text files into chunks of 10 lines and transform them into capital letters:
394+
395+
----
396+
Channel
397+
.fromPath('data/meta/random.txt')
398+
.splitText( by: 10 ) { it.toUpperCase() }
399+
.view()
400+
----
401+
402+
You can also make counts for each line:
403+
404+
----
405+
count=0
406+
407+
Channel
408+
.fromPath('data/meta/random.txt')
409+
.splitText()
410+
.view { "${count++}: ${it.toUpperCase().trim()}" }
411+
412+
----
413+
414+
Finally, you can also use the operator on plain files (outside of the channel context), as so:
415+
416+
----
417+
def f = file('data/meta/random.txt')
418+
def lines = f.splitText()
419+
def count=0
420+
for( String row : lines ) {
421+
log.info "${count++} ${row.toUpperCase()}"
422+
}
423+
----
424+
425+
=== Comma separate values (.csv)
426+
427+
The `splitCsv` operator allows you to parse text items emitted by a channel, that are formatted using the CSV format.
428+
429+
It then splits them into records or groups them into a list of records with a specified length.
430+
431+
In the simplest case, just apply the `splitCsv` operator to a channel emitting a CSV formatted text files or text entries, to view only the first and fourth columns. For example:
432+
433+
----
434+
Channel
435+
.fromPath("data/meta/patients_1.csv")
436+
.splitCsv()
437+
// row is a list object
438+
.view { row -> "${row[0]},${row[3]}" }
439+
----
440+
441+
When the CSV begins with a header line defining the column names, you can specify the parameter `header: true` which allows you to reference each value by its name, as shown in the following example:
442+
443+
----
444+
Channel
445+
.fromPath("data/meta/patients_1.csv")
446+
.splitCsv(header: true)
447+
// row is a list object
448+
.view { row -> "${row.patient_id},${row.num_samples}" }
449+
----
450+
451+
Alternatively you can provide custom header names by specifying a the list of strings in the header parameter as shown below:
452+
453+
----
454+
Channel
455+
.fromPath("data/meta/patients_1.csv")
456+
.splitCsv(header: ['col1', 'col2', 'col3', 'col4', 'col5'] )
457+
// row is a list object
458+
.view { row -> "${row.col1},${row.col4}" }
459+
----
460+
461+
You can also process multiple csv files at the same time:
462+
463+
----
464+
Channel
465+
.fromPath("data/meta/patients_*.csv") // <-- just use a pattern
466+
.splitCsv(header:true)
467+
.view { row -> "${row.patient_id}\t${row.num_samples}" }
468+
----
469+
470+
TIP: Notice that you can change the output format simply by adding a different delimiter.
471+
472+
Finally, you can also operate on csv files outside the channel context, as so:
473+
474+
----
475+
def f = file('data/meta/patients_1.csv')
476+
def lines = f.splitCsv()
477+
for( List row : lines ) {
478+
log.info "${row[0]} -- ${row[2]}"
479+
}
480+
----
481+
482+
[discrete]
483+
=== Exercise
484+
485+
Try inputting fastq reads to the RNA-Seq workflow from earlier using `.splitCSV`.
486+
487+
.Click here for the answer:
488+
[%collapsible]
489+
====
490+
Add a csv text file containing the following, as example input with the name "fastq.csv":
491+
492+
[source,nextflow,linenums]
493+
----
494+
gut,/workspace/nf-training-public/nf-training/data/ggal/gut_1.fq,/workspace/nf-training-public/nf-training/data/ggal/gut_2.fq
495+
----
496+
497+
Then replace the input channel for the reads in `script7.nf`. Changing the following lines:
498+
499+
[source,nextflow,linenums]
500+
----
501+
Channel
502+
.fromFilePairs( params.reads, checkIfExists: true )
503+
.into { read_pairs_ch; read_pairs2_ch }
504+
----
505+
506+
To a splitCsv channel factory input:
507+
508+
[source,nextflow,linenums]
509+
----
510+
Channel
511+
.fromPath("fastq.csv")
512+
.splitCsv()
513+
.view () { row -> "${row[0]},${row[1]},${row[2]}" }
514+
.into { read_pairs_ch; read_pairs2_ch }
515+
----
516+
517+
Finally, change the cardinality of the processes that use the input data. For example, for the quantification process I change it from:
518+
519+
[source,nextflow,linenums]
520+
----
521+
process quantification {
522+
tag "$sample_id"
523+
524+
input:
525+
path salmon_index from index_ch
526+
tuple val(sample_id), path(reads) from read_pairs_ch
527+
528+
output:
529+
path sample_id into quant_ch
530+
531+
script:
532+
"""
533+
salmon quant --threads $task.cpus --libType=U -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id
534+
"""
535+
}
536+
----
537+
538+
To:
539+
540+
[source,nextflow,linenums]
541+
----
542+
process quantification {
543+
tag "$sample_id"
544+
545+
input:
546+
path salmon_index from index_ch
547+
tuple val(sample_id), path(reads1), path(reads2) from read_pairs_ch
548+
549+
output:
550+
path sample_id into quant_ch
551+
552+
script:
553+
"""
554+
salmon quant --threads $task.cpus --libType=U -i $salmon_index -1 ${reads1} -2 ${reads2} -o $sample_id
555+
"""
556+
}
557+
----
558+
559+
Repeat for the fastqc step. Now the workflow should run from a CSV file.
560+
====
561+
562+
=== Tab separated values (.tsv)
563+
564+
Parsing tsv files works in a similar way, just adding the `sep:'\t'` option in the `splitCsv` context:
565+
566+
----
567+
Channel
568+
.fromPath("data/meta/regions.tsv", checkIfExists:true)
569+
// use `sep` option to parse TAB separated files
570+
.splitCsv(sep:'\t')
571+
// row is a list object
572+
.view()
573+
----
574+
575+
[discrete]
576+
=== Exercise
577+
578+
Try using the tab separation technique on the file "data/meta/regions.tsv", but print just the first column, and remove the header.
579+
580+
.Answer:
581+
[%collapsible]
582+
====
583+
Channel
584+
.fromPath("data/meta/regions.tsv", checkIfExists:true)
585+
// use `sep` option to parse TAB separated files
586+
.splitCsv(sep:'\t', header:true )
587+
// row is a list object
588+
.view { row -> "${row.patient_id}" }
589+
====
590+
591+
== More complex file formats
592+
593+
=== JSON
594+
595+
We can also easily parse the JSON file format using the following groovy schema:
596+
597+
----
598+
import groovy.json.JsonSlurper
599+
600+
def f = file('data/meta/regions.json')
601+
def records = new JsonSlurper().parse(f)
602+
603+
604+
for( def entry : records ) {
605+
log.info "$entry.patient_id -- $entry.feature"
606+
}
607+
----
608+
609+
IMPORTANT: When using an older JSON version, you may need to replace `parse(f)` with `parseText(f.text)`
610+
611+
=== YAML
612+
613+
In a similar way, this is a way to parse YAML files:
614+
615+
----
616+
import org.yaml.snakeyaml.Yaml
617+
618+
def f = file('data/meta/regions.json')
619+
def records = new Yaml().load(f)
620+
621+
622+
for( def entry : records ) {
623+
log.info "$entry.patient_id -- $entry.feature"
624+
}
625+
----
626+
627+
=== Storage of parsers into modules
628+
629+
The best way to store parser scripts is to keep them in a nextflow module file.
630+
631+
This follows the DSL2 way of working.
632+
633+
See the following nextflow script:
634+
635+
----
636+
nextflow.preview.dsl=2
637+
638+
include{ parseJsonFile } from './modules/parsers.nf'
639+
640+
process foo {
641+
input:
642+
tuple val(meta), path(data_file)
643+
644+
"""
645+
echo your_command $meta.region_id $data_file
646+
"""
647+
}
648+
649+
workflow {
650+
Channel.fromPath('data/meta/regions*.json') \
651+
| flatMap { parseJsonFile(it) } \
652+
| map { entry -> tuple(entry,"/some/data/${entry.patient_id}.txt") } \
653+
| foo
654+
}
655+
----
656+
657+
To get this script to work, first we need to create a file called `parsers.nf`, and store it in the modules folder in the current directory.
658+
659+
This file should have the `parseJsonFile` function present, then Nextflow will use this as a custom function within the workflow scope.
660+

asciidocs/containers.adoc

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -381,12 +381,7 @@ Conda is a popular package and environment manager. The built-in support for Con
381381
allows Nextflow pipelines to automatically create and activate the Conda
382382
environment(s), given the dependencies specified by each process.
383383

384-
For this Gitpod tutorial you need to activate conda by typing:
385-
386-
387-
```bash
388-
conda_activate
389-
```
384+
For this Gitpod tutorial you need to open a new terminal to ensure that conda is activated (see the + button on the terminal).
390385

391386
You should now see that the beginning of your command prompt has `(base)` written. If you wish to deactivate conda at any point, simply enter `conda deactivate`.
392387

asciidocs/index.adoc

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,5 +30,4 @@ include::config.adoc[]
3030
include::executors.adoc[]
3131
include::cache_and_resume.adoc[]
3232
include::debugging.adoc[]
33-
include::parsing.adoc[]
3433
:leveloffset: -1

0 commit comments

Comments
 (0)