From 5c7a9d632be513c6eb973ce22429733cb7bc53f3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E2=80=98ens-ftricomi=E2=80=99?= Date: Mon, 13 May 2024 11:47:13 +0100 Subject: [PATCH] added README --- README.md | 181 +++++++++++++++++++++ pipelines/nextflow/workflows/statistics.nf | 8 +- 2 files changed, 185 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index e69de29..dc77ac3 100644 --- a/README.md +++ b/README.md @@ -0,0 +1,181 @@ +# Genebuild statistics pipeline + +The pipeline provides Busco, Omark completeness scores, calculates statistics for Ensembl website when the core database is available. +If only the assembly accession and the taxon id are available the pipeline provide Busco score (mode=genome) for the assembly. + +## Running options + +The following options require a list of mandatory arguments (see `Mandatory arguments`). + +## Busco pipeline `--run_busco_core` + +Busco is a measure of completeness of genome assembly and annotation of the gene set. See the documentation for further details [Busco userguide](https://busco.ezlab.org/busco_userguide.html) + +Docker image available in https://hub.docker.com/r/ezlabgva/busco + +#### `--mode` +Select Busco mode, i.e. genome mode (assess a genome assembly), protein mode (assess a gene set) or both. By default, run both modes. +--busco_mode STR Busco mode: genome or protein, default is to run both' + +#### `--copyToFtp` +Boolean option to copy output in Ensembl ftp, default false + +#### `--host` +The host name for the databases + +#### `--port` +The port number of the host + +#### `--user` +The read/wrote username for the host. + +#### `--user_r` +The read only username for the host. + +#### `--password` +The database password. + +```bash +nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl --enscode $ENSCODE --csvFile --outDir --host --port --user --user_r --password --mode --run_busco_core true -profile slurm +``` + +## OMArk pipeline `--run_omark` + +OMArk is a software of proteome (protein-coding gene repertoire) quality assessment. It provides measure of proteome completeness, characterize all protein coding genes in the light of existing homologs, and identify the presence of contamination from other species. +Further information available in the official repo https://github.com/DessimozLab/OMArk + +#### `--copyToFtp` +Boolean option to copy output in Ensembl ftp, default false + +#### `--host` +The host name for the databases + +#### `--port` +The port number of the host + +#### `--user` +The read/wrote username for the host. + +#### `--user_r` +The read only username for the host. + +#### `--password` +The database password. + +```bash +nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl --enscode $ENSCODE --csvFile --outDir --host --port --user --user_r --password --run_omark true -profile slurm +``` + +## Ensembl statistics pipeline `--run_ensembl_stats` + +The pipeline calculate core statistics for Ensembl browser. + +### `--apply_stats` +Boolean option to upload Ensembl statistics in a mysql db, default false + +#### `--copyToFtp` +Boolean option to copy output in Ensembl ftp, default false + +#### `--host` +The host name for the databases + +#### `--port` +The port number of the host + +#### `--user` +The read/wrote username for the host. + +#### `--user_r` +The read only username for the host. + +#### `--password` +The database password. + +```bash +nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl --enscode $ENSCODE --csvFile --outDir --host --port --user --user_r --password --run_ensembl_stats true -profile slurm +``` + +## Busco NCBI genome pipeline `--run_busco_ncbi` + +Option available to check the quality of the genome by running Busco in genome mode. + +```bash +nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl --enscode $ENSCODE --csvFile --outDir --run_busco_ncbi true -profile slurm +``` + + +## Requirements + +### Mandatory arguments + +#### `--csvFile` +The structure of the file can cahnge according to the running options +| Running mode | csv file format | +|-----------------|--------| +| --run_busco_core | core | +| | | +| --run_omark | core | +| | | +| --run_busco_ncbi | gca,taxon_id | +| | , | + +#### `--enscode` +Path to the root directory containing the Perl repositories + +#### `--outDir` +Path to the directory where to store the results of the pipeline + +### Optional arguments + +#### `--bioperl` +Path to the directory containing the BioPerl 1.6.924 library. If not provided, the value passed to `--enscode` will be used as root, i.e. `/bioperl-1.6.924`. + +#### `--cacheDir` +Path to the directory to use as cache for the intermediate files. If not provided, the value passed to `--outDir` will be used as root, i.e. `/cache`. + +#### `--files_latency` +Sleep time (in seconds) after the genome and proteins have been fetched. Needed by several file systems due to their internal latency. By default, 60 seconds. + +### Pipeline configuration + +#### Using the provided nextflow.config +We are using profiles to be able to run the pipeline on different HPC clusters. The default is `standard`. + +* `standard`: uses LSF to run the compute heavy jobs. It expects the usage of `scratch` to use a low latency filesystem. +* `slurm`: uses SLURM to run the compute heavy jobs. It expects the usage of `scratch` to use a low latency filesystem. + + +#### Using a local configuration file +You can use a local config with `-c` to finely configure your pipeline. All parameters can be configured, we recommend setting these ones as well: + +* `process.scratch`: The patch to the scratch directory to use +* `workDir`: The directory where nextflow stores any file + +### Information about all the parameters + +```bash +nextflow run ./ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf --help +``` + + +#### Ensembl dependencies +These are the Ensembl repositories required by this pipeline: + +| Repository name | branch | URL| +|-----------------|--------|----| +| ensembl | default | https://github.com/Ensembl/ensembl.git | +| ensembl-analysis | main | https://github.com/Ensembl/ensembl-analysis.git | +| ensembl-io | default | https://github.com/Ensembl/ensembl-io.git | +|core_meta_updates + +It is recommended that all the repositories are cloned into the same folder. + +Remember that, following the instructions in [Ensembl's Perl API installation](http://www.ensembl.org/info/docs/api/api_installation.html), you will also need to have BioPerl v1.6.924 available in your system. If you do not, you can install it executing the following commands: + +```bash +wget https://github.com/bioperl/bioperl-live/archive/release-1-6-924.zip +unzip release-1-6-924.zip +mv bioperl-live-release-1-6-924 bioperl-1.6.924 +``` + +It is recommended to install it in the same folder as the Ensembl repositories. diff --git a/pipelines/nextflow/workflows/statistics.nf b/pipelines/nextflow/workflows/statistics.nf index 2ce5c7c..52f9a78 100644 --- a/pipelines/nextflow/workflows/statistics.nf +++ b/pipelines/nextflow/workflows/statistics.nf @@ -65,7 +65,7 @@ if (params.help) { log.info 'Usage: ' log.info 'nextflow -C ensembl-genes-nf/pipelines/nextflow/workflows/nextflow.config \ run ensembl-genes-nf/pipelines/nextflow/workflows/main.nf \ - --enscode --csvFile --outDir --host --port --user --bioperl --project \ + -entry STATISTICS --enscode --csvFile --outDir --host --port --user --bioperl --project \ --run_busco_core --run_busco_ncbi --run_omark --run_ensembl_stats \ --apply_stats --copyToFtp --busco_mode' log.info '' @@ -75,7 +75,7 @@ if (params.help) { log.info ' --user STR Db user' log.info ' --user_r STR Db user read_only' log.info ' --enscode STR Enscode path' - log.info ' --outDir STR Output directory. Default is workDir' + log.info ' --outDir STR Output directory.' log.info ' --csvFile STR Path for the csv containing the db name' log.info ' --bioperl STR BioPerl path (optional)' log.info ' --project STR Project, for the formatting of the output ("ensembl" or "brc")' @@ -145,8 +145,8 @@ workflow STATISTICS{ } minimal_data = Channel.from(minimalProcessedData) - data1=minimal_data - data1.each{ d-> d.view()} + //data1=minimal_data + //data1.each{ d-> d.view()} /* data = Channel.fromPath(params.csvFile, type: 'file', checkIfExists: true) .splitCsv( header:true)