Skip to content

Commit

Permalink
added README
Browse files Browse the repository at this point in the history
  • Loading branch information
ens-ftricomi committed May 13, 2024
1 parent ed6c2ac commit 5c7a9d6
Show file tree
Hide file tree
Showing 2 changed files with 185 additions and 4 deletions.
181 changes: 181 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Genebuild statistics pipeline

The pipeline provides Busco, Omark completeness scores, calculates statistics for Ensembl website when the core database is available.
If only the assembly accession and the taxon id are available the pipeline provide Busco score (mode=genome) for the assembly.

## Running options

The following options require a list of mandatory arguments (see `Mandatory arguments`).

## Busco pipeline `--run_busco_core`

Busco is a measure of completeness of genome assembly and annotation of the gene set. See the documentation for further details [Busco userguide](https://busco.ezlab.org/busco_userguide.html)

Docker image available in https://hub.docker.com/r/ezlabgva/busco

#### `--mode`
Select Busco mode, i.e. genome mode (assess a genome assembly), protein mode (assess a gene set) or both. By default, run both modes.
--busco_mode STR Busco mode: genome or protein, default is to run both'

#### `--copyToFtp`
Boolean option to copy output in Ensembl ftp, default false

#### `--host`
The host name for the databases

#### `--port`
The port number of the host

#### `--user`
The read/wrote username for the host.

#### `--user_r`
The read only username for the host.

#### `--password`
The database password.

```bash
nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user> --password <mysql_password> --mode <busco_mode> --run_busco_core true -profile slurm
```

## OMArk pipeline `--run_omark`

OMArk is a software of proteome (protein-coding gene repertoire) quality assessment. It provides measure of proteome completeness, characterize all protein coding genes in the light of existing homologs, and identify the presence of contamination from other species.
Further information available in the official repo https://github.com/DessimozLab/OMArk

#### `--copyToFtp`
Boolean option to copy output in Ensembl ftp, default false

#### `--host`
The host name for the databases

#### `--port`
The port number of the host

#### `--user`
The read/wrote username for the host.

#### `--user_r`
The read only username for the host.

#### `--password`
The database password.

```bash
nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user> --password <mysql_password> --run_omark true -profile slurm
```

## Ensembl statistics pipeline `--run_ensembl_stats`

The pipeline calculate core statistics for Ensembl browser.

### `--apply_stats`
Boolean option to upload Ensembl statistics in a mysql db, default false

#### `--copyToFtp`
Boolean option to copy output in Ensembl ftp, default false

#### `--host`
The host name for the databases

#### `--port`
The port number of the host

#### `--user`
The read/wrote username for the host.

#### `--user_r`
The read only username for the host.

#### `--password`
The database password.

```bash
nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --host <mysql_host> --port <mysql_port> --user <user> --user_r <read_user> --password <mysql_password> --run_ensembl_stats true -profile slurm
```

## Busco NCBI genome pipeline `--run_busco_ncbi`

Option available to check the quality of the genome by running Busco in genome mode.

```bash
nextflow -C $ENSCODE/ensembl-genes-nf/nextflow.config run $ENSCODE/ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf -entry STATISTICS --bioperl <bioperl_lib> --enscode $ENSCODE --csvFile <csv_file_path> --outDir <output_dir_path> --run_busco_ncbi true -profile slurm
```


## Requirements

### Mandatory arguments

#### `--csvFile`
The structure of the file can cahnge according to the running options
| Running mode | csv file format |
|-----------------|--------|
| --run_busco_core | core |
| | <db_name> |
| --run_omark | core |
| | <db_name> |
| --run_busco_ncbi | gca,taxon_id |
| | <gca>,<taxon_id> |

#### `--enscode`
Path to the root directory containing the Perl repositories

#### `--outDir`
Path to the directory where to store the results of the pipeline

### Optional arguments

#### `--bioperl`
Path to the directory containing the BioPerl 1.6.924 library. If not provided, the value passed to `--enscode` will be used as root, i.e. `<enscode>/bioperl-1.6.924`.

#### `--cacheDir`
Path to the directory to use as cache for the intermediate files. If not provided, the value passed to `--outDir` will be used as root, i.e. `<outDir>/cache`.

#### `--files_latency`
Sleep time (in seconds) after the genome and proteins have been fetched. Needed by several file systems due to their internal latency. By default, 60 seconds.

### Pipeline configuration

#### Using the provided nextflow.config
We are using profiles to be able to run the pipeline on different HPC clusters. The default is `standard`.

* `standard`: uses LSF to run the compute heavy jobs. It expects the usage of `scratch` to use a low latency filesystem.
* `slurm`: uses SLURM to run the compute heavy jobs. It expects the usage of `scratch` to use a low latency filesystem.


#### Using a local configuration file
You can use a local config with `-c` to finely configure your pipeline. All parameters can be configured, we recommend setting these ones as well:

* `process.scratch`: The patch to the scratch directory to use
* `workDir`: The directory where nextflow stores any file

### Information about all the parameters

```bash
nextflow run ./ensembl-genes-nf/pipelines/nextflow/workflows/statistics.nf --help
```


#### Ensembl dependencies
These are the Ensembl repositories required by this pipeline:

| Repository name | branch | URL|
|-----------------|--------|----|
| ensembl | default | https://github.com/Ensembl/ensembl.git |
| ensembl-analysis | main | https://github.com/Ensembl/ensembl-analysis.git |
| ensembl-io | default | https://github.com/Ensembl/ensembl-io.git |
|core_meta_updates

It is recommended that all the repositories are cloned into the same folder.

Remember that, following the instructions in [Ensembl's Perl API installation](http://www.ensembl.org/info/docs/api/api_installation.html), you will also need to have BioPerl v1.6.924 available in your system. If you do not, you can install it executing the following commands:

```bash
wget https://github.com/bioperl/bioperl-live/archive/release-1-6-924.zip
unzip release-1-6-924.zip
mv bioperl-live-release-1-6-924 bioperl-1.6.924
```

It is recommended to install it in the same folder as the Ensembl repositories.
8 changes: 4 additions & 4 deletions pipelines/nextflow/workflows/statistics.nf
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ if (params.help) {
log.info 'Usage: '
log.info 'nextflow -C ensembl-genes-nf/pipelines/nextflow/workflows/nextflow.config \
run ensembl-genes-nf/pipelines/nextflow/workflows/main.nf \
--enscode --csvFile --outDir --host --port --user --bioperl --project \
-entry STATISTICS --enscode --csvFile --outDir --host --port --user --bioperl --project \
--run_busco_core --run_busco_ncbi --run_omark --run_ensembl_stats \
--apply_stats --copyToFtp --busco_mode'
log.info ''
Expand All @@ -75,7 +75,7 @@ if (params.help) {
log.info ' --user STR Db user'
log.info ' --user_r STR Db user read_only'
log.info ' --enscode STR Enscode path'
log.info ' --outDir STR Output directory. Default is workDir'
log.info ' --outDir STR Output directory.'
log.info ' --csvFile STR Path for the csv containing the db name'
log.info ' --bioperl STR BioPerl path (optional)'
log.info ' --project STR Project, for the formatting of the output ("ensembl" or "brc")'
Expand Down Expand Up @@ -145,8 +145,8 @@ workflow STATISTICS{
}

minimal_data = Channel.from(minimalProcessedData)
data1=minimal_data
data1.each{ d-> d.view()}
//data1=minimal_data
//data1.each{ d-> d.view()}
/*
data = Channel.fromPath(params.csvFile, type: 'file', checkIfExists: true)
.splitCsv( header:true)
Expand Down

0 comments on commit 5c7a9d6

Please sign in to comment.