This is where the workflows for running APAeval participants (= "method workflows") live.
NOTE: The following sections give in depth instructions on how to create new APAeval method workflows. If you're looking for instructions on how to run an existing workflow for one of our benchmarked methods, please refer to the
README.md
in the respective directory. You can find quick links to those directories in the participant overview table below. In any case, make sure you have the APAeval conda environment set up and running.
List of bioinformatic methods benchmarked in APAeval. Please update columns as the method workflows progress.
Method | Citation | Type | Status in APAeval | Benchmarked | OpenEBench link |
---|---|---|---|---|---|
APA-Scan | Fahmi et al. 2020 | Identification, relative quantification, differential usage | Issue #26 PR #160 |
No (incompatible with APAeval input and metrics, bugs) |
https://dev-openebench.bsc.es/tool/apa-scan |
APAlyzer | Wang & Tian 2020 | Relative quantification, differential usage | Snakemake workflow | No (incompatible with APAeval metrics) |
https://dev-openebench.bsc.es/tool/apalyzer |
APAtrap | Ye et al. 2018 | Identification, absolute and relative quantification, differential usage | Nextflow workflow (high time/memory consumption) Issue # 244 |
Yes | NA |
Aptardi | Lusk et al. 2021 | Identification | Nextflow workflow (high time/memory consumption, only tested on small test files, no ML model building, uses authors’ published model) |
No (time/memory issues) |
https://openebench.bsc.es/tool/aptardi |
CSI-UTR | Harrison et al. 2019 | Differential usage | Issue #388 Nextflow workflow (only tested on small test files) |
No (incompatible with APAeval inputs, bugs) |
NA |
DaPars | Xia et al. 2014 | Identification, relative quantification, differential usage | Nextflow workflow | Yes | NA |
DaPars2 | Feng et al. 2018 | Identification, relative quantification, differential usage | Snakemake workflow | Yes | NA |
diffUTR | Gerber et al. 2021 | Differential usage | Nextflow workflow (only tested on small test files) |
No (incompatible with APAeval metrics) |
https://dev-openebench.bsc.es/tool/diffutr |
GETUTR | Kim et al. 2015 | Identification, relative quantification, differential usage | Nextflow workflow | Yes | https://openebench.bsc.es/tool/getutr |
IsoSCM | Shenker et al. 2015 | Identification, relative quantification, differential usage | Nextflow workflow | Yes | https://dev-openebench.bsc.es/tool/isoscm |
LABRAT | Goering et al. 2020 | Relative quantification, differential usage | Nextflow workflow (only tested on small test files) Issue #406 |
No (incompatible with APAeval metrics) |
(https://openebench.bsc.es/tool/labrat |
MISO | Katz et al. 2010 | Absolute and relative quantification, differential usage | Issue #36 PR #85 |
No (incompatible with APAeval input) |
https://openebench.bsc.es/tool/miso |
mountainClimber | Cass & Xiao 2019 | Identification, quantification, differential usage (according to publication) | Issue #37 PR #86 |
No (bugs, utter lack of user-friendliness) |
https://openebench.bsc.es/tool/mountainclimber |
PAQR | Gruber et al. 2014 | Absolute and relative quantification, differential usage | Snakemake workflow Issue #457 |
Yes | https://openebench.bsc.es/tool/paqr |
QAPA | Ha et al. 2018 | Absolute and relative quantification, differential usage | Nextflow workflow (hardcoded defaults, build mode in beta, we recommend using pre-built annotations) Issue #457 |
Yes | https://openebench.bsc.es/tool/qapa |
Roar | Grassi et al. 2016 | Relative quantification, differential usage | PR #161 Issue #38 |
No (incompatible with APAeval input) |
https://openebench.bsc.es/tool/roar |
TAPAS | Arefeen et al. 2018 | Identification, relative quantification, differential usage | Nextflow workflow (differential usage functionality not implemented) |
Yes | https://openebench.bsc.es/tool/tapas |
Method workflows contain all steps that need to be run per method (in OEB terms: per participant). Depending on the participant, a method workflow will have to perform pre-processing steps to convert the APAeval sanctioned input files into a format that the participant can consume. This does not include e.g. adapter trimming or mapping of reads, as those steps are already performed in our general pre-processing pipeline. After pre-processing, the actual execution of the method has to be implemented, and subsequently post-processing steps might be required to convert the obtained output into the format defined by the APAeval specifications.
- Sanctioned input files: Each of the processed input data APAeval is using for their challenges is provided as .bam file (see specifications for file formats). If participants need other file formats, these HAVE TO be created as part of the pre-processing within method workflows (see 2.). Similarly, for each dataset we provide a gencode annotation in .gtf format, as well as a reference PAS atlas in .bed format for participants that depend on pre-defined PAS. All other annotation formats that might be needed HAVE TO be created from those. Non-sanctioned annotation- or similar auxiliary files MUST NOT be downloaded as part of the method workflows, in order to ensure comparability of all participants’ performance.
As several method workflows might have to do the same pre-processing tasks, we created a utils directory, where scripts (which have their corresponding docker images uploaded to the APAeval dockerhub) are stored. Please check the utils directory before writing your own conversion scripts, and/or add your pre-processing scripts to the utils directory if you think others might be able to re-use them.
- Method execution: For each method to be benchmarked (“participant”) one method workflow has to be written. The workflow MUST include all necessary pre- and post-processing steps that are needed to get from the input formats provided by APAeval (see 1.), to the output specified by APAeval in their metrics specifications (see 3.). The workflow should include run mode parameters for the benchmarking events that it qualifies for, set to either true or false (e.g. run_identification = true). Each run of the method workflow should output files for events where run modes are set to true. If a method has distinct run modes other than those concerning the three benchmarking events, the calls to those should also be parameterized. If those run modes could significantly alter the behaviour of the method, please discuss with the APAeval community whether the distinct run modes should actually be treated as distinct participants in APAeval (see section on parameters). That could for example be the case if the method can be run with either mathematical model A or model B, and the expected results would be quite different. At the moment we can't foresee all possibilities, so we count on you to report and discuss any such cases. In any case, please do document extensively how the method can be used and how you employed it. In general, all relevant participant parameters should be configurable in the workflow config files. Parameters, file names, run modes, etc. MUST NOT be hardcoded within the workflow.
IMPORTANT: Do not download any other annotation files because the docs of your participant say so. Instead, create all files the participant needs from the ones provided by APAeval. If you don't know how, please don't hesitate to start discussions within the APAeval community! Chances are high that somebody already encountered a similar problem and will be able to help.
- Post-processing: To ensure compatibility with the OEB benchmarking events, specifications for file formats (output of method workflows = input for benchmarking workflows) are provided by APAeval. There is one specification per metric (=statistical parameter to assess performance of a participant), but calculation of several metrics can require a common input file format (thus, the file has to be created only once by the method workflow). The required method workflow outputs are a bed file containing coordinates of identified PAS (and their respective expression in tpm, if applicable) a tsv file containing information on differential expression (if applicable), and a json file containing information about compute resource and time requirements (see output specifications for detailed description of the file formats). These files have to be created within the method workflows as post-processing steps.
Method workflows should be implemented in either Nexflow or Snakemake, and individual steps should be isolated through the use of containers. For more information on how to create these containers, see section containers.
To implement a method workflow for a participant, copy either the snakemake template
or the nextflow template dsl1/nextflow template dsl2 into
the participant's directory and adapt the workflow directory names as described in the template's README
.
Don't forget to adapt the README
itself as well.
Example:
method_workflows/
|--QAPA/
|--QAPA_snakemake/
|--workflow/Snakefile
|--config/config.QAPA.yaml
|--envs/QAPA.yaml
|--envs/QAPA.Dockerfile
|-- ...
|--MISO/
|--MISO_nextflow/
|-- ...
For the sake of reproducibility and interoperability, we require the use of Docker containers in our method workflows. The participants to be benchmarked have to be available in a container, but also any other tools that are used for pre- or post-processing in a method workflow should be containerized. Whether you get individual containers for all the tools of your workflow, or combine them inside one container is up to you (The former being the more flexible option of course).
IMPORTANT: Do check out the utils directory before you work on containers for pre- or post-processing tools, maybe someone already did the same thing. If not, and you're gonna build useful containers, don't forget to add them there as well.
Here are some pointers on how to best approach the containerization:
-
Check if your participant (or other tool) is already available as a Docker container, e.g. at
- dockerhub
- biocontainers
- google for
[TOOL_NAME] Docker
or[TOOL_NAME] Dockerfile
-
If no Docker image is availabe for your tool
- create a container on BioContainers via either a bioconda recipe or a Dockerfile
- naming conventions:
- if your container only contains one tool:
apaeval/{tool_name}:{tool_version}
, e.g.apaeval/my_tool:v1.0.0
- if you combine all tools required for your workflow:
apaeval/mwf_{participant_name}:{commit_hash}
, wherecommit_hash
is the short SHA of the Git commit in the APAeval repo that last modified the corresponding Dockerfile, e.g., 65132f2
- if your container only contains one tool:
-
Now you just have to specify the docker image(s) in your method workflow:
For more information about input files, see "sanctioned input files" above. For development and debugging you can use the small test input dataset we provide with this repository. You should use the .bam
and/or .gtf
files as input to your workflow. The .bed
file serves as an example for a ground truth file. As long as the test_data
directory doesn't contain a "poly(A) sites database file", which some methods will require, you should also use the .bed
file for testing purposes.
Both snakemake template and nextflow template contain example sample.csv
files. Here you'd fill in the paths to the samples you'd be running, and any other sample specific information required by the workflow you're implementing. Thus, you can/must adapt the fields of this samples.csv
according to your workflow's requirements.
Moreover, both workflow languages require additional information in config
files. This is the place to specify run- or participant-specific parameters
Important notes:
- Describe in your README extensively where parameters (sample info, participant specific parameters) have to be specified for a new run of the pipeline.
- Describe in the README if your participant has different run modes, or parameter settings that might alter the participant's performance considerably. In such a case you should suggest that the different modes should be treated in APAeval as entirely distinct participants. Feel free to start discussions about this in our Github discussions board
- Parameterize your code as much as possible, so that the user will only have to change the sample sheet and config file, and not the code. E.g. output file paths should be built from information the user has filled into the sample sheet or config file.
- For information on how files need to be named see below!
In principle you are free to store output files how it best suits you (or the participant).
However, the "real" and final outputs for each run of the benchmarking will need to be copied to a directory in the format
PATH/TO/APAEVAL/EVENT/PARTICIPANT/
This directory must contain:
- Output files (check formats and filenames)
- Configuration files (with parameter settings), e.g.
config.yaml
andsamples.csv
. logs/
directory with all log files created by the workflow exeuction.
File formats for the 3 benchmarking events are described in the output specification.
As mentioned above it is best to parameterize filenames, such that for each run the names and codes can be set by changing only the sample sheet and config file!
File names must adhere to the following schema: PARTICIPANT.CHALLENGE.OUTCODE.ext
For the codes please refer to the following documents:
- PARTICIPANT: same as directory name in
method_workflows
- CHALLENGE:
sample_name
in APAeval Zenodo snapshot - OUTCODE: in
method_workflow_file_specification.md
Example:
Identification_01/MISO/MISO.P19_siControl_R1.01.bed
would be the output of MISO (your participant) for the identification benchmarking event (OUTCODE 01, we know that from method_workflow_file_specification.md
), run on dataset "P19_siControl_R1" (exact name as sample_name
in APAeval Zenodo snapshot)
At least 2 independent reviews are required before your code can be merged into the main APAeval branch. Why not review some other PR while you wait for yours to be accepted? You can find some instructions in Sam's PR review guide.