Method workflows

This is where the workflows for running APAeval participants (= "method workflows") live.

NOTE: The following sections give in depth instructions on how to create new APAeval method workflows. If you're looking for instructions on how to run an existing workflow for one of our benchmarked methods, please refer to the README.md in the respective directory. You can find quick links to those directories in the participant overview table below. In any case, make sure you have the APAeval conda environment set up and running.

Benchmarking Participants
Overview
More details
Templates
Containers
Input
- Test data
- Parameters
Output
- Formats
- Filenames
PR reviews

Benchmarking Participants

List of bioinformatic methods benchmarked in APAeval. Please update columns as the method workflows progress.

Method	Citation	Type	Status in APAeval	Benchmarked	OpenEBench link
APA-Scan	Fahmi et al. 2020	Identification, relative quantification, differential usage	Issue #26 PR #160	No (incompatible with APAeval input and metrics, bugs)	https://dev-openebench.bsc.es/tool/apa-scan
APAlyzer	Wang & Tian 2020	Relative quantification, differential usage	Snakemake workflow	No (incompatible with APAeval metrics)	https://dev-openebench.bsc.es/tool/apalyzer
APAtrap	Ye et al. 2018	Identification, absolute and relative quantification, differential usage	Nextflow workflow (high time/memory consumption) Issue # 244	Yes	NA
Aptardi	Lusk et al. 2021	Identification	Nextflow workflow (high time/memory consumption, only tested on small test files, no ML model building, uses authors’ published model)	No (time/memory issues)	https://openebench.bsc.es/tool/aptardi
CSI-UTR	Harrison et al. 2019	Differential usage	Issue #388 Nextflow workflow (only tested on small test files)	No (incompatible with APAeval inputs, bugs)	NA
DaPars	Xia et al. 2014	Identification, relative quantification, differential usage	Nextflow workflow	Yes	NA
DaPars2	Feng et al. 2018	Identification, relative quantification, differential usage	Snakemake workflow	Yes	NA
diffUTR	Gerber et al. 2021	Differential usage	Nextflow workflow (only tested on small test files)	No (incompatible with APAeval metrics)	https://dev-openebench.bsc.es/tool/diffutr
GETUTR	Kim et al. 2015	Identification, relative quantification, differential usage	Nextflow workflow	Yes	https://openebench.bsc.es/tool/getutr
IsoSCM	Shenker et al. 2015	Identification, relative quantification, differential usage	Nextflow workflow	Yes	https://dev-openebench.bsc.es/tool/isoscm
LABRAT	Goering et al. 2020	Relative quantification, differential usage	Nextflow workflow (only tested on small test files) Issue #406	No (incompatible with APAeval metrics)	(https://openebench.bsc.es/tool/labrat
MISO	Katz et al. 2010	Absolute and relative quantification, differential usage	Issue #36 PR #85	No (incompatible with APAeval input)	https://openebench.bsc.es/tool/miso
mountainClimber	Cass & Xiao 2019	Identification, quantification, differential usage (according to publication)	Issue #37 PR #86	No (bugs, utter lack of user-friendliness)	https://openebench.bsc.es/tool/mountainclimber
PAQR	Gruber et al. 2014	Absolute and relative quantification, differential usage	Snakemake workflow Issue #457	Yes	https://openebench.bsc.es/tool/paqr
QAPA	Ha et al. 2018	Absolute and relative quantification, differential usage	Nextflow workflow (hardcoded defaults, build mode in beta, we recommend using pre-built annotations) Issue #457	Yes	https://openebench.bsc.es/tool/qapa
Roar	Grassi et al. 2016	Relative quantification, differential usage	PR #161 Issue #38	No (incompatible with APAeval input)	https://openebench.bsc.es/tool/roar
TAPAS	Arefeen et al. 2018	Identification, relative quantification, differential usage	Nextflow workflow (differential usage functionality not implemented)	Yes	https://openebench.bsc.es/tool/tapas

Overview

Method workflows contain all steps that need to be run per method (in OEB terms: per participant). Depending on the participant, a method workflow will have to perform pre-processing steps to convert the APAeval sanctioned input files into a format that the participant can consume. This does not include e.g. adapter trimming or mapping of reads, as those steps are already performed in our general pre-processing pipeline. After pre-processing, the actual execution of the method has to be implemented, and subsequently post-processing steps might be required to convert the obtained output into the format defined by the APAeval specifications.

More details

Sanctioned input files: Each of the processed input data APAeval is using for their challenges is provided as .bam file (see specifications for file formats). If participants need other file formats, these HAVE TO be created as part of the pre-processing within method workflows (see 2.). Similarly, for each dataset we provide a gencode annotation in .gtf format, as well as a reference PAS atlas in .bed format for participants that depend on pre-defined PAS. All other annotation formats that might be needed HAVE TO be created from those. Non-sanctioned annotation- or similar auxiliary files MUST NOT be downloaded as part of the method workflows, in order to ensure comparability of all participants’ performance.

As several method workflows might have to do the same pre-processing tasks, we created a utils directory, where scripts (which have their corresponding docker images uploaded to the APAeval dockerhub) are stored. Please check the utils directory before writing your own conversion scripts, and/or add your pre-processing scripts to the utils directory if you think others might be able to re-use them.

Method execution: For each method to be benchmarked (“participant”) one method workflow has to be written. The workflow MUST include all necessary pre- and post-processing steps that are needed to get from the input formats provided by APAeval (see 1.), to the output specified by APAeval in their metrics specifications (see 3.). The workflow should include run mode parameters for the benchmarking events that it qualifies for, set to either true or false (e.g. run_identification = true). Each run of the method workflow should output files for events where run modes are set to true. If a method has distinct run modes other than those concerning the three benchmarking events, the calls to those should also be parameterized. If those run modes could significantly alter the behaviour of the method, please discuss with the APAeval community whether the distinct run modes should actually be treated as distinct participants in APAeval (see section on parameters). That could for example be the case if the method can be run with either mathematical model A or model B, and the expected results would be quite different. At the moment we can't foresee all possibilities, so we count on you to report and discuss any such cases. In any case, please do document extensively how the method can be used and how you employed it. In general, all relevant participant parameters should be configurable in the workflow config files. Parameters, file names, run modes, etc. MUST NOT be hardcoded within the workflow.

IMPORTANT: Do not download any other annotation files because the docs of your participant say so. Instead, create all files the participant needs from the ones provided by APAeval. If you don't know how, please don't hesitate to start discussions within the APAeval community! Chances are high that somebody already encountered a similar problem and will be able to help.

Post-processing: To ensure compatibility with the OEB benchmarking events, specifications for file formats (output of method workflows = input for benchmarking workflows) are provided by APAeval. There is one specification per metric (=statistical parameter to assess performance of a participant), but calculation of several metrics can require a common input file format (thus, the file has to be created only once by the method workflow). The required method workflow outputs are a bed file containing coordinates of identified PAS (and their respective expression in tpm, if applicable) a tsv file containing information on differential expression (if applicable), and a json file containing information about compute resource and time requirements (see output specifications for detailed description of the file formats). These files have to be created within the method workflows as post-processing steps.

Method workflows should be implemented in either Nexflow or Snakemake, and individual steps should be isolated through the use of containers. For more information on how to create these containers, see section containers.

Templates

To implement a method workflow for a participant, copy either the snakemake template or the nextflow template dsl1/nextflow template dsl2 into the participant's directory and adapt the workflow directory names as described in the template's README. Don't forget to adapt the README itself as well.

Example:

method_workflows/
 |--QAPA/
     |--QAPA_snakemake/
          |--workflow/Snakefile
          |--config/config.QAPA.yaml
          |--envs/QAPA.yaml
          |--envs/QAPA.Dockerfile
          |-- ...
 |--MISO/
      |--MISO_nextflow/
          |-- ...

Containers

For the sake of reproducibility and interoperability, we require the use of Docker containers in our method workflows. The participants to be benchmarked have to be available in a container, but also any other tools that are used for pre- or post-processing in a method workflow should be containerized. Whether you get individual containers for all the tools of your workflow, or combine them inside one container is up to you (The former being the more flexible option of course).

IMPORTANT: Do check out the utils directory before you work on containers for pre- or post-processing tools, maybe someone already did the same thing. If not, and you're gonna build useful containers, don't forget to add them there as well.

Here are some pointers on how to best approach the containerization:

Check if your participant (or other tool) is already available as a Docker container, e.g. at
- dockerhub
- biocontainers
- google for [TOOL_NAME] Docker or [TOOL_NAME] Dockerfile
If no Docker image is availabe for your tool
- create a container on BioContainers via either a bioconda recipe or a Dockerfile
- naming conventions:
  - if your container only contains one tool: apaeval/{tool_name}:{tool_version}, e.g. apaeval/my_tool:v1.0.0
  - if you combine all tools required for your workflow: apaeval/mwf_{participant_name}:{commit_hash}, where commit_hash is the short SHA of the Git commit in the APAeval repo that last modified the corresponding Dockerfile, e.g., 65132f2
Now you just have to specify the docker image(s) in your method workflow:
- For nextflow, the individual containers can be specified in the processes.
- For Snakemake, the individual containers can be specified per rule.

Input

Test data

For more information about input files, see "sanctioned input files" above. For development and debugging you can use the small test input dataset we provide with this repository. You should use the .bam and/or .gtf files as input to your workflow. The .bed file serves as an example for a ground truth file. As long as the test_data directory doesn't contain a "poly(A) sites database file", which some methods will require, you should also use the .bed file for testing purposes.

Parameters

Both snakemake template and nextflow template contain example sample.csv files. Here you'd fill in the paths to the samples you'd be running, and any other sample specific information required by the workflow you're implementing. Thus, you can/must adapt the fields of this samples.csv according to your workflow's requirements.

Moreover, both workflow languages require additional information in config files. This is the place to specify run- or participant-specific parameters

Important notes:

Describe in your README extensively where parameters (sample info, participant specific parameters) have to be specified for a new run of the pipeline.

Describe in the README if your participant has different run modes, or parameter settings that might alter the participant's performance considerably. In such a case you should suggest that the different modes should be treated in APAeval as entirely distinct participants. Feel free to start discussions about this in our Github discussions board

Parameterize your code as much as possible, so that the user will only have to change the sample sheet and config file, and not the code. E.g. output file paths should be built from information the user has filled into the sample sheet or config file.

For information on how files need to be named see below!

Output

In principle you are free to store output files how it best suits you (or the participant). However, the "real" and final outputs for each run of the benchmarking will need to be copied to a directory in the format
PATH/TO/APAEVAL/EVENT/PARTICIPANT/

This directory must contain:

Output files (check formats and filenames)
Configuration files (with parameter settings), e.g. config.yaml and samples.csv.
logs/ directory with all log files created by the workflow exeuction.

Formats

File formats for the 3 benchmarking events are described in the output specification.

Filenames

As mentioned above it is best to parameterize filenames, such that for each run the names and codes can be set by changing only the sample sheet and config file!

File names must adhere to the following schema: PARTICIPANT.CHALLENGE.OUTCODE.ext
For the codes please refer to the following documents:

PARTICIPANT: same as directory name in method_workflows
CHALLENGE: sample_name in APAeval Zenodo snapshot
OUTCODE: in method_workflow_file_specification.md

Example:
Identification_01/MISO/MISO.P19_siControl_R1.01.bed would be the output of MISO (your participant) for the identification benchmarking event (OUTCODE 01, we know that from method_workflow_file_specification.md), run on dataset "P19_siControl_R1" (exact name as sample_name in APAeval Zenodo snapshot)

PR reviews

At least 2 independent reviews are required before your code can be merged into the main APAeval branch. Why not review some other PR while you wait for yours to be accepted? You can find some instructions in Sam's PR review guide.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Method workflows

Benchmarking Participants

Overview

More details

Templates

Containers

Input

Test data

Parameters

Output

Formats

Filenames

PR reviews

Files

README.md

Latest commit

History

README.md

File metadata and controls

Method workflows

Benchmarking Participants

Overview

More details

Templates

Containers

Input

Test data

Parameters

Output

Formats

Filenames

PR reviews