Skip to content

Conversation

@ftegenfe
Copy link

@ftegenfe ftegenfe commented Jan 20, 2026

FOR CONTRIBUTOR:

  • I have read the CONTRIBUTING.md document and this tool is appropriate for the tools-iuc repo.
  • License permits unrestricted use (educational + commercial)
  • This PR adds a new tool or tool collection
  • This PR updates an existing tool or tool collection
  • This PR does something else (explain below)

#end if
##
## softlink result files - the directory name depends on OrthoDB version used in the mapping
&& results_path="\$(ODB-mapper CONFIG project)/Results"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest where is this safed? Can we not provide this path upfront?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in a previous comment, the path depends on the OrthoDB API version used. In principle the mapper can be run on different OrthoDB versions although I have not implemented it here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle the whole result path could be interesting. In a galaxy context, how can one provide a whole path with its content?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use PWD/CWD and put the results in ./results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle the whole result path could be interesting. In a galaxy context, how can one provide a whole path with its content?

You mean that you want to provide it as an output? For this the directory datatype could help, or tar.gz?

@bgruening
Copy link
Member

Ok, then please clean up the test comment and fix the last failing test:

Output hits: Test output file (refhits.og) is missing. If you are using planemo, try adding --update_test_data to generate it.

Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is ready to go ... just one open comment. Would you like to look at this?

@@ -0,0 +1,67 @@
<tool id="ODB-mapper" name="Map to orthology" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="@PROFILE@">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shhoudb't name and id be orthologer?

Copy link
Author

@ftegenfe ftegenfe Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I had missed this comment! True, the orthologer package contains two main tools.
One that computes orthology from a set of fasta files. The other one is based on that one but maps a given fasta file to OrthoDB data. The priority was to first add the mapper tool. My intention is to add the orthologer tool as well.
Maybe another naming scheme would be appropriate?
Thinking about it a bit more - computing orthology is usually done over many fasta files. That's not very suitable in this environment. It will quickly use a lot of resources.
I noticed that FastOMA gives a warning about that particular issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we should add suite and auto_tool_repositories to the .shed.yaml file, see eg https://github.com/galaxyproject/tools-iuc/blob/main/tools/ampvis2/.shed.yml

How about name="orthologer map" and <description>FASTA to OrthoDB orthology</description>. This would render as orthologer map: FASTA to OrthoDB orthology.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it a bit more - computing orthology is usually done over many fasta files. That's not very suitable in this environment. It will quickly use a lot of resources.

Indeed in the current setup a separate job would be created for each input fasta. Which gives maximum possible parallelisation (but overhead of job creation). What is the processing time per fasta? An alternative would be to use multiple="true" in the fasta input. Then a single job would be created and the loop over the files needs to be done in the tool command (The user may still create appropriately sized jobs by using collections --- but this is then for advanced users).

Not sure if the fastoma comment applies to your tool (xref. The reasoning there was that the tool wraps a whole workflow instead of implementing the workflow in Galaxy - which limits achievable parallelisation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize, the two modes in orthologer are quite different:

  1. ODB-mapper - maps FASTA to OrthoDB orthology
  2. orthologer - computes orthology among a set of given FASTA files (no input from OrthoDB)
    Your suggestion of allowing multiple fasta files to ODB-mapper makes sense. I will do that.
    I'm working on the second to test locally but would prefer to publish that one later.

Your suggestion for names also makes sense. I will make the changes.

I will also update the .shed.yml according to your suggestion in order to prepare for including the 2nd tool.

#end if
##
## softlink result files - the directory name depends on OrthoDB version used in the mapping
&& results_path="\$(ODB-mapper CONFIG project)/Results"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle the whole result path could be interesting. In a galaxy context, how can one provide a whole path with its content?

You mean that you want to provide it as an output? For this the directory datatype could help, or tar.gz?

<test expect_num_outputs="4">
<param name="fasta" value="example.fs"/>
<param name="node" value="1489911"/>
<output name="hits" file="refhits.og" lines_diff="2"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add test assertions for the other outputs as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 4 outputs. Three contain the clusters in different formats:

  1. annotations - mapped genes with details on annotation, with OrthoDB cluster id's
  2. hits - only the mapped genes together with OrthoDB cluster id's
  3. clusters - full clusters, raw output from the mapping
  4. summary meta data; percentage genes mapped, resource usage, OrthoDB API version

Looking through this now, it would be enough to have only 1 as 2 is just the 2 first columns of 1.
The summary file can also be ignored. However this file gives crucially the OrthoDB API version used and hence is useful.

I suggest the following:

  • remove outputs 2 and 3
  • test against annotations file
  • keep summary output
  • can add a test on the summary output using something weak like the nr of lines for the assertion

</tests>
<help><![CDATA[

This tool maps a given fasta file against OrthoDB orthology.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this query online data? How much data is transferred?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this downloads data and it may vary a lot depending on which target level is chosen.
It varies from 1M for the most specific levels to a few Gb for the top node (3.6G for eukaryota root).
Also mapping at higher levels will take longer time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should cache the orthodb locally. Potentially downloading GB per job seems not to be a good idea if it can be avoided.

Seems that MAP_ORTHODB_DATA allows to specify where data is saved. So we could provide this via reference data (data table)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yes that is true - not sure how to set it in a galaxy environment though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


This tool maps a given fasta file against OrthoDB orthology.
The level is given as an NCBI taxid (e.g 33208 for Metazoa).
If no level is given, a level is selected using busco autolineage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The auto lineage mode transfers quite a bit of data. Can we use busco reference data that is cached in Galaxy already?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we can use cached busco data but yes that would be useful.
Another option is to require the user to provide a level hence avoiding autolineage option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ftegenfe check the docs: There are two cariables that we should definitely use:

  • BUSCO_OFFLINE set to 0 if BUSCO should run offline - if so it will look in BUSCO_DATA for files
  • BUSCO_DATA BUSCO data install directory

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do not allow auto lineage, those two will not be used.
But yes if we do allow it, BUSCO_DATA could point to some storage where the BUSCO lineage files reside.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess would be that you need to do something like:

  • setup a select for a BUSCO DB:
    <param name="cached_db" label="Cached database with lineage" type="select">
  • add BUSCO_OFFLINE=0 && export BUSCO_DATA="$cached_db.fields.path" && at the beginning of the tool script (can you double check if 0 really means offline -- looks strange to me)

Since the reference data is huge we need to trick a bit, see:

## tool tests can not run with --offline (otherwise we would need to store a lot of data at IUC)

If a lineage needs to be selected somewhere you can do it like so:
<param argument="--lineage_dataset" type="select" label="Lineage">

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, BUSCO_OFFLINE=1 means offline.
At the moment I have disabled the possibility to run with auto-lineage. That is the mapping level parameter is required.
I have been away a bit from work. I will have a look at the data tables documentation links you sent wrt setting MAP_ORTHODB_DATA.

@ftegenfe
Copy link
Author

What is further required in terms of changes? Has this tool been approved for merge with the main branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants