-
Notifications
You must be signed in to change notification settings - Fork 499
added orthologer tool #7604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
added orthologer tool #7604
Conversation
| #end if | ||
| ## | ||
| ## softlink result files - the directory name depends on OrthoDB version used in the mapping | ||
| && results_path="\$(ODB-mapper CONFIG project)/Results" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest where is this safed? Can we not provide this path upfront?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in a previous comment, the path depends on the OrthoDB API version used. In principle the mapper can be run on different OrthoDB versions although I have not implemented it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle the whole result path could be interesting. In a galaxy context, how can one provide a whole path with its content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use PWD/CWD and put the results in ./results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle the whole result path could be interesting. In a galaxy context, how can one provide a whole path with its content?
You mean that you want to provide it as an output? For this the directory datatype could help, or tar.gz?
Co-authored-by: Björn Grüning <[email protected]>
Co-authored-by: Björn Grüning <[email protected]>
Co-authored-by: Björn Grüning <[email protected]>
Co-authored-by: Saim Momin <[email protected]>
Co-authored-by: Saim Momin <[email protected]>
Co-authored-by: Saim Momin <[email protected]>
Co-authored-by: Saim Momin <[email protected]>
Co-authored-by: Saim Momin <[email protected]>
Co-authored-by: Saim Momin <[email protected]>
…nto add_orthologer
|
Ok, then please clean up the test comment and fix the last failing test:
|
bgruening
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one is ready to go ... just one open comment. Would you like to look at this?
tools/orthologer/ODB-mapper.xml
Outdated
| @@ -0,0 +1,67 @@ | |||
| <tool id="ODB-mapper" name="Map to orthology" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="@PROFILE@"> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shhoudb't name and id be orthologer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I had missed this comment! True, the orthologer package contains two main tools.
One that computes orthology from a set of fasta files. The other one is based on that one but maps a given fasta file to OrthoDB data. The priority was to first add the mapper tool. My intention is to add the orthologer tool as well.
Maybe another naming scheme would be appropriate?
Thinking about it a bit more - computing orthology is usually done over many fasta files. That's not very suitable in this environment. It will quickly use a lot of resources.
I noticed that FastOMA gives a warning about that particular issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we should add suite and auto_tool_repositories to the .shed.yaml file, see eg https://github.com/galaxyproject/tools-iuc/blob/main/tools/ampvis2/.shed.yml
How about name="orthologer map" and <description>FASTA to OrthoDB orthology</description>. This would render as orthologer map: FASTA to OrthoDB orthology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it a bit more - computing orthology is usually done over many fasta files. That's not very suitable in this environment. It will quickly use a lot of resources.
Indeed in the current setup a separate job would be created for each input fasta. Which gives maximum possible parallelisation (but overhead of job creation). What is the processing time per fasta? An alternative would be to use multiple="true" in the fasta input. Then a single job would be created and the loop over the files needs to be done in the tool command (The user may still create appropriately sized jobs by using collections --- but this is then for advanced users).
Not sure if the fastoma comment applies to your tool (xref. The reasoning there was that the tool wraps a whole workflow instead of implementing the workflow in Galaxy - which limits achievable parallelisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize, the two modes in orthologer are quite different:
- ODB-mapper - maps FASTA to OrthoDB orthology
- orthologer - computes orthology among a set of given FASTA files (no input from OrthoDB)
Your suggestion of allowing multiple fasta files to ODB-mapper makes sense. I will do that.
I'm working on the second to test locally but would prefer to publish that one later.
Your suggestion for names also makes sense. I will make the changes.
I will also update the .shed.yml according to your suggestion in order to prepare for including the 2nd tool.
| #end if | ||
| ## | ||
| ## softlink result files - the directory name depends on OrthoDB version used in the mapping | ||
| && results_path="\$(ODB-mapper CONFIG project)/Results" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle the whole result path could be interesting. In a galaxy context, how can one provide a whole path with its content?
You mean that you want to provide it as an output? For this the directory datatype could help, or tar.gz?
tools/orthologer/ODB-mapper.xml
Outdated
| <test expect_num_outputs="4"> | ||
| <param name="fasta" value="example.fs"/> | ||
| <param name="node" value="1489911"/> | ||
| <output name="hits" file="refhits.og" lines_diff="2"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add test assertions for the other outputs as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 4 outputs. Three contain the clusters in different formats:
- annotations - mapped genes with details on annotation, with OrthoDB cluster id's
- hits - only the mapped genes together with OrthoDB cluster id's
- clusters - full clusters, raw output from the mapping
- summary meta data; percentage genes mapped, resource usage, OrthoDB API version
Looking through this now, it would be enough to have only 1 as 2 is just the 2 first columns of 1.
The summary file can also be ignored. However this file gives crucially the OrthoDB API version used and hence is useful.
I suggest the following:
- remove outputs 2 and 3
- test against annotations file
- keep summary output
- can add a test on the summary output using something weak like the nr of lines for the assertion
| </tests> | ||
| <help><![CDATA[ | ||
|
|
||
| This tool maps a given fasta file against OrthoDB orthology. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this query online data? How much data is transferred?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this downloads data and it may vary a lot depending on which target level is chosen.
It varies from 1M for the most specific levels to a few Gb for the top node (3.6G for eukaryota root).
Also mapping at higher levels will take longer time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should cache the orthodb locally. Potentially downloading GB per job seems not to be a good idea if it can be avoided.
Seems that MAP_ORTHODB_DATA allows to specify where data is saved. So we could provide this via reference data (data table)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yes that is true - not sure how to set it in a galaxy environment though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some docs are here: https://docs.galaxyproject.org/en/master/admin/data_tables.html https://docs.galaxyproject.org/en/master/dev/data_managers.html
Maybe the busco data table gives a bit of inspiration...
tools/orthologer/ODB-mapper.xml
Outdated
|
|
||
| This tool maps a given fasta file against OrthoDB orthology. | ||
| The level is given as an NCBI taxid (e.g 33208 for Metazoa). | ||
| If no level is given, a level is selected using busco autolineage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The auto lineage mode transfers quite a bit of data. Can we use busco reference data that is cached in Galaxy already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we can use cached busco data but yes that would be useful.
Another option is to require the user to provide a level hence avoiding autolineage option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ftegenfe check the docs: There are two cariables that we should definitely use:
BUSCO_OFFLINEset to 0 if BUSCO should run offline - if so it will look in BUSCO_DATA for filesBUSCO_DATABUSCO data install directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I do not allow auto lineage, those two will not be used.
But yes if we do allow it, BUSCO_DATA could point to some storage where the BUSCO lineage files reside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess would be that you need to do something like:
- setup a select for a BUSCO DB:
tools-iuc/tools/busco/busco.xml
Line 164 in 157d1ec
<param name="cached_db" label="Cached database with lineage" type="select"> - add
BUSCO_OFFLINE=0 && export BUSCO_DATA="$cached_db.fields.path" &&at the beginning of the tool script (can you double check if 0 really means offline -- looks strange to me)
Since the reference data is huge we need to trick a bit, see:
tools-iuc/tools/busco/busco.xml
Line 20 in 157d1ec
| ## tool tests can not run with --offline (otherwise we would need to store a lot of data at IUC) |
If a lineage needs to be selected somewhere you can do it like so:
tools-iuc/tools/busco/busco.xml
Line 222
in
157d1ec
<param argument="--lineage_dataset" type="select" label="Lineage">
tools-iuc/tools/busco/busco.xml
Line 222 in 157d1ec
| <param argument="--lineage_dataset" type="select" label="Lineage"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, BUSCO_OFFLINE=1 means offline.
At the moment I have disabled the possibility to run with auto-lineage. That is the mapping level parameter is required.
I have been away a bit from work. I will have a look at the data tables documentation links you sent wrt setting MAP_ORTHODB_DATA.
Co-authored-by: M Bernt <[email protected]>
|
What is further required in terms of changes? Has this tool been approved for merge with the main branch? |
FOR CONTRIBUTOR: