Skip to content

Missing manuals for some annotation packages #444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jwokaty opened this issue Mar 31, 2025 · 4 comments
Open

Missing manuals for some annotation packages #444

jwokaty opened this issue Mar 31, 2025 · 4 comments
Assignees

Comments

@jwokaty
Copy link
Collaborator

jwokaty commented Mar 31, 2025

A few annotation packages are missing pdf manuals. I'm not sure yet how many are affected (seems to be 11 on devel), but so far I noticed these on both release and devel:

https://bioconductor.org/packages/3.20/data/annotation/html/AHCytoBands.html
https://bioconductor.org/packages/3.20/data/annotation/html/AHEnsDbs.html

Looking at the propagation log, it shows the following for example, which should have manuals

========================================================================
Wed 26 Mar 09:30:01 EDT 2025
------------------------------------------------------------------------

Updating 3.21/data/annotation repo with source packages...
‘/home/biocbuild/public_html/BBS/3.21/data-annotation/OUTGOING/source/TENET.AnnotationHub_1.0.0.tar.gz‘ -> ‘/home/biocpush/PACKAGES/3.21/data/annotation/src/contrib/TENET.AnnotationHub_1.0.0.tar.gz‘
Warning message:
In list.old.pkgs(suffix = ".tar.gz") :
  pkgs with bad version format: BSgenome.Gmax.NCBI.Gmv40_4.0.tar.gz, BSgenome.Vvinifera.URGI.IGGP12Xv0_0.1.tar.gz, BSgenome.Vvinifera.URGI.IGGP12Xv2_0.1.tar.gz, BSgenome.Vvinifera.URGI.IGGP8X_0.1.tar.gz, DO.db_2.9.tar.gz
TENET.AnnotationHub_0.99.5.tar.gz 
                             TRUE 

Updating 3.21/data/annotation repo with reference manuals...
'/home/biocbuild/public_html/BBS/3.21/data-annotation/OUTGOING/manuals/AHCytoBands.pdf' -> '/home/biocpush/PACKAGES/3.21/data/annotation/manuals/AHCytoBands/man/AHCytoBands.pdf' 
'/home/biocbuild/public_html/BBS/3.21/data-annotation/OUTGOING/manuals/AHEnsDbs.pdf' -> '/home/biocpush/PACKAGES/3.21/data/annotation/manuals/AHEnsDbs/man/AHEnsDbs.pdf'
'/home/biocbuild/public_html/BBS/3.21/data-annotation/OUTGOING/manuals/AHLRBaseDbs.pdf' -> '/home/biocpush/PACKAGES/3.21/data/annotation/manuals/AHLRBaseDbs/man/AHLRBaseDbs.pdf'

It states that AHCytoBands.pdf and AHEnsDbs.pdf are copied. While AHCytoBands.pdf does exist in that path, the AHEnsDbs.pdf does not. Later in the same log:

manuals/AHCytoBands/man/
manuals/AHEnsDbs/man/
manuals/AHLRBaseDbs/man/
manuals/AHMeSHDbs/man/
manuals/AHPathbankDbs/man/
manuals/AHPubMedDbs/man/
manuals/AHWikipathwaysDbs/man/

In the OUTGOING folder on biocbuild, we see

biocpush@nebbiolo1:~$ ls /home/biocbuild/public_html/BBS/3.21/data-annotation/OUTGOING/manuals/
AHCytoBands.pdf    AHWikipathwaysDbs.pdf               chromhmmData.pdf       EpiTxDb.Sc.sacCer3.pdf  MPO.db.pdf             SomaScan.db.pdf
AHEnsDbs.pdf       AlphaMissense.v2023.hg19.pdf        CTCF.pdf               geneplast.data.pdf      org.Mxanthus.db.pdf    synaptome.data.pdf
AHLRBaseDbs.pdf    AlphaMissense.v2023.hg38.pdf        ENCODExplorerData.pdf  GeneSummary.pdf         PANTHER.db.pdf         synaptome.db.pdf
AHMeSHDbs.pdf      alternativeSplicingEvents.hg38.pdf  EPICv2manifest.pdf     GenomicState.pdf        rat2302frmavecs.pdf    TENET.AnnotationHub.pdf
AHPathbankDbs.pdf  cadd.v1.6.hg19.pdf                  EpiTxDb.Hs.hg38.pdf    JASPAR2022.pdf          rGenomeTracksData.pdf  UniProtKeywords.pdf
AHPubMedDbs.pdf    cadd.v1.6.hg38.pdf                  EpiTxDb.Mm.mm10.pdf    JASPAR2024.pdf          scAnnotatR.models.pdf

In the propagation scripts for data annotation, biocViews::extractManuals is called. No other package type uses it to my knowledge, so I wonder if it is overwriting manuals created from R CMD check. It might be helpful to run extractManuals on AHCytoBands, AHEnsDbs, and alternativeSplicingEvents.hg38 (has a manual, appears to be generated by extractManuals) or running propagation script by script to see how manuals changes in the CRAN style repository.

@jwokaty jwokaty self-assigned this Mar 31, 2025
@hpages
Copy link
Contributor

hpages commented Apr 1, 2025

FWIW you can get the full list with:

biocpush@nebbiolo1:~/PACKAGES/3.21/data/annotation/manuals$ ls | sort > ~/sandbox/all_pkgs.txt
biocpush@nebbiolo1:~/PACKAGES/3.21/data/annotation/manuals$ ls */man/*.pdf | cut -d '/' -f 1 | sort > ~/sandbox/pkgs_with_manual.txt
biocpush@nebbiolo1:~/PACKAGES/3.21/data/annotation/manuals$ cd ~/sandbox/
biocpush@nebbiolo1:~/sandbox$ diff all_pkgs.txt pkgs_with_manual.txt
5,11d4
< AHCytoBands
< AHEnsDbs
< AHLRBaseDbs
< AHMeSHDbs
< AHPathbankDbs
< AHPubMedDbs
< AHWikipathwaysDbs
162d154
< chromhmmData
226d217
< grasp2db
782d772
< rGenomeTracksData
817d806
< scAnnotatR.models

Looking at AHCytoBands:

The package has no man page. However, R CMD Rd2pdf should still be able to generate a minimalist manual when used in a "standard" way. For example, the following produces a AHCytoBands.pdf file:

wget https://bioconductor.org/packages/3.21/data/annotation/src/contrib/AHCytoBands_0.99.1.tar.gz
tar zxf AHCytoBands_0.99.1.tar.gz
R CMD Rd2pdf AHCytoBands

Problem is that the prepareRepos-data-annotation.sh script on nebbiolo1 does not call R CMD Rd2pdf in the "standard" way. The script relies on this buildManualsFromTarball() function from the biocViews package:
https://github.com/Bioconductor/biocViews/blob/8aa6dafe1f6f65a55ff5dd6f3565690c49a5618e/R/repository.R#L77-L102
Seems to me that the function is trying to do things in an unnecessarily complicated way, and, as a result, is shooting itself in the foot. In particular, it fails to generate the manual for packages with no man pages.

I didn't look at the other annotation packages with a missing manual.

@jwokaty
Copy link
Collaborator Author

jwokaty commented Apr 1, 2025

@hpages Thanks for your analysis. It's hard to follow why some choices were made to use the function in biocViews.

If this is the case and R CMD Rd2pdf can produce all manuals, I will make a PR to switch to that and stop using the function in biocViews.

@hpages
Copy link
Contributor

hpages commented Apr 1, 2025

It's more complicated than that.

The reason you see this line

'/home/biocbuild/public_html/BBS/3.21/data-annotation/OUTGOING/manuals/AHCytoBands.pdf' -> '/home/biocpush/PACKAGES/3.21/data/annotation/manuals/AHCytoBands/man/AHCytoBands.pdf'

in the log file is because AHCytoBands is part of the small subset of annotation packages that go thru automated builds (once a week). The irony here is that the builds were actually able to produce AHCytoBands.pdf, and updateReposPkgs-data-annotation.sh was able to propagate it to the staging repo on nebbiolo1. But AHCytoBands.pdf got removed later when prepareRepos-data-annotation.sh called buildManualsFromTarball(). This situation is unique to the data-annotation packages because they were originally not going thru automated builds. However, a few years ago, we started to run the automated builds on a small subset of them. So for the packages in this small subset, the manuals are actually produced twice.

The whole approach to generation and propagation of the manuals would need to be revisited. At least 2 things would need to happen:

  1. We should stop trying to propagate the manuals generated by the automated builds. As has been reported in issue Manuals generated by bioconductor includes internal functions #440, these manuals are not appropriate and it would be hard to fix that because we don't control how R CMD check calls R CMD Rd2pdf to produce them.
  2. The business of producing the manuals should only be done by prepareRepos-data-annotation.sh. The script already does it for data-annotation packages but it should also do it for software and data-experiment. This means that buildManualsFromTarball() needs to be fixed/simplified. Also right now the function fills the propagate-data-annotation-*.log file with ton of output that is not helpful, so it should also be improved to generate less but more useful output.

This will make the manual business simpler/better because (a) they will be handled the same way for all the builds, (b) the code that handles them will be in one place only, (c) we'll have full control on how we generate them (we will no longer rely on R CMD check for that), (d) they will be produced only once for each package, and (e) they will no longer include internal functions.

Hope this helps.

@hpages
Copy link
Contributor

hpages commented Apr 1, 2025

It's hard to follow why some choices were made to use the function in biocViews.

That's because most data-annotation packages are not going thru the build system. See above for the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants