-
Notifications
You must be signed in to change notification settings - Fork 0
optimize main functions #86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a major optimization overhaul of the package codebase to improve runtime and RAM efficiency by switching from long format VCF data (one row per sample-variant) to wide format (one row per variant with sample-by-variant matrix), along with transitioning from data.frame to data.table and matrix data structures.
- Changed VCF data format from long to wide with matrix-based genotype storage
- Migrated data structures from
data.frametodata.tableandmatrixfor efficiency - Updated
import.vcfandapply.polygenic.scorewith breaking changes requiring new parameter configurations
Reviewed Changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| vignettes/UserGuide.Rmd | Updated examples to use new VCF import format with long.format = TRUE parameter |
| tests/testthat/test-vcf-pgs-merge.R | Added tests for both wide and long format VCF handling |
| tests/testthat/test-vcf-import.R | Updated import tests to handle new wide/long format outputs |
| tests/testthat/test-strand-flip-handling.R | Minor formatting fix (added semicolon) |
| tests/testthat/test-sample-by-snp-matrix-utility.R | Enhanced matrix utility tests with new wide format conversion functions |
| tests/testthat/test-plotting.R | Updated plotting tests to use vcf.long.format = TRUE |
| tests/testthat/test-pgs-application.R | Comprehensive updates to test both wide and long format equivalence |
| tests/testthat/test-dosage-calculator.R | Updated dosage calculation tests for matrix input support |
| tests/testthat/helper-test-utils.R | Added utility functions for converting between VCF formats |
| man/import.vcf.Rd | Updated documentation for new VCF import interface |
| man/combine.vcf.with.pgs.Rd | Updated documentation for data.table compatibility |
| man/apply.polygenic.score.Rd | Updated documentation for new VCF format parameter |
| R/variant-by-sample-matrix-utility.R | Rewrote matrix utilities for data.table efficiency and added VCF conversion functions |
| R/run-pgs-statistics.R | Migrated statistical functions to data.table for performance |
| R/handle-vcf.R | Major refactor of VCF import to support wide/long format outputs |
| R/handle-multiallelic-sites.R | Enhanced multiallelic site handling for matrix-based processing |
| R/combine-vcf-with-pgs.R | Optimized VCF-PGS merging using data.table operations |
| R/calculate-dosage.R | Enhanced dosage calculation to support matrix inputs |
| R/assess-strand-flip.R | Optimized strand flip assessment with vectorized operations |
| R/apply-pgs.R | Complete rewrite for matrix-based processing and data.table efficiency |
| NEWS.md | Added version 4.0.0 changelog with breaking changes |
| NAMESPACE | Updated imports to use data.table instead of reshape2 |
| DESCRIPTION | Version bump to 4.0.0 and removed reshape2 dependency |
whelena
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and I don't have any major comments
In this PR is a massive overhaul of the package codebase for the purpose of increasing runtime and RAM efficiency.
Large changes are as follows:
data.frameto primarilydata.tableandmatrixdata.tablesyntaxThe following functions contain breaking changes from
v3.1.0import.vcfnow has a different output format. Long VCF format is still a supported output format for back-compatibility; however the output object has a different naming scheme than previously.apply.polygenic.scoreexpects a widevcf.datainput by default, to make compatible with long format, thevcf.long.formatargument must be set toTRUEinstead of the defaultFALSE.The output of
apply.polygenic.scorehas received a couple of new columns, but all former elements are preserved.This PR is accompanied by a version increment to
v4.0.0due to breaking changes.In future PRs, major documentation sources (README, vignettes, examples) will be comprehensively updated to reflect new default usage of
apply.polygenic.score.I have read the code review guidelines and the code review best practice on GitHub check-list.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added the changes included in this pull request to
NEWSunder the next release version or unreleased, and updated the date.I have updated the version number in
metadata.yamlandDESCRIPTION.Both
R CMD buildandR CMD checkrun successfully.Testing Results
All tests PASS
Local R CMD check passes with no NOTEs, warnings, or errors.