Skip to content

Conversation

@alkaZeltser
Copy link
Collaborator

In this PR is a massive overhaul of the package codebase for the purpose of increasing runtime and RAM efficiency.
Large changes are as follows:

  • VCF data format switched from long (one row for every unique sample-variant pair), to wide (one row for every unique variant) accompanied by a sample by variant matrix. This is a breaking change!
  • Data structures have been changed from primarily data.frame to primarily data.table and matrix
  • Where necessary, code that manipulates data structures has been updated to data.table syntax
  • Additional functionality re-written to more efficiently handle matrices (e.g. via the implementation of masks and empty matrix initialization)

The following functions contain breaking changes from v3.1.0

  • import.vcf now has a different output format. Long VCF format is still a supported output format for back-compatibility; however the output object has a different naming scheme than previously.
  • apply.polygenic.score expects a wide vcf.data input by default, to make compatible with long format, the vcf.long.format argument must be set to TRUE instead of the default FALSE.

The output of apply.polygenic.score has received a couple of new columns, but all former elements are preserved.

This PR is accompanied by a version increment to v4.0.0 due to breaking changes.

In future PRs, major documentation sources (README, vignettes, examples) will be comprehensively updated to reflect new default usage of apply.polygenic.score.

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the github standards before opening this pull request.

  • I have added the changes included in this pull request to NEWS under the next release version or unreleased, and updated the date.

  • I have updated the version number in metadata.yaml and DESCRIPTION.

  • Both R CMD build and R CMD check run successfully.

Testing Results

All tests PASS
Local R CMD check passes with no NOTEs, warnings, or errors.

@jarbet jarbet requested a review from Copilot July 30, 2025 17:12
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a major optimization overhaul of the package codebase to improve runtime and RAM efficiency by switching from long format VCF data (one row per sample-variant) to wide format (one row per variant with sample-by-variant matrix), along with transitioning from data.frame to data.table and matrix data structures.

  • Changed VCF data format from long to wide with matrix-based genotype storage
  • Migrated data structures from data.frame to data.table and matrix for efficiency
  • Updated import.vcf and apply.polygenic.score with breaking changes requiring new parameter configurations

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
vignettes/UserGuide.Rmd Updated examples to use new VCF import format with long.format = TRUE parameter
tests/testthat/test-vcf-pgs-merge.R Added tests for both wide and long format VCF handling
tests/testthat/test-vcf-import.R Updated import tests to handle new wide/long format outputs
tests/testthat/test-strand-flip-handling.R Minor formatting fix (added semicolon)
tests/testthat/test-sample-by-snp-matrix-utility.R Enhanced matrix utility tests with new wide format conversion functions
tests/testthat/test-plotting.R Updated plotting tests to use vcf.long.format = TRUE
tests/testthat/test-pgs-application.R Comprehensive updates to test both wide and long format equivalence
tests/testthat/test-dosage-calculator.R Updated dosage calculation tests for matrix input support
tests/testthat/helper-test-utils.R Added utility functions for converting between VCF formats
man/import.vcf.Rd Updated documentation for new VCF import interface
man/combine.vcf.with.pgs.Rd Updated documentation for data.table compatibility
man/apply.polygenic.score.Rd Updated documentation for new VCF format parameter
R/variant-by-sample-matrix-utility.R Rewrote matrix utilities for data.table efficiency and added VCF conversion functions
R/run-pgs-statistics.R Migrated statistical functions to data.table for performance
R/handle-vcf.R Major refactor of VCF import to support wide/long format outputs
R/handle-multiallelic-sites.R Enhanced multiallelic site handling for matrix-based processing
R/combine-vcf-with-pgs.R Optimized VCF-PGS merging using data.table operations
R/calculate-dosage.R Enhanced dosage calculation to support matrix inputs
R/assess-strand-flip.R Optimized strand flip assessment with vectorized operations
R/apply-pgs.R Complete rewrite for matrix-based processing and data.table efficiency
NEWS.md Added version 4.0.0 changelog with breaking changes
NAMESPACE Updated imports to use data.table instead of reshape2
DESCRIPTION Version bump to 4.0.0 and removed reshape2 dependency

Copy link

@whelena whelena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and I don't have any major comments

@alkaZeltser alkaZeltser merged commit d595990 into main Aug 5, 2025
5 of 6 checks passed
@alkaZeltser alkaZeltser deleted the nzeltser-optimize-main-functions branch August 5, 2025 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants