optimize main functions #86

alkaZeltser · 2025-07-30T00:15:41Z

In this PR is a massive overhaul of the package codebase for the purpose of increasing runtime and RAM efficiency.
Large changes are as follows:

VCF data format switched from long (one row for every unique sample-variant pair), to wide (one row for every unique variant) accompanied by a sample by variant matrix. This is a breaking change!
Data structures have been changed from primarily data.frame to primarily data.table and matrix
Where necessary, code that manipulates data structures has been updated to data.table syntax
Additional functionality re-written to more efficiently handle matrices (e.g. via the implementation of masks and empty matrix initialization)

The following functions contain breaking changes from v3.1.0

import.vcf now has a different output format. Long VCF format is still a supported output format for back-compatibility; however the output object has a different naming scheme than previously.
apply.polygenic.score expects a wide vcf.data input by default, to make compatible with long format, the vcf.long.format argument must be set to TRUE instead of the default FALSE.

The output of apply.polygenic.score has received a couple of new columns, but all former elements are preserved.

This PR is accompanied by a version increment to v4.0.0 due to breaking changes.

In future PRs, major documentation sources (README, vignettes, examples) will be comprehensively updated to reflect new default usage of apply.polygenic.score.

I have read the code review guidelines and the code review best practice on GitHub check-list.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added the changes included in this pull request to NEWS under the next release version or unreleased, and updated the date.
I have updated the version number in metadata.yaml and DESCRIPTION.
Both R CMD build and R CMD check run successfully.

Testing Results

All tests PASS
Local R CMD check passes with no NOTEs, warnings, or errors.

…nge)

…ling bugs

…nput

Copilot

Pull Request Overview

This PR implements a major optimization overhaul of the package codebase to improve runtime and RAM efficiency by switching from long format VCF data (one row per sample-variant) to wide format (one row per variant with sample-by-variant matrix), along with transitioning from data.frame to data.table and matrix data structures.

Changed VCF data format from long to wide with matrix-based genotype storage
Migrated data structures from data.frame to data.table and matrix for efficiency
Updated import.vcf and apply.polygenic.score with breaking changes requiring new parameter configurations

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
vignettes/UserGuide.Rmd	Updated examples to use new VCF import format with `long.format = TRUE` parameter
tests/testthat/test-vcf-pgs-merge.R	Added tests for both wide and long format VCF handling
tests/testthat/test-vcf-import.R	Updated import tests to handle new wide/long format outputs
tests/testthat/test-strand-flip-handling.R	Minor formatting fix (added semicolon)
tests/testthat/test-sample-by-snp-matrix-utility.R	Enhanced matrix utility tests with new wide format conversion functions
tests/testthat/test-plotting.R	Updated plotting tests to use `vcf.long.format = TRUE`
tests/testthat/test-pgs-application.R	Comprehensive updates to test both wide and long format equivalence
tests/testthat/test-dosage-calculator.R	Updated dosage calculation tests for matrix input support
tests/testthat/helper-test-utils.R	Added utility functions for converting between VCF formats
man/import.vcf.Rd	Updated documentation for new VCF import interface
man/combine.vcf.with.pgs.Rd	Updated documentation for data.table compatibility
man/apply.polygenic.score.Rd	Updated documentation for new VCF format parameter
R/variant-by-sample-matrix-utility.R	Rewrote matrix utilities for data.table efficiency and added VCF conversion functions
R/run-pgs-statistics.R	Migrated statistical functions to data.table for performance
R/handle-vcf.R	Major refactor of VCF import to support wide/long format outputs
R/handle-multiallelic-sites.R	Enhanced multiallelic site handling for matrix-based processing
R/combine-vcf-with-pgs.R	Optimized VCF-PGS merging using data.table operations
R/calculate-dosage.R	Enhanced dosage calculation to support matrix inputs
R/assess-strand-flip.R	Optimized strand flip assessment with vectorized operations
R/apply-pgs.R	Complete rewrite for matrix-based processing and data.table efficiency
NEWS.md	Added version 4.0.0 changelog with breaking changes
NAMESPACE	Updated imports to use data.table instead of reshape2
DESCRIPTION	Version bump to 4.0.0 and removed reshape2 dependency

R/calculate-dosage.R

whelena

Looks good and I don't have any major comments

R/apply-pgs.R

R/assess-strand-flip.R

tests/testthat/helper-test-utils.R

alkaZeltser added 30 commits July 10, 2025 19:19

add split matrix workflow

cfcbb37

improve NA handling

5520a5a

add split matrix multiallelic workflow

186935c

add split matrix import functionality and modify output (breaking cha…

2b3512b

…nge)

update dosage calc tests

ac5a2fe

update pgs application tests

30bbad6

update vcf import tests

9ea556d

update merge tests

bc37b65

add helper test function

e2fab67

add data.table to namespace

107f3ab

rewrite vcf merging in data.table for efficiency

9a2cd7f

minor vcf.import refactoring for efficiency

2b99db7

minor updates to tests for data.table refactor

3950f53

update vignette with long.vcf option, temporary fix

79d5c37

update plotting test

4151242

refactor wide format logic for efficiency, fix multiallelic site hand…

29826c7

…ling bugs

refactor dosage calculation for efficiency

3a346ed

fix merging bug

b2026bc

refactor multiallelic handling of wide input for efficiency

4772d50

refactor stats with data.table

b5a2f24

refactor sample by variant matrix utiltity for efficiency

48853ca

add helper function to convert between long form input to wide form i…

34a8655

…nput

tweak tested error messages

7731229

add missing wide format equivalence check

73bcdc3

update expected output formatting check for matrix utility

714bdf3

update version to 4.0.0

51063f2

fix data.table conversion bug

48e0554

temp fixes to user guide

2fd8e19

remove commented old code

2c78981

data.table syntax handling for CRAN

6beb190

alkaZeltser added 15 commits July 15, 2025 16:19

update vcf import documentation

3b7dfe2

remove reshape2 dependency

21ef108

update docs with new vcf format toggle

14cc47d

update example for vcf pgs merge

ce2684f

formatting tweaks

44a3b05

resolve merge

02de3ce

update sample by snp matrix utility

83989ec

modify apply.polygenic.score to convert long format to wide format

baac820

remove redundant long vcf format workflow

e7c8b99

streamline allele validator

f015326

vectorize allele flipping

cdfe4bb

refactor strand flip assessment for efficiency

a81f93b

update NEWS

e43168e

clean up old code

f563064

lintr fixes

666d8cc

alkaZeltser requested review from jarbet, rhughwhite and whelena July 30, 2025 00:59

jarbet requested a review from Copilot July 30, 2025 17:12

Copilot AI reviewed Jul 30, 2025

View reviewed changes

R/calculate-dosage.R Outdated Show resolved Hide resolved

whelena approved these changes Jul 31, 2025

View reviewed changes

R/apply-pgs.R Show resolved Hide resolved

R/apply-pgs.R Show resolved Hide resolved

R/assess-strand-flip.R Outdated Show resolved Hide resolved

tests/testthat/helper-test-utils.R Outdated Show resolved Hide resolved

alkaZeltser added 3 commits August 5, 2025 15:09

clean up dead code

65872f2

only report unique instances of invalid alleles

2a54800

remove redundant vcf format conversion function

8b66e5f

alkaZeltser merged commit d595990 into main Aug 5, 2025
5 of 6 checks passed

alkaZeltser deleted the nzeltser-optimize-main-functions branch August 5, 2025 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize main functions #86

optimize main functions #86

Uh oh!

alkaZeltser commented Jul 30, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

whelena left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

optimize main functions #86

optimize main functions #86

Uh oh!

Conversation

alkaZeltser commented Jul 30, 2025

Testing Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

whelena left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants