Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ Version: 3.0.2
Authors@R: c(
person('Paul', 'Boutros', role = 'cre', email = '[email protected]'),
person('Nicole', 'Zeltser', role = 'aut', comment = c(ORCID = '0000-0001-7246-2771')),
person('Rachel', 'Dang', role = 'ctb'))
person('Rachel', 'Dang', role = 'ctb'),
person('Raag', 'Agrawal', role = 'ctb'))
Description: Simple and transparent parsing of genotype/dosage data
from an input Variant Call Format (VCF) file, matching of genotype
coordinates to the component Single Nucleotide Polymorphisms (SNPs)
Expand All @@ -14,7 +15,7 @@ Description: Simple and transparent parsing of genotype/dosage data
in accordance with the additive weighted sum of dosages model. Methods
are designed in reference to best practices described by
Collister, Liu, and Clifton (2022) <doi:10.3389/fgene.2022.818574>.
Depends:
Depends:
R (>= 4.2.0)
Imports:
vcfR,
Expand All @@ -23,7 +24,7 @@ Imports:
reshape2,
BoutrosLab.plotting.general,
lattice
Suggests:
Suggests:
knitr,
rmarkdown,
testthat (>= 3.0.0)
Expand Down
5 changes: 5 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# Unreleased

# ApplyPolygenicScore unreleased
## Changed
* Fixed regression of combine.vcf.with.pgs() function that prevented it from handling multiple rsIDs on the same line.
* added new contributor

# ApplyPolygenicScore 3.0.2

## Changed
Expand Down
41 changes: 24 additions & 17 deletions R/combine-vcf-with-pgs.R
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement on clarity. I noticed that the original iteration uses the Indiv column, is that column not necessary or not always available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we replicate all columns in the dataframe so we don't explicitely need to reference Indiv.

Original file line number Diff line number Diff line change
Expand Up @@ -105,23 +105,30 @@ combine.vcf.with.pgs <- function(vcf.data, pgs.weight.data) {
missing.snp.pgs.weight.data <- subset(missing.snp.merged.data, select = colnames(pgs.weight.data));
rm(missing.snp.merged.data);

# Split VCF$ID column into separate rows for each rsID (multiple rsIDs are separated by ;)
# most efficient way to do this is to use the data.table package
if (any(grepl(';', vcf.data$ID))) {
data.table::setDT(vcf.data);
split.rsid.vcf.data <- merge(
x = vcf.data,
# split only entries with multiple rsIDs, save in new column, and merge back with the original data
y = vcf.data[grepl(';', get('ID')), unique(unlist(strsplit(as.character(get('ID')), ';', fixed = TRUE))), by = .(get('Indiv'), get('CHROM'), get('POS'))
][,.(new.ID = get('V1'), get('Indiv'), get('CHROM'), get('POS'))],
by = c('CHROM', 'POS', 'Indiv'),
all = TRUE
);
# replace entries with multiple rsIDs with the new, split, rsID
split.rsid.vcf.data <- split.rsid.vcf.data[!is.na(new.ID), ID := new.ID][, new.ID := NULL];
} else {
split.rsid.vcf.data <- vcf.data;
}
# Expand the VCF$ID column to a row-per-rsID format.
# Some variants have multiple rsIDs in the ID column separated by semicolons.
# We detect such cases using grepl, split them, and expand the data so that each rsID has its own row.
# we create a new data frame with the expanded rsID data
if (any(grepl(';', vcf.data$ID))) {
split.rows <- strsplit(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep only unique IDs after splitting? the original code has it so I assume there might be edge cases where the ID column have non-unique values after splitting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be a row where rsABC;rsABC exists?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a rare case, but technically possible. Not too hard to implement w/ Raag's changes but the thing about these operations is that the data can get really big. The starting data frame has the number of rows equivalent to the product of the cohort size and number of variants. The goal of data.table was to minimize the number of copies of this potentially massive object being held in memory at any given time. Not sure what the most efficient strategy is with a base R implementation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a small ish file that works with both versions you can try benchmarking with R's peakRAM package.

x = as.character(vcf.data$ID),
split = ';',
fixed = TRUE
);

row.indices <- rep(
x = seq_len(nrow(vcf.data)),
times = lengths(split.rows)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fairly certain that rep needs to be run in "each" mode not "times".

);

expanded.vcf <- vcf.data[row.indices, ];

expanded.vcf$ID <- unlist(split.rows);

split.rsid.vcf.data <- expanded.vcf;
} else {
split.rsid.vcf.data <- vcf.data;
}

# merge missing SNP data on split rsID
merged.vcf.with.missing.pgs.data <- merge(
Expand Down