Ragrawal fix semicolon #75

raagagrawal · 2025-03-22T06:23:00Z

I have read the code review guidelines and the code review best practice on GitHub check-list.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added the changes included in this pull request to NEWS under the next release version or unreleased, and updated the date.
I have updated the version number in metadata.yaml and DESCRIPTION.
Both R CMD build and R CMD check run successfully.

Closes #...
#74

raagagrawal · 2025-03-22T06:29:43Z

This PR closes #74.

I replaced the data.table syntax with base R and validated that it now doesn't error out on a sample dataset with semicolons between rsIDs.

rachelmadang · 2025-03-23T19:10:30Z

R/combine-vcf-with-pgs.R

-                all = TRUE
+            split.rows <- strsplit(
+                as.character(vcf.data$ID),
+                ';',


I would like to see each of the arguments explicitly written out for readability.

Done! Now with semicolons as well

whelena · 2025-03-23T20:57:04Z

R/combine-vcf-with-pgs.R

This is a great improvement on clarity. I noticed that the original iteration uses the Indiv column, is that column not necessary or not always available?

we replicate all columns in the dataframe so we don't explicitely need to reference Indiv.

whelena · 2025-03-23T20:59:57Z

R/combine-vcf-with-pgs.R

+    # We detect such cases using grepl, split them, and expand the data so that each rsID has its own row.
+    # we create a new data frame with the expanded rsID data
+    if (any(grepl(';', vcf.data$ID))) {
+        split.rows <- strsplit(


Maybe keep only unique IDs after splitting? the original code has it so I assume there might be edge cases where the ID column have non-unique values after splitting.

Would this be a row where rsABC;rsABC exists?

I think that would be a rare case, but technically possible. Not too hard to implement w/ Raag's changes but the thing about these operations is that the data can get really big. The starting data frame has the number of rows equivalent to the product of the cohort size and number of variants. The goal of data.table was to minimize the number of copies of this potentially massive object being held in memory at any given time. Not sure what the most efficient strategy is with a base R implementation.

If you have a small ish file that works with both versions you can try benchmarking with R's peakRAM package.

rachelmadang · 2025-03-24T17:44:53Z

Given that this bug slipped past due to missing unit test, please add appropriate test.

alkaZeltser · 2025-03-24T18:18:31Z

R/combine-vcf-with-pgs.R

+
+        row.indices <- rep(
+            x           = seq_len(nrow(vcf.data)),
+            times       = lengths(split.rows)


Fairly certain that rep needs to be run in "each" mode not "times".

alkaZeltser · 2025-03-24T18:20:14Z

@raagagrawal @forbiddenpersimmon @dan-knight @whelena I'm gonna contribute to this PR, starting with making some unit tests. Stay tuned.

rachelmadang

LGTM!

alkaZeltser · 2025-04-01T00:52:13Z

Confirmed that new fixes work with @raagagrawal's test file and all new unit tests pass. Going to merge this in.

raagagrawal added 2 commits March 21, 2025 23:21

fix

9409634

remove commented code

8c2daa1

raagagrawal requested review from alkaZeltser, dan-knight, rachelmadang and whelena March 22, 2025 06:27

update NEWS and contributors

c9d677c

rachelmadang reviewed Mar 23, 2025

View reviewed changes

arguments written out

9b95a6d

whelena approved these changes Mar 23, 2025

View reviewed changes

alkaZeltser reviewed Mar 24, 2025

View reviewed changes

alkaZeltser added 2 commits March 26, 2025 11:33

save original vcf rsID for output

7e701a2

add tests for semicolon-separated ID merge

0fd71b7

alkaZeltser requested a review from rachelmadang March 26, 2025 18:46

rachelmadang approved these changes Mar 26, 2025

View reviewed changes

handle duplicate rsIDs

7aff72b

alkaZeltser merged commit 07b8d71 into main Apr 1, 2025
6 checks passed

raagagrawal deleted the ragrawal-fix-semicolon branch April 1, 2025 01:09

Ragrawal fix semicolon #75

Ragrawal fix semicolon #75

Uh oh!

Conversation

raagagrawal commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raagagrawal commented Mar 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rachelmadang commented Mar 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alkaZeltser commented Mar 24, 2025

Uh oh!

rachelmadang left a comment

Choose a reason for hiding this comment

Uh oh!

alkaZeltser commented Apr 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

raagagrawal commented Mar 22, 2025 •

edited

Loading