Skip to content

Conversation

@raagagrawal
Copy link
Contributor

@raagagrawal raagagrawal commented Mar 22, 2025

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the github standards before opening this pull request.

  • I have added the changes included in this pull request to NEWS under the next release version or unreleased, and updated the date.

  • I have updated the version number in metadata.yaml and DESCRIPTION.

  • Both R CMD build and R CMD check run successfully.

Closes #...
#74

@raagagrawal
Copy link
Contributor Author

This PR closes #74.

I replaced the data.table syntax with base R and validated that it now doesn't error out on a sample dataset with semicolons between rsIDs.

all = TRUE
split.rows <- strsplit(
as.character(vcf.data$ID),
';',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see each of the arguments explicitly written out for readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Now with semicolons as well

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement on clarity. I noticed that the original iteration uses the Indiv column, is that column not necessary or not always available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we replicate all columns in the dataframe so we don't explicitely need to reference Indiv.

# We detect such cases using grepl, split them, and expand the data so that each rsID has its own row.
# we create a new data frame with the expanded rsID data
if (any(grepl(';', vcf.data$ID))) {
split.rows <- strsplit(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep only unique IDs after splitting? the original code has it so I assume there might be edge cases where the ID column have non-unique values after splitting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be a row where rsABC;rsABC exists?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a rare case, but technically possible. Not too hard to implement w/ Raag's changes but the thing about these operations is that the data can get really big. The starting data frame has the number of rows equivalent to the product of the cohort size and number of variants. The goal of data.table was to minimize the number of copies of this potentially massive object being held in memory at any given time. Not sure what the most efficient strategy is with a base R implementation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a small ish file that works with both versions you can try benchmarking with R's peakRAM package.

@rachelmadang
Copy link

Given that this bug slipped past due to missing unit test, please add appropriate test.


row.indices <- rep(
x = seq_len(nrow(vcf.data)),
times = lengths(split.rows)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fairly certain that rep needs to be run in "each" mode not "times".

@alkaZeltser
Copy link
Collaborator

@raagagrawal @forbiddenpersimmon @dan-knight @whelena I'm gonna contribute to this PR, starting with making some unit tests. Stay tuned.

Copy link

@rachelmadang rachelmadang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alkaZeltser
Copy link
Collaborator

Confirmed that new fixes work with @raagagrawal's test file and all new unit tests pass. Going to merge this in.

@alkaZeltser alkaZeltser merged commit 07b8d71 into main Apr 1, 2025
6 checks passed
@raagagrawal raagagrawal deleted the ragrawal-fix-semicolon branch April 1, 2025 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants