-
Notifications
You must be signed in to change notification settings - Fork 0
Ragrawal fix semicolon #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This PR closes #74. I replaced the data.table syntax with base R and validated that it now doesn't error out on a sample dataset with semicolons between rsIDs. |
R/combine-vcf-with-pgs.R
Outdated
| all = TRUE | ||
| split.rows <- strsplit( | ||
| as.character(vcf.data$ID), | ||
| ';', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see each of the arguments explicitly written out for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Now with semicolons as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great improvement on clarity. I noticed that the original iteration uses the Indiv column, is that column not necessary or not always available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we replicate all columns in the dataframe so we don't explicitely need to reference Indiv.
R/combine-vcf-with-pgs.R
Outdated
| # We detect such cases using grepl, split them, and expand the data so that each rsID has its own row. | ||
| # we create a new data frame with the expanded rsID data | ||
| if (any(grepl(';', vcf.data$ID))) { | ||
| split.rows <- strsplit( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keep only unique IDs after splitting? the original code has it so I assume there might be edge cases where the ID column have non-unique values after splitting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be a row where rsABC;rsABC exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a rare case, but technically possible. Not too hard to implement w/ Raag's changes but the thing about these operations is that the data can get really big. The starting data frame has the number of rows equivalent to the product of the cohort size and number of variants. The goal of data.table was to minimize the number of copies of this potentially massive object being held in memory at any given time. Not sure what the most efficient strategy is with a base R implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a small ish file that works with both versions you can try benchmarking with R's peakRAM package.
|
Given that this bug slipped past due to missing unit test, please add appropriate test. |
R/combine-vcf-with-pgs.R
Outdated
|
|
||
| row.indices <- rep( | ||
| x = seq_len(nrow(vcf.data)), | ||
| times = lengths(split.rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fairly certain that rep needs to be run in "each" mode not "times".
|
@raagagrawal @forbiddenpersimmon @dan-knight @whelena I'm gonna contribute to this PR, starting with making some unit tests. Stay tuned. |
rachelmadang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Confirmed that new fixes work with @raagagrawal's test file and all new unit tests pass. Going to merge this in. |
I have read the code review guidelines and the code review best practice on GitHub check-list.
The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)-[brief_description_of_branch].
I have set up or verified the branch protection rule following the github standards before opening this pull request.
I have added the changes included in this pull request to
NEWSunder the next release version or unreleased, and updated the date.I have updated the version number in
metadata.yamlandDESCRIPTION.Both
R CMD buildandR CMD checkrun successfully.Closes #...
#74