-
Notifications
You must be signed in to change notification settings - Fork 40
Add extra args to collate to handle edge cases #302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #302 +/- ##
==========================================
- Coverage 94.41% 93.48% -0.94%
==========================================
Files 57 57
Lines 3188 3223 +35
==========================================
+ Hits 3010 3013 +3
- Misses 178 210 +32
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hello, this is a courtesy notice that Pcytominer's |
|
Hi @bethac07 - thanks for your efforts on this! I wanted to check in to see if you're still planning to move forward with this PR. If there's any way we may assist or if you have any questions just let us know. Thanks for taking a moment to make a contribution here! |
|
Hi @d33bs, Just for funsies, I've rebased it and cleaned up some docstrings, but if collate is being deprecated, spending a bunch of time writing tests (which is all this needed aside from those docstrings) is probably not worth my nor your time. In a perfect world, despite the deprecation and ruff comments I'd still want to merge it so that it can be accessed if needed in a pre 2.0 |
|
Thanks @bethac07 ! I understand what you mean about this and feel similarly. Just the same, I'd like to keep our CI tests passing to help avoid confusion with future changes. Would it be alright with you for me to modify the code so it passes |
|
Sure, fine with me :) |
|
Hi @bethac07 - I made some changes to help us pass linting checks. Double checking: is this ready for review and moving forward (the PR is still in draft state and includes |
|
That was just the tests, if we can pull without, it's all set! Thanks much for the help |
d33bs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| shutil.rmtree(backend_dir) | ||
|
|
||
|
|
||
| def find_and_fix_metadata(path_to_plate_folder, overwrite=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a docstring and type hints to this function to help increase the documentation and understanding.
| append_metadata(image_csv, overwrite) | ||
|
|
||
|
|
||
| def append_metadata(path_to_csv, overwrite=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a docstring and type hints to this function to help increase the documentation and understanding.
| download_flags=[], | ||
| upload_flags=[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a blank / empty list for default arguments somewhat dangerous for Python functions because when the function evaluates at runtime it quietly will couple all calls to the same list object (creating sometimes surprising results). Consider using None as a default here and creating a new list each time within the function instead, e.g.:
if download_flags is None:
download_flags = [] | all_meta = path_to_csv.split("/")[-2] | ||
| plate = "-".join(all_meta.split("-")[:-2]) | ||
| well = all_meta.split("-")[-2] | ||
| site = all_meta.split("-")[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These looked to have a very specific formatting expectation that could go awry. Consider adding an example of what's expected here with a code comment or adding to a docstring.
|
|
||
| remote_aggregated_file = f"{aws_remote}/backend/{batch}/{plate}/{plate}.csv" | ||
|
|
||
| sync_cmd = f"aws s3 sync --exclude * --include */Cells.csv --include */Nuclei.csv --include */Cytoplasm.csv --include */Image.csv {remote_input_dir} {input_dir}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need for run_check_errors to take str arguments anymore after these changes? If not, consider revising run_check_errors to only take list, reducing the complexity of that function (and potential for edge cases).
| "*/Image.csv", | ||
| remote_input_dir, | ||
| input_dir, | ||
| *download_flags, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a simple test we could add to help describe and confirm the new download behaviors? I bring this up as this change may impact #301 (or vice versa) and it could be important to understand distinction.
| print(f"Downloading CSVs from {remote_input_dir} to {input_dir}") | ||
| run_check_errors(sync_cmd) | ||
|
|
||
| if overwrite_metadata or append_metadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a simple test we could add to help implement append_metadata? Similar to the above, I am concerned that if we don't validate the functionality that the code might become more difficult to troubleshoot (or less usable than it is currently).
| errors="ignore", | ||
| ) | ||
| edited = True | ||
| insertion_index = list(df.columns).index("ModuleError_01LoadData") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hard-coding could result in inflexibility for the function. Consider making this more flexible with a default index selection and leveraging df.columns.get_loc() to avoid a cast to list for df.columns.

Description
Collate handles the standard cases pretty well at this point, but there are a number of edge cases I've been writing extra code in my own fork to handle; it would be nice to add first-class support
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.