-
Notifications
You must be signed in to change notification settings - Fork 19
Feature/auto detect format #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature/auto detect format #77
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #77 +/- ##
==========================================
- Coverage 98.82% 98.68% -0.14%
==========================================
Files 5 5
Lines 255 305 +50
==========================================
+ Hits 252 301 +49
- Misses 3 4 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi @rajatkriplani, thanks for this! I approved the CI workflows now, would you mind having a look at the failed checks? Additionally, we usually open PRs separately for different features / issues, but I noticed in this PR the diff with Feel free to have a go, let me know if you get stuck at any point. |
47105b4 to
aacdbe8
Compare
|
Hi @sfmig I have improved the test coverage, still the following line are uncovered in except Exception as e: # Catch other potential file reading errors
raise ValueError(f"Could not read file {file_path}: {e}") from efor the above mocking is required, so should I go with |
|
Hi @rajatkriplani, Yes do have a go at using unittest.mock for this. You may find examples of its use in the Hope this helps! |
|
Also, do have a look at the docs building check which seems to be failing (alongside the code coverage checks) |
|
@sfmig I have done the mocking for some tests please have a look at it. |
|
Hello @sfmig |
|
Hello @sfmig just dropping a quick follow-up here since Zulip’s been a bit quiet — would love any further thoughts or feedback on this when you get a chance. |
sfmig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rajatkriplani, thanks for having a go!
I think this is a good attempt, although we will likely need to iterate a bit. I left some in-line comments in the code, but also have some general comments:
- In general, the PR feels a bit too verbose. If you can, I would try to keep the implementation as minimal as possible. For example, some of the error handling is dealth further down the pipeline using the validators module. I would recommend having a look at the full annotation pipeline (i.e. the process of reading an annotation file as a dataframe), understanding it well, and then try to keep only the essential bits here.
- Could you have a look at the CI checks, and investigate why they are failing? I found these two likely culprits in the logs:
/home/runner/work/ethology/ethology/ethology/annotations/io/load_bboxes.py:docstring of ethology.annotations.io.load_bboxes.from_files:39: ERROR: Unexpected indentation. [docutils]
/home/runner/work/ethology/ethology/ethology/annotations/io/load_bboxes.py:docstring of ethology.annotations.io.load_bboxes.from_files:40: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]
Hope this helps!
| if not file_path.is_file(): | ||
| raise FileNotFoundError(f"Annotation file not found: {file_path}") | ||
|
|
||
| try: | ||
| with open(file_path) as f: | ||
| # Load only enough to check keys, avoid loading huge files | ||
| # if possible | ||
| # For simplicity here, load the whole thing. | ||
| # Optimization is possible if needed. | ||
| data = json.load(f) | ||
| except json.JSONDecodeError as e: | ||
| raise ValueError( | ||
| f"Error decoding JSON data from file {file_path}: {e}" | ||
| ) from e | ||
| except Exception as e: # Catch other potential file reading errors | ||
| raise ValueError(f"Could not read file {file_path}: {e}") from e | ||
|
|
||
| if not isinstance(data, dict): | ||
| raise ValueError( | ||
| f"Expected JSON root to be a dictionary, but got {type(data)} " | ||
| f"in file {file_path}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could do away with these since later we run the validators when loading the data...
Would you mind having a look and checking if we can remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed feedback! I've refactored _detect_format to be more minimal and rely on the downstream validators.
|
|
||
| # --- Format Detection Logic --- | ||
| determined_format: Literal["VIA", "COCO"] | ||
| if format == "auto": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the multiple file case: could we instead infer the format from every file, and throw an error if there is no consensus between them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a new helper _determine_format_from_paths that calls _detect_format for each file in a list and raises a ValueError if inconsistent
| file_paths: Path | str | list[Path | str], | ||
| format: Literal["VIA", "COCO"], | ||
| images_dirs: Path | str | list[Path | str] | None = None, | ||
| format: Literal["VIA", "COCO", "auto"] = "auto", # Changed default and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to remove the comments that highlight the changes: it adds a bit of clutter and the changes are clear to the reviewer in the Github interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| def _from_multiple_files( | ||
| list_filepaths: list[Path | str], format: Literal["VIA", "COCO"] | ||
| list_filepaths: list[Path], format: Literal["VIA", "COCO"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list_filepaths type annotation should be list[Path | str], right?
|
|
||
|
|
||
| @pytest.fixture | ||
| def invalid_json_file(tmp_path: Path) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind having a look at the existing fixtures, and re-using them whenever possible (rather than creating new ones)?
Description
What is this PR
Why is this PR needed?
This PR aims to improve user experience by automatically detecting the format based on the file's content.
What does this PR do?
This PR introduces automatic detection for the format of input annotation files:
_detect_formatwithinload_bboxes.pyfrom_filesfunction signature to make theformatargument optional, defaulting"auto".format="auto"success and failure scenarios.References
Closes #43
How has this PR been tested?
The code has been tested locally by running
pytest. New unit tests have been added totests/test_unit/test_annotations/test_load_bboxes.pyIs this a breaking change?
No. This PR only adds a new, optional behavior (format="auto") as the default.
Does this PR require an update to the documentation?
Yes. The docstring for the
ethology.annotations.io.load_bboxes.from_filesfunctionChecklist: