Skip to content

Conversation

@rajatkriplani
Copy link

Description

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?
This PR aims to improve user experience by automatically detecting the format based on the file's content.

What does this PR do?
This PR introduces automatic detection for the format of input annotation files:

  • Adds a private helper function _detect_format within load_bboxes.py
  • Modifies from_files function signature to make the format argument optional, defaulting "auto".
  • When format="auto", from_files now calls _detect_format on the first input file to determine the format for loading.
  • Adds new unit tests to cover the format="auto" success and failure scenarios.

References

Closes #43

How has this PR been tested?

The code has been tested locally by running pytest. New unit tests have been added to tests/test_unit/test_annotations/test_load_bboxes.py

Is this a breaking change?

No. This PR only adds a new, optional behavior (format="auto") as the default.

Does this PR require an update to the documentation?

Yes. The docstring for the ethology.annotations.io.load_bboxes.from_files function

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

@codecov
Copy link

codecov bot commented Apr 3, 2025

Codecov Report

Attention: Patch coverage is 98.11321% with 1 line in your changes missing coverage. Please review.

Project coverage is 98.68%. Comparing base (abc3c5b) to head (a8eaf5d).

Files with missing lines Patch % Lines
ethology/annotations/io/load_bboxes.py 98.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #77      +/-   ##
==========================================
- Coverage   98.82%   98.68%   -0.14%     
==========================================
  Files           5        5              
  Lines         255      305      +50     
==========================================
+ Hits          252      301      +49     
- Misses          3        4       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sfmig
Copy link
Member

sfmig commented Apr 3, 2025

Hi @rajatkriplani, thanks for this!

I approved the CI workflows now, would you mind having a look at the failed checks?

Additionally, we usually open PRs separately for different features / issues, but I noticed in this PR the diff with main includes the labelling guide as well as the auto detect format work. Would you mind changing it so that here you only show the auto detect format work? That makes it easier for the reviewer to go through.

Feel free to have a go, let me know if you get stuck at any point.

@rajatkriplani rajatkriplani force-pushed the feature/auto-detect-format branch from 47105b4 to aacdbe8 Compare April 3, 2025 14:17
@rajatkriplani
Copy link
Author

Hi @sfmig I have improved the test coverage, still the following line are uncovered in ethology/annotations/io/load_bboxes.py

except Exception as e:  # Catch other potential file reading errors
         raise ValueError(f"Could not read file {file_path}: {e}") from e

for the above mocking is required, so should I go with unittest.mock?

@sfmig
Copy link
Member

sfmig commented Apr 8, 2025

Hi @rajatkriplani,

Yes do have a go at using unittest.mock for this. You may find examples of its use in the movement tests codebase.

Hope this helps!

@sfmig
Copy link
Member

sfmig commented Apr 8, 2025

Also, do have a look at the docs building check which seems to be failing (alongside the code coverage checks)

@sfmig sfmig self-requested a review April 8, 2025 12:33
@rajatkriplani
Copy link
Author

@sfmig I have done the mocking for some tests please have a look at it.

@rajatkriplani
Copy link
Author

Hello @sfmig
I am assuming only the doc build is remaining. Please know how to make changes in the doc (and which doc?)?

@rajatkriplani
Copy link
Author

Hello @sfmig just dropping a quick follow-up here since Zulip’s been a bit quiet — would love any further thoughts or feedback on this when you get a chance.

Copy link
Member

@sfmig sfmig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rajatkriplani, thanks for having a go!

I think this is a good attempt, although we will likely need to iterate a bit. I left some in-line comments in the code, but also have some general comments:

  • In general, the PR feels a bit too verbose. If you can, I would try to keep the implementation as minimal as possible. For example, some of the error handling is dealth further down the pipeline using the validators module. I would recommend having a look at the full annotation pipeline (i.e. the process of reading an annotation file as a dataframe), understanding it well, and then try to keep only the essential bits here.
  • Could you have a look at the CI checks, and investigate why they are failing? I found these two likely culprits in the logs:
/home/runner/work/ethology/ethology/ethology/annotations/io/load_bboxes.py:docstring of ethology.annotations.io.load_bboxes.from_files:39: ERROR: Unexpected indentation. [docutils]
/home/runner/work/ethology/ethology/ethology/annotations/io/load_bboxes.py:docstring of ethology.annotations.io.load_bboxes.from_files:40: WARNING: Block quote ends without a blank line; unexpected unindent. [docutils]

Hope this helps!

Comment on lines 54 to 75
if not file_path.is_file():
raise FileNotFoundError(f"Annotation file not found: {file_path}")

try:
with open(file_path) as f:
# Load only enough to check keys, avoid loading huge files
# if possible
# For simplicity here, load the whole thing.
# Optimization is possible if needed.
data = json.load(f)
except json.JSONDecodeError as e:
raise ValueError(
f"Error decoding JSON data from file {file_path}: {e}"
) from e
except Exception as e: # Catch other potential file reading errors
raise ValueError(f"Could not read file {file_path}: {e}") from e

if not isinstance(data, dict):
raise ValueError(
f"Expected JSON root to be a dictionary, but got {type(data)} "
f"in file {file_path}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do away with these since later we run the validators when loading the data...

Would you mind having a look and checking if we can remove this?

Copy link
Author

@rajatkriplani rajatkriplani Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed feedback! I've refactored _detect_format to be more minimal and rely on the downstream validators.


# --- Format Detection Logic ---
determined_format: Literal["VIA", "COCO"]
if format == "auto":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the multiple file case: could we instead infer the format from every file, and throw an error if there is no consensus between them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a new helper _determine_format_from_paths that calls _detect_format for each file in a list and raises a ValueError if inconsistent

file_paths: Path | str | list[Path | str],
format: Literal["VIA", "COCO"],
images_dirs: Path | str | list[Path | str] | None = None,
format: Literal["VIA", "COCO", "auto"] = "auto", # Changed default and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to remove the comments that highlight the changes: it adds a bit of clutter and the changes are clear to the reviewer in the Github interface

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


def _from_multiple_files(
list_filepaths: list[Path | str], format: Literal["VIA", "COCO"]
list_filepaths: list[Path], format: Literal["VIA", "COCO"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list_filepaths type annotation should be list[Path | str], right?



@pytest.fixture
def invalid_json_file(tmp_path: Path) -> Path:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind having a look at the existing fixtures, and re-using them whenever possible (rather than creating new ones)?

@sfmig sfmig mentioned this pull request Sep 11, 2025
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatically detect likely format of input annotation file

3 participants