Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: parallel file validation for a datapackage #1722

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

pierrecamilleri
Copy link
Collaborator

@pierrecamilleri pierrecamilleri commented Dec 13, 2024

Context

  • In validator.py, Resource was imported as a type, but then used as a class to retrieve a class method (Resource.from_descriptor).
  • The tests would not intercept this error, as they were performed on a schema with a foreign key, and the validation would fall back on synchronous validation in this case.
  • Importing Resource (not as a type) into validator.py leads to a circular import. A small workaround would have been to import Resource locally, but this was not very satisfactory.

Suggested changes and fixes

  • I repaired the tests, using datapackages with no foreign keys to test parallel validation. I checked that these tests would have failed because with the bug.
  • I moved the implementation of Validator.validate_resource and Validator.validate_package methods to the Resource.validate and Package.validate methods. This solves the circular import problem, and removes duplication (method signatures).
  • The tests have been dispatched as well.
  • I soft deprecated the Validator class, keeping the methods (forwarding input parameters) as they are part of the public API. No warning, only a deprecation message in the docstring, and a mention that there is no plan to remove the class in the future.

Note

I could not produce any significant performance gain in my tests, I wonder if there is some shared resources' lock that prevents the validation from effectively running the validation in parallel (but I have not spent much time on this yet).

Archive of explorations
  • Test that it works correctly
    • Works without error, however, I could not produce any significant performance gain in my tests.
  • Check why the tests did not intercept this error
    • The tests for parallel processing are run for a schema with a foreign key - which fall backs to single-core processing. So the validate_parallel function was actually never executed in the tests.
  • See if I can find a more elegant way to deal with the circular import. I actually don't see the benefit of having this validator.py class instead of dispatching the methods to Resource and DataPackage directly, but I may have missed something.

@pierrecamilleri pierrecamilleri marked this pull request as ready for review January 27, 2025 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding --parallel CLI flag breaks validation
1 participant