[datasets] Mocking small versions of datasets for unittests

As docTR grows, the number of supported datasets will increase. We cannot afford to add several minutes to the CI tests for every dataset that we add. So I suggest the following:
- adding pytest fixture in `tests/conftest.py` that will create the data files in a temporary folder and return the path to it
- use this for dataset unittests instead of downloading the subsamples or full dataset

The sole inconvenience I can see is the time to implement, but the advantages are that we won't need internet to run those unittests anymore, the CI will be considerably faster and any developer will be able to read the structure of the dataset file in the unittest.

If we move forward with this, we'll have to do PRs for the following datasets:
- [x] CORD #722
- [x] DocArtefacts #719 
- [x] FUNSD #722
- [x] IC03 #722
- [x] IC13 #662
- [x] IIIT5K #722
- [x] SROIE #722
- [x] SVHN #634
- [x] SVT #722
- [x] SynthText #722
- [x] DetectionDataset
- [x] RecognitionDataset
- [x] OCRDataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[datasets] Mocking small versions of datasets for unittests #680

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[datasets] Mocking small versions of datasets for unittests #680

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions