Closed
Description
As docTR grows, the number of supported datasets will increase. We cannot afford to add several minutes to the CI tests for every dataset that we add. So I suggest the following:
- adding pytest fixture in
tests/conftest.py
that will create the data files in a temporary folder and return the path to it - use this for dataset unittests instead of downloading the subsamples or full dataset
The sole inconvenience I can see is the time to implement, but the advantages are that we won't need internet to run those unittests anymore, the CI will be considerably faster and any developer will be able to read the structure of the dataset file in the unittest.
If we move forward with this, we'll have to do PRs for the following datasets:
- CORD Mock Sroie / Funsd / Cord / Synthtext / DocArtefacts / IIIT5K / SVT / IC03 (all ^^) #722
- DocArtefacts Mock Doc-Artefacts dataset #719
- FUNSD Mock Sroie / Funsd / Cord / Synthtext / DocArtefacts / IIIT5K / SVT / IC03 (all ^^) #722
- IC03 Mock Sroie / Funsd / Cord / Synthtext / DocArtefacts / IIIT5K / SVT / IC03 (all ^^) #722
- IC13 ICDAR2013 dataset integration #662
- IIIT5K Mock Sroie / Funsd / Cord / Synthtext / DocArtefacts / IIIT5K / SVT / IC03 (all ^^) #722
- SROIE Mock Sroie / Funsd / Cord / Synthtext / DocArtefacts / IIIT5K / SVT / IC03 (all ^^) #722
- SVHN SVHN dataset integration #634
- SVT Mock Sroie / Funsd / Cord / Synthtext / DocArtefacts / IIIT5K / SVT / IC03 (all ^^) #722
- SynthText Mock Sroie / Funsd / Cord / Synthtext / DocArtefacts / IIIT5K / SVT / IC03 (all ^^) #722
- DetectionDataset
- RecognitionDataset
- OCRDataset