add coco-text as a test/train set #1131

Thomas-MMJ · 2023-01-23T21:28:39Z

🚀 The feature

You might consider adding COCO-text as one of the supported datasets,

https://vision.cornell.edu/se3/coco-text-2/#download

Motivation, pitch

It is another high quality dataset, text on objects at various angles (sides of vehicles, signs, etc.)

Alternatives

No response

Additional context

No response

felixdittrich92 · 2023-04-20T19:17:37Z

Hey @Thomas-MMJ 👋 ,

Thanks for the request do you want to add it maybe ?
If so im happy to guide you If there is any help needed :)

dvando · 2024-02-22T08:57:22Z

Hi @felixdittrich92 , has anybody worked on this? I'd love to hop into the project and contribute to this issue. :)

felixdittrich92 · 2024-02-22T09:08:07Z

Hey @dvando 👋,

No it's still open.
Sure feel free to work on it, if you have any questions or need some help contact me :)

dvando · 2024-04-17T09:47:07Z

Hi @felixdittrich92 , my apology it took me a while to actually work on it, I've been dealing with some issues from work.

I've got some questions about the URLs for download, COCO-text has 2 separate URLs, the first one is for the images, and the second is for the labels, but the VisionDataset only accepts 1 URL which I believe lead to a compressed images and it's labels.

I also checked the other datasets (funsd, cord, synttext, etc), and all of them initialized the VisionDataset using 1 URL only, I was thinking about merging the files myself, but then I was wondering if that's the right thing to do. (Changing the base class should not be an option I believe)

Sorry, and thanks in advance. :)

felixdittrich92 · 2024-04-17T10:30:06Z

Hi @dvando 😄
No stress ^^

Option 1: You could take a look at https://github.com/mindee/doctr/blob/main/doctr/datasets/imgur5k.py (here the user needs to provide the paths to the data and we provide only the loader)
Option 2: What's the dataset size in MB / GB ? What's the license ? If both isn't troublesome we could combine the dataset and upload it :)

dvando · 2024-04-17T13:06:14Z

So with option 1, the user should download the images and the labels by themself? That sounds okay.
The dataset has ~13 GB in size and has CC by 4.0 license.

Both sound fine to me, which one do you prefer @felixdittrich92 ? :)

felixdittrich92 · 2024-04-17T13:07:56Z

So with option 1, the user should download the images and the labels by themself? That sounds okay. The dataset has ~13 GB in size and has CC by 4.0 license.

Both sound fine to me, which one do you prefer @felixdittrich92 ? :)

Option 1 👍

felixdittrich92 · 2024-04-17T13:09:34Z

As reference PR: #1359 :)

sarjil77 · 2025-01-19T19:00:12Z

@felixdittrich92,
i would like to work on this and want to contribute over here, please guide me over here

thanks in advance.

felixdittrich92 · 2025-01-20T06:56:30Z

Sure @sarjil77 :)

First download the dataset: https://bgshih.github.io/cocotext/ (annotations) & images http://images.cocodataset.org/zips/train2014.zip

The following PR can be used as reference: https://github.com/mindee/doctr/pull/1359/files

In doctr/datasets create a new python file coco_text.py, which contains the conversion logic ref.: https://github.com/mindee/doctr/blob/main/doctr/datasets/wildreceipt.py
For detection_task=True it returns only the det converted det annotations as the name says use_polygons it returns 4point polygon coordinates otherwise boxes and with recognition_task=True it crops all the polygons/boxes on the fly to create a recognition dataset otherwise if detection_task=False and recognition_task=False it returns the OCR E2E dataset containing the polygon/box annotations and corresponding labels.

In tests/conftest.py we need to create a mock of the original annotations for testing purposes for example:

doctr/tests/conftest.py

Line 675 in ebfc9f3

def mock_wildreceipt_dataset(tmpdir_factory, mock_image_stream):

These fixture is used in

doctr/tests/pytorch/test_datasets_pt.py

Line 741 in ebfc9f3

    
           def test_wildreceipt_dataset(input_size, num_samples, rotate, recognition, detection, mock_wildreceipt_dataset):

and

doctr/tests/tensorflow/test_datasets_tf.py

Line 714 in ebfc9f3

    
           def test_wildreceipt_dataset(input_size, num_samples, rotate, recognition, detection, mock_wildreceipt_dataset):

You can place the coco_text tests directly after the wildreceipt test cases :)

As last step we add the documentation entries:
See the first two modified files here: https://github.com/mindee/doctr/pull/1359/files

If you need any further information feel free to ask 👍

sarjil77 · 2025-01-20T07:03:46Z

Thanks @felixdittrich92, will further look into it.

Thomas-MMJ added the type: enhancement Improvement label Jan 23, 2023

felixdittrich92 added module: datasets Related to doctr.datasets awaiting response Waiting for feedback labels Jul 24, 2023

felixdittrich92 added this to the 2.0.0 milestone Feb 9, 2024

felixdittrich92 added the good first issue Good for newcomers label Feb 9, 2024

felixdittrich92 assigned dvando Apr 17, 2024

felixdittrich92 unassigned dvando Jan 17, 2025

sarjil77 mentioned this issue Jan 20, 2025

Adding Gujarati Language support #1845

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add coco-text as a test/train set #1131

add coco-text as a test/train set #1131

Thomas-MMJ commented Jan 23, 2023

felixdittrich92 commented Apr 20, 2023

dvando commented Feb 22, 2024

felixdittrich92 commented Feb 22, 2024

dvando commented Apr 17, 2024

felixdittrich92 commented Apr 17, 2024

dvando commented Apr 17, 2024

felixdittrich92 commented Apr 17, 2024

felixdittrich92 commented Apr 17, 2024

sarjil77 commented Jan 19, 2025

felixdittrich92 commented Jan 20, 2025

sarjil77 commented Jan 20, 2025

add coco-text as a test/train set #1131

add coco-text as a test/train set #1131

Comments

Thomas-MMJ commented Jan 23, 2023

🚀 The feature

Motivation, pitch

Alternatives

Additional context

felixdittrich92 commented Apr 20, 2023

dvando commented Feb 22, 2024

felixdittrich92 commented Feb 22, 2024

dvando commented Apr 17, 2024

felixdittrich92 commented Apr 17, 2024

dvando commented Apr 17, 2024

felixdittrich92 commented Apr 17, 2024

felixdittrich92 commented Apr 17, 2024

sarjil77 commented Jan 19, 2025

felixdittrich92 commented Jan 20, 2025

sarjil77 commented Jan 20, 2025