Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add coco-text as a test/train set #1131

Open
Thomas-MMJ opened this issue Jan 23, 2023 · 11 comments
Open

add coco-text as a test/train set #1131

Thomas-MMJ opened this issue Jan 23, 2023 · 11 comments
Labels
awaiting response Waiting for feedback good first issue Good for newcomers module: datasets Related to doctr.datasets type: enhancement Improvement
Milestone

Comments

@Thomas-MMJ
Copy link

🚀 The feature

You might consider adding COCO-text as one of the supported datasets,

https://vision.cornell.edu/se3/coco-text-2/#download

Motivation, pitch

It is another high quality dataset, text on objects at various angles (sides of vehicles, signs, etc.)

Alternatives

No response

Additional context

No response

@Thomas-MMJ Thomas-MMJ added the type: enhancement Improvement label Jan 23, 2023
@felixdittrich92
Copy link
Contributor

Hey @Thomas-MMJ 👋 ,

Thanks for the request do you want to add it maybe ?
If so im happy to guide you If there is any help needed :)

@felixdittrich92 felixdittrich92 added module: datasets Related to doctr.datasets awaiting response Waiting for feedback labels Jul 24, 2023
@felixdittrich92 felixdittrich92 added this to the 2.0.0 milestone Feb 9, 2024
@felixdittrich92 felixdittrich92 added the good first issue Good for newcomers label Feb 9, 2024
@dvando
Copy link

dvando commented Feb 22, 2024

Hi @felixdittrich92 , has anybody worked on this? I'd love to hop into the project and contribute to this issue. :)

@felixdittrich92
Copy link
Contributor

Hey @dvando 👋,

No it's still open.
Sure feel free to work on it, if you have any questions or need some help contact me :)

@dvando
Copy link

dvando commented Apr 17, 2024

Hi @felixdittrich92 , my apology it took me a while to actually work on it, I've been dealing with some issues from work.

I've got some questions about the URLs for download, COCO-text has 2 separate URLs, the first one is for the images, and the second is for the labels, but the VisionDataset only accepts 1 URL which I believe lead to a compressed images and it's labels.

I also checked the other datasets (funsd, cord, synttext, etc), and all of them initialized the VisionDataset using 1 URL only, I was thinking about merging the files myself, but then I was wondering if that's the right thing to do. (Changing the base class should not be an option I believe)

Sorry, and thanks in advance. :)

@felixdittrich92
Copy link
Contributor

Hi @dvando 😄
No stress ^^

Option 1: You could take a look at https://github.com/mindee/doctr/blob/main/doctr/datasets/imgur5k.py (here the user needs to provide the paths to the data and we provide only the loader)
Option 2: What's the dataset size in MB / GB ? What's the license ? If both isn't troublesome we could combine the dataset and upload it :)

@dvando
Copy link

dvando commented Apr 17, 2024

So with option 1, the user should download the images and the labels by themself? That sounds okay.
The dataset has ~13 GB in size and has CC by 4.0 license.

Both sound fine to me, which one do you prefer @felixdittrich92 ? :)

@felixdittrich92
Copy link
Contributor

So with option 1, the user should download the images and the labels by themself? That sounds okay. The dataset has ~13 GB in size and has CC by 4.0 license.

Both sound fine to me, which one do you prefer @felixdittrich92 ? :)

Option 1 👍

@felixdittrich92
Copy link
Contributor

As reference PR: #1359 :)

@sarjil77
Copy link
Contributor

@felixdittrich92,
i would like to work on this and want to contribute over here, please guide me over here

thanks in advance.

@felixdittrich92
Copy link
Contributor

Sure @sarjil77 :)

First download the dataset: https://bgshih.github.io/cocotext/ (annotations) & images http://images.cocodataset.org/zips/train2014.zip

The following PR can be used as reference: https://github.com/mindee/doctr/pull/1359/files

In doctr/datasets create a new python file coco_text.py, which contains the conversion logic ref.: https://github.com/mindee/doctr/blob/main/doctr/datasets/wildreceipt.py
For detection_task=True it returns only the det converted det annotations as the name says use_polygons it returns 4point polygon coordinates otherwise boxes and with recognition_task=True it crops all the polygons/boxes on the fly to create a recognition dataset otherwise if detection_task=False and recognition_task=False it returns the OCR E2E dataset containing the polygon/box annotations and corresponding labels.

In tests/conftest.py we need to create a mock of the original annotations for testing purposes for example:

def mock_wildreceipt_dataset(tmpdir_factory, mock_image_stream):

These fixture is used in

def test_wildreceipt_dataset(input_size, num_samples, rotate, recognition, detection, mock_wildreceipt_dataset):

and
def test_wildreceipt_dataset(input_size, num_samples, rotate, recognition, detection, mock_wildreceipt_dataset):

You can place the coco_text tests directly after the wildreceipt test cases :)

As last step we add the documentation entries:
See the first two modified files here: https://github.com/mindee/doctr/pull/1359/files

If you need any further information feel free to ask 👍

@sarjil77
Copy link
Contributor

Thanks @felixdittrich92, will further look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response Waiting for feedback good first issue Good for newcomers module: datasets Related to doctr.datasets type: enhancement Improvement
Projects
None yet
Development

No branches or pull requests

4 participants