Require detailed explanation on few points #423

vishveshtrivedi · 2021-08-13T04:41:58Z

vishveshtrivedi
Aug 13, 2021

Hello fg-midee & charlesmindee,
The End to end ocr named doctr developed by you is fantastic.It is very easy to use and have very good results.Currently i am working on ocr related project.I had implemented doctr on sample images and have received good results.However I had few question which i list below and would be grateful for receiving explanatioins on them.
Questions:
1)Which dbresnet50 model are you using?pretrained on synthtext dataset or tested on real world dataset as mentioned in the paper?
2)How can we fine tune the model?
3)Is their anyway we can get output after detection without postprocessing?
4)how can we improve accuracy of detection?
5)when would your private dataset be available?
6)How much training data we need to get good results on our dataset?(dataset type would be forms,invoices,receipts etc)
7)Also you have mentioned that to train the model Each JSON file must contains 3 lists of boxes.Why 3 boxes are needed for single image?

Answered by fg-mindee

Aug 13, 2021

Hi @vishveshtrivedi 👋

Glad to hear that the library is useful!
Here are some answers to your questions:

I'm not sure I perfectly understand your question but the trained params of DocTR are trained on real world dataset. No pretraining on Synthtext has been used yet, and we test the models on publicly available datasets + the private ones.
You can check the README of each training script for further information, but passing the --pretrained flag will start your training from the version we trained. You will need to format your dataset to the format mentioned in the README for it to work
If you are talking about raw logits, yes, you can pass return_model_output=True as an argument of your…

View full answer

fg-mindee · 2021-08-13T11:22:10Z

fg-mindee
Aug 13, 2021

Hi @vishveshtrivedi 👋

Glad to hear that the library is useful!
Here are some answers to your questions:

I'm not sure I perfectly understand your question but the trained params of DocTR are trained on real world dataset. No pretraining on Synthtext has been used yet, and we test the models on publicly available datasets + the private ones.
You can check the README of each training script for further information, but passing the --pretrained flag will start your training from the version we trained. You will need to format your dataset to the format mentioned in the README for it to work
If you are talking about raw logits, yes, you can pass return_model_output=True as an argument of your call to the model ! (cf. https://github.com/mindee/doctr/blob/main/doctr/models/detection/differentiable_binarization/pytorch.py#L174)
That's an open question but speaking for this project, we're going to extend the augmentations that we use and take backbone that will be pretrained for character classification
Well, if you're talking about the private test & training sets, I'm not sure it will ! Real world datasets include sensitive information that we cannot disclose. However we have started paving the way for synthetic datasets (cf. feat: Added character classification training scripts #414 & [datasets] Add a synthetic recognition dataset #262), and those will most likely be shared publicly
Hard to tell but for the detection part, we get good results starting from 100k. For the recognition part, it depends on your character distribution but in our own experience, starting from 1M. Again please note that those figures are meant to get a generic model, if you have a more specific use case, I would argue you could decrease all of those !
We're going to change this, but those are for your box quality (boxes_1 being very confident, boxes_3 being not so confident). We're most likely gonna switch to a single list of boxes, and a list of flags for confidence 👍

Hope this helps, let me know if I misunderstood something :)

0 replies

vishveshtrivedi · 2021-08-13T13:10:43Z

vishveshtrivedi
Aug 13, 2021
Author

Thanks for quick reply.
Replying to your following answers:

I asked the first question as in the db_resnet paper authors have mentioned that they have pretrained on synthtext and evaluated/fine-tuned on real world dataset. But as I understand your answer, you have trained on real world dataset from scratch and is giving good results!
I read the https://github.com/mindee/doctr/tree/main/references/detection file. I wanted to know that in order to fine tune the model (either on a font or layout) and achieve good results how much data & computing power do we need?
In my use case I found that after postprocessing some words were detected originally but then filtered out. That means postprocessor changes output and hence the input to the recognition. Also on line https://github.com/mindee/doctr/blob/350a96101d482be3c70b488393c362f623beff78/doctr/models/detection/differentiable_binarization/pytorch.py#L179
according to me out["preds"] would be used by the recognition model. Is there any way we can skip this and pass only the original detection output to recognition model?
Few other questions:
Can you explain what happens in the postprocessing? I noticed some filtering based on probabilities and size but what else is done to the detected boxes that changes them so much?
What would be the difference between out["out_map"] & out["preds"]? Asking this because if it was possible to use prob_map directly for the input to recognition we could easily skip the postprocessing step for our use case.

Thanks a lot!!

0 replies

fg-mindee · 2021-08-13T14:06:07Z

fg-mindee
Aug 13, 2021

I asked the first question as in the db_resnet paper authors have mentioned that they have pretrained on synthtext and evaluated/fine-tuned on real world dataset. But as I understand your answer, you have trained on real world dataset from scratch and is giving good results!

correct!

I read the https://github.com/mindee/doctr/tree/main/references/detection file. I wanted to know that in order to fine tune the model (either on a font or layout) and achieve good results how much data & computing power do we need?

regarding the quantity of data, cf. my answer on 6 ;)
For computation power, let's talk in GPU VRAM, you'll need about 10Gb to have a decent batch size

In my use case I found that after postprocessing some words were detected originally but then filtered out. That means postprocessor changes output and hence the input to the recognition. Also on line https://github.com/mindee/doctr/blob/350a96101d482be3c70b488393c362f623beff78/doctr/models/detection/differentiable_binarization/pytorch.py#L179
according to me out["preds"] would be used by the recognition model. Is there any way we can skip this and pass only the original detection output to recognition model?

For now, it's true that users cannot change the threshold for postprocessing. But once your model is instantiated, you can always do:

model.postprocessor.box_thresh = your_new_threshold

setting a lower value will keep more boxes
And if you want to pass the raw output, as I said you can use return_model_output=True 👌

Can you explain what happens in the postprocessing? I noticed some filtering based on probabilities and size but what else is done to the detected boxes that changes them so much?

It's the postprocessing from the paper, if you want to check the code: https://github.com/mindee/doctr/blob/main/doctr/models/detection/core.py#L85-L116
https://github.com/mindee/doctr/blob/main/doctr/models/detection/differentiable_binarization/base.py#L79-L137

What would be the difference between out["out_map"] & out["preds"]? Asking this because if it was possible to use prob_map directly for the input to recognition we could easily skip the postprocessing step for our use case.

predictions are postprocessed results (the boxes), while out_map is the logits tensor coming out of the model (a segmentation map of sorts) :)

Let me know if that isn't very clear!

0 replies

vishveshtrivedi · 2021-08-13T14:23:24Z

vishveshtrivedi
Aug 13, 2021
Author

Thanks a lot for the reply!!
One final question:
1)Why you have not used resnet101 & resnet152?

0 replies

fg-mindee · 2021-08-13T14:59:50Z

fg-mindee
Aug 13, 2021

Simply put:

those are much bigger architectures that require a bigger RAM capacity
if you can do it with a lighter model, you get less chance to overfit and you can save some energy & inference time

If at some point we see that we need bigger architectures, we'll try for now we favour lighter models 👍

0 replies

vishveshtrivedi · 2021-08-16T14:57:19Z

vishveshtrivedi
Aug 16, 2021
Author

Thanks!!!!

0 replies

vishveshtrivedi · 2021-08-16T15:00:20Z

vishveshtrivedi
Aug 16, 2021
Author

Hi @fg-mindee
I have one doubt.To improve detection, I changed the bin_thresh value from 0.3 to 0.2,this improved my detection but it changed my recognition.The values which were recognized correctly before have some error after the change.Is their any connection of bin_thresh with recognition??

0 replies

charlesmindee · 2021-08-18T08:55:52Z

charlesmindee
Aug 18, 2021
Maintainer

Hi @vishveshtrivedi,

The bin_thresh value is used to binarize the raw segmentation map, if you lower it most likely you will detect more words but the risk is to loose the space between words. It should probably lead to a higher recall and a lower precision for the detection task. The recognition task does not use bin_thresh, but boxes detected in the detection task with bin_thresh are used to recognize words, so in a way, it is related. If you have a too high bin_thresh you will end up with (too) large boxes, with maybe more than 1 word on each box. This can lead to a bad final recognition result because we don't have spaces in our vocabularies and thus our models can only deal with 1 word in each box.

When you say "this improved my detection", which metric do you consider ? (recall, precision, accuracy ?)
Are you able to spot the recognition errors to give me some concrete examples (ground-truth vs predictions) ?

Thank you and have a good day !

0 replies

vishveshtrivedi · 2021-08-18T09:37:19Z

vishveshtrivedi
Aug 18, 2021
Author

Hi @charlesmindee,
1)Metric used for detection is accuracy.
i.e. accuracy = detected words in an image/Total words in an image
2)Considered the example of date.Ground truth = 04/14/2020
When bin_thresh was 0.3 date was 04/14/2020.When I changed it to 0.2 date recognized was 0414#2020

Thanks a lot !

0 replies

charlesmindee · 2021-08-18T13:57:22Z

charlesmindee
Aug 18, 2021
Maintainer

For the first point, it seems quite logical to detect more words when you decrease the threshold, as explained above.
For the second point it is quite weird, maybe we should plot the cropped box (input of the text recognition model) for this word for both thresholds to see the difference between the 2 pictures. Do you think you can do that ?
Thanks!

0 replies

vishveshtrivedi · 2021-08-19T15:47:12Z

vishveshtrivedi
Aug 19, 2021
Author

Hi @charlesmindee ,
For the second point, I have compared the cropped images for two different thresholds. After changing the threshold to 0.2 in one case, the box of the date in question merged with an adjacent word box. This harmed the recognition results of both the fields. This is understandable. However, in another case the box of the date in question expanded slightly along in y-axis (vertically). The recognition results were are mentioned above (0414#2020). How does such a slight change in box size harm the recognition so much?
Also, I want to ask how have you calculated the value of bin_thresh to 0.3? By some experiments on the test dataset?

0 replies

charlesmindee · 2021-08-19T16:38:33Z

charlesmindee
Aug 19, 2021
Maintainer

Hi, for the first case the result is logical as you mentioned, for the second case it is quite weird. I must admit I can't really explain that, it is strange because the model did recognized the right digits but replaced the second slash and removed the first one. Which recognition model did you use ?
For the value of the bin_thresh, it is an empirical set-up: the paper uses a 0.2 threshold for text in the wild (without mentioning the way to get this parameter !), but as it turns out it seems to work well for us for document text recognition with 0.3. The interesting thing is that this parameter only influences the post-processing and does not interact with the neural net during the training process, so we can adjust it afterwards and you can fine-tune it (very carefully, as you have seen it is very sensitive !) to adapt it to your own use-cases, without re-training the model.

0 replies

vishveshtrivedi · 2021-08-20T07:44:00Z

vishveshtrivedi
Aug 20, 2021
Author

Hi @charlesmindee,
I have used crnn_vgg16_bn as a recognition model & db_resnet50 for detection.

0 replies

charlesmindee · 2021-08-20T07:57:07Z

charlesmindee
Aug 20, 2021
Maintainer

Hi @vishveshtrivedi,
Maybe you can try with our master model to see if the same glitch is happening, for now I cannot really explain it but I will investigate that on my side.
Have a nice day!

0 replies

vishveshtrivedi · 2021-08-20T12:27:31Z

vishveshtrivedi
Aug 20, 2021
Author

Hi @charlesmindee,
I will try the master model.

Also, an interesting thing I noticed was that I changed the image DPI from 500 to 600 (we are converting PDF to images) and in a few images the recognition improved a lot. Is there some reason for recognition being so sensitive to DPI?

Finally, would it be recommended for the input image to be in a certain preprocessed way (binarized, greyscale, deblured, etc.)?

Thank you so much!

0 replies

charlesmindee · 2021-08-23T07:09:19Z

charlesmindee
Aug 23, 2021
Maintainer

Hi @vishveshtrivedi,

If you increase DPI, you will have higher resolution images from your pdf pages, and this can help the recognition model to distinguish letters written in small fonts or slightly blurred lines which can't be resolved at a lower resolution. However, 500 DPI is already a huge resolution (4134 x 5846 Pixel for a A4 page). Are you feeding the model with a document from_pdf or do you convert your pdf to images before creating yout document object ? We almost exclusively work with a DPI of 144, and it seems to be enough for A4 pdf pages.

You don't need to binarize/greyscale/... or preprocess your images before feeding the model, it should work fine! Of course, if you work with particularly noisy or blurred documents, it should only improve you performances to preprocess the data.

Have a nice day!

0 replies

vishveshtrivedi · 2021-08-23T10:05:36Z

vishveshtrivedi
Aug 23, 2021
Author

Hi @charlesmindee,
I am converting pdf to images using pdf2image library & then using document from_images(), I am feeding the image to model.

Thanks a lot !

0 replies

charlesmindee · 2021-08-23T10:20:43Z

charlesmindee
Aug 23, 2021
Maintainer

OK, you can also use our pdf converter instantiating a document from_pdf(), it will use a 144 DPI rate for the conversion.
I am moving this to a Github discussion (this will close the issue but open a discussion), it seems more appropriate to keep on chatting on technical aspects of the library!

Thanks and don't hesitate to come back with new questions !

0 replies

vishveshtrivedi · 2021-08-26T09:24:09Z

vishveshtrivedi
Aug 26, 2021
Author

OK

0 replies

vishveshtrivedi · 2021-08-27T08:39:02Z

vishveshtrivedi
Aug 27, 2021
Author

Hi @charlesmindee ,
You have used ctc_beam_search decoder to convert logits into decoded output(for recognition).Is there any other way for converting logits?
Have you used lexicon-free or lexicon-based transcription for recognition tasks?

2 replies

charlesmindee Aug 27, 2021
Maintainer

Hi @vishveshtrivedi,

Using tensorflow you have 2 options to decode ctc logits: beam search decoder or greedy decoder, the last one is a special case of the beam search with a beam width of 1 as explained here.

For the recognition task, we used a character-level vocabulary, described here, it is totally lexicon free (at a word level).

Have a nice day!

dhea1323 May 20, 2022

I'm interested in the text recognition process which is carried out without paying attention to a particular language and only focusing on character vocabulary. Can you explain in more detail if I want to do my own training for the model recognition?. Thank you!

vishveshtrivedi · 2021-09-16T08:36:16Z

vishveshtrivedi
Sep 16, 2021
Author

Hi @charlesmindee ,
Does doctr have any support for broken characters in the documents?

1 reply

charlesmindee Sep 16, 2021
Maintainer

Hi @vishveshtrivedi, I am not sure to fully understand what you mean here, could you specify with a picture or an example what is a broken character ? Thank you!

vishveshtrivedi · 2021-09-16T09:58:26Z

vishveshtrivedi
Sep 16, 2021
Author

Hi @charlesmindee ,
I am attaching an image for your reference.

As you can see there are some numbers which are broken.
Thank You!

2 replies

charlesmindee Sep 16, 2021
Maintainer

Hi again @vishveshtrivedi,

I ran this padded version on our end to end model, and it seems to be robust to the broken numbers.
Nonetheless, if you have an idea to improve doctr for this kind of usecase, don't hesitate to open a PR!

Have a nice day

charlesmindee Sep 16, 2021
Maintainer

vishveshtrivedi · 2021-09-17T06:49:07Z

vishveshtrivedi
Sep 17, 2021
Author

Hi @charlesmindee,
This seems nice.
By the way, what detection and recognition model have you used for ocr & on what parameters have you padded the image?
Thank you!

1 reply

charlesmindee Sep 17, 2021
Maintainer

We used our baseline default model (default predictor with db + crnn), for the image I just took a screenshot in the browser so it is 0-padded. If you don't pad the image you end up with messy results in the end to end OCR because we are resizing documents to 1024 x 1024, otherwise if you don't want to pad you need to pass the raw crop directly to the recognition predictor (crnn).

vishveshtrivedi · 2021-09-17T15:05:51Z

vishveshtrivedi
Sep 17, 2021
Author

ok

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require detailed explanation on few points #423

{{title}}

Replies: 24 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Require detailed explanation on few points #423

vishveshtrivedi Aug 13, 2021

Replies: 24 comments · 6 replies

fg-mindee Aug 13, 2021

vishveshtrivedi Aug 13, 2021 Author

fg-mindee Aug 13, 2021

vishveshtrivedi Aug 13, 2021 Author

fg-mindee Aug 13, 2021

vishveshtrivedi Aug 16, 2021 Author

vishveshtrivedi Aug 16, 2021 Author

charlesmindee Aug 18, 2021 Maintainer

vishveshtrivedi Aug 18, 2021 Author

charlesmindee Aug 18, 2021 Maintainer

vishveshtrivedi Aug 19, 2021 Author

charlesmindee Aug 19, 2021 Maintainer

vishveshtrivedi Aug 20, 2021 Author

charlesmindee Aug 20, 2021 Maintainer

vishveshtrivedi Aug 20, 2021 Author

charlesmindee Aug 23, 2021 Maintainer

vishveshtrivedi Aug 23, 2021 Author

charlesmindee Aug 23, 2021 Maintainer

vishveshtrivedi Aug 26, 2021 Author

vishveshtrivedi Aug 27, 2021 Author

charlesmindee Aug 27, 2021 Maintainer

dhea1323 May 20, 2022

vishveshtrivedi Sep 16, 2021 Author

charlesmindee Sep 16, 2021 Maintainer

vishveshtrivedi Sep 16, 2021 Author

charlesmindee Sep 16, 2021 Maintainer

charlesmindee Sep 16, 2021 Maintainer

vishveshtrivedi Sep 17, 2021 Author

charlesmindee Sep 17, 2021 Maintainer

vishveshtrivedi Sep 17, 2021 Author

vishveshtrivedi
Aug 13, 2021

Replies: 24 comments 6 replies

fg-mindee
Aug 13, 2021

vishveshtrivedi
Aug 13, 2021
Author

fg-mindee
Aug 13, 2021

vishveshtrivedi
Aug 13, 2021
Author

fg-mindee
Aug 13, 2021

vishveshtrivedi
Aug 16, 2021
Author

vishveshtrivedi
Aug 16, 2021
Author

charlesmindee
Aug 18, 2021
Maintainer

vishveshtrivedi
Aug 18, 2021
Author

charlesmindee
Aug 18, 2021
Maintainer

vishveshtrivedi
Aug 19, 2021
Author

charlesmindee
Aug 19, 2021
Maintainer

vishveshtrivedi
Aug 20, 2021
Author

charlesmindee
Aug 20, 2021
Maintainer

vishveshtrivedi
Aug 20, 2021
Author

charlesmindee
Aug 23, 2021
Maintainer

vishveshtrivedi
Aug 23, 2021
Author

charlesmindee
Aug 23, 2021
Maintainer

vishveshtrivedi
Aug 26, 2021
Author

vishveshtrivedi
Aug 27, 2021
Author

charlesmindee Aug 27, 2021
Maintainer

vishveshtrivedi
Sep 16, 2021
Author

charlesmindee Sep 16, 2021
Maintainer

vishveshtrivedi
Sep 16, 2021
Author

charlesmindee Sep 16, 2021
Maintainer

charlesmindee Sep 16, 2021
Maintainer

vishveshtrivedi
Sep 17, 2021
Author

charlesmindee Sep 17, 2021
Maintainer

vishveshtrivedi
Sep 17, 2021
Author