feat: integrate got-ocr2.0 as image reader #355

phv2312 · 2024-10-02T15:32:33Z

Description

Integrate the got-ocr2.0 OCR as image reader
New extension manager for easily switch between different supported loaders
Also, thanks @cin-jimmy for his suggestion on github stale (issue)

Type of change

New features (non-breaking change).
Bug fix (non-breaking change).
Breaking change (fix or feature that would cause existing functionality not to work as expected).

Checklist

I have performed a self-review of my code.
I have added thorough tests if it is a core feature.
There is a reference to the original bug report and related work.
I have commented on my code, particularly in hard-to-understand areas.
The feature is well documented.

cin-niko · 2024-10-04T08:56:57Z

@phv2312, can you add a docker-compose file (allow choose the docker image for OCR service)? I think it will help people test more easily.

libs/kotaemon/kotaemon/indices/ingests/extensions.py

integration/got-ocr2.md

phv2312 · 2024-10-26T04:15:16Z

Hi @taprosoft @cin-niko. Sorry for no update for long time. Can you help to review this PR again

docker-compose.dev.yml

docker-compose.yml

integration/got-ocr2.md

libs/kotaemon/kotaemon/indices/ingests/extensions.py

phv2312 · 2024-12-15T11:39:02Z

Hi @cin-niko and @taprosoft . I have updated according to niko's comments and rebased from the latest master already.
Can you help to check this PR again ?

cin-niko · 2024-12-16T05:59:18Z

@phv2312 Overall is good. But it seems that setting the loader for extensions feature doesn't work.
For example:

Set pdf loader in Settings -> Retrieval Settings -> File loader: Work
Set pdf loader in Settings -> Loader settings -> Loader .pdf: Doesn't work

taprosoft · 2024-12-16T06:18:34Z

@phv2312 sorry for the late comment. Overall the logic is fine but the current settings UI is a bit cluttered. I will push a small change to improve this prior to merging.

phv2312 changed the title ~~Feat/integrate 3rd~~ feat: integrate 3rd Oct 2, 2024

phv2312 changed the title ~~feat: integrate 3rd~~ feat: integrate got-ocr2.0 Oct 2, 2024

phv2312 changed the title ~~feat: integrate got-ocr2.0~~ feat: integrate got-ocr2.0 as image reader Oct 2, 2024

phv2312 requested a review from taprosoft October 2, 2024 15:43

phv2312 mentioned this pull request Oct 3, 2024

[BUG] - OCR is not running #307

Closed

taprosoft reviewed Oct 5, 2024

View reviewed changes

libs/kotaemon/kotaemon/indices/ingests/extensions.py Outdated Show resolved Hide resolved

ngduyanhece reviewed Oct 24, 2024

View reviewed changes

libs/kotaemon/kotaemon/indices/ingests/extensions.py Outdated Show resolved Hide resolved

ngduyanhece reviewed Oct 24, 2024

View reviewed changes

integration/got-ocr2.md Show resolved Hide resolved

phv2312 force-pushed the feat/integrate_3rd branch from 25904f0 to de703a3 Compare October 25, 2024 07:55

Cinnamon deleted a comment from ngduyanhece Oct 25, 2024

phv2312 requested a review from taprosoft October 27, 2024 05:27

cin-niko reviewed Oct 28, 2024

View reviewed changes

phv2312 added 13 commits December 15, 2024 14:03

feat: update stale action

bb0f46e

feat: comfort gocr2 format

93ec972

fix: resolve conflicts

4b7e9ca

feat: resolve conflicts

84e5683

feat: update state for extension manager while load setting

8157733

fix: resolve conflicts

53c7203

feat: comfort pre-commit

e6b4439

feat: update ocr loader

02f8a81

feat: change default loader for image

fca8a10

feat: update guideline

d9b9bc1

feat: update github stales & remove unncessary files

dfc416e

feat: update extentions for pdf

3f5d1fb

feat: introduce docker-compose

a358fd2

phv2312 force-pushed the feat/integrate_3rd branch from a488e31 to a358fd2 Compare December 15, 2024 07:05

phv2312 added 2 commits December 15, 2024 14:09

chore: refactor by pre-commit

609a1f0

feat: update docling reader into extension manager

ae56308

feat: bring extension manager to kotaemon

9bbd8a1

refactor: move exteions manager to ktem

038dabb

fix: remove duplicate loader setting in retrieval settings

cb4fabc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate got-ocr2.0 as image reader #355

feat: integrate got-ocr2.0 as image reader #355

phv2312 commented Oct 2, 2024 •

edited

Loading

cin-niko commented Oct 4, 2024

phv2312 commented Oct 26, 2024

phv2312 commented Dec 15, 2024

cin-niko commented Dec 16, 2024

taprosoft commented Dec 16, 2024

feat: integrate got-ocr2.0 as image reader #355

Are you sure you want to change the base?

feat: integrate got-ocr2.0 as image reader #355

Conversation

phv2312 commented Oct 2, 2024 • edited Loading

Description

Type of change

Checklist

cin-niko commented Oct 4, 2024

phv2312 commented Oct 26, 2024

phv2312 commented Dec 15, 2024

cin-niko commented Dec 16, 2024

taprosoft commented Dec 16, 2024

phv2312 commented Oct 2, 2024 •

edited

Loading