localdocs: avoid cases where batch can make no progress #3094

cebtenzzre · 2024-10-15T15:56:25Z

It has been brought to my attention that some users are still experiencing hangs in LocalDocs indexing after v3.4.1 that they did not see in v3.3.x or earlier. This is likely a result of #2986. Although the duration timer itself has been in #2396 without much issue, before we would at least process one page of a PDF if beginning the transaction took just under 100ms. Now we may not even process a single word.

I've now realized that the QPdfDocument we use to get metadata may also contribute to this, which this PR doesn't entirely fix - we may still only get a single word per iteration.

TODO: This should be changed such that we do not need to open the PDF document every iteration. Done.

Checklist

This PR has been thoroughly tested.

Signed-off-by: Jared Van Bortel <[email protected]>

Recording this metadata once avoids the need to open the PDF document every time we enter scanQueue. Signed-off-by: Jared Van Bortel <[email protected]>

Signed-off-by: Jared Van Bortel <[email protected]>

cebtenzzre · 2024-10-15T21:00:58Z

I was able to confirm that for large PDFs with many high-resolution images, the open for reading metadata can reliably take more than 100ms, which before this PR would prevent any words from being read.

Before (current main):

open duration avg 994 ms over 10 opens
avg 0 pages
open duration avg 988 ms over 10 opens
avg 0 pages
open duration avg 988 ms over 10 opens
avg 0 pages

Before the docx change, we would always read at least one page, even if it meant exceeding 100ms.

After (this PR):

open duration n/a over 1 opens
avg 76 pages
open duration n/a over 9 opens
avg 14 pages
open duration n/a over 3 opens
avg 61 pages

manyoso

one minor nit

gpt4all-chat/src/database.cpp

Signed-off-by: Jared Van Bortel <[email protected]>

cebtenzzre added 5 commits October 15, 2024 15:31

localdocs: do not count transaction start time against the batch timer

70403a9

Signed-off-by: Jared Van Bortel <[email protected]>

localdocs: explicitly process at least one document per batch

0ca3317

Signed-off-by: Jared Van Bortel <[email protected]>

localdocs: explicitly process at least one word

47e23f4

Signed-off-by: Jared Van Bortel <[email protected]>

localdocs: handle metadata in document reader

296b29d

Recording this metadata once avoids the need to open the PDF document every time we enter scanQueue. Signed-off-by: Jared Van Bortel <[email protected]>

changelog: add this PR

61dc351

Signed-off-by: Jared Van Bortel <[email protected]>

cebtenzzre marked this pull request as ready for review October 15, 2024 19:33

cebtenzzre force-pushed the ensure-localdocs-progress branch from 78a3cfb to 61dc351 Compare October 15, 2024 19:33

cebtenzzre requested a review from manyoso October 15, 2024 21:03

manyoso requested changes Oct 16, 2024

View reviewed changes

gpt4all-chat/src/database.cpp Outdated Show resolved Hide resolved

localdocs: remove magic number from comment

519a915

Signed-off-by: Jared Van Bortel <[email protected]>

cebtenzzre requested a review from manyoso October 16, 2024 15:00

manyoso approved these changes Oct 16, 2024

View reviewed changes

cebtenzzre merged commit 735dd82 into main Oct 16, 2024
4 of 10 checks passed

cebtenzzre deleted the ensure-localdocs-progress branch October 16, 2024 15:11

cebtenzzre added a commit that referenced this pull request Oct 16, 2024

localdocs: avoid cases where batch can make no progress (#3094)

36a3826

Signed-off-by: Jared Van Bortel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

localdocs: avoid cases where batch can make no progress #3094

localdocs: avoid cases where batch can make no progress #3094

cebtenzzre commented Oct 15, 2024 •

edited

Loading

cebtenzzre commented Oct 15, 2024 •

edited

Loading

manyoso left a comment

localdocs: avoid cases where batch can make no progress #3094

localdocs: avoid cases where batch can make no progress #3094

Conversation

cebtenzzre commented Oct 15, 2024 • edited Loading

Checklist

cebtenzzre commented Oct 15, 2024 • edited Loading

manyoso left a comment

Choose a reason for hiding this comment

cebtenzzre commented Oct 15, 2024 •

edited

Loading

cebtenzzre commented Oct 15, 2024 •

edited

Loading