Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

localdocs: avoid cases where batch can make no progress #3094

Merged
merged 6 commits into from
Oct 16, 2024

Conversation

cebtenzzre
Copy link
Member

@cebtenzzre cebtenzzre commented Oct 15, 2024

It has been brought to my attention that some users are still experiencing hangs in LocalDocs indexing after v3.4.1 that they did not see in v3.3.x or earlier. This is likely a result of #2986. Although the duration timer itself has been in #2396 without much issue, before we would at least process one page of a PDF if beginning the transaction took just under 100ms. Now we may not even process a single word.

I've now realized that the QPdfDocument we use to get metadata may also contribute to this, which this PR doesn't entirely fix - we may still only get a single word per iteration.

TODO: This should be changed such that we do not need to open the PDF document every iteration. Done.

Checklist

  • This PR has been thoroughly tested.

Recording this metadata once avoids the need to open the PDF document
every time we enter scanQueue.

Signed-off-by: Jared Van Bortel <[email protected]>
Signed-off-by: Jared Van Bortel <[email protected]>
@cebtenzzre cebtenzzre marked this pull request as ready for review October 15, 2024 19:33
@cebtenzzre
Copy link
Member Author

cebtenzzre commented Oct 15, 2024

I was able to confirm that for large PDFs with many high-resolution images, the open for reading metadata can reliably take more than 100ms, which before this PR would prevent any words from being read.

Before (current main):

open duration avg 994 ms over 10 opens
avg 0 pages
open duration avg 988 ms over 10 opens
avg 0 pages
open duration avg 988 ms over 10 opens
avg 0 pages

Before the docx change, we would always read at least one page, even if it meant exceeding 100ms.

After (this PR):

open duration n/a over 1 opens
avg 76 pages
open duration n/a over 9 opens
avg 14 pages
open duration n/a over 3 opens
avg 61 pages

Copy link
Collaborator

@manyoso manyoso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor nit

gpt4all-chat/src/database.cpp Outdated Show resolved Hide resolved
@cebtenzzre cebtenzzre merged commit 735dd82 into main Oct 16, 2024
4 of 10 checks passed
@cebtenzzre cebtenzzre deleted the ensure-localdocs-progress branch October 16, 2024 15:11
cebtenzzre added a commit that referenced this pull request Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants