Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Document processing API is not online (bulk file uploads) #2913

Open
rthwm opened this issue Dec 29, 2024 · 6 comments
Open

[BUG]: Document processing API is not online (bulk file uploads) #2913

rthwm opened this issue Dec 29, 2024 · 6 comments
Labels
possible bug Bug was reported but is not confirmed or is unable to be replicated.

Comments

@rthwm
Copy link

rthwm commented Dec 29, 2024

How are you running AnythingLLM?

Docker (local)

What happened?

When uploading bulk files into documents, I receive an error message "document processing api is not online" randomly on different files as they're being uploaded.

In experimentation, I had selected 8 PDF files that were all over 300+mb each. One of the 8 failed with the above error. If I wait for the other 7 to complete and then re-upload the one that failed, it uploads successfully.

In small batches, this is manageable, as I can pin-point the one that failed and re-upload it. However, in bulk testing, multiple files will fail and it's impossible to keep track of which ones sent and which ones fails, so the only solution I've found is to delete all the files, then re-upload them at 4 to 6 at a time (which takes HOURS when uploading hundreds of documents).

  1. It appears as if the API that manages the upload is limited to the number of documents it can process at one time and/or if it tries to start an upload and the API is busy handling other files, it will fail as not online.

  2. If a file fails, the system doesn't appear to try and upload the file again. It just errors and the user must try and track which file failed and then re-submit it for upload after the queue has finished. This is next to impossible on bulk files.

A) It would be nice, when uploading files in bulk or uploading large files, to control how many documents it tries to process at once. Example, if I am uploading 1,500 PDF files, a setting to limit the processor to no more than 4 documents at a time (to try and minimize the failures / track which files failed on upload).

B) It would be nice if there was a log file or report produced after a bulk upload that would list which files failed and which were successful. This would make it easier to identify which files need to be re-uploaded.

C) During the upload process, if a file fails to upload due to the API being unavailable, have the system automatically try the file again. Either move the file to the bottom of the list and retry or automatically try and then fail after X number of attempts.

Thank you.

Are there known steps to reproduce?

Windows running Docker version, upload 100+ large (30mb+) documents into the document manager.

@rthwm rthwm added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label Dec 29, 2024
@timothycarambat
Copy link
Member

When you upload this many files are you using the built-in CPU embedder or something external like ollama or openai?

@rthwm
Copy link
Author

rthwm commented Dec 30, 2024

Built in embedder.

@timothycarambat
Copy link
Member

Then this constraint is likely arsising from resource constraints as the local embedder is running on CPU only and depending on the document chunk throughput could be crashing or failing or allocate. its unrelated to the retry mechanism proposed, but swapping to something like Ollama or OpenAI may alleviate that as they can be done off-machine or use the GPU on device.

@rthwm
Copy link
Author

rthwm commented Dec 31, 2024

I've switched it over to Ollama, rebuilding the embeddings now (going to take a while). Once this completes, I will try uploading another batch of PDFs and see what happens. Ill post back if this fixed the issue or not.

@rthwm
Copy link
Author

rthwm commented Jan 1, 2025

Alright, after switching to Ollama, I am still getting "document processing API is not online" while doing bulk uploads. Granted, there doesn't seem to be nearly as many of these errors, but in a batch upload of around 900 pdf/txt files, I've seen the API offline error come up about 6 times now and counting. Next issue to that (as described initially), once the upload finishes, I will have to delete everything I just uploaded as I can't isolate which files failed vs which ones were successful. The failures do seem to be related to the number of documents being processed at once / the CollectorApi being busy.

@rthwm
Copy link
Author

rthwm commented Jan 1, 2025

Another item to note, I went through the log files to see if I could isolate an error within, with the words "document processing API is not online". Interesting enough, there is no log entry for this exact phrase of error. Searching for "not online", produces no results. The only reference in the logs (which I can't fully confirm is for this exact error and this error is repeated a few times through the logs on different files) is:

2025-01-01 13:02:15 [backend] info: [CollectorApi] Document Cook_better_food.pdf uploaded processed and successfully. It is now available in documents.
2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed
2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed
2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed
2025-01-01 13:02:15 [backend] info: [TELEMETRY SENT] {"event":"document_uploaded","distinctId":"08fe1348-286a-4313-9d72-f6d357f86f90","properties":{"runtime":"docker"}}

This portion of the log may not be fully relevant to the error I am seeing on the front-end as the front-end error doesn't correlate to any direct reference in the backend logs that I can see. It would be nice if the error message was changed from saying "document processing API is not online" to "document processing API is offline", as it would make searching the logs a little easier for failures related to "offline". Even with that, I've gone through the logs line by line (searching for the word "failed") and can't find anything that directly shows this specific error (API is not online) is even happening.

From the front-end, I see 3 different errors at random times.

  1. Text content was empty for document_Name.pdf (which I know what this error means / isn't important or related to the topic).
  2. document processing API is not online
  3. fetch failed (doesn't show the API is not online error, just says fetch failed)

What I am unsure about in the logs, when I see (example) "2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed", is this a log report for error #3 only or is this the recorded log entry for #2 and #3 together.

For this upload experiment, I uploaded 961 files (half pdf the other half txt), 829 were successfully uploaded. This would indicate that 132 files failed to process / upload, because of one of the 3 errors previously mentioned. There is no easy method that I have found to isolate which files failed due to error #1 verse errors #2 or #3. (which I understand is a separate but related from the API not online issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
possible bug Bug was reported but is not confirmed or is unable to be replicated.
Projects
None yet
Development

No branches or pull requests

2 participants