-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Document processing API is not online (bulk file uploads) #2913
Comments
When you upload this many files are you using the built-in CPU embedder or something external like ollama or openai? |
Built in embedder. |
Then this constraint is likely arsising from resource constraints as the local embedder is running on CPU only and depending on the document chunk throughput could be crashing or failing or allocate. its unrelated to the retry mechanism proposed, but swapping to something like Ollama or OpenAI may alleviate that as they can be done off-machine or use the GPU on device. |
I've switched it over to Ollama, rebuilding the embeddings now (going to take a while). Once this completes, I will try uploading another batch of PDFs and see what happens. Ill post back if this fixed the issue or not. |
Alright, after switching to Ollama, I am still getting "document processing API is not online" while doing bulk uploads. Granted, there doesn't seem to be nearly as many of these errors, but in a batch upload of around 900 pdf/txt files, I've seen the API offline error come up about 6 times now and counting. Next issue to that (as described initially), once the upload finishes, I will have to delete everything I just uploaded as I can't isolate which files failed vs which ones were successful. The failures do seem to be related to the number of documents being processed at once / the CollectorApi being busy. |
Another item to note, I went through the log files to see if I could isolate an error within, with the words "document processing API is not online". Interesting enough, there is no log entry for this exact phrase of error. Searching for "not online", produces no results. The only reference in the logs (which I can't fully confirm is for this exact error and this error is repeated a few times through the logs on different files) is: 2025-01-01 13:02:15 [backend] info: [CollectorApi] Document Cook_better_food.pdf uploaded processed and successfully. It is now available in documents. This portion of the log may not be fully relevant to the error I am seeing on the front-end as the front-end error doesn't correlate to any direct reference in the backend logs that I can see. It would be nice if the error message was changed from saying "document processing API is not online" to "document processing API is offline", as it would make searching the logs a little easier for failures related to "offline". Even with that, I've gone through the logs line by line (searching for the word "failed") and can't find anything that directly shows this specific error (API is not online) is even happening. From the front-end, I see 3 different errors at random times.
What I am unsure about in the logs, when I see (example) "2025-01-01 13:02:15 [backend] info: [CollectorApi] fetch failed", is this a log report for error #3 only or is this the recorded log entry for #2 and #3 together. For this upload experiment, I uploaded 961 files (half pdf the other half txt), 829 were successfully uploaded. This would indicate that 132 files failed to process / upload, because of one of the 3 errors previously mentioned. There is no easy method that I have found to isolate which files failed due to error #1 verse errors #2 or #3. (which I understand is a separate but related from the API not online issue). |
How are you running AnythingLLM?
Docker (local)
What happened?
When uploading bulk files into documents, I receive an error message "document processing api is not online" randomly on different files as they're being uploaded.
In experimentation, I had selected 8 PDF files that were all over 300+mb each. One of the 8 failed with the above error. If I wait for the other 7 to complete and then re-upload the one that failed, it uploads successfully.
In small batches, this is manageable, as I can pin-point the one that failed and re-upload it. However, in bulk testing, multiple files will fail and it's impossible to keep track of which ones sent and which ones fails, so the only solution I've found is to delete all the files, then re-upload them at 4 to 6 at a time (which takes HOURS when uploading hundreds of documents).
It appears as if the API that manages the upload is limited to the number of documents it can process at one time and/or if it tries to start an upload and the API is busy handling other files, it will fail as not online.
If a file fails, the system doesn't appear to try and upload the file again. It just errors and the user must try and track which file failed and then re-submit it for upload after the queue has finished. This is next to impossible on bulk files.
A) It would be nice, when uploading files in bulk or uploading large files, to control how many documents it tries to process at once. Example, if I am uploading 1,500 PDF files, a setting to limit the processor to no more than 4 documents at a time (to try and minimize the failures / track which files failed on upload).
B) It would be nice if there was a log file or report produced after a bulk upload that would list which files failed and which were successful. This would make it easier to identify which files need to be re-uploaded.
C) During the upload process, if a file fails to upload due to the API being unavailable, have the system automatically try the file again. Either move the file to the bottom of the list and retry or automatically try and then fail after X number of attempts.
Thank you.
Are there known steps to reproduce?
Windows running Docker version, upload 100+ large (30mb+) documents into the document manager.
The text was updated successfully, but these errors were encountered: