[feature request] Enable Retry/Resume Processing for Files Interrupted by LLM Timeouts #1821

panacorn · 2025-07-19T20:22:41Z

panacorn
Jul 19, 2025

I’m running into frequent LLM timeouts when LightRAG processes large documents for entities and relations (using Ollama/API). When this happens, the whole file processing just stops, and nothing else gets indexed for that document.

Right now, I have to manually restart the process for that file if it bombs out with a timeout. The LLM cache helps a lot here—it lets the system skip over chunks that were already processed, so there’s no extra LLM overhead for those. But the main annoyance is that I can’t just hit a “retry” or “resume” button in the UI. Instead, I have to delete the file and upload it again to restart processing, which is obviously messy and inefficient.

Feature Suggestions
• Automatic retry/resume on error: It would be great if LightRAG could auto-retry the processing job if it fails with a timeout (or similar transient error), instead of just quitting. Right now, I have to babysit it and manually restart.
• Manual resume button in UI: There should be a button in the web UI that lets me resume processing right where it left off for a file, without deleting/reuploading. The status should be preserved.
• Clearer error reporting: Better logging and status in the UI when things fail—show which chunk caused the timeout, error details, etc.

Basically, I need a way—either automatic or manual—for LightRAG to pick up where it left off after a timeout, without extra manual work or reuploading files. The LLM cache already takes care of skipping processed chunks, so this should be doable. It’s a pretty common workflow for batch jobs in other systems.

onestardao · 2025-07-31T15:03:40Z

onestardao
Jul 31, 2025

ahhh yes. you’ve hit a painfully under-discussed issue in RAG pipelines — especially with larger files:
LLM timeout + missing resume = silent fail & user rage combo.

we logged this in our open-source diagnosis map as:

No.12: “No Resume Protocols”

No.14: “Timeout on First Pass = Full Abort”

we got tired of this exact babysitting loop, so we built auto-resume logic + async chunk recovery into our toolchain.
MIT licensed, free to use, no strings:

👉 https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md

if you're interested, happy to show how we hook it up — zero restart, just skip-forward and retry the failed chunk only.
should save you hours of reupload misery.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature request] Enable Retry/Resume Processing for Files Interrupted by LLM Timeouts #1821

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[feature request] Enable Retry/Resume Processing for Files Interrupted by LLM Timeouts #1821

Uh oh!

panacorn Jul 19, 2025

Replies: 1 comment

Uh oh!

onestardao Jul 31, 2025

panacorn
Jul 19, 2025

onestardao
Jul 31, 2025