Skip to content

Optimize for OpenAI Prompt Caching: Restructure entity extraction prompts for 50% cost reduction and faster indexing #2355

@adorosario

Description

@adorosario

Summary

OpenAI introduced automatic prompt caching in October 2024 for GPT-4o, GPT-4o-mini, o1-preview, and o1-mini models. This feature provides a 50% discount on cached prompt tokens and faster processing times for prompts longer than 1024 tokens.

However, LightRAG's current prompt structure prevents effective caching during indexing, missing a significant opportunity to reduce costs and improve indexing latency.

The Problem

Current Prompt Structure

In lightrag/operate.py:2807-2820, the entity extraction system prompt embeds variable content (input_text) directly into the system message:

entity_extraction_system_prompt = PROMPTS[
    "entity_extraction_system_prompt"
].format(**{**context_base, "input_text": content})

This creates a system prompt that looks like:

---Role--- (static, ~100 tokens)
---Instructions--- (static, ~400 tokens)  
---Examples--- (static, ~800 tokens)
---Real Data to be Processed---
<Input>
Entity_types: [static during indexing run]
Text:

{input_text} ← THIS CHANGES FOR EVERY CHUNK ❌


### Why This Prevents Caching

OpenAI's prompt caching works by caching the **longest shared prefix** of prompts. Since `input_text` is embedded at the end of the system prompt, every chunk creates a completely different system prompt string. There is no shared prefix across chunks, so **nothing gets cached**.

### Reference

From the prompt template in `lightrag/prompt.py:11-69`:

```python
PROMPTS["entity_extraction_system_prompt"] = """---Role---
...
---Real Data to be Processed---
<Input>
Entity_types: [{entity_types}]
Text:

{input_text} # Variable content embedded in system prompt

"""

The Solution

Restructure Prompts for Caching

To leverage OpenAI's automatic prompt caching, the prompts should be restructured:

Optimal structure:

  • System message: Static instructions + examples + entity types (~1300 tokens, cacheable!)
  • User message: Just the variable input_text (~150 tokens per chunk)

This would allow the ~1300 token system message to be cached and reused for ALL chunks during an indexing run, with only the small user message varying.

Proposed Changes

  1. Split the system prompt template (lightrag/prompt.py):

    • Remove {input_text} from entity_extraction_system_prompt
    • Keep only the static instructions, examples, and entity types
  2. Modify the user prompt template:

    • Make entity_extraction_user_prompt contain the variable input_text
  3. Update the extraction logic (lightrag/operate.py):

    • Format system prompt once (without input_text)
    • Format user prompt with input_text for each chunk

Example Restructured Template

PROMPTS["entity_extraction_system_prompt"] = """---Role---
You are a Knowledge Graph Specialist responsible for extracting entities and relationships from the input text.

---Instructions---
[... all the static instructions ...]

---Examples---
[... all the examples ...]

---Entity Types---
Entity_types: [{entity_types}]
"""

PROMPTS["entity_extraction_user_prompt"] = """---Task---
Extract entities and relationships from the following input text.

---Input Text---

{input_text}


---Output---
"""

Expected Impact

Cost Savings

For a typical indexing run of 8,000 chunks:

  • Current: ~1,450 tokens × 8,000 chunks = ~11.6M prompt tokens (all counted as new)
  • With caching: ~1,450 tokens (first chunk) + ~150 tokens × 7,999 chunks = ~1.3M new prompt tokens + ~10.4M cached tokens (50% discount)
  • Result: ~45% cost reduction on prompt tokens during indexing

Latency Improvements

  • Cached prompt tokens process significantly faster than new tokens
  • Reduces overall indexing time, especially for large document collections
  • More responsive during bulk upload operations

Automatic Activation

OpenAI's prompt caching is automatic for prompts > 1024 tokens:

  • No API changes required beyond restructuring prompts
  • Works with existing GPT-4o, GPT-4o-mini, o1-preview, o1-mini models
  • Cache persists 5-10 minutes (max 1 hour), perfect for batch indexing

References

Additional Benefits

This optimization would:

  • ✅ Reduce indexing costs by ~45% for OpenAI users
  • ✅ Improve indexing latency significantly
  • ✅ Make LightRAG more cost-effective for large-scale deployments
  • ✅ Require minimal code changes
  • ✅ Work automatically without user configuration

Affected Files

  • lightrag/prompt.py - Prompt templates
  • lightrag/operate.py - Entity extraction logic (lines ~2807-2850)

Thank you for considering this optimization! Happy to provide more details or assist with implementation if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesttrackedIssue is tracked by project

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions