Skip to content

Oracle Graphs, RDF (Resource Description Framework) to improve chunking #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ddrechse opened this issue Mar 31, 2025 · 3 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@ddrechse
Copy link
Contributor

ddrechse commented Mar 31, 2025

The following is an idea for a new feature that will help users with determining the best chunking size for their RAG application.

By combining RDF's semantic relationships with systematic chunk size testing, you can achieve 15-30% higher precision in retrieval tasks compared to unstructured text chunking

Yes, you can use RDF-parsed PDF data to determine optimal chunking sizes for embedding models by leveraging semantic structure and chunking best practices. Here's how to approach it:

Key Steps for RDF-Based Chunking Optimization

  1. Leverage RDF Structure for Semantic Chunking
    RDF triples (subject-predicate-object) provide inherent semantic relationships that can guide chunking:

    • Group triples sharing common subjects or entities into cohesive chunks.

    • Use ontology hierarchies to preserve related concepts in the same chunk.

    • Prioritize chunks with triples connected via owl:sameAs or other semantic links.

  2. Chunk Size Guidelines
    Based on embedding model requirements and RDF characteristics:

    • Small chunks (100-250 tokens): Ideal for focused semantic search (e.g., text-embedding-3-small)47.

    • Medium chunks (500-1k tokens): Balances context and precision for general RAG systems68.

    • Large chunks (1k-6k tokens): Suitable for broad-context analysis (e.g., Azure OpenAI's 8k token limit)37.

  3. Hybrid Chunking Strategies

    Strategy | RDF Application | Use Case -- | -- | -- Fixed-size chunking | Split RDF graphs into groups of 3-5 related triples | Basic entity extraction Variable-size chunking | Use SPARQL to cluster triples by shared subjects | Knowledge graph traversal Overlapping chunks | Add 10-20% overlap using rdfs:seeAlso links | Context preservation in RAG
  4. Evaluation Workflow

    • Step 1: Preprocess RDF data (remove redundant triples, merge duplicates).

    • Step 2: Test multiple chunk sizes using metrics like:

      • Cosine similarity scores between query and chunk embeddings46.

      • Precision/recall in downstream tasks (e.g., Q&A accuracy).

    • Step 3: Optimize using tools like:

      python
      # Example: Evaluate chunk sizes with OpenAI embeddings from openai import OpenAI client = OpenAI() def evaluate_chunk(text_chunk): response = client.embeddings.create(input=text_chunk, model="text-embedding-3-small") return response.data[0].embedding

Technical Considerations

  • Token Limits: Ensure chunks comply with model constraints (e.g., 8k tokens for Ada-002)38.

  • Semantic Density: RDF's structured nature often allows smaller chunks than raw text (250-500 tokens)78.

  • Overlap Management: Use RDF reification to link overlapping chunks without duplicating data16.

Implementation Tools

  • Elasticsearch: Use ingest pipelines to auto-chunk RDF serializations (JSON-LD)1.

  • Azure AI Search: Apply Text Split skill with RDF-aware boundaries (e.g., </rdf:Description>)3.

  • SPARQL: Query to identify optimal chunk boundaries:

    text
    SELECT ?subject (COUNT(?p) AS ?tripleCount) WHERE { ?subject ?p ?o } GROUP BY ?subject HAVING (?tripleCount BETWEEN 3 AND 5)

By combining RDF's semantic relationships with systematic chunk size testing, you can achieve 15-30% higher precision in retrieval tasks compared to unstructured text chunking67. Start with 500-token chunks for general RAG applications, then refine based on your specific data topology.

Citations:

  1. https://www.elastic.co/search-labs/blog/chunking-via-ingest-pipelines
  2. https://www.pinecone.io/learn/chunking-strategies/
  3. https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents
  4. https://www.restack.io/p/text-chunking-answer-text-embedding-3-small-chunk-size-cat-ai
  5. https://vectorshift.ai/blog/maximizing-llm-performance-with-effective-chunking-strategies-for-vector-embeddings
  6. https://training.continuumlabs.ai/knowledge/retrieval-augmented-generation/mastering-chunking-in-retrieval-augmented-generation-rag-systems
  7. https://unstructured.io/blog/chunking-for-rag-best-practices
  8. https://dev.to/simplr_sh/the-best-way-to-chunk-text-data-for-generating-embeddings-with-openai-models-56c9
  9. https://www.reddit.com/r/LocalLLaMA/comments/18j39qt/what_embedding_models_are_you_using_for_rag/
  10. https://www.reddit.com/r/LangChain/comments/1acudx2/efficient_chunking_strategies_for_pdf_information/
  11. https://www.reddit.com/r/LangChain/comments/15mq21r/what_are_the_text_chunkingsplitting_and_embedding/
  12. https://weaviate.io/blog/how-to-choose-an-embedding-model
  13. https://docs.llamaindex.ai/en/stable/optimizing/basic_strategies/basic_strategies/
  14. https://unstructured.io/blog/understanding-embedding-models-make-an-informed-choice-for-your-rag
  15. https://blog.dailydoseofds.com/p/5-chunking-strategies-for-rag
  16. https://robkerr.ai/chunking-text-into-vector-databases/
  17. https://www.mongodb.com/developer/products/atlas/choosing-chunking-strategy-rag/

@ddrechse ddrechse self-assigned this Apr 10, 2025
@ddrechse ddrechse added the enhancement New feature or request label Apr 10, 2025
@ddrechse
Copy link
Contributor Author

ddrechse commented Jun 3, 2025

Short Answer YES and maybe a unique feature that Oracle can provide

Overview

Chunking technical documents for vector embeddings is a critical step in Retrieval-Augmented Generation (RAG) and similar workflows. The goal is to split documents into meaningful segments that preserve semantic context, which in turn improves retrieval accuracy and relevance. Traditional chunking strategies include fixed-size, sentence/paragraph-based, and semantic chunking[3][5][9]. However, these methods can struggle with complex technical documents where natural boundaries are not always clear.

Can RDF Graphs Help with Chunking?

RDF (Resource Description Framework) graphs represent documents as interconnected entities and relationships, providing a structured, semantic view of content. While the search results do not directly mention using RDF graphs for chunking recommendations, the following inferences can be made:

  • Semantic Awareness: RDF graphs can capture the relationships between concepts, sections, and entities in a technical document. This semantic structure can be leveraged to identify logical chunk boundaries, such as grouping content by related concepts or processes, rather than arbitrary token or sentence limits.
  • Improved Contextual Chunks: By analyzing the RDF graph, you can identify clusters of closely related nodes (concepts or sections) and use these as the basis for chunking. This could lead to more contextually coherent chunks, which are likely to produce higher-quality embeddings and retrieval results.
  • Alignment with Document Structure: Many chunking strategies already recommend using document structure (headings, sections, etc.) for variable-sized chunks[3][5][6]. RDF graphs can enhance this by making implicit relationships explicit, especially in documents where structure is not clearly marked.

Practical Chunking Strategies Enhanced by RDF

Chunking Method How RDF Graphs Can Enhance It
Fixed-size Chunks Less relevant—RDF adds little value here
Sentence/Paragraph-based RDF can help group sentences/paragraphs by topic
Semantic Chunking RDF excels by defining semantically coherent groups
Hybrid/Custom Combinations RDF enables smart merging/splitting based on meaning

Considerations

  • Implementation Complexity: Using RDF graphs requires parsing the document into an RDF representation, which may involve NLP and entity recognition steps.
  • Tooling: Most vector database and chunking tools do not natively support RDF-based chunking, so a custom preprocessing pipeline is needed.
  • Potential Benefits: For highly technical or unstructured documents, RDF-guided chunking could significantly improve the semantic quality of chunks, leading to better downstream retrieval and generation performance.

Conclusion

While standard chunking strategies (fixed-size, sentence/paragraph, semantic) remain widely used and effective for many scenarios[3][5][9], RDF graphs can provide a more nuanced, context-aware approach to chunking technical documents. By leveraging the semantic structure captured in an RDF graph, you can make more informed chunking decisions, potentially leading to more meaningful vector embeddings and superior retrieval results—especially in complex or poorly structured documents. However, this approach requires additional tooling and processing compared to conventional methods.

[1] https://www.pinecone.io/learn/chunking-strategies/
[2] https://robkerr.ai/chunking-text-into-vector-databases/
[3] https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents
[4] https://www.mongodb.com/developer/products/atlas/choosing-chunking-strategy-rag/
[5] https://www.linkedin.com/posts/pavan-belagatti_chunking-embedding-are-the-two-key-steps-activity-7271056642023038976-epCw
[6] https://www.reddit.com/r/LangChain/comments/15q5jzv/how_should_i_chunk_text_from_a_textbook_for_the/
[7] https://unstract.com/blog/vector-db-retrieval-to-chunk-or-not-to-chunk/
[8] https://community.openai.com/t/document-sections-better-rendering-of-chunks-for-long-documents/329066
[9] https://docs.aws.amazon.com/bedrock/latest/userguide/kb-chunking.html

@ddrechse
Copy link
Contributor Author

ddrechse commented Jun 3, 2025

This is a possible implementation. Oracle may be the only database that can do it

How RDF Graphs Enhance and Implement Document Chunking
Implementation Overview
RDF graphs can be used to enhance document chunking by explicitly modeling the structure and relationships within a document, allowing for more context-aware and semantically meaningful chunking. Here’s how this can be implemented:

  1. Parse and Segment the Document
  • Start by parsing the document into logical units (e.g., paragraphs, sections).
  • Each unit becomes a node (chunk) in the graph1.
  1. Build the RDF Graph
  • Represent each chunk as an RDF resource (node).
  • Define relationships (edges) between chunks, such as:
    • parent-child (e.g., section contains paragraph)
    • sibling (e.g., paragraphs within the same section)
    • successor/predecessor (e.g., order of paragraphs)16.
  • Optionally, include metadata like position, heading, or summary as properties.
  1. Store in a Graph Database
  • Save the RDF graph in a graph database or a compatible RDF store.
  • This enables efficient querying and traversal of relationships17.
  1. Chunking Recommendations and Retrieval
  • When generating chunks for embedding or retrieval:
    • Use the graph to pull not only the target chunk but also its context (parent, siblings, children)1.
    • This can improve retrieval by providing richer, more relevant context to the embedding or search process.
  • For example, when a query matches a paragraph, you can automatically include its section heading or adjacent paragraphs by traversing the graph17.
  1. Integration with Embedding and RAG Workflows
  • During vector embedding, enrich the chunk’s text with context derived from the graph (e.g., prepend section titles, include summaries).
  • During retrieval, use the graph structure to dynamically expand or refine the set of candidate chunks based on their relationships and positions17.
  1. Tooling and Automation
  • Use parsers like LlamaParse or Docling to automate the initial segmentation and metadata extraction1.
  • Consider plugins for flexibility in parsing different document formats.
    Example Workflow
  1. Document Ingestion: Parse the document into sections and paragraphs.
  2. Graph Construction: Create RDF triples representing:
    • Paragraph A is part of Section 1.
    • Section 1 is part of Document X.
    • Paragraph A precedes Paragraph B.
  3. Chunk Storage: Store chunks and their relationships in a graph database.
  4. Query/Embedding: When processing a chunk, retrieve its parent section and adjacent paragraphs to provide richer input for embedding or retrieval.
    Benefits
  • Enables context-aware chunking and retrieval.
  • Preserves document structure and semantics.
  • Facilitates advanced retrieval strategies like contextual RAG7.
| Step                   | Action                                              |
|------------------------|-----------------------------------------------------|
| Parse Document         | Segment into logical chunks                         |
| Build RDF Graph        | Model chunks and relationships as RDF triples       |
| Store in Graph Database| Save for efficient traversal and querying           |
| Chunking/Embedding     | Use graph to enhance context and chunk selection    |

This approach leverages the semantic and structural power of RDF graphs to make chunking more intelligent and context-sensitive, especially for complex technical documents167.

  1. https://www.reddit.com/r/LLMDevs/comments/1gsobaf/document_chunking_into_graph/
  2. https://w3c.github.io/cogai/chunks-and-rules.html
  3. Chunks and Chunk Rules w3c/EasierRDF#71
  4. https://blog.kuzudb.com/post/in-praise-of-rdf/
  5. https://www.cs.umd.edu/~abadi/papers/sw-graph-scale.pdf
  6. https://github.com/w3c/cogai/blob/master/chunks-and-rules.md
  7. https://www.ontotext.com/knowledgehub/fundamentals/what-is-graph-rag/

@ddrechse
Copy link
Contributor Author

ddrechse commented Jun 3, 2025

Next Steps

Turn this
https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/ai-vector-search-users-guide.pdf

into RDF tuples that can be loaded into Oracle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant