Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a field to DoclingDocument for metadata #73

Open
ceberam opened this issue Nov 27, 2024 · 0 comments
Open

Add a field to DoclingDocument for metadata #73

ceberam opened this issue Nov 27, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@ceberam
Copy link
Collaborator

ceberam commented Nov 27, 2024

The DoclingDocument class defines a unified data model for rich document representation and the docling library provides tools for conversion of common document formats (e.g., PDF, MS Office formats, and HTML) into DoclingDocument objects.

Currently, DoclingDocument data model addresses the content of a document but common formats typically include structured information on top of that content (metadata), which is not currently represented in DoclingDocument.
Including such metadata would be beneficial for handling document collections and strengthening downstream applications like enterprise search and RAG on document collections.

Examples of metadata fields are:

  • in PDF documents: title, author, subject, content creator, keywords, creation date, modification date
  • in MS Office documents: title, author, subject, keywords, company, category, comments
  • HTML: description, title, country code (and, in general, any content in <meta> tag).

This feature requests consists of:

  • Adding a field in DoclingDocument for document metadata.
  • Identifying and defining a set of fields for the metadata that is general enough for common document types.
  • Eventually addressing the use case of custom metadata fields that are specific to a document type and that could be populated by docling's PDF parser or document backends.
@ceberam ceberam self-assigned this Nov 27, 2024
@ceberam ceberam added the enhancement New feature or request label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant