You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The DoclingDocument class defines a unified data model for rich document representation and the docling library provides tools for conversion of common document formats (e.g., PDF, MS Office formats, and HTML) into DoclingDocument objects.
Currently, DoclingDocument data model addresses the content of a document but common formats typically include structured information on top of that content (metadata), which is not currently represented in DoclingDocument.
Including such metadata would be beneficial for handling document collections and strengthening downstream applications like enterprise search and RAG on document collections.
Examples of metadata fields are:
in PDF documents: title, author, subject, content creator, keywords, creation date, modification date
in MS Office documents: title, author, subject, keywords, company, category, comments
HTML: description, title, country code (and, in general, any content in <meta> tag).
This feature requests consists of:
Adding a field in DoclingDocument for document metadata.
Identifying and defining a set of fields for the metadata that is general enough for common document types.
Eventually addressing the use case of custom metadata fields that are specific to a document type and that could be populated by docling's PDF parser or document backends.
The text was updated successfully, but these errors were encountered:
The
DoclingDocument
class defines a unified data model for rich document representation and the docling library provides tools for conversion of common document formats (e.g., PDF, MS Office formats, and HTML) intoDoclingDocument
objects.Currently,
DoclingDocument
data model addresses the content of a document but common formats typically include structured information on top of that content (metadata), which is not currently represented inDoclingDocument
.Including such metadata would be beneficial for handling document collections and strengthening downstream applications like enterprise search and RAG on document collections.
Examples of metadata fields are:
<meta>
tag).This feature requests consists of:
DoclingDocument
for document metadata.The text was updated successfully, but these errors were encountered: