Skip to content

Passing custom metadata per document #8

@r-gg

Description

@r-gg

Issue Description

When converting multiple documents, I want to pass several metadata fields which are different for each document. This functionality is available for multiple default haystack converters (e.g. for MarkdownToDocument). Just like in the default haystack converters, one should either be able to pass:

  1. a single dictionary whose fields will be added to the metadata of all chunks or
  2. a list of dictionaries having the same length as the list of passed documents (mapping fields of each dictionary to the metadata fields of the chunks of the respective document).

This is however not present in the current implementation. Workaround where the metadata would be set after conversion (with export type DOC_CHUNKS) is not possible for the following reason: When working with multiple documents (i.e. len(paths)>1) it is difficult to track which chunks belong to which document. Some documents can have the same filename and binary_hash, so for chunks belonging to these documents it is impossible to differentiate to which original document the chunk belongs.

Possible Solution

Add the optional meta parameter to the components DoclingConverter.run() method and expand the existing meta dictionaries (returned by the _meta_extractor) with the dictionary/dictionaries passed in the new meta parameter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions