Skip to content

Merged Cells in Excel #1939

Open
Open
@paul-yangmy

Description

@paul-yangmy

Question

  1. When using the following code to parse an Excel file:
    converter = DocumentConverter(
    allowed_formats=[InputFormat.XLSX],
    format_options={InputFormat.XLSX: ExcelFormatOption(pipeline_cls=SimplePipeline)},
    )
    If the sheet contains merged cells in the header, the parser recognizes it as multiple tables. Is there any way to make it recognize only one Markdown table per sheet?
  2. By the way, is there a parameter to specify a particular sheet to parse? Not for all Excel sheets.
  3. The output JSON from the export_to_dict function is a bit too complex, could provide a simple explanation of what each variable represents? My goal is to extract tables from Excel in a row-based JSON format, but currently, I can only achieve this by exporting to Markdown and then splitting the content by blank lines :(

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions