Complete text in rows #231

pankpy · 2024-11-04T17:38:12Z

Thank you for the initiative. I am using it for table extraction and it is returning tables/dataframes as expected. However, it is not giving complete text in some rows or providing text in multiple lines. Is there any way to fix this?

cau-git · 2024-11-05T09:01:33Z

@pankpy Could you please provide an example to illustrate the behaviour? Thanks.

pankpy · 2024-11-05T12:07:37Z

Thank you. Please find attached files.

from docling.datamodel.base_models import InputFormat
from docling.document_converter import (
DocumentConverter,
PdfFormatOption,
WordFormatOption,
)
from docling.pipeline.simple_pipeline import SimplePipeline
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False # Not using scanned documents
pipeline_options.do_table_structure = True

doc_converter = (
DocumentConverter( # all of the below is optional, has internal defaults.
allowed_formats=[
InputFormat.PDF,
InputFormat.IMAGE,
InputFormat.DOCX,
InputFormat.HTML,
InputFormat.PPTX,
], # whitelist formats, non-matching files are ignored.
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options, # pipeline options go here.
backend=PyPdfiumDocumentBackend # optional: pick an alternative backend
),
InputFormat.DOCX: WordFormatOption(
pipeline_cls=SimplePipeline # default for office formats and HTML
),
},
)
)

###############

ConversionResult = doc_converter.convert("E:\zPankaj\Sample.pdf") # previously convert_single
print(ConversionResult.document.export_to_markdown())

print('VERIFY RESULT',ConversionResult.document)
print('RESULT TYPE',type(ConversionResult.document))

for i, table in enumerate(ConversionResult.document.tables):
df = table.export_to_dataframe()
print(df)
df.to_excel(f'Output Sample_S df_{i}.xlsx')
Sample.pdf
Output Sample_S df_0.xlsx
Output Sample_S df_1.xlsx

cau-git added the table structure label Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete text in rows #231

Complete text in rows #231

pankpy commented Nov 4, 2024 •

edited

Loading

cau-git commented Nov 5, 2024

pankpy commented Nov 5, 2024

Complete text in rows #231

Complete text in rows #231

Comments

pankpy commented Nov 4, 2024 • edited Loading

cau-git commented Nov 5, 2024

pankpy commented Nov 5, 2024

pankpy commented Nov 4, 2024 •

edited

Loading