Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete text in rows #231

Open
pankpy opened this issue Nov 4, 2024 · 2 comments
Open

Complete text in rows #231

pankpy opened this issue Nov 4, 2024 · 2 comments

Comments

@pankpy
Copy link

pankpy commented Nov 4, 2024

Thank you for the initiative. I am using it for table extraction and it is returning tables/dataframes as expected. However, it is not giving complete text in some rows or providing text in multiple lines. Is there any way to fix this?

@cau-git
Copy link
Contributor

cau-git commented Nov 5, 2024

@pankpy Could you please provide an example to illustrate the behaviour? Thanks.

@pankpy
Copy link
Author

pankpy commented Nov 5, 2024

Thank you. Please find attached files.

from docling.datamodel.base_models import InputFormat
from docling.document_converter import (
DocumentConverter,
PdfFormatOption,
WordFormatOption,
)
from docling.pipeline.simple_pipeline import SimplePipeline
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False # Not using scanned documents
pipeline_options.do_table_structure = True

doc_converter = (
DocumentConverter( # all of the below is optional, has internal defaults.
allowed_formats=[
InputFormat.PDF,
InputFormat.IMAGE,
InputFormat.DOCX,
InputFormat.HTML,
InputFormat.PPTX,
], # whitelist formats, non-matching files are ignored.
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options, # pipeline options go here.
backend=PyPdfiumDocumentBackend # optional: pick an alternative backend
),
InputFormat.DOCX: WordFormatOption(
pipeline_cls=SimplePipeline # default for office formats and HTML
),
},
)
)

###############

ConversionResult = doc_converter.convert("E:\zPankaj\Sample.pdf") # previously convert_single
print(ConversionResult.document.export_to_markdown())

print('VERIFY RESULT',ConversionResult.document)
print('RESULT TYPE',type(ConversionResult.document))

for i, table in enumerate(ConversionResult.document.tables):
df = table.export_to_dataframe()
print(df)
df.to_excel(f'Output Sample_S df_{i}.xlsx')
Sample.pdf
Output Sample_S df_0.xlsx
Output Sample_S df_1.xlsx
Pycharm_prints

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants