Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Max Num Pages also in MS Word and Powerpoint #689

Closed
JamMaster1999 opened this issue Jan 7, 2025 · 9 comments
Closed

Use Max Num Pages also in MS Word and Powerpoint #689

JamMaster1999 opened this issue Jan 7, 2025 · 9 comments
Labels
enhancement New feature or request

Comments

@JamMaster1999
Copy link

Bug

I am setting max_num_pages in my convert method but it proceeds to perform on all the pages.

Steps to reproduce

`
def init_docling(
input_file: str,
output_folder: str,
max_pages: Optional[int] = None,
):
from docling.document_converter import DocumentConverter, PdfFormatOption, WordFormatOption

# Create output directories
os.makedirs(output_folder, exist_ok=True)

converter = DocumentConverter()
result = converter.convert(input_file, max_num_pages=3)
# Add debug prints
print(f"Number of pages in result: {len(result.document.pages)}")
print(f"Result object: {result}")
# Save result
doc_filename = os.path.splitext(os.path.basename(input_file))[0]
md_path = Path(os.path.join(output_folder, f"{doc_filename}.md"))
result.document.save_as_markdown(md_path)`

Docling version

2.14.0

Python version

3.12.8

@JamMaster1999 JamMaster1999 added the bug Something isn't working label Jan 7, 2025
@trinanjan12
Copy link

@JamMaster1999 I don't think the max_num_pages is used to load the pdf till that number, The reason it errors out because of this condition, if not self.page_count <= self.limits.max_num_pages, check this in document.py, Pipeline runs for all the pages

@JamMaster1999
Copy link
Author

Thanks @trinanjan12 So I have to slice the file before hand? Also this is for docx, haven't tried PDF.

@trinanjan12
Copy link

@JamMaster1999 I guess so, for now.

@JamMaster1999
Copy link
Author

JamMaster1999 commented Jan 7, 2025 via email

@trinanjan12
Copy link

maybe @dolfim-ibm can comment on this.

@dolfim-ibm
Copy link
Contributor

The max_num_pages option is currently designed to skip documents which are longer than the provided value, not to truncate them. This is currently only used in the PDF pipeline.

@dolfim-ibm
Copy link
Contributor

I will flag this issue as a feature request for using the parameter also in the other backends.

@dolfim-ibm dolfim-ibm added enhancement New feature or request and removed bug Something isn't working labels Jan 7, 2025
@dolfim-ibm dolfim-ibm changed the title Max Num Pages is not working Use Max Num Pages also in MS Word and Powerpoint Jan 7, 2025
@JamMaster1999
Copy link
Author

@dolfim-ibm Got it! Is there a way to make max num page to truncate them? I am thinking something more like page range where it processes a specific set of pages in the document, not entirely skip the document.

@cau-git
Copy link
Contributor

cau-git commented Jan 30, 2025

@JamMaster1999 We will track this in a clean new feature request here: #845

@cau-git cau-git closed this as completed Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants