Skip to content

Add option to skip corrupt PDFs in PDFMergerUtility with improved exception handling #208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

SwethaMuthuvel
Copy link

@SwethaMuthuvel SwethaMuthuvel commented Jul 4, 2025

What This PR Does

This pull request improves the robustness and debuggability of PDFMergerUtility by:

  1. Adding a skipCorruptFiles flag

    • Allows users to skip unreadable or corrupt PDF files during merge.
    • Default behavior remains unchanged (i.e., throws on error).
  2. Wrapping IOException with source context

    • Converts vague errors like:
      IOException: Could not parse object stream
      
      into more useful messages like:
      IOException: Failed to load PDF from source: /path/to/file.pdf
      
    • Helps identify exactly which file failed.
  3. Applied consistently in both merge modes

    • optimizedMergeDocuments(...)
    • legacyMergeDocuments(...)
    • Added warning logs when skipping files.

Why This Helps

  • Improves debuggability — pinpoints which file caused the failure.
  • Makes batch operations resilient — avoids total failure from one bad input.
  • Scales better — suitable for bulk merging scenarios.
  • Does not break existing behavior — opt-in via setSkipCorruptFiles(true).

Swetha Muthuvel added 2 commits July 4, 2025 13:05
- Removed duplicate LOG.info calls from optimized and legacy merge methods.
- Introduced shared field 'lastMergeSkippedCount' to track skipped corrupt PDFs.
- Log merge summary once from mergeDocuments(), improving clarity and avoiding redundant output.
@lehmi
Copy link
Contributor

lehmi commented Jul 4, 2025

Please reformat the code first using our formatter rules to make it easier to evaluate your proposed changes

@THausherr
Copy link
Contributor

I'm wondering what the use case of this change would be. Wouldn't the target file be worthless if parts of the source is missing?

Is this for a school / university project, or is this part of an AI training / evaluation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants