Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Docling version and improve OCR options handling with new docling ver. #574

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

eshwarprasadS
Copy link
Contributor

This PR intends to bump the version of docling from docling>=2.4.2,<=2.8.3 to docling>=2.18.0. This is to bring in the fix for the particular docling chunking failure issue on markdowns with unescaped special characters (docling-project/docling#823)

The primary changes are:

  • updates to requirements.txt
  • updates to CI environment handling in tox.ini and chunkers.py
  • Removing legacy patterns using bare docling.parse in taxonomy.py, since the pdf parsed doc content is not necessary to be passed to DocumentChunker

@mergify mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing ci-failure dependencies Pull requests that update a dependency file labels Mar 18, 2025
Signed-off-by: eshwarprasadS <[email protected]>
@mergify mergify bot added ci-failure and removed ci-failure labels Mar 18, 2025
@mergify mergify bot removed the ci-failure label Mar 18, 2025
Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mergify mergify bot added the one-approval label Mar 19, 2025
Comment on lines +185 to +186
if filepaths:
return filepaths
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this change and it makes sense, but I'm surprised the existing unit tests didn't break. The difference here is we're returning 1 variable rather than 2. Usually Python complains when the number of expected outputs from a function doesn't match the number of outputs it actually received.

It sounds like our existing unit tests don't have enough coverage. I'm a bit concerned that we might run into unexpected behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug a little deeper and it looks like the existing logic is technically unaffected by this change, but can we add a unit test to validate the outputs of this function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration dependencies Pull requests that update a dependency file one-approval testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants