-
Notifications
You must be signed in to change notification settings - Fork 3k
Add OCR fallback for scanned/non-searchable PDFs (#1156) #1268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@microsoft-github-policy-service agree |
Thanks for the contribution. This looks promising. Let me do some testing. NOTE: I'm not sure we should throw a dependency error if no text is found. What if the PDF just doesn't have text? |
I think this scenario will be rare , like mostly last page of pdf but 99% of the cases, pdf can be in non-extractive, like images/charts. |
I had thought on very similar feature but leveraging an optional |
file_stream.seek(0) | ||
text = pdfminer.high_level.extract_text(file_stream) | ||
if text and text.strip(): | ||
return DocumentConverterResult(markdown=text) | ||
|
||
# If no text found, fall back to OCR | ||
if _ocr_dependency_exc_info is not None: | ||
raise MissingDependencyException( | ||
"OCR dependencies are missing. Please install pytesseract and pdf2image for OCR support." | ||
) from _ocr_dependency_exc_info[1].with_traceback(_ocr_dependency_exc_info[2]) | ||
|
||
file_stream.seek(0) | ||
images = convert_from_bytes(file_stream.read()) | ||
ocr_text = [] | ||
for img in images: | ||
ocr_text.append(pytesseract.image_to_string(img)) | ||
ocr_output = "\n\n".join(ocr_text) | ||
return DocumentConverterResult(markdown=ocr_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue of PDFs containing both mineable text and images that contain text. It would be nice to have a more sophisticated branching mechanism that accounts for this and/or allowing an API to override by the markitdown caller.
Description
Added OCR support to the PDF converter to handle scanned and non-searchable PDF files. When a PDF does not contain extractable text, the converter will now use OCR (via pytesseract and pdf2image) to extract text content from the PDF images.
Changes
PdfConverter
to first attempt text extraction with pdfminer as before.Example Usage
Related Issues
Closes #1156 — Pdf file conversion not working when pdf file is non scanable