About text extraction from scanned invoice pdf #1450
Replies: 1 comment 2 replies
-
Tesseract can struggle with text that is low contrast compared to its background.
|
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello ,
So the thing is that I am facing problem during extraction of text using OCRmyPDF and the real part it's extracting all the details on the scanned invoice but not all text, leaving some text behind. I have tried many ways to get that text extracted but none have worked.
This is the code:-
import os
import subprocess
import ocrmypdf
Step 1: Upload PDF
pdf_filename = input("Enter the full path of the PDF file: ")
def convert_to_grayscale(input_pdf, output_pdf):
"""Convert a PDF to grayscale using Ghostscript."""
try:
subprocess.run(
[
"gswin64c", # Windows-compatible Ghostscript executable
"-sDEVICE=pdfwrite", "-dCompatibilityLevel=1.4", "-dNOPAUSE",
"-dBATCH", "-sOutputFile=" + output_pdf, "-sColorConversionStrategy=Gray",
"-dProcessColorModel=/DeviceGray", "-dDownsampleGrayImages=false",
"-dJPEGQ=100", "-dAutoFilterGrayImages=false", "-dGrayImageFilter=/FlateEncode",
"-dDownsampleColorImages=false", "-dDownsampleGrayImages=false", "-dJPEGQ=95",
"-dAutoFilterGrayImages=true", "-dUseCIEColor=false", "-dMaxBitmap=50000000",
"-dCompressFonts=true", "-dEmbedAllFonts=true", "-dSubsetFonts=true",
"-dUseArtBox=true", input_pdf
],
check=True
)
print(f"Grayscale conversion completed: {output_pdf}")
except subprocess.CalledProcessError as e:
print(f"Error during Ghostscript grayscale conversion: {e}")
Step 2: Convert PDF to grayscale
grayscale_pdf_filename = "grayscale_output.pdf" # Output grayscale PDF file name
print("Converting PDF to grayscale using Ghostscript...")
convert_to_grayscale(pdf_filename, grayscale_pdf_filename)
Step 3: Add OCR layer using OCRmyPDF
output_pdf_filename = "ocr_output.pdf" # Output OCR PDF file name
print("Running OCRmyPDF to add OCR layer to the grayscale PDF...")
try:
ocrmypdf.ocr(
grayscale_pdf_filename,
output_pdf_filename,
tesseract_config="--psm 3 --oem 3", # Tesseract configurations
lang=None, # Default language
rotate_pages=True, # Auto-rotate pages
deskew=True, # Deskew pages
image_dpi=300, # Image DPI for output
jpeg_quality=95, # JPEG quality
optimize=3, # Optimize PDF
compress_text=False, # Avoid compressing text
force_ocr=True, # Force OCR on all pages
remove_background=False, # Preserve background
clean=True # Clean up the output
)
print("OCR processing completed.")
except Exception as e:
print(f"Error during OCR processing: {e}")
Step 4: Extract text using pdftotext
output_text_filename = "extracted_text.txt" # Path for extracted text file
print("Extracting text using pdftotext...")
try:
# Run pdftotext to extract text
subprocess.run(
["pdftotext", "-layout", "-enc", "UTF-8", output_pdf_filename, output_text_filename],
check=True
)
print(f"Text extraction completed: {output_text_filename}")
except subprocess.CalledProcessError as e:
print(f"Error during text extraction: {e}")
Step 5: Display extracted text
if os.path.exists(output_text_filename):
with open(output_text_filename, 'r', encoding='utf-8') as file:
extracted_text = file.read()
print("Extracted Text:")
print(extracted_text)
else:
print(f"Error: {output_text_filename} not found.")
its not extracting the Faktura VAT part
this is the extracted text:-
Sprzedawca
PPHU NATHALIE-MEBLE.PL
Natalia Pietrus nr 39/2024/WDTTR
Trebaczéw 69
63-642 Perzéw
NIP: PL 6192020532 Data wystawienia:
Lo.
20.11.2024
Data dostawy / wykonania ustugi: 16.11.2024
Strona: W/
Bank: Santander Bank Polska SA Nr rachunku: PL39 1090 1144 0000 0001 3016 3134
Kod SWIFT: WBKPPLPP
Nabywea: Odbiorea:
(DFDS LOGISTICS B.V.) (DFDS LOGISTICS B.V.)
BURGEMEESTER VAN LIERPLEIN 57 BURGEMEESTER VAN LIERPLEIN 57
3134 ZB VLAARDINGEN , THE NETHERLANDS 3134 ZB VLAARDINGEN , THE NETHERLANDS
NIP: NL 801283929B02 NIP: NL 801283929B02
Opis: — zalgezniki:
[tp. Nazwa towaru/ustugi Kod CN/ PKWiU Tlosé
| Transport-zlecenie nr/ booking number: 1 sat _* 950,00 950,00
delforw1 15296434
przelew 05.01.2025 950,00 EUR 4 103,81
4 103,81 = 4 103,81
(B) Comarch ERP Optima, v. 2024.5,1.1941, nr klucza 5000034062
its not extracting the Faktura VAT part
please help me by giving a fix or a solution
this is the pdf file:-
file.pdf
Beta Was this translation helpful? Give feedback.
All reactions