Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Form-Filled PDF extractions #673

Open
jackdorney1999 opened this issue Jan 3, 2025 · 4 comments
Open

Form-Filled PDF extractions #673

jackdorney1999 opened this issue Jan 3, 2025 · 4 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@jackdorney1999
Copy link

Question

How can I ensure that form filled data is present in the images of the PDF pages?

Hi there,

I am attempting to use Docling as part of an attribute extraction framework. I need to be able to handle attributes that may be inputted in form filled PDFs. I have seen that this is possible to extract the form filled data when outputting as markdown, when I have this as my pipeline parameter with a python implementation:

-- Set up pipeline options with the given resolution
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.images_scale = resolution
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
pipeline_options.ocr_options = RapidOcrOptions()

-- Initialize document converter
doc_converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)

-- Convert the input file
conversion_result = doc_converter.convert(input_file)

-- Save the JSON representation of the document
docling_doc = conversion_result.document
json_output_path = os.path.join(docling_folder, "doc.json")
with open(json_output_path, "w") as fp:
fp.write(json.dumps(docling_doc.export_to_dict()))

-- Save the Markdown file
markdown_content = conversion_result.document.export_to_markdown(image_mode='EMBEDDED')
markdown_output_path = os.path.join(markdown_folder, f"{pdf_name}.md")
with open(markdown_output_path, "w") as fp:
fp.write(markdown_content)

-- Save images for each page
for page_no, page in conversion_result.document.pages.items():
page_image_filename = os.path.join(image_folder, f"{pdf_name}-page-{page_no}.png")
with open(page_image_filename, "wb") as fp:
page.image.pil_image.save(fp, format="PNG")

I have found that:

pipeline_options.table_structure_options.do_cell_matching = True

means it will be present in the markdown (despite the form filled aspect of this pdf not being a table).

However, when I extract images of the pages of the PDF, this form filled data is missing, and I am missing all the attributes I am looking to extract.

Is there a way that I can ensure that the form filled data will be present in the images of the pdf pages? Are there parameters in the pipeline that would enable this?

Thanks

@jackdorney1999 jackdorney1999 added the question Further information is requested label Jan 3, 2025
@cau-git
Copy link
Contributor

cau-git commented Jan 6, 2025

@jackdorney1999 Hi, can you please attach an example document and the minimal code to reproduce your issue? Thanks.

@jackdorney1999
Copy link
Author

jackdorney1999 commented Jan 7, 2025

@cau-git here is the code that I am using for this:

#Docling Imports
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions,  TableFormerMode
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import ImageRefMode, PictureItem, TableItem, DoclingDocument

def extraction_pipeline_rapid_ocr(input_file, output_dir, resolution):
    start_time = time.time()

    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # Get the name of the PDF (without extension) for folder structure
    pdf_name = Path(input_file).stem
    pdf_output_dir = os.path.join(output_dir, pdf_name)

    # Create nested folders for the specific PDF with resolution
    docling_folder = os.path.join(pdf_output_dir, f'DoclingDocument_{resolution}')
    markdown_folder = os.path.join(pdf_output_dir, f'Markdown_{resolution}')
    image_folder = os.path.join(pdf_output_dir, f'Images_{resolution}')

    os.makedirs(docling_folder, exist_ok=True)
    os.makedirs(markdown_folder, exist_ok=True)
    os.makedirs(image_folder, exist_ok=True)

    # Set up pipeline options with the given resolution
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True
    pipeline_options.images_scale = resolution
    pipeline_options.generate_page_images = True
    pipeline_options.generate_picture_images = True
    pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
    pipeline_options.ocr_options = RapidOcrOptions()

    # Initialize document converter
    doc_converter = DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )

    # Convert the input file
    conversion_result = doc_converter.convert(input_file)

    # Save the JSON representation of the document
    docling_doc = conversion_result.document
    json_output_path = os.path.join(docling_folder, "doc.json")
    with open(json_output_path, "w") as fp:
        fp.write(json.dumps(docling_doc.export_to_dict()))

    # Save the Markdown file
    markdown_content = conversion_result.document.export_to_markdown(image_mode='EMBEDDED')
    markdown_output_path = os.path.join(markdown_folder, f"{pdf_name}.md")
    with open(markdown_output_path, "w") as fp:
        fp.write(markdown_content)

    # Save images for each page
    for page_no, page in conversion_result.document.pages.items():
        page_image_filename = os.path.join(image_folder, f"{pdf_name}-page-{page_no}.png")
        with open(page_image_filename, "wb") as fp:
            page.image.pil_image.save(fp, format="PNG")


    end_time = time.time() - start_time
    logging.info(f"Document converted and files exported in {end_time:.2f} seconds.")
    
extraction_pipeline_rapid_ocr(
input_file='sample_pdf.pdf',
output_dir='outputs/github_test',
resolution=1.0
)

This also gave inconsistent markdown:

## Sample Fillable PDF Form

Fillable  PDF  forms  can  be  customised  to  your  needs.  They  allow  form  recipients  to  fill  out information on screen like a web page form, then print, save or email the results.

Name

Date

Address

## Fillable Fields

What are your favourite activities? Reading Walking Music Other: /Yes /Yes

## Tick Boxes (multiple options can be selected)

What is your favourite activity? Reading Walking Music Other:

## Radio Buttons (only one option can be selected)

These buttons can be printable or visible only when onscreen.

## Buttons (to prompt certain actions)

Test 123

Jan

1 2012

1, springfield road, uk

<!-- image -->

Please find the example pdf, and the extracted image attached
sample_pdf.pdf
sample_pdf-page-1

@PeterStaar-IBM
Copy link
Contributor

@jackdorney1999 The pdf parser should be able to extract text from the filled field as well as know if it comes from a filled out field. I will sync with @cau-git how we can propagate it through the docling pipeline.

@jackdorney1999
Copy link
Author

@PeterStaar-IBM Thanks for reaching out, please let me know if you require any further information from me

@cau-git cau-git added the enhancement New feature or request label Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants