Form-Filled PDF extractions #673

jackdorney1999 · 2025-01-03T17:33:23Z

Question

How can I ensure that form filled data is present in the images of the PDF pages?

Hi there,

I am attempting to use Docling as part of an attribute extraction framework. I need to be able to handle attributes that may be inputted in form filled PDFs. I have seen that this is possible to extract the form filled data when outputting as markdown, when I have this as my pipeline parameter with a python implementation:

-- Set up pipeline options with the given resolution
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.images_scale = resolution
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
pipeline_options.ocr_options = RapidOcrOptions()

-- Initialize document converter
doc_converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)

-- Convert the input file
conversion_result = doc_converter.convert(input_file)

-- Save the JSON representation of the document
docling_doc = conversion_result.document
json_output_path = os.path.join(docling_folder, "doc.json")
with open(json_output_path, "w") as fp:
fp.write(json.dumps(docling_doc.export_to_dict()))

-- Save the Markdown file
markdown_content = conversion_result.document.export_to_markdown(image_mode='EMBEDDED')
markdown_output_path = os.path.join(markdown_folder, f"{pdf_name}.md")
with open(markdown_output_path, "w") as fp:
fp.write(markdown_content)

-- Save images for each page
for page_no, page in conversion_result.document.pages.items():
page_image_filename = os.path.join(image_folder, f"{pdf_name}-page-{page_no}.png")
with open(page_image_filename, "wb") as fp:
page.image.pil_image.save(fp, format="PNG")

I have found that:

pipeline_options.table_structure_options.do_cell_matching = True

means it will be present in the markdown (despite the form filled aspect of this pdf not being a table).

However, when I extract images of the pages of the PDF, this form filled data is missing, and I am missing all the attributes I am looking to extract.

Is there a way that I can ensure that the form filled data will be present in the images of the pdf pages? Are there parameters in the pipeline that would enable this?

Thanks

cau-git · 2025-01-06T11:56:56Z

@jackdorney1999 Hi, can you please attach an example document and the minimal code to reproduce your issue? Thanks.

jackdorney1999 · 2025-01-07T16:45:32Z

@cau-git here is the code that I am using for this:

#Docling Imports
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions,  TableFormerMode
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import ImageRefMode, PictureItem, TableItem, DoclingDocument

def extraction_pipeline_rapid_ocr(input_file, output_dir, resolution):
    start_time = time.time()

    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)

    # Get the name of the PDF (without extension) for folder structure
    pdf_name = Path(input_file).stem
    pdf_output_dir = os.path.join(output_dir, pdf_name)

    # Create nested folders for the specific PDF with resolution
    docling_folder = os.path.join(pdf_output_dir, f'DoclingDocument_{resolution}')
    markdown_folder = os.path.join(pdf_output_dir, f'Markdown_{resolution}')
    image_folder = os.path.join(pdf_output_dir, f'Images_{resolution}')

    os.makedirs(docling_folder, exist_ok=True)
    os.makedirs(markdown_folder, exist_ok=True)
    os.makedirs(image_folder, exist_ok=True)

    # Set up pipeline options with the given resolution
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True
    pipeline_options.images_scale = resolution
    pipeline_options.generate_page_images = True
    pipeline_options.generate_picture_images = True
    pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
    pipeline_options.ocr_options = RapidOcrOptions()

    # Initialize document converter
    doc_converter = DocumentConverter(
        format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
    )

    # Convert the input file
    conversion_result = doc_converter.convert(input_file)

    # Save the JSON representation of the document
    docling_doc = conversion_result.document
    json_output_path = os.path.join(docling_folder, "doc.json")
    with open(json_output_path, "w") as fp:
        fp.write(json.dumps(docling_doc.export_to_dict()))

    # Save the Markdown file
    markdown_content = conversion_result.document.export_to_markdown(image_mode='EMBEDDED')
    markdown_output_path = os.path.join(markdown_folder, f"{pdf_name}.md")
    with open(markdown_output_path, "w") as fp:
        fp.write(markdown_content)

    # Save images for each page
    for page_no, page in conversion_result.document.pages.items():
        page_image_filename = os.path.join(image_folder, f"{pdf_name}-page-{page_no}.png")
        with open(page_image_filename, "wb") as fp:
            page.image.pil_image.save(fp, format="PNG")


    end_time = time.time() - start_time
    logging.info(f"Document converted and files exported in {end_time:.2f} seconds.")
    
extraction_pipeline_rapid_ocr(
input_file='sample_pdf.pdf',
output_dir='outputs/github_test',
resolution=1.0
)

This also gave inconsistent markdown:

## Sample Fillable PDF Form

Fillable  PDF  forms  can  be  customised  to  your  needs.  They  allow  form  recipients  to  fill  out information on screen like a web page form, then print, save or email the results.

Name

Date

Address

## Fillable Fields

What are your favourite activities? Reading Walking Music Other: /Yes /Yes

## Tick Boxes (multiple options can be selected)

What is your favourite activity? Reading Walking Music Other:

## Radio Buttons (only one option can be selected)

These buttons can be printable or visible only when onscreen.

## Buttons (to prompt certain actions)

Test 123

Jan

1 2012

1, springfield road, uk

<!-- image -->

Please find the example pdf, and the extracted image attached
sample_pdf.pdf

PeterStaar-IBM · 2025-01-11T13:56:14Z

@jackdorney1999 The pdf parser should be able to extract text from the filled field as well as know if it comes from a filled out field. I will sync with @cau-git how we can propagate it through the docling pipeline.

jackdorney1999 · 2025-01-16T09:32:49Z

@PeterStaar-IBM Thanks for reaching out, please let me know if you require any further information from me

jackdorney1999 added the question Further information is requested label Jan 3, 2025

cau-git added the enhancement New feature or request label Jan 31, 2025

cau-git assigned cau-git and PeterStaar-IBM Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Form-Filled PDF extractions #673

Form-Filled PDF extractions #673

jackdorney1999 commented Jan 3, 2025

cau-git commented Jan 6, 2025

jackdorney1999 commented Jan 7, 2025 •

edited

Loading

PeterStaar-IBM commented Jan 11, 2025

jackdorney1999 commented Jan 16, 2025

Form-Filled PDF extractions #673

Form-Filled PDF extractions #673

Comments

jackdorney1999 commented Jan 3, 2025

Question

How can I ensure that form filled data is present in the images of the PDF pages?

cau-git commented Jan 6, 2025

jackdorney1999 commented Jan 7, 2025 • edited Loading

PeterStaar-IBM commented Jan 11, 2025

jackdorney1999 commented Jan 16, 2025

jackdorney1999 commented Jan 7, 2025 •

edited

Loading