-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Form-Filled PDF extractions #673
Comments
@jackdorney1999 Hi, can you please attach an example document and the minimal code to reproduce your issue? Thanks. |
@cau-git here is the code that I am using for this: #Docling Imports
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions, TableFormerMode
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import ImageRefMode, PictureItem, TableItem, DoclingDocument
def extraction_pipeline_rapid_ocr(input_file, output_dir, resolution):
start_time = time.time()
# Ensure output directory exists
os.makedirs(output_dir, exist_ok=True)
# Get the name of the PDF (without extension) for folder structure
pdf_name = Path(input_file).stem
pdf_output_dir = os.path.join(output_dir, pdf_name)
# Create nested folders for the specific PDF with resolution
docling_folder = os.path.join(pdf_output_dir, f'DoclingDocument_{resolution}')
markdown_folder = os.path.join(pdf_output_dir, f'Markdown_{resolution}')
image_folder = os.path.join(pdf_output_dir, f'Images_{resolution}')
os.makedirs(docling_folder, exist_ok=True)
os.makedirs(markdown_folder, exist_ok=True)
os.makedirs(image_folder, exist_ok=True)
# Set up pipeline options with the given resolution
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.images_scale = resolution
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
pipeline_options.ocr_options = RapidOcrOptions()
# Initialize document converter
doc_converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
# Convert the input file
conversion_result = doc_converter.convert(input_file)
# Save the JSON representation of the document
docling_doc = conversion_result.document
json_output_path = os.path.join(docling_folder, "doc.json")
with open(json_output_path, "w") as fp:
fp.write(json.dumps(docling_doc.export_to_dict()))
# Save the Markdown file
markdown_content = conversion_result.document.export_to_markdown(image_mode='EMBEDDED')
markdown_output_path = os.path.join(markdown_folder, f"{pdf_name}.md")
with open(markdown_output_path, "w") as fp:
fp.write(markdown_content)
# Save images for each page
for page_no, page in conversion_result.document.pages.items():
page_image_filename = os.path.join(image_folder, f"{pdf_name}-page-{page_no}.png")
with open(page_image_filename, "wb") as fp:
page.image.pil_image.save(fp, format="PNG")
end_time = time.time() - start_time
logging.info(f"Document converted and files exported in {end_time:.2f} seconds.")
extraction_pipeline_rapid_ocr(
input_file='sample_pdf.pdf',
output_dir='outputs/github_test',
resolution=1.0
) This also gave inconsistent markdown: ## Sample Fillable PDF Form
Fillable PDF forms can be customised to your needs. They allow form recipients to fill out information on screen like a web page form, then print, save or email the results.
Name
Date
Address
## Fillable Fields
What are your favourite activities? Reading Walking Music Other: /Yes /Yes
## Tick Boxes (multiple options can be selected)
What is your favourite activity? Reading Walking Music Other:
## Radio Buttons (only one option can be selected)
These buttons can be printable or visible only when onscreen.
## Buttons (to prompt certain actions)
Test 123
Jan
1 2012
1, springfield road, uk
<!-- image --> Please find the example pdf, and the extracted image attached |
@jackdorney1999 The pdf parser should be able to extract text from the filled field as well as know if it comes from a filled out field. I will sync with @cau-git how we can propagate it through the docling pipeline. |
@PeterStaar-IBM Thanks for reaching out, please let me know if you require any further information from me |
Question
How can I ensure that form filled data is present in the images of the PDF pages?
Hi there,
I am attempting to use Docling as part of an attribute extraction framework. I need to be able to handle attributes that may be inputted in form filled PDFs. I have seen that this is possible to extract the form filled data when outputting as markdown, when I have this as my pipeline parameter with a python implementation:
-- Set up pipeline options with the given resolution
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.images_scale = resolution
pipeline_options.generate_page_images = True
pipeline_options.generate_picture_images = True
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
pipeline_options.ocr_options = RapidOcrOptions()
-- Initialize document converter
doc_converter = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
)
-- Convert the input file
conversion_result = doc_converter.convert(input_file)
-- Save the JSON representation of the document
docling_doc = conversion_result.document
json_output_path = os.path.join(docling_folder, "doc.json")
with open(json_output_path, "w") as fp:
fp.write(json.dumps(docling_doc.export_to_dict()))
-- Save the Markdown file
markdown_content = conversion_result.document.export_to_markdown(image_mode='EMBEDDED')
markdown_output_path = os.path.join(markdown_folder, f"{pdf_name}.md")
with open(markdown_output_path, "w") as fp:
fp.write(markdown_content)
-- Save images for each page
for page_no, page in conversion_result.document.pages.items():
page_image_filename = os.path.join(image_folder, f"{pdf_name}-page-{page_no}.png")
with open(page_image_filename, "wb") as fp:
page.image.pil_image.save(fp, format="PNG")
I have found that:
pipeline_options.table_structure_options.do_cell_matching = True
means it will be present in the markdown (despite the form filled aspect of this pdf not being a table).
However, when I extract images of the pages of the PDF, this form filled data is missing, and I am missing all the attributes I am looking to extract.
Is there a way that I can ensure that the form filled data will be present in the images of the pdf pages? Are there parameters in the pipeline that would enable this?
Thanks
The text was updated successfully, but these errors were encountered: