Add support for ImageElements -> Parse images #64

ic-xu · 2024-09-12T12:11:35Z

The images in the PDF were lost after parsing before. Now RAG needs to use images, so I added the image extraction function.

Filimoa · 2024-09-12T23:41:12Z

Thanks for the PR - this looks awesome! I will look a take a deeper look into this asap

ic-xu · 2024-09-13T02:52:44Z

Okay, if you have any needs, please communicate in time. I will be happy to communicate with you.

Filimoa · 2024-09-17T17:28:16Z

PdfMiner looks good - I changed the image schema to encode the image data in base64 str to allow for easy serialization. We also added a mime type. Finally added a test - that looks good.

Looks like the pymupdf implementation is failing for me.

import openparse

basic_doc_path = "/Users/sergey/Downloads/pdf-with-image.pdf"
pdf_obj = openparse.Pdf(basic_doc_path)
parsed_basic_doc = ingest(pdf_obj)

Returns

ValidationError: 1 validation error for Bbox
  Value error, y1 must be greater than y0 [type=value_error, input_value={'x0': 72.0, 'y0': 719.54...0, 'page_height': 792.0}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.9/v/value_error

I noticed you flipped the coordinates? Maybe why?

fy0 = page.rect.height - node["bbox"][1]
fy1 = page.rect.height - node["bbox"][3]

To be honest, I'm not even sure how important it is to have this implemented for pymupdf because those documents are already OCRd and not sure how that affects this?

Filimoa · 2024-09-17T17:43:54Z

You can see the changes in the "parse-images-pdf-miner" branch - I can't seem to figure out how to merge it into this

ic-xu · 2024-09-18T02:17:13Z

PdfMiner looks good - I changed the image schema to encode the image data in base64 str to allow for easy serialization. We also added a mime type. Finally added a test - that looks good.

Looks like the pymupdf implementation is failing for me.
import openparse

basic_doc_path = "/Users/sergey/Downloads/pdf-with-image.pdf"
pdf_obj = openparse.Pdf(basic_doc_path)
parsed_basic_doc = ingest(pdf_obj)
Returns
ValidationError: 1 validation error for Bbox
  Value error, y1 must be greater than y0 [type=value_error, input_value={'x0': 72.0, 'y0': 719.54...0, 'page_height': 792.0}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.9/v/value_error
I noticed you flipped the coordinates? Maybe why?
fy0 = page.rect.height - node["bbox"][1]
fy1 = page.rect.height - node["bbox"][3]
To be honest, I'm not even sure how important it is to have this implemented for pymupdf because those documents are already OCRd and not sure how that affects this?

Indeed, for pymupdf, if the image has already been OCRed, then there is no need to parse the image anymore. However, I thought you would process the OCR in the process later, and only parse text in the text.

ic-xu · 2024-09-18T02:18:51Z

How about this, I'll submit the image parsing part for pdfminer first. As for pymupdf, I'll keep it as it is, without adding an image parsing module. Is that okay?

Filimoa · 2024-09-18T04:28:12Z

Sounds good!

陈旭 added 6 commits April 25, 2024 10:45

fix:Fix the bug that the layout of PPT is reversed when parsing it

4378137

Merge remote-tracking branch 'origin/main'

a512fbd

Merge remote-tracking branch 'main/main'

de863ea

Merge remote-tracking branch 'main/main'

63deec6

Merge remote-tracking branch 'main/main'

f970b64

feat: Add image extraction function

e52c7a6

ic-xu changed the title ~~Parse images~~ Add support for ImageElements -> Parse images Sep 12, 2024

feat: Add image extraction function

00f02a2

Filimoa merged commit 00f02a2 into Filimoa:main Sep 24, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for ImageElements -> Parse images #64

Add support for ImageElements -> Parse images #64

Uh oh!

ic-xu commented Sep 12, 2024

Uh oh!

Filimoa commented Sep 12, 2024

Uh oh!

ic-xu commented Sep 13, 2024

Uh oh!

Filimoa commented Sep 17, 2024

Uh oh!

Filimoa commented Sep 17, 2024

Uh oh!

ic-xu commented Sep 18, 2024

Uh oh!

ic-xu commented Sep 18, 2024

Uh oh!

Filimoa commented Sep 18, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support for ImageElements -> Parse images #64

Add support for ImageElements -> Parse images #64

Uh oh!

Conversation

ic-xu commented Sep 12, 2024

Uh oh!

Filimoa commented Sep 12, 2024

Uh oh!

ic-xu commented Sep 13, 2024

Uh oh!

Filimoa commented Sep 17, 2024

Uh oh!

Filimoa commented Sep 17, 2024

Uh oh!

ic-xu commented Sep 18, 2024

Uh oh!

ic-xu commented Sep 18, 2024

Uh oh!

Filimoa commented Sep 18, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants