-
-
Notifications
You must be signed in to change notification settings - Fork 148
Open
Description
Describe the bug
not extracting all the images in pdf
To Reproduce
- For a pdf file with 9 pages, there is one image in page 6, page 7, page 8 (page num start with 0), respectively
- the ImageExtraction only detected the image in page 7 but ignored the images in page 6 and page 8
# read the Document
doc: typing.Optional[Document] = None
text_l: SimpleTextExtraction = SimpleTextExtraction()
image_l: ImageExtraction = ImageExtraction()
with open(file_path, "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [text_l, image_l])
# check whether we have read a Document
assert doc is not None
images = []
for page in range(0, 9):
if "XObject" in doc.get_page(page)["Resources"]:
for k, v in doc.get_page(page)["Resources"]["XObject"].items():
print("%d\t%s" % (page, k))
for page, content in image_l.get_images().items():
images += (content)
print(f"image page: {page}")
Expected behaviour
the ImageExtraction listenser should return all the images
Desktop (please complete the following information):
- OS: Windows10
- borb version 2.1.10
Additional context
Add any other context about the problem here.
Metadata
Metadata
Assignees
Labels
No labels