Skip to content

BUG: ImageExtraction not extracting all the images in pdf #162

@luojunhui1

Description

@luojunhui1

Describe the bug
not extracting all the images in pdf

To Reproduce

  1. For a pdf file with 9 pages, there is one image in page 6, page 7, page 8 (page num start with 0), respectively
  2. the ImageExtraction only detected the image in page 7 but ignored the images in page 6 and page 8
# read the Document
    doc: typing.Optional[Document] = None
    text_l: SimpleTextExtraction = SimpleTextExtraction()
    image_l: ImageExtraction = ImageExtraction()

    with open(file_path, "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [text_l, image_l])

    # check whether we have read a Document
    assert doc is not None

    images = []

    for page in range(0, 9):
        if "XObject" in doc.get_page(page)["Resources"]:
            for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                print("%d\t%s" % (page, k))
    
    for page, content in image_l.get_images().items():
        images += (content)
        print(f"image page: {page}")

Expected behaviour
the ImageExtraction listenser should return all the images

Screenshots
image

Desktop (please complete the following information):

  • OS: Windows10
  • borb version 2.1.10

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions