Skip to content

Got error when running partition_pdf #2

Open
@sigurn2

Description

@sigurn2

I actually run your code: 01_semi_structured_data.ipynb in collab

from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename="statement_of_changes.pdf",
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=".",
)

and got error shows

WARNING:unstructured:This function will be deprecated in a future release and `unstructured` will simply use the DEFAULT_MODEL from `unstructured_inference.model.base` to set default model name
---------------------------------------------------------------------------
UnidentifiedImageError                    Traceback (most recent call last)
[<ipython-input-10-c47946c825bc>](https://localhost:8080/#) in <cell line: 6>()
      4 from unstructured.partition.pdf import partition_pdf
      5 
----> 6 raw_pdf_elements = partition_pdf(
      7     filename="statement_of_changes.pdf",
      8     extract_images_in_pdf=False,

10 frames
[/usr/local/lib/python3.10/dist-packages/PIL/Image.py](https://localhost:8080/#) in open(fp, mode, formats)
   3281         fp.seek(0)
   3282     except (AttributeError, io.UnsupportedOperation):
-> 3283         fp = io.BytesIO(fp.read())
   3284         exclusive_fp = True
   3285 

UnidentifiedImageError: cannot identify image file '/tmp/tmpt9l2pd51/88be9f82-5a19-4ec0-baa1-a029cf45dfc4-1.ppm'

I have no idea how to resolve it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions