Skip to content

text extraction hangs on MacOS 10.14 #14

@devcsrj

Description

@devcsrj

I am trying to use pdfbox, with this vanilla snippet:

converter = pdfbox.PDFBox()
converter.extract_text(
    input_path=str(pdf.absolute()),
    output_path=str(txt.absolute()))

But it becomes stuck. I debugged the stack tree, and it hangs at this line:

Screen Shot 2019-10-02 at 6 25 24 AM

I confirmed that a Java process is spawned:

➜ jps
5416 Jps
5385
329    <-- spawned process

But it is just stuck there.

Running the cached jar by python-pdfbox in the terminal works:

java -jar pdfbox-app-2.0.17.jar ExtractText '/Users/devcsrj/Projects/devcsrj/klerk/dist/17/SENATE/regular-1/journal-28.pdf' '/Users/devcsrj/Projects/devcsrj/klerk/dist/17/SENATE/regular-1/journal-28.txt'

So I am no longer sure what's going on. Thoughts?


Environment

Python

python-pdfbox = "==0.1.7"
python_version = "3.7"

Java

openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-20190711112007.graal.jdk8u-src-tar-gz-b08)
OpenJDK 64-Bit GraalVM CE 19.2.0 (build 25.222-b08-jvmci-19.2-b02, mixed mode)

OS

macOS Mojave 10.14.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions