Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submit Document Object to self-hosted nlm-ingestor #78

Open
BainMcKay opened this issue Jul 20, 2024 · 2 comments
Open

Submit Document Object to self-hosted nlm-ingestor #78

BainMcKay opened this issue Jul 20, 2024 · 2 comments

Comments

@BainMcKay
Copy link

BainMcKay commented Jul 20, 2024

Using Python, I am downloading GoogleDrive files to a local server, and caching them in a server tmp folder for failsafe-restart at checkpoint. I load the file into a Document Object which I then parse with semantic parsing. I want to submit the document object to local nlm-ingestor server for processing as well. But If I submit filename and document object, if fails on 404. I don't want to create a publicly available downloads folder on the mlm-ingestor server. Is there a way to submit the document objects, vs the url, to [self.parse_pdf(pdf_file)] in [file_reader.py]?

@BainMcKay
Copy link
Author

BainMcKay commented Jul 26, 2024

Found the issue.

  1. the url in the example does not match the routing rule in the server code. it should be [http://yourserverip/api/parsedocument?renderFormat=all.]. The additional folders were not in the server RESTAPI routing path. The REST route is [api/parsedocument]
  2. The PDF rule parser is looking for a style attribute, which did not exist in TIKA text extraction from CV PDF documents I was using. It looks like there was an attempt to assign a default value if the style attribute was not found, causing the document to flush with an opaque error [404 NOT FOUND]. I tried conditionals base on style not found, but it threads down the code. As such, I added a condition, if style attribute not found, report it to the console log and flush the document. Then the calling client API switches to an other Parsing algorithm which does work.

BUG: The style parser bug needs to be fixed for the parser to work.

@jamesvillarrubia
Copy link
Collaborator

You may need to download the most recent jar file 2.9.2_v2, tika-server-standard-nlm-modified-2.9.2_v2.jar or downgrade to 2.4.1v6. There was a big update to bring nlm-ingestor in line with Apache Tika's most recent updates, but modifications to Tika's jars had to be done too. Bugs were introduced in 2.9.2_v1 regarding the style parser that may be fixed in v2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants