Submit Document Object to self-hosted nlm-ingestor #78

BainMcKay · 2024-07-20T20:02:33Z

Using Python, I am downloading GoogleDrive files to a local server, and caching them in a server tmp folder for failsafe-restart at checkpoint. I load the file into a Document Object which I then parse with semantic parsing. I want to submit the document object to local nlm-ingestor server for processing as well. But If I submit filename and document object, if fails on 404. I don't want to create a publicly available downloads folder on the mlm-ingestor server. Is there a way to submit the document objects, vs the url, to [self.parse_pdf(pdf_file)] in [file_reader.py]?

BainMcKay · 2024-07-26T18:04:27Z

Found the issue.

the url in the example does not match the routing rule in the server code. it should be [http://yourserverip/api/parsedocument?renderFormat=all.]. The additional folders were not in the server RESTAPI routing path. The REST route is [api/parsedocument]
The PDF rule parser is looking for a style attribute, which did not exist in TIKA text extraction from CV PDF documents I was using. It looks like there was an attempt to assign a default value if the style attribute was not found, causing the document to flush with an opaque error [404 NOT FOUND]. I tried conditionals base on style not found, but it threads down the code. As such, I added a condition, if style attribute not found, report it to the console log and flush the document. Then the calling client API switches to an other Parsing algorithm which does work.

BUG: The style parser bug needs to be fixed for the parser to work.

jamesvillarrubia · 2024-08-07T00:55:59Z

You may need to download the most recent jar file 2.9.2_v2, tika-server-standard-nlm-modified-2.9.2_v2.jar or downgrade to 2.4.1v6. There was a big update to bring nlm-ingestor in line with Apache Tika's most recent updates, but modifications to Tika's jars had to be done too. Bugs were introduced in 2.9.2_v1 regarding the style parser that may be fixed in v2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submit Document Object to self-hosted nlm-ingestor #78

Submit Document Object to self-hosted nlm-ingestor #78

BainMcKay commented Jul 20, 2024 •

edited

Loading

BainMcKay commented Jul 26, 2024 •

edited

Loading

jamesvillarrubia commented Aug 7, 2024

Submit Document Object to self-hosted nlm-ingestor #78

Submit Document Object to self-hosted nlm-ingestor #78

Comments

BainMcKay commented Jul 20, 2024 • edited Loading

BainMcKay commented Jul 26, 2024 • edited Loading

jamesvillarrubia commented Aug 7, 2024

BainMcKay commented Jul 20, 2024 •

edited

Loading

BainMcKay commented Jul 26, 2024 •

edited

Loading