Skip to content

Commit f97814c

Browse files
author
Ambika Sukla
committed
added tesseract and removed unwanted files
1 parent e59831b commit f97814c

File tree

10 files changed

+226
-1253
lines changed

10 files changed

+226
-1253
lines changed

Dockerfile

Lines changed: 17 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,22 @@ ENV APP_HOME /app
55
# install Java
66
RUN mkdir -p /usr/share/man/man1 && \
77
apt-get update -y && \
8-
apt-get install -y openjdk-17-jre-headless && \
9-
apt-get install -y libxml2-dev && \
10-
apt-get install -y libxslt-dev && \
11-
apt-get install -y build-essential
8+
apt-get install -y openjdk-17-jre-headless
9+
# install essential packages
10+
RUN apt-get install -y \
11+
libxml2-dev libxslt-dev \
12+
build-essential libmagic-dev
13+
# install tesseract
14+
RUN apt-get install -y \
15+
tesseract-ocr \
16+
lsb-release \
17+
&& echo "deb https://notesalexp.org/tesseract-ocr5/$(lsb_release -cs)/ $(lsb_release -cs) main" | tee /etc/apt/sources.list.d/notesalexp.list > /dev/null \
18+
&& apt-get update -oAcquire::AllowInsecureRepositories=true \
19+
&& apt-get install notesalexp-keyring -oAcquire::AllowInsecureRepositories=true -y --allow-unauthenticated \
20+
&& apt-get update \
21+
&& apt-get install -y \
22+
tesseract-ocr libtesseract-dev \
23+
&& wget -P /usr/share/tesseract-ocr/5/tessdata/ https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
1224
RUN apt-get install unzip -y && \
1325
apt-get install git -y && \
1426
apt-get autoremove -y
@@ -21,4 +33,4 @@ RUN pip install -r requirements.txt
2133
RUN python -m nltk.downloader stopwords
2234
RUN python -m nltk.downloader punkt
2335
RUN chmod +x run.sh
24-
CMD ./run.sh
36+
CMD ./run.sh

README.md

Lines changed: 22 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,16 @@ The PDF parser works off text layer and also offers a OCR option (apply_ocr) to
88
Check out the notebook [pdf_visual_ingestor_step_by_step](notebooks/pdf_visual_ingestor_step_by_step.ipynb) to experiment directly with the PDF parser.
99

1010
The PDF Parser offers the following features:
11-
1. Sections and subsections along with their levels.
12-
2. Paragraphs - combines lines.
13-
3. Links between sections and paragraphs.
14-
5. Tables along with the section the tables are found in.
15-
6. Lists and nested lists.
16-
7. Join content spread across pages.
17-
8. Removal of repeating headers and footers.
18-
9. Watermark removal.
19-
10. OCR with boundary boxes
11+
12+
1. Sections and subsections along with their levels.
13+
2. Paragraphs - combines lines.
14+
3. Links between sections and paragraphs.
15+
5. Tables along with the section the tables are found in.
16+
6. Lists and nested lists.
17+
7. Join content spread across pages.
18+
8. Removal of repeating headers and footers.
19+
9. Watermark removal.
20+
10. OCR with boundary boxes
2021

2122
### HTML
2223
A special HTML parser that creates layout aware blocks to make RAG performance better with higher quality chunks.
@@ -47,14 +48,22 @@ In some cases, your PDFs may result in errors in the Java server and you will ne
4748
python -m nlm_ingestor.ingestion_daemon
4849
```
4950
### Run the docker file
50-
A docker image is available via github container registry. Before running the following code, you may need to authenticate with docker first
51-
cat ~/TOKEN.txt | docker login https://ghcr.io -u USERNAME --password-stdin
52-
where TOKEN.txt is the token you create as described here: https://docs.github.com/en/[email protected]/packages/working-with-a-github-packages-registry/working-with-the-docker-registry
51+
A docker image is available via public github container registry.
5352

53+
Pull the docker image
5454
```
5555
docker pull ghcr.io/nlmatics/nlm-ingestor:latest
56-
docker run nlm-ingestor-<version>
5756
```
57+
Run the docker image mapping the port 5001 to port of your choice.
58+
```
59+
docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor:latest-<version>
60+
```
61+
Once you have the server running, your llmsherpa url will be:
62+
"http://localhost:5010/api/parseDocument?renderFormat=all"
63+
- to apply OCR add &applyOcr=yes
64+
- to use the new indent parser which uses a different alogrithm to assign header levels, add &useNewIndentParser=yes
65+
- this server is good for your development - in production it is recommended to run this behind a secure gateway using nginx or cloud gateways
66+
5867
### Test the ingestor server
5968
Sample test code to test the server with llmsherpa parser is in this [notebook](notebooks/test_llmsherpa_api.ipynb).
6069

nlm_ingestor/file_parser/tika_parser.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ def __init__(self):
1414

1515
def parse_to_html(self, filepath, do_ocr=False):
1616
# Turn off OCR by default
17-
timeout = 9000
17+
timeout = 3000
1818
headers = {
1919
"X-Tika-OCRskipOcr": "true"
2020
}

nlm_ingestor/ingestion_daemon/__main__.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -20,15 +20,17 @@ def parse_document(
2020
render_format: str = "all",
2121
):
2222
render_format = request.args.get('renderFormat', 'all')
23-
use_new_indent_parser = request.args.get('useNewIndentParser', 'all')
23+
use_new_indent_parser = request.args.get('useNewIndentParser', 'no')
24+
apply_ocr = request.args.get('applyOcr', 'no')
2425
file = request.files['file']
2526
tmp_file = None
2627
try:
2728
parse_options = {
2829
"parse_and_render_only": True,
2930
"render_format": render_format,
30-
"use_new_indent_parser": use_new_indent_parser,
31-
"parse_pages": ()
31+
"use_new_indent_parser": use_new_indent_parser == "yes",
32+
"parse_pages": (),
33+
"apply_ocr": apply_ocr == "yes"
3234
}
3335
# save the incoming file to a temporary location
3436
filename = secure_filename(file.filename)
@@ -52,6 +54,7 @@ def parse_document(
5254
)
5355

5456
except Exception as e:
57+
print("error uploading file, stacktrace: ", traceback.format_exc())
5558
logger.error(
5659
f"error uploading file, stacktrace: {traceback.format_exc()}",
5760
exc_info=True,
@@ -65,8 +68,7 @@ def parse_document(
6568

6669
def main():
6770
logger.info("Starting ingestor service..")
68-
app.run(host="0.0.0.0", port=5001, debug=False)
69-
71+
app.run(host="0.0.0.0", port=5001, debug=True)
7072

7173
if __name__ == "__main__":
7274
main()

nlm_ingestor/ingestor/ingestor_api.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ def ingest_document(
3333
logger.info(f"Parsing {mime_type} at {doc_location} with name {doc_name}")
3434
if mime_type == "application/pdf":
3535
logger.info("using pdf parser")
36+
print("testing..", parse_options)
3637
pdfi = pdf_ingestor.PDFIngestor(doc_location, parse_options)
3738
return_dict = pdfi.return_dict
3839
elif mime_type in {"text/markdown", "text/x-markdown"}:

nlm_ingestor/ingestor/pdf_ingestor.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,9 @@
77

88
from bs4 import BeautifulSoup
99

10-
from . import table_builder
1110
from nlm_ingestor.file_parser import pdf_file_parser
1211
from timeit import default_timer
13-
from .visual_ingestor import table_parser
1412
from .visual_ingestor import visual_ingestor
15-
from nlm_ingestor.ingestor.visual_ingestor import block_renderer
1613
from nlm_ingestor.ingestor.visual_ingestor.new_indent_parser import NewIndentParser
1714
from nlm_ingestor.ingestor_utils.utils import NpEncoder, \
1815
detect_block_center_aligned, detect_block_center_of_page

0 commit comments

Comments
 (0)