Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PDF extraction and OCR options for hybrid chunking #557

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
6790918
Update PDF extraction and OCR options for hybrid chunking
aakankshaduggal Feb 12, 2025
51fb86d
Update docling versions
aakankshaduggal Feb 12, 2025
90a7a4b
Update easyocr params
aakankshaduggal Feb 12, 2025
19ba945
Add docling-core[chunking] to requirements
aakankshaduggal Feb 12, 2025
def01ea
Update accelerator options for easy ocr
aakankshaduggal Feb 13, 2025
c7d20c2
Merge branch 'main' into hybrid-chunker
aakankshaduggal Feb 25, 2025
635171c
Update the pdf doc parser unload function for v2 parser
aakankshaduggal Feb 25, 2025
89bda78
Update exception and initialize chunks
aakankshaduggal Feb 25, 2025
7b6f051
Update chunk documents to avoid xporting to JSON and re-reading
aakankshaduggal Feb 25, 2025
f6c0eb7
Update tests for chunk documents, check for empty chunks
aakankshaduggal Feb 25, 2025
d1b7e31
Update import for chunking
aakankshaduggal Feb 25, 2025
1f1cfd3
Adding transformers and semchunk because module requires chunking extra
aakankshaduggal Feb 25, 2025
f1477b7
Take transformers out of the block from leanimports
aakankshaduggal Feb 25, 2025
5c28c5e
Update test_chunkers to update functional tests
aakankshaduggal Feb 25, 2025
f24e852
Update lean imports to not run into OOM error
aakankshaduggal Feb 25, 2025
c9878c1
Update src/instructlab/sdg/utils/chunkers.py
aakankshaduggal Feb 26, 2025
30bd3b5
Update use_gpu to None for easy ocr
aakankshaduggal Feb 27, 2025
0376926
Increase from 1.7 to a larger value to avoid the PyTorch MPS backend …
aakankshaduggal Feb 27, 2025
7709f0c
Remove docling core test from lean imports, add transformers back and…
aakankshaduggal Feb 28, 2025
ec99a89
fix: Move docling_core import inside method to avoid top-level transf…
Mar 11, 2025
09e7b09
fix: Remove semchunk and transformers from requirements.txt
Mar 11, 2025
bd3488b
ci: Update GitHub Actions workflow to use macos-latest-xlarge runners
Mar 11, 2025
7c11b3c
ci: Update GitHub Actions workflow free disk space condition to refle…
Mar 11, 2025
2cdb25e
test: Add fixture to force CPU usage on macOS CI environments
Mar 11, 2025
0a3eecd
test: Add fixture to force CPU usage on macOS CI environments, disabl…
Mar 11, 2025
755851c
test: test debug logging and MPS handling in macOS CI environment fix…
eshwarprasadS Mar 11, 2025
16f6b9f
test: Reorder imports and minor formatting in macOS CI fixture
eshwarprasadS Mar 11, 2025
bde48a3
test: Simplify macOS device handling in chunkers and test fixture
eshwarprasadS Mar 11, 2025
cb5f4b1
refactor: Simplify MPS handling in macOS CI test fixture
eshwarprasadS Mar 11, 2025
259bdb5
fix: add CI env check condition to disabling MPS
eshwarprasadS Mar 11, 2025
0d5a749
Revert latest commit to fix broken test
eshwarprasadS Mar 11, 2025
728443e
ci: update test workflow to use macos-latest platform
eshwarprasadS Mar 13, 2025
a33749b
ci: add CI environment variable to enable macOS MPS handling
eshwarprasadS Mar 13, 2025
a8273f4
refactor: Remove PDF extraction using docling parse, remove all refer…
eshwarprasadS Mar 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,9 @@ jobs:
- name: Run unit and functional tests with tox
run: |
tox
env:
# Increase from 1.7 to a greater value to avoid the PyTorch MPS backend running OOM.
PYTORCH_MPS_HIGH_WATERMARK_RATIO: 2.0

- name: Remove llama-cpp-python from cache
if: always()
Expand Down
6 changes: 3 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# SPDX-License-Identifier: Apache-2.0
click>=8.1.7,<9.0.0
datasets>=2.18.0,<3.0.0
docling[tesserocr]>=2.4.2,<=2.8.3; sys_platform != 'darwin'
docling>=2.4.2,<=2.8.3; sys_platform == 'darwin'
docling-parse>=2.0.0,<3.0.0
docling-core[chunking]>=2.9.0
docling[tesserocr]>=2.9.0; sys_platform != 'darwin'
docling>=2.9.0; sys_platform == 'darwin'
GitPython>=3.1.42,<4.0.0
gguf>=0.6.0
httpx>=0.25.0,<1.0.0
Expand Down
Loading
Loading