Migrate from spa cy and nltk #154

nonprofittechy · 2025-10-30T19:06:38Z

This just merges the smaller changes into main that have already been reviewed.

The purpose of this PR is to remove all of the clunky older ML libraries that this package used to rely on, both to improve performance and to reduce costs (since those functions were being hosted on an always-on server).

Fix #135
Fix #145
Fix #133
Fix #131
Fix #128
Fix #79
Fix #37
Fix #39
Fix #41
Fix #42

Harder than anticipated, for a few reasons: * can't get things in order. I guess that's the point of PDFminer, but... * PDF miner doesn't give you AcroForms at all. It has a completley hardcoded way of getting them, outside the context of the page. * We can do these two things: * kinda put all of the fields back in the original text (see replace_original_text). Doesn't work too well though, lots of duplicate pieces of text that put many of the fields in the same place, when they should be in a different place. * could gather fields with the same adjacent text, and get all parts of that text in the PDF. Not guaranteed to be in order tho. * for each field, get all of the surrounding context. Is okay! But consistently gets too much text for GPT4. Even if we make it smaller, sometimes the surrounding context isn't the full sentence, or gets too much from other fields (will have too much shared / confusing the two fields). TBH next goal is to try the PDFPageAndFieldInterpreter approach, notes in there.

…ture evaluations

…te unit test accordingly

Co-authored-by: Bryce Willey <[email protected]>

…ance in the repo itself instead of just PR

…r into replace-passivepy

Replace passivepy with a call to an LLM

…nfig when it's available there

…g-when-available

…n-available Allow working without .env; pull creds from docassemble config

…mpletion functions

…function

…nd started promptfoo but still needs work

Co-authored-by: Copilot <[email protected]>

Co-authored-by: Bryce Willey <[email protected]>

Finish the migration to LLMs; removing NLTK, etc.

BryceStevenWilley and others added 30 commits June 16, 2024 22:04

Got PDF fields in the text of the PDF working

db6a496

Mypy fixes

78cdfde

Add license and passive voice test dataset from PassivePy repo for fu…

85e7d8c

…ture evaluations

Checkpoint - this basically works, could do more optimization though

2932927

Simplified to use chatcompletion API again; performance reaching 94%

6e58010

Do tokenization only with regex

a329380

Remove references to responses API as we don't use that anymore; upda…

749face

…te unit test accordingly

Format with black; remove extra global call

d5223fa

Update formfyxer/passive_voice_detection.py

4a82d3d

Co-authored-by: Bryce Willey <[email protected]>

Update formfyxer/passive_voice_detection.py

9a7d5cd

Co-authored-by: Bryce Willey <[email protected]>

Address feedback from PR; move integration test, drop note to perform…

f90822a

…ance in the repo itself instead of just PR

Merge branch 'replace-passivepy' of github.com:SuffolkLITLab/FormFyxe…

a720859

…r into replace-passivepy

Merge pull request #147 from SuffolkLITLab/replace-passivepy

c08bfd4

Replace passivepy with a call to an LLM

Merge branch 'main' into migrate-from-spaCy-and-nltk

cbf3db5

Merge with main

0b3fc7e

Fix #149 - allow working without .env; pull creds from docassemble co…

4a66841

…nfig when it's available there

Less repetitive code

f726f78

Formatting

15d69be

Merge branch 'main' into migrate-from-spaCy-and-nltk

f52093a

Merge branch 'main' into migrate-from-spaCy-and-nltk

cb93ac2

Merge branch 'migrate-from-spaCy-and-nltk' into use-docassemble-confi…

330cf73

…g-when-available

Typing

44e8582

Correct ignore for optional import

d2cc7c8

Merge pull request #150 from SuffolkLITLab/use-docassemble-config-whe…

6586554

…n-available Allow working without .env; pull creds from docassemble config

Update to gpt-5-nano; refactor screen grouping to use LLM

bf577a0

Refactor - use prompts/.txt consistently; use cleaner API for text co…

0ddf093

…mpletion functions

Merge from pdf_context_extract branch before updating normalize_name …

9758e78

…function

Fix merge

62631ad

GPT-based, in-context field labeling

dc4cce0

nonprofittechy and others added 15 commits October 7, 2025 08:52

WIP - working on eval

a15d6de

Use LLM for renaming fields; start on tests, some integration tests a…

e83e7f0

…nd started promptfoo but still needs work

Remove joblibs; nltk dependency

b5fb08c

Pretty much remove all of the janky dependencies

8f0bfd9

remove unused dependencies; fix title fallback discovered in smoke test

01ebb16

cleanup some imports

ceb2994

Cleanup readme, remove some redundant comments

8e5228f

Typing changes

ee7fec2

Update test to match 2 message format (system and user)

9d277db

Update formfyxer/prompts/guess_form_name.txt

8fb38c8

Co-authored-by: Copilot <[email protected]>

Update formfyxer/prompts/describe_form.txt

6834824

Co-authored-by: Copilot <[email protected]>

Update formfyxer/pdf_wrangling.py

be15493

Co-authored-by: Bryce Willey <[email protected]>

Update formfyxer/pdf_wrangling.py

fa8081f

Co-authored-by: Bryce Willey <[email protected]>

Remove unused test; simplify repetitive code

68f6080

Merge pull request #153 from SuffolkLITLab/field-grouping-with-llm

c345bbb

Finish the migration to LLMs; removing NLTK, etc.

nonprofittechy merged commit 1126236 into main Oct 30, 2025
3 checks passed

nonprofittechy deleted the migrate-from-spaCy-and-nltk branch October 30, 2025 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate from spa cy and nltk #154

Migrate from spa cy and nltk #154

Uh oh!

nonprofittechy commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Migrate from spa cy and nltk #154

Migrate from spa cy and nltk #154

Uh oh!

Conversation

nonprofittechy commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants