Skip to content

Conversation

@symroe
Copy link
Member

@symroe symroe commented Nov 20, 2025

In this PR I do a couple of things:

  1. Remove the SOPN testing baseline code. This was more of a one off feature that we added to compare the existing SOPN tooling against AWS Textract. We wanted a way to validate that Textract was at least as good as our existing code. We validated this to the point where we're getting rid of the existing code (the next commit). It might be useful to have the ability to generate a baseline in future, but this code is already quite stale so I think it's better to get rid of it and write new code if we want something like this in future.
  2. We kept both old and new systems for parsing PDFs into DataFrames. We don't want to use Camelot any more, partly because Textract is much better (it deals with images, for one thing), but also because it's not maintained any more. In the second commit, I remove everything I can find to do with Camelot.

A couple of extra points:

  1. I've not removed all local PDF libraries. This is because we have some tooling to split up PDFs by page that relies on these libraries. There are changes I'd like to make to this system over time, but I consider that out of scope for this PR. This means we still need to install some local libraries at the system level.
  2. I've removed some of the parse tables tests because they were tied to Camelot not Textract. We want something like these tests in future, but they would be testing code that is likely to change again in the near future, so I've decided to delete them with an eye to re-making them as part of future work. Again, the refactor wasn't worth the time for code that's about to change anyway.

This was used to capture baselines for Camelot. We no longer need this
code
This is a bit of a tangle to remove cleanly. I think I've removed too
much, especially some of the tests that really should be converted to
use AWS Textract rather than just removing them.

The plan is to built these up again with a bit of a rethink / redesign
of the whole system. Maintaining these tests while refactoring wouldn't
be a good idea, so I suggest revisiting them later.
@symroe symroe requested a review from chris48s November 20, 2025 13:11
@chris48s
Copy link
Member

chris48s commented Nov 21, 2025

It is quite hard for me to provide a really good review of this because I'm not super familiar with this area of the code. I also don't have a really clear idea of what your vision is for the next bits of work following on from this. I will try and give some sensible-ish comments as best I can.


Having deleted all this code, are we now in a position to remove camelot-py[cv] from pyproject.toml, even if we have to keep the other python/system libs for PDF parsing?


I've not removed all local PDF libraries

Not for this PR, but if we are keeping pdfminer.six and pypdf long-term, we should see if we can get them upgraded to current versions.


One thing we lose by removing the Camelot backend is the ability to have a "working" local application with out connecting it to AWS. That doesn't mean we should keep Camelot, but do you think there is any mileage in having a "mock" PDF parsing backend? I am thinking we write a "parser" where you feed it any PDF and it throws the PDF away and just returns the same hard-coded list of people/parties. Worth bothering with?


There's a code formatting error to resolve with ruff to get the build passing.


.PHONY: populate-sopn-testing-database
populate-sopn-testing-database: migrate-db
python manage.py candidates_import_from_live_site
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a tangent, but while we're getting trigger happy with the delete button, shall we bin candidates_import_from_live_site too? I reckon now we have https://github.com/DemocracyClub/yournextrepresentative/blob/master/scripts/get-prod-db.sh this is basically just going to rot.

@chris48s
Copy link
Member

chris48s commented Dec 2, 2025

Quick thing I noticed while working on something else..
There are some tests in the test suite with a:

@skipIf(should_skip_conversion_tests(), "Required conversion libs not installed")

decorator on them. Can you double-check if that is something we can get rid of as part of this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants