Remove camelot #2624

symroe · 2025-11-20T13:10:50Z

In this PR I do a couple of things:

Remove the SOPN testing baseline code. This was more of a one off feature that we added to compare the existing SOPN tooling against AWS Textract. We wanted a way to validate that Textract was at least as good as our existing code. We validated this to the point where we're getting rid of the existing code (the next commit). It might be useful to have the ability to generate a baseline in future, but this code is already quite stale so I think it's better to get rid of it and write new code if we want something like this in future.
We kept both old and new systems for parsing PDFs into DataFrames. We don't want to use Camelot any more, partly because Textract is much better (it deals with images, for one thing), but also because it's not maintained any more. In the second commit, I remove everything I can find to do with Camelot.

A couple of extra points:

I've not removed all local PDF libraries. This is because we have some tooling to split up PDFs by page that relies on these libraries. There are changes I'd like to make to this system over time, but I consider that out of scope for this PR. This means we still need to install some local libraries at the system level.
I've removed some of the parse tables tests because they were tied to Camelot not Textract. We want something like these tests in future, but they would be testing code that is likely to change again in the near future, so I've decided to delete them with an eye to re-making them as part of future work. Again, the refactor wasn't worth the time for code that's about to change anyway.

This was used to capture baselines for Camelot. We no longer need this code

This is a bit of a tangle to remove cleanly. I think I've removed too much, especially some of the tests that really should be converted to use AWS Textract rather than just removing them. The plan is to built these up again with a bit of a rethink / redesign of the whole system. Maintaining these tests while refactoring wouldn't be a good idea, so I suggest revisiting them later.

chris48s · 2025-11-21T09:03:31Z

It is quite hard for me to provide a really good review of this because I'm not super familiar with this area of the code. I also don't have a really clear idea of what your vision is for the next bits of work following on from this. I will try and give some sensible-ish comments as best I can.

Having deleted all this code, are we now in a position to remove camelot-py[cv] from pyproject.toml, even if we have to keep the other python/system libs for PDF parsing?

I've not removed all local PDF libraries

Not for this PR, but if we are keeping pdfminer.six and pypdf long-term, we should see if we can get them upgraded to current versions.

One thing we lose by removing the Camelot backend is the ability to have a "working" local application with out connecting it to AWS. That doesn't mean we should keep Camelot, but do you think there is any mileage in having a "mock" PDF parsing backend? I am thinking we write a "parser" where you feed it any PDF and it throws the PDF away and just returns the same hard-coded list of people/parties. Worth bothering with?

There's a code formatting error to resolve with ruff to get the build passing.

chris48s · 2025-11-21T09:03:56Z

Makefile

-
-.PHONY: populate-sopn-testing-database
-populate-sopn-testing-database: migrate-db
-	python manage.py candidates_import_from_live_site


This is a bit of a tangent, but while we're getting trigger happy with the delete button, shall we bin candidates_import_from_live_site too? I reckon now we have https://github.com/DemocracyClub/yournextrepresentative/blob/master/scripts/get-prod-db.sh this is basically just going to rot.

chris48s · 2025-12-02T10:37:08Z

Quick thing I noticed while working on something else..
There are some tests in the test suite with a:

@skipIf(should_skip_conversion_tests(), "Required conversion libs not installed")

decorator on them. Can you double-check if that is something we can get rid of as part of this PR.

symroe added 2 commits November 15, 2025 10:22

Remove SOPN tooling

0a5dac7

This was used to capture baselines for Camelot. We no longer need this code

symroe requested a review from chris48s November 20, 2025 13:11

chris48s reviewed Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove camelot #2624

Remove camelot #2624

Uh oh!

symroe commented Nov 20, 2025

Uh oh!

chris48s commented Nov 21, 2025 •

edited

Loading

Uh oh!

chris48s Nov 21, 2025

Uh oh!

chris48s commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove camelot #2624

Are you sure you want to change the base?

Remove camelot #2624

Uh oh!

Conversation

symroe commented Nov 20, 2025

Uh oh!

chris48s commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chris48s Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

chris48s commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chris48s commented Nov 21, 2025 •

edited

Loading