Skip to content

Conversation

@Luis-manzur
Copy link
Contributor

This pull request introduces a new post-extraction text cleanup mechanism to the codebase. The main addition is a cleanup_extracted_text method, which allows for sanitizing plain text after it has been extracted from source documents. This enables removal of extraction artifacts and unwanted content, improving the quality of processed text. The method is implemented as a no-op in the abstract base class and is overridden with custom logic in a subclass for SCOTUS slip opinions. Additionally, the sample caller is updated to use the cleaned text for further processing.

This PR addresses - #1651

@Luis-manzur Luis-manzur requested review from flooie and grossir November 5, 2025 20:42
@Luis-manzur Luis-manzur linked an issue Nov 5, 2025 that may be closed by this pull request
@Luis-manzur Luis-manzur moved this to PRs to Review in Case Law Sprint Nov 5, 2025
Copy link
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is almost ready; some small style changes needed

:param content: The scraped text
:return: Cleaned text
"""
content = content.replace("Page Proof Pending Publication\n", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked up an existing opinion's text
and this will work as expected

text = "ost joined a suit brought against the Board under\nPage Proof Pending Publication\n  the Administrative Procedure Act (APA). The complaint c"
text.replace("Page Proof Pending Publication\n", "")
'ost joined a suit brought against the Board under\n  the Administrative Procedure Act (APA). The complaint c'

@grossir grossir assigned Luis-manzur and unassigned flooie Nov 5, 2025
@Luis-manzur Luis-manzur requested a review from grossir November 6, 2025 14:49
Copy link
Contributor

@grossir grossir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ready for me. Just a comment below on the order of calls (which should end up being the same in CL). I think @flooie will want to check this one since it introduces a new feature to Juriscraper

Comment on lines +150 to +152
cleaned_extracted_text = site.cleanup_extracted_text(extracted_content)

metadata_dict = site.extract_from_text(cleaned_extracted_text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think site.extract_from_text(cleaned_extracted_text) should go before, since we may use some of the content we want to clean as reference to find the content we want to extract. site.cleanup_extracted_text will reduce the amount of text in the file, while site.extract_from_text may take advantage of that extra text to find stuff

@flooie
Copy link
Contributor

flooie commented Nov 13, 2025

@Luis-manzur this needs a test - probably since it's a new feature a new test. Try to keep it minimal - in the same vein as extract from text - tests. please.

@Luis-manzur Luis-manzur assigned flooie and unassigned Luis-manzur Nov 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: PRs to Review

Development

Successfully merging this pull request may close these issues.

add clean_extracted_text method

4 participants