-
-
Notifications
You must be signed in to change notification settings - Fork 141
1651 add clean extracted text method #1653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
grossir
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is almost ready; some small style changes needed
| :param content: The scraped text | ||
| :return: Cleaned text | ||
| """ | ||
| content = content.replace("Page Proof Pending Publication\n", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I picked up an existing opinion's text
and this will work as expected
text = "ost joined a suit brought against the Board under\nPage Proof Pending Publication\n the Administrative Procedure Act (APA). The complaint c"
text.replace("Page Proof Pending Publication\n", "")
'ost joined a suit brought against the Board under\n the Administrative Procedure Act (APA). The complaint c'
juriscraper/opinions/united_states/federal_appellate/scotus_slip.py
Outdated
Show resolved
Hide resolved
juriscraper/opinions/united_states/federal_appellate/scotus_slip.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ready for me. Just a comment below on the order of calls (which should end up being the same in CL). I think @flooie will want to check this one since it introduces a new feature to Juriscraper
| cleaned_extracted_text = site.cleanup_extracted_text(extracted_content) | ||
|
|
||
| metadata_dict = site.extract_from_text(cleaned_extracted_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think site.extract_from_text(cleaned_extracted_text) should go before, since we may use some of the content we want to clean as reference to find the content we want to extract. site.cleanup_extracted_text will reduce the amount of text in the file, while site.extract_from_text may take advantage of that extra text to find stuff
|
@Luis-manzur this needs a test - probably since it's a new feature a new test. Try to keep it minimal - in the same vein as extract from text - tests. please. |
…thod' into 1651-add-clean_extracted_text-method
for more information, see https://pre-commit.ci
This pull request introduces a new post-extraction text cleanup mechanism to the codebase. The main addition is a
cleanup_extracted_textmethod, which allows for sanitizing plain text after it has been extracted from source documents. This enables removal of extraction artifacts and unwanted content, improving the quality of processed text. The method is implemented as a no-op in the abstract base class and is overridden with custom logic in a subclass for SCOTUS slip opinions. Additionally, the sample caller is updated to use the cleaned text for further processing.This PR addresses - #1651