Skip to content

Conversation

@grossir
Copy link
Contributor

@grossir grossir commented Jan 10, 2025

Solves #1292

Now parsing: disposition, docket_number and judges

Solves #1292

Now parsing: disposition, docket_number and judges
@grossir grossir requested a review from flooie January 10, 2025 01:19
@grossir grossir self-assigned this Jan 10, 2025
Comment on lines 33 to 34
"aff in pt, vacate, & rem in pt": "Affirm in part, vacate and remand in part",
"aff in pt & vacate": "Affirm and vacate", # https://www.courtlistener.com/opinion/9502826/state-v-scott/pdf/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to fine tune these dispositions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this one
aff in pt & vacate": "Affirm and vacate"
should be
aff in pt & vacate": "Affirm in part and vacate"

Any other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggested the past tense. Dismiss should match Affirmed. Dismissed. Reversed and Remanded, ... etc.

Also - Aff in pt & vacate should be Affirmed in Part and Vacated

Copy link
Contributor

@flooie flooie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current regex pattern for citations is too restrictive. It seems to require a citation for processing, causing cases without one to be skipped. This should be loosened to ensure that all relevant data is captured, even when a citation is missing.

The back scraper and extract_from_text method only work with PDFs from 2005 onward, as that’s when the court transitioned to a new format. Before 2005, the extraction fails due to a change in text patterns.

Should add HTML cleanup code as well.

The regular backscraper is likely to fail starting around 2009-2010 because of overly strict regex constraints. Adjusting the pattern to accommodate format variations would improve reliability.

overall I liked what you did.

@flooie
Copy link
Contributor

flooie commented Jan 23, 2025

@grossir this is still with you right?

@grossir
Copy link
Contributor Author

grossir commented Jan 23, 2025

@flooie yes, I still have to get back to this

@flooie flooie moved this from Buffer Zone to PRs to Review in Case Law Sprint Mar 24, 2025
@grossir grossir force-pushed the fix_sd_extract_from_text branch from 5712f7b to d680a1a Compare April 8, 2025 16:49
@grossir grossir force-pushed the fix_sd_extract_from_text branch from d680a1a to e94da40 Compare April 8, 2025 16:57
@grossir grossir requested a review from flooie April 8, 2025 17:05
@grossir grossir assigned flooie and unassigned grossir Apr 8, 2025
@grossir
Copy link
Contributor Author

grossir commented Apr 8, 2025

I didn't implement logic for the HTML pages before 20056 since we have that data and it would complicate the scraper logic for little gain

Copy link
Contributor

@flooie flooie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of work, thanks @grossir

@flooie flooie merged commit 9f022ce into main Apr 18, 2025
13 checks passed
@flooie flooie deleted the fix_sd_extract_from_text branch April 18, 2025 14:51
@github-project-automation github-project-automation bot moved this from PRs to Review to Done in Case Law Sprint Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants