Skip to content

1301 nd neutral citation in the html #1477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Luis-manzur
Copy link
Contributor

This pull request introduces improvements to the nd scraper in juriscraper to extract citations from HTML and updates the CHANGES.md file to reflect this enhancement. The most significant changes include adding citation handling to the scraper, modifying the _process_html method to parse and store citations, and updating the ordered fields for the scraper.

Enhancements to the nd scraper:

@Luis-manzur Luis-manzur requested a review from flooie July 2, 2025 14:43
@Luis-manzur Luis-manzur linked an issue Jul 2, 2025 that may be closed by this pull request
@Luis-manzur Luis-manzur moved this to PRs to Review in Case Law Sprint Jul 2, 2025
@flooie
Copy link
Contributor

flooie commented Jul 2, 2025

can you update this with a history

# History:
#  - 2014-07-10: Created by Andrei Chelaru
#  - 2014-11-07: Updated by mlr to account for new website.
#  - 2014-12-09: Updated by mlr to make the date range wider and more thorough.
#  - 2015-08-19: Updated by Andrei Chelaru to add backwards scraping support.
#  - 2015-08-27: Updated by Andrei Chelaru to add explicit waits
#  - 2021-12-28: Updated by flooie to remove selenium.
#  - 2024-02-21; Updated by grossir: handle dynamic backscrapes

sorta look at some others that have this please

@flooie
Copy link
Contributor

flooie commented Jul 2, 2025

I think we should switch judges to author or author_str whichever is used in cl -

Copy link
Contributor

@flooie flooie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a look at the comments. You removed a call but not the code. Lets see if we still need the clean name function and we certainly cant exclude any case which doesnt match the pattern that excludes lots of others. It's possible they are also publishing without some citation I guess.

@flooie flooie assigned Luis-manzur and unassigned flooie Jul 2, 2025
@Luis-manzur
Copy link
Contributor Author

take a look at the comments. You removed a call but not the code. Lets see if we still need the clean name function and we certainly cant exclude any case which doesnt match the pattern that excludes lots of others. It's possible they are also publishing without some citation I guess.

if there is no citation it will continue as normally but citation will be blank

@Luis-manzur Luis-manzur assigned flooie and unassigned Luis-manzur Jul 2, 2025
@@ -50,11 +54,24 @@ def _process_html(self) -> None:

for idx, txt in enumerate(raw_values[:5]):
if idx == 0:
# Separate case name and citation if present
match = re.match(
r"^(.*?)(\s*((\d{4}\sND\s\d+)|(1 \d\.N\.W\d d\d+)))?\s*$",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats with the 1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont see any way to actually identify the nw2d and nw3d citations but the regex here looks wrong to me.

@flooie flooie assigned Luis-manzur and unassigned flooie Jul 3, 2025
@Luis-manzur Luis-manzur assigned flooie and unassigned Luis-manzur Jul 3, 2025
if "(consolidated w/" in name:
other_dockets = ",".join(re.findall(r"\d{8}", name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you updated the function but didnt update the docstring. it still says its returning extra docket numbers. But shouldnt those extra docket numbers be included in the final docket numbers? Can you create a second example html that captures this edge cases in the html so I can see how it is processed

Comment on lines 60 to 77
txt,
)
if match:
case_name = match.group(1).strip()
# If matched with the second regex (1 \d\.N\.W\d d\d+), set citation to ""
if match.group(5):
citation = ""
else:
citation = (
match.group(2).strip()
if match.group(2)
else ""
)
txt = case_name
else:
citation = ""
txt = self.clean_name(txt)
values.append(citation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is getting a bit unwieldy. I think the few examples of nw3d with broken citations should be ignored. focus solely on the ND format and ignore the rest. It's not worth the hassle in my opinion for bad data.

@flooie flooie assigned Luis-manzur and unassigned flooie Jul 4, 2025
@flooie
Copy link
Contributor

flooie commented Jul 4, 2025

@Luis-manzur back to you

@Luis-manzur Luis-manzur assigned flooie and unassigned Luis-manzur Jul 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: PRs to Review
Development

Successfully merging this pull request may close these issues.

nd now includes the neutral citation in the scraped HTML
2 participants