-
-
Notifications
You must be signed in to change notification settings - Fork 128
1301 nd neutral citation in the html #1477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
can you update this with a history
sorta look at some others that have this please |
I think we should switch judges to author or author_str whichever is used in cl - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take a look at the comments. You removed a call but not the code. Lets see if we still need the clean name function and we certainly cant exclude any case which doesnt match the pattern that excludes lots of others. It's possible they are also publishing without some citation I guess.
if there is no citation it will continue as normally but citation will be blank |
@@ -50,11 +54,24 @@ def _process_html(self) -> None: | |||
|
|||
for idx, txt in enumerate(raw_values[:5]): | |||
if idx == 0: | |||
# Separate case name and citation if present | |||
match = re.match( | |||
r"^(.*?)(\s*((\d{4}\sND\s\d+)|(1 \d\.N\.W\d d\d+)))?\s*$", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats with the 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont see any way to actually identify the nw2d and nw3d citations but the regex here looks wrong to me.
if "(consolidated w/" in name: | ||
other_dockets = ",".join(re.findall(r"\d{8}", name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you updated the function but didnt update the docstring. it still says its returning extra docket numbers. But shouldnt those extra docket numbers be included in the final docket numbers? Can you create a second example html that captures this edge cases in the html so I can see how it is processed
txt, | ||
) | ||
if match: | ||
case_name = match.group(1).strip() | ||
# If matched with the second regex (1 \d\.N\.W\d d\d+), set citation to "" | ||
if match.group(5): | ||
citation = "" | ||
else: | ||
citation = ( | ||
match.group(2).strip() | ||
if match.group(2) | ||
else "" | ||
) | ||
txt = case_name | ||
else: | ||
citation = "" | ||
txt = self.clean_name(txt) | ||
values.append(citation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is getting a bit unwieldy. I think the few examples of nw3d with broken citations should be ignored. focus solely on the ND format and ignore the rest. It's not worth the hassle in my opinion for bad data.
@Luis-manzur back to you |
This pull request introduces improvements to the
nd
scraper injuriscraper
to extract citations from HTML and updates theCHANGES.md
file to reflect this enhancement. The most significant changes include adding citation handling to the scraper, modifying the_process_html
method to parse and store citations, and updating the ordered fields for the scraper.Enhancements to the
nd
scraper:juriscraper/opinions/united_states/state/nd.py
: Added "citation" to theordered_fields
list to include citations as part of the extracted data.juriscraper/opinions/united_states/state/nd.py
: Updated the_process_html
method to parse case names and citations from raw HTML using a regular expression. The method now separates case names from citations, appends citations to thevalues
list, and ensures proper handling when no citation is present.juriscraper/opinions/united_states/state/nd.py
: Adjusted the_process_html
method to account for the added "citation" field when creating thecase
dictionary by zippingordered_fields
with thevalues
list.