1301 nd neutral citation in the html #1477

Luis-manzur · 2025-07-02T14:43:40Z

This pull request introduces improvements to the nd scraper in juriscraper to extract citations from HTML and updates the CHANGES.md file to reflect this enhancement. The most significant changes include adding citation handling to the scraper, modifying the _process_html method to parse and store citations, and updating the ordered fields for the scraper.

Enhancements to the `nd` scraper:

juriscraper/opinions/united_states/state/nd.py: Added "citation" to the ordered_fields list to include citations as part of the extracted data.
juriscraper/opinions/united_states/state/nd.py: Updated the _process_html method to parse case names and citations from raw HTML using a regular expression. The method now separates case names from citations, appends citations to the values list, and ensures proper handling when no citation is present.
juriscraper/opinions/united_states/state/nd.py: Adjusted the _process_html method to account for the added "citation" field when creating the case dictionary by zipping ordered_fields with the values list.

flooie · 2025-07-02T15:38:47Z

can you update this with a history

# History:
#  - 2014-07-10: Created by Andrei Chelaru
#  - 2014-11-07: Updated by mlr to account for new website.
#  - 2014-12-09: Updated by mlr to make the date range wider and more thorough.
#  - 2015-08-19: Updated by Andrei Chelaru to add backwards scraping support.
#  - 2015-08-27: Updated by Andrei Chelaru to add explicit waits
#  - 2021-12-28: Updated by flooie to remove selenium.
#  - 2024-02-21; Updated by grossir: handle dynamic backscrapes

sorta look at some others that have this please

flooie · 2025-07-02T15:43:02Z

I think we should switch judges to author or author_str whichever is used in cl -

juriscraper/opinions/united_states/state/nd.py

flooie

take a look at the comments. You removed a call but not the code. Lets see if we still need the clean name function and we certainly cant exclude any case which doesnt match the pattern that excludes lots of others. It's possible they are also publishing without some citation I guess.

Luis-manzur · 2025-07-02T19:46:11Z

take a look at the comments. You removed a call but not the code. Lets see if we still need the clean name function and we certainly cant exclude any case which doesnt match the pattern that excludes lots of others. It's possible they are also publishing without some citation I guess.

if there is no citation it will continue as normally but citation will be blank

flooie · 2025-07-03T00:50:38Z

juriscraper/opinions/united_states/state/nd.py

@@ -50,11 +54,24 @@ def _process_html(self) -> None:

            for idx, txt in enumerate(raw_values[:5]):
                if idx == 0:
+                    # Separate case name and citation if present
+                    match = re.match(
+                        r"^(.*?)(\s*((\d{4}\sND\s\d+)|(1 \d\.N\.W\d d\d+)))?\s*$",


whats with the 1?

I dont see any way to actually identify the nw2d and nw3d citations but the regex here looks wrong to me.

juriscraper/opinions/united_states/state/nd.py

flooie · 2025-07-04T12:59:58Z

juriscraper/opinions/united_states/state/nd.py

        if "(consolidated w/" in name:
-            other_dockets = ",".join(re.findall(r"\d{8}", name))


you updated the function but didnt update the docstring. it still says its returning extra docket numbers. But shouldnt those extra docket numbers be included in the final docket numbers? Can you create a second example html that captures this edge cases in the html so I can see how it is processed

flooie · 2025-07-04T13:00:55Z

juriscraper/opinions/united_states/state/nd.py

+                        txt,
+                    )
+                    if match:
+                        case_name = match.group(1).strip()
+                        # If matched with the second regex (1 \d\.N\.W\d d\d+), set citation to ""
+                        if match.group(5):
+                            citation = ""
+                        else:
+                            citation = (
+                                match.group(2).strip()
+                                if match.group(2)
+                                else ""
+                            )
+                        txt = case_name
+                    else:
+                        citation = ""
+                    txt = self.clean_name(txt)
+                    values.append(citation)


this is getting a bit unwieldy. I think the few examples of nw3d with broken citations should be ignored. focus solely on the ND format and ignore the rest. It's not worth the hassle in my opinion for bad data.

flooie · 2025-07-04T13:01:33Z

@Luis-manzur back to you

…tion

flooie · 2025-07-10T14:24:47Z

juriscraper/opinions/united_states/state/nd.py

                if idx == 0:
-                    txt, _ = self.clean_name(txt)
+                    # Separate case name and citation if present
+                    match = re.match(
+                        r"^(.*?)(\s*(\d{4}\sND\s\d+))?\s*$",
+                        txt,
+                    )
+                    if match:
+                        case_name = match.group(1).strip()
+                        citation = (
+                            match.group(2).strip() if match.group(2) else ""
+                        )
+                        txt = case_name
+                    else:
+                        citation = ""
+                    txt, other_docket = self.clean_name(txt)
+                    values.append(citation)
                else:
                    txt = txt.split(":", 1)[1].strip()
                values.append(txt)


looks like this code crashes if you force it into the else.

juriscraper/opinions/united_states/state/nd.py

flooie

Not working for some scenarios, can you take another look at this

# Conflicts: # CHANGES.md

Luis-manzur added 3 commits July 2, 2025 10:32

feat(nd): Add citation extraction to case name processing

cf246c8

feat(nd): update nd example files

ebe178a

feat(nd): update nd example files

da04fd1

Luis-manzur requested a review from flooie July 2, 2025 14:43

Luis-manzur assigned flooie Jul 2, 2025

Luis-manzur added this to Case Law Sprint Jul 2, 2025

Luis-manzur linked an issue Jul 2, 2025 that may be closed by this pull request

nd now includes the neutral citation in the scraped HTML #1301

Open

Luis-manzur moved this to PRs to Review in Case Law Sprint Jul 2, 2025

flooie reviewed Jul 2, 2025

View reviewed changes

juriscraper/opinions/united_states/state/nd.py Outdated Show resolved Hide resolved

flooie reviewed Jul 2, 2025

View reviewed changes

juriscraper/opinions/united_states/state/nd.py Show resolved Hide resolved

flooie requested changes Jul 2, 2025

View reviewed changes

flooie assigned Luis-manzur and unassigned flooie Jul 2, 2025

feat(nd): enhance citation regex and clean case name processing

4fb7e64

Luis-manzur assigned flooie and unassigned Luis-manzur Jul 2, 2025

chore: add history to nd

9331199

flooie reviewed Jul 3, 2025

View reviewed changes

juriscraper/opinions/united_states/state/nd.py Outdated Show resolved Hide resolved

flooie assigned Luis-manzur and unassigned flooie Jul 3, 2025

Luis-manzur and others added 2 commits July 3, 2025 11:25

feat(nd): refine citation handling and simplify clean_name method

a405db9

Merge branch 'main' into 1301-nd-neutral-citation-in-the-html

332b73b

Luis-manzur assigned flooie and unassigned Luis-manzur Jul 3, 2025

flooie reviewed Jul 4, 2025

View reviewed changes

flooie assigned Luis-manzur and unassigned flooie Jul 4, 2025

Luis-manzur added 2 commits July 4, 2025 14:34

fix(nd): simplify citation extraction logic in case name processing

cce7716

fix(nd): update clean_name method to return additional docket informa…

60d8472

…tion

Luis-manzur assigned flooie and unassigned Luis-manzur Jul 4, 2025

Luis-manzur and others added 2 commits July 4, 2025 14:41

Merge branch 'main' into 1301-nd-neutral-citation-in-the-html

29559b0

Merge branch 'main' into 1301-nd-neutral-citation-in-the-html

2b3e4fe

flooie assigned Luis-manzur and unassigned flooie Jul 10, 2025

flooie reviewed Jul 10, 2025

View reviewed changes

juriscraper/opinions/united_states/state/nd.py Show resolved Hide resolved

flooie requested changes Jul 10, 2025

View reviewed changes

Luis-manzur and others added 4 commits July 10, 2025 15:00

Merge branch 'main' into 1301-nd-neutral-citation-in-the-html

e54957e

fix(nd): correct regex to ensure citation is captured properly

30b2a71

Merge branch 'main' into 1301-nd-neutral-citation-in-the-html

4b449c7

# Conflicts: # CHANGES.md

chore: update CHANGES.md to include reference for nd updates

8822601

Luis-manzur assigned flooie and unassigned Luis-manzur Jul 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

1301 nd neutral citation in the html #1477

1301 nd neutral citation in the html #1477

Uh oh!

Luis-manzur commented Jul 2, 2025

Uh oh!

flooie commented Jul 2, 2025 •

edited

Loading

Uh oh!

flooie commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

flooie left a comment

Uh oh!

Luis-manzur commented Jul 2, 2025

Uh oh!

flooie Jul 3, 2025

Uh oh!

flooie Jul 3, 2025

Uh oh!

Uh oh!

flooie Jul 4, 2025

Uh oh!

flooie Jul 4, 2025

Uh oh!

flooie commented Jul 4, 2025

Uh oh!

flooie Jul 10, 2025

Uh oh!

Uh oh!

flooie left a comment

Uh oh!

Uh oh!

		if "(consolidated w/" in name:
		other_dockets = ",".join(re.findall(r"\d{8}", name))

Uh oh!

1301 nd neutral citation in the html #1477

Are you sure you want to change the base?

1301 nd neutral citation in the html #1477

Uh oh!

Conversation

Luis-manzur commented Jul 2, 2025

Enhancements to the nd scraper:

Uh oh!

flooie commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flooie commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

flooie left a comment

Choose a reason for hiding this comment

Uh oh!

Luis-manzur commented Jul 2, 2025

Uh oh!

flooie Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

flooie Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flooie Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

flooie Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

flooie commented Jul 4, 2025

Uh oh!

flooie Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flooie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Enhancements to the `nd` scraper:

flooie commented Jul 2, 2025 •

edited

Loading