Skip to content

feat(OpinionSite): return "lower_court_id" field #1434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

grossir
Copy link
Contributor

@grossir grossir commented Jun 11, 2025

Solves #1432

This new field will go into "Docket.appeal_from_id"

Also, make tex scraper return "lower_court_id"

@grossir grossir moved this to PRs to Review in Case Law Sprint Jun 11, 2025
Solves #1432

This new field will go into "Docket.appeal_from_id"

Also, make `tex` scraper return "lower_court_id"
@grossir grossir force-pushed the 1432-opinion-site-return-lower-court-id branch from abe7443 to 716ec9d Compare June 11, 2025 19:58
@flooie
Copy link
Contributor

flooie commented Jun 16, 2025

I think we need to fix the Texas scraper with respect to IDs before we move this forward, perhaps we should make this a draft

@flooie flooie moved this from PRs to Review to Waiting on Feedback in Case Law Sprint Jun 16, 2025
@flooie flooie assigned grossir and unassigned flooie Jun 16, 2025
@grossir grossir moved this from Waiting on Feedback to Blocked in Case Law Sprint Jun 16, 2025
@@ -162,16 +163,21 @@ def parse_lower_court_info(title: str) -> tuple[str, str]:
if match := re.search(texapp_regex, title):
lower_court = match.group("lower_court")
lower_court_number = title[match.end() :].split(",")[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstrings on this function are not correct.

also Instead of not returning BODA id I created one in courts-db txboda

@flooie
Copy link
Contributor

flooie commented Jun 16, 2025

@grossir I find this


    @staticmethod
    def parse_lower_court_info(title: str) -> tuple[str, str]:
        """Parses lower court information from the title string

        :param title string
        :return lower_court, lower_court_number
        """

        # format when appeal comes from texapp. Example:
        # ' from Harris County; 1st Court of Appeals District (01-22-00182-CV, 699 SW3d 20, 03-23-23)'
        texapp_regex = r" from (?P<lower_court>.*)\s*\("

        # Examples:
        #  "(U.S. Fifth Circuit 23-10804)"
        #  "(U.S. 5th Circuit 19-51012)"
        # "(BODA Cause No. 67623)"
        other_courts_regex = r"\((?P<lower_court>(BODA|U\.S\. (Fif|5)th Circuit))\s(?P<lower_number>(Cause No. )?[\d-]+)\)$"

        if match := re.search(texapp_regex, title):
            lower_court = match.group("lower_court")
            lower_court_number = title[match.end() :].split(",")[0]
            return lower_court, lower_court_number, "texapp"
        elif match := re.search(other_courts_regex, title):
            lower_court = match.group("lower_court")
            lower_court_number = match.group("lower_number")

            if lower_court == "BODA":
                lower_court = "Board of Disciplinary Appeals"
                lower_court_id = ""
            else:
                # if this is not a BODA match, then it can only be a
                # Fifth Circuit match. Update this if the regex above changes
                lower_court_id = "ca5"

            return lower_court, lower_court_number, lower_court_id
        return "", "", ""

to be problematic. Can we return it to just return the lower court number and extract out the remaining data from extract from text.


    def extract_from_text(self, scraped_text: str) -> dict:
        """"""
        match = re.split(r"═{15,}", scraped_text)
        court_id = ""
        metadata = {"Docket": {}}
        if not match:
            return metadata
        lower_court = match[1].replace("On Petition for Review from the", "").strip()
        if lower_court.startswith("Court of Appeals"):
            court_id = "texapp"
        elif lower_court.startswith("Board of Disciplinary Appeals"):
            court_id = "txboda"
        elif lower_court.startswith("United States Court of Appeals for the Fifth Circuit"):
            court_id = "ca5"
        if court_id != "":
            metadata['Docket']['lower_court_str'] = lower_court
            metadata['Docket']['lower_court_id'] = court_id
        return metadata

I think I like the way the courts names are written here - they match and look much nicer to me.

@grossir
Copy link
Contributor Author

grossir commented Jun 17, 2025

@flooie

I was checking the PDFs and I would keep the data from the HTML source, because it has:

  • lower complexity: for example, on the PDF the separator is not always "On Petition for Review from the", I have also found "On Certified Question from the", and there may be other variations to account for
  • more information: the "lower_court_str" also mentions the county it's coming from; not only the district

About the formatting being prettier or more standard in the PDF, when we implement the frontend we will just use the "appeal_from_id", which links to a Court object which has the standard court name; so I don't think a standard name should matter too much for "lower_court_str" / "appeal_from_str"

@grossir grossir moved this from Blocked to Waiting on Feedback in Case Law Sprint Jun 17, 2025
@grossir grossir assigned flooie and unassigned grossir Jun 17, 2025
@flooie
Copy link
Contributor

flooie commented Jun 17, 2025

@grossir I think the HTML is providing a non standard name for the court and I much prefer the format from the PDF.

let me take a look at a bigger sample

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Waiting on Feedback
Development

Successfully merging this pull request may close these issues.

2 participants