Skip to content

Implement Selenium-Based Scraper for Louisiana Third Circuit Court of Appeals Opinions #1338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

DrJfrost
Copy link

@DrJfrost DrJfrost commented Mar 2, 2025

PR Description:

This pull request introduces a robust Selenium-based scraper for the Louisiana Third Circuit Court of Appeals, enabling dynamic interaction with the court's JavaScript-heavy website to extract opinion data. Below is a detailed summary of the implementation, features, and functionality:

Overview

The Louisiana Third Circuit Court of Appeals website requires JavaScript for rendering content, making traditional static scraping methods ineffective. To address this, we implemented a Selenium-based scraper that interacts with the site's "Search Opinions" form, dynamically selects search parameters (year, month, or specific opinion date), submits the form, and parses the resulting modal to extract case details.


Key Features

  1. Dynamic Search Functionality:

    • Supports searching opinions by:
      • Year and Month: Selects the year and month from dropdown menus and retrieves all opinions published during that period.
      • Specific Opinion Date: Searches for opinions issued on a specific date.
    • Handles edge cases such as invalid inputs or no results returned gracefully.
  2. Modal Handling:

    • Waits for the modal containing search results to load after form submission.
    • Extracts and processes the HTML content of the modal using lxml.html.
  3. Data Extraction:

    • Parses the modal's HTML to extract key case details, including:
      • Case title
      • Docket number
      • Opinion date
      • Parish and lower court information
      • Download URL for the opinion
    • Cleans and harmonizes extracted data for consistency.
  4. Error Handling and Validation:

    • Ensures robust error handling for scenarios such as missing fields, empty results, or unexpected HTML structures.
    • Validates dates to ensure no future dates are included unless explicitly marked as approximate.
  5. JSON Output:

    • Converts parsed cases into a JSON collection for easy integration with downstream systems.
  6. WebDriver Management:

    • Initializes and cleans up the Selenium WebDriver efficiently:
      • Runs in headless mode for performance.
      • Ensures proper resource cleanup after scraping.

Implementation Details

  1. Class Structure:

    • The scraper extends the OpinionSiteLinear class from Juriscraper, leveraging its built-in methods for parsing and validation.
    • Implements Selenium-specific methods (_setup_webdriver, _teardown_webdriver, _download) to interact with the website.
  2. HTML Parsing:

    • Uses lxml.html to parse the modal's HTML content.
    • Iterates over rows in the table to extract case details and populate the self.cases list.
  3. Testing:

    • Due to the dynamic nature of the website, static example files cannot be used for testing.
    • Includes a simple script (test_la3circuit.py) to test the scraper against the live website.
    • Instructions for running the test script are provided below.

How to Test

To test the scraper, run:

python test_la3circuit.py

The script will:

  • Launch a headless browser instance using Selenium.
  • Interact with the live website to perform a search (e.g., by month and year or specific opinion date).
  • Parse the results and print them in JSON format.

Ensure that Selenium and the appropriate WebDriver (e.g., GeckoDriver for Firefox) are installed before running the test.


Example Output

After running the scraper, the parsed cases might look like this in JSON format:

[
    {
        "name": "CAMERON PARISH POLICE JURY VERSUS CHARLES DARREN BENOIT",
        "url": "https://www.la3circuit.org/opinion/CA-0024-0119",
        "date": "2025-01-15",
        "docket": "CA-0024-0119",
        "lower_court": "Thirty-Eighth Judicial District Court",
        "status": "Published",
        "date_filed_is_approximate": false
    },
    {
        "name": "DAVID MORGAN SESSIONS VERSUS ABBYGAIL WILKERSON",
        "url": "https://www.la3circuit.org/opinion/CA-0024-0265",
        "date": "2025-01-15",
        "docket": "CA-0024-0265",
        "lower_court": "Thirty-Fifth Judicial District Court",
        "status": "Published",
        "date_filed_is_approximate": false
    }
]

Future Improvements

  1. Back-Scraping:

    • Add functionality to scrape historical opinions by iterating over a range of years and months.
  2. Performance Optimization:

    • Minimize redundant interactions with the website (e.g., caching dropdown selections).
  3. Error Logging:

    • Log errors encountered during scraping for debugging purposes.
  4. Alternative APIs:

    • Investigate whether the court provides an API or alternative endpoints for fetching opinions.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@flooie flooie closed this Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants