Implement Selenium-Based Scraper for Louisiana Third Circuit Court of Appeals Opinions #1338

DrJfrost · 2025-03-02T05:50:51Z

PR Description:

This pull request introduces a robust Selenium-based scraper for the Louisiana Third Circuit Court of Appeals, enabling dynamic interaction with the court's JavaScript-heavy website to extract opinion data. Below is a detailed summary of the implementation, features, and functionality:

Overview

The Louisiana Third Circuit Court of Appeals website requires JavaScript for rendering content, making traditional static scraping methods ineffective. To address this, we implemented a Selenium-based scraper that interacts with the site's "Search Opinions" form, dynamically selects search parameters (year, month, or specific opinion date), submits the form, and parses the resulting modal to extract case details.

Key Features

Dynamic Search Functionality:
- Supports searching opinions by:
  - Year and Month: Selects the year and month from dropdown menus and retrieves all opinions published during that period.
  - Specific Opinion Date: Searches for opinions issued on a specific date.
- Handles edge cases such as invalid inputs or no results returned gracefully.
Modal Handling:
- Waits for the modal containing search results to load after form submission.
- Extracts and processes the HTML content of the modal using lxml.html.
Data Extraction:
- Parses the modal's HTML to extract key case details, including:
  - Case title
  - Docket number
  - Opinion date
  - Parish and lower court information
  - Download URL for the opinion
- Cleans and harmonizes extracted data for consistency.
Error Handling and Validation:
- Ensures robust error handling for scenarios such as missing fields, empty results, or unexpected HTML structures.
- Validates dates to ensure no future dates are included unless explicitly marked as approximate.
JSON Output:
- Converts parsed cases into a JSON collection for easy integration with downstream systems.
WebDriver Management:
- Initializes and cleans up the Selenium WebDriver efficiently:
  - Runs in headless mode for performance.
  - Ensures proper resource cleanup after scraping.

Implementation Details

Class Structure:
- The scraper extends the OpinionSiteLinear class from Juriscraper, leveraging its built-in methods for parsing and validation.
- Implements Selenium-specific methods (_setup_webdriver, _teardown_webdriver, _download) to interact with the website.
HTML Parsing:
- Uses lxml.html to parse the modal's HTML content.
- Iterates over rows in the table to extract case details and populate the self.cases list.
Testing:
- Due to the dynamic nature of the website, static example files cannot be used for testing.
- Includes a simple script (test_la3circuit.py) to test the scraper against the live website.
- Instructions for running the test script are provided below.

How to Test

To test the scraper, run:

python test_la3circuit.py

The script will:

Launch a headless browser instance using Selenium.
Interact with the live website to perform a search (e.g., by month and year or specific opinion date).
Parse the results and print them in JSON format.

Ensure that Selenium and the appropriate WebDriver (e.g., GeckoDriver for Firefox) are installed before running the test.

Example Output

After running the scraper, the parsed cases might look like this in JSON format:

[
    {
        "name": "CAMERON PARISH POLICE JURY VERSUS CHARLES DARREN BENOIT",
        "url": "https://www.la3circuit.org/opinion/CA-0024-0119",
        "date": "2025-01-15",
        "docket": "CA-0024-0119",
        "lower_court": "Thirty-Eighth Judicial District Court",
        "status": "Published",
        "date_filed_is_approximate": false
    },
    {
        "name": "DAVID MORGAN SESSIONS VERSUS ABBYGAIL WILKERSON",
        "url": "https://www.la3circuit.org/opinion/CA-0024-0265",
        "date": "2025-01-15",
        "docket": "CA-0024-0265",
        "lower_court": "Thirty-Fifth Judicial District Court",
        "status": "Published",
        "date_filed_is_approximate": false
    }
]

Future Improvements

Back-Scraping:
- Add functionality to scrape historical opinions by iterating over a range of years and months.
Performance Optimization:
- Minimize redundant interactions with the website (e.g., caching dropdown selections).
Error Logging:
- Log errors encountered during scraping for debugging purposes.
Alternative APIs:
- Investigate whether the court provides an API or alternative endpoints for fetching opinions.

CLAassistant · 2025-03-02T05:50:57Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

for more information, see https://pre-commit.ci

DrJfrost added 2 commits March 2, 2025 00:41

Adds scraper for la3circuit seleniumbased

2878460

formats document

5095c59

[pre-commit.ci] auto fixes from pre-commit.com hooks

c69ad4f

for more information, see https://pre-commit.ci

flooie closed this Apr 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Implement Selenium-Based Scraper for Louisiana Third Circuit Court of Appeals Opinions #1338

Implement Selenium-Based Scraper for Louisiana Third Circuit Court of Appeals Opinions #1338

Uh oh!

DrJfrost commented Mar 2, 2025

Uh oh!

CLAassistant commented Mar 2, 2025

Uh oh!

Uh oh!

Uh oh!

Implement Selenium-Based Scraper for Louisiana Third Circuit Court of Appeals Opinions #1338

Implement Selenium-Based Scraper for Louisiana Third Circuit Court of Appeals Opinions #1338

Uh oh!

Conversation

DrJfrost commented Mar 2, 2025

PR Description:

Overview

Key Features

Implementation Details

How to Test

Example Output

Future Improvements

Uh oh!

CLAassistant commented Mar 2, 2025

Uh oh!

Uh oh!