Skip to content

[Feature Request]: Make generate_schema resilient across multi-page HTML samples #1672

@SohamKukreti

Description

@SohamKukreti

crawl4ai version

v0.7.8

Expected Behavior

generate_schema should propose stable selectors that work across multiple pages with varying DOM positions (e.g., table rows shifting). Given several representative HTML samples, it should prefer attribute/text-anchored selectors over fragile nth-child positions.

Current Behavior

generate_schema inspects a single HTML sample and may emit brittle selectors like table.pdp-sku-info tbody tr:nth-child(6) td:nth-child(2) a. On other product pages, the same field appears in nth-child(5) or nth-child(7), so extraction fails until a human finds a more stable selector (e.g., a[href*="/product?Manufacturer="]).

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

Any

Python version

3.10+

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    ✨ EnhancementImprovement on an existing feature📌 Root causedidentified the root cause of bug

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions