Skip to content

Redundent data from HTML source are included #1930

Open
@kehiy

Description

@kehiy

Question

I have simple script that used docling to parse this webpage:

https://ramzinex.com/help/register-in-ramzinex

The issue is when I parse the document it contains ol tag and footer details as well. How can I exclude them?

The ol tag info:
صرافی رمزینکس
راهنما
ثبت نام و احراز هویت

The footer info:

Image

This whole section will be included in final document.
Also, when I use hybrid chunker to chunk them, these are still included.

Is there any config to exclude redundant stuff? from links, PDFs or anything else?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions