Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure same documents are not pushed more than once. #52

Open
bidoubiwa opened this issue Jun 29, 2023 · 2 comments
Open

Ensure same documents are not pushed more than once. #52

bidoubiwa opened this issue Jun 29, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@bidoubiwa
Copy link
Contributor

Context

Some websites have multiple URL's pointing to the same page.
For example in openai:

Problem

Since the crawler is not able to know it already scrapped those pages, it will scrap it again. This leads to having multiple times the same documents.

The current solution would be to add a distinctAttribute: "content" in the meilisearch settings of your scrapix configuration.

Solution

The long term solution would be to create a new field in Meilisearch containing the hash of a document with its relevant fields.
For example in section_hash we add a hash of all the different fields:

  • hierarchy_lvl0
  • hierarchy_lvl1
  • hierarchy_lvl2
  • hierarchy_lvl3
  • hierarchy_lvl4
  • hierarchy_lvl5
  • hierarchy_radio_lvl0
  • hierarchy_radio_lvl1
  • hierarchy_radio_lvl2
  • hierarchy_radio_lvl3
  • hierarchy_radio_lvl4
  • hierarchy_radio_lvl5
  • content

We then add section_hash by default in the distinctAttributes here for example

void this.sender.updateSettings({

But also in the default strategy

@bidoubiwa bidoubiwa added the enhancement New feature or request label Jun 29, 2023
@qdequele
Copy link
Member

Not an easy one without slowing down the process a lot or having a hidden field on the document. 🤔

@qdequele
Copy link
Member

qdequele commented Nov 8, 2024

I don't know if this issue is still relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants