You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since the crawler is not able to know it already scrapped those pages, it will scrap it again. This leads to having multiple times the same documents.
The current solution would be to add a distinctAttribute: "content" in the meilisearch settings of your scrapix configuration.
Solution
The long term solution would be to create a new field in Meilisearch containing the hash of a document with its relevant fields.
For example in section_hash we add a hash of all the different fields:
hierarchy_lvl0
hierarchy_lvl1
hierarchy_lvl2
hierarchy_lvl3
hierarchy_lvl4
hierarchy_lvl5
hierarchy_radio_lvl0
hierarchy_radio_lvl1
hierarchy_radio_lvl2
hierarchy_radio_lvl3
hierarchy_radio_lvl4
hierarchy_radio_lvl5
content
We then add section_hash by default in the distinctAttributes here for example
Context
Some websites have multiple URL's pointing to the same page.
For example in openai:
Problem
Since the crawler is not able to know it already scrapped those pages, it will scrap it again. This leads to having multiple times the same documents.
The current solution would be to add a
distinctAttribute: "content"
in the meilisearch settings of your scrapix configuration.Solution
The long term solution would be to create a new field in Meilisearch containing the hash of a document with its relevant fields.
For example in
section_hash
we add a hash of all the different fields:hierarchy_lvl0
hierarchy_lvl1
hierarchy_lvl2
hierarchy_lvl3
hierarchy_lvl4
hierarchy_lvl5
hierarchy_radio_lvl0
hierarchy_radio_lvl1
hierarchy_radio_lvl2
hierarchy_radio_lvl3
hierarchy_radio_lvl4
hierarchy_radio_lvl5
content
We then add
section_hash
by default in the distinctAttributes here for examplescrapix/src/scrapers/docssearch.ts
Line 13 in 070c907
But also in the default strategy
The text was updated successfully, but these errors were encountered: