-
Notifications
You must be signed in to change notification settings - Fork 65
Description
Description
AsyncSearchIndex.clear() attempts to remove all documents by paginating through indexed documents in batches and deleting them. However, the pagination logic is currently unstable and can result in some documents being deleted multiple times, while others may be omitted entirely.
The root cause is that pagination is performed without a SORTBY clause, so the order of documents returned by each batch is not guaranteed to be stable or unique. As a result, when there are more documents than the page_size, some documents might not be deleted at all, and others may appear on multiple pages.
Current logic (index.py, lines 1566–1568):
async for batch in self.paginate(
FilterQuery(FilterExpression("*"), return_fields=["id"]), page_size=500
):
...Why this is a problem:
Without a deterministic sort (i.e., SORTBY), RediSearch does not guarantee consistent result ordering across pages, causing duplicates and/or omissions.
Suggested solution:
Add a unique, indexed, and sortable field to your schema (for example: document_id), and update the query to paginate in a stable order using sort_by:
async for batch in self.paginate(
FilterQuery(FilterExpression("*"), return_fields=["id"], sort_by="document_id"), page_size=500
):
...This ensures every document is fetched exactly once.
I am preparing a PR for this fix.