-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] segment replication stops when publish checkpoint fails #17595
Comments
Hi, @ashking94 |
@guojialiang92 Thanks for sharing this issue. Would retries help if the transport call for publish checkpoint fails? We can consider having a retry strategy that does not give up and keeps retrying with a backoff interval? Around the async polling approach, wouldn't it add too many transport calls and add fixed network cost even if a replica shard is not actively being written? |
Thanks, @ashking94
To confirm my understanding, Do you suggest extending the constructor of
In order to avoid additional network overhead, asynchronous logic can only send requests to copies of the avtive replicas. At the same time, we can refer to Overall, specifying an infinite timeout may change less, and future scenarios may be able to reuse it. Looking forward to your reply. |
Describe the bug
The current segment replication is based on pull mode, but the premise is that the replica receives the checkpoint sent by the primary shard.
Once the replica does not receive the checkpoint of the primary for some reason (For example, network issues cause the limit of
TransportReplicationAction#REPLICATION_RETRY_TIMEOUT
to be exceeded), and the primary shard is no longer written, the replica shard will always be unable to synchronize with the primary shard.Usually, when encountering this kind of problem, users need to write a piece of data to trigger the primary shard to publish a new checkpoint. Or solve it by reducing the number of replicas first and then increasing the number of replicas.
Related component
Indexing:Replication
To Reproduce
Stable reproduction can be performed in Integrated Test
RemoteTransportException
exception before processing requestindices:admin/publishCheckpoint[r].
indices:admin/publishCheckpoint[r]
normally.Expected behavior
I suggest adding a scheduled asynchronous task. When the primary shard detects that the replica is behind for more than a certain time threshold, it triggers a publish checkpoint.
Please help me evaluate whether I can make improvements in this way. I will submit a Pull Request and supplement relevant test cases.
Additional Details
Plugins
Please list all plugins currently enabled.
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: