Skip to content
This repository was archived by the owner on May 10, 2022. It is now read-only.
This repository was archived by the owner on May 10, 2022. It is now read-only.

bug: client can't recover from a replica-server failure. #53

@neverchanje

Description

@neverchanje
  • 2019/9/11 17:00. Our SRE stopped one instance of replica-server in our staging environment
    trying to simulate the problem java-client can't recover.

  • 2019/9/11 17:00. Some of our clients recovered right away while replica-server restarted, but some couldn't reconnect and kept retrying with ERR_TIMEOUT error.

2019-09-11 17:12:48,659 ERROR [nioEventLoopGroup-4-1] [com.xiaomi.message.mixin.fsm.ruleset.SaveAckRuleSet$3$1.operationComplete(SaveAckRuleSet.java:169)] - SaveB2CAckInfo fail, appId=2882303761517479657 msgId=sms-1c33f956-7dad-4327-8986-aa400cb4a8b2, ex:
com.xiaomi.infra.pegasus.client.PException: com.xiaomi.infra.pegasus.rpc.ReplicationException: ERR_TIMEOUT
	at com.xiaomi.infra.pegasus.client.PegasusTable$8.onCompletion(PegasusTable.java:376)
	at com.xiaomi.infra.pegasus.rpc.async.ClientRequestRound.thisRoundCompletion(ClientRequestRound.java:51)
	at com.xiaomi.infra.pegasus.rpc.async.TableHandler.tryDelayCall(TableHandler.java:314)
	at com.xiaomi.infra.pegasus.rpc.async.TableHandler.onRpcReply(TableHandler.java:295)
	at com.xiaomi.infra.pegasus.rpc.async.TableHandler$3.run(TableHandler.java:326)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession.tryNotifyWithSequenceID(ReplicaSession.java:226)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession.access$300(ReplicaSession.java:32)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession$4.run(ReplicaSession.java:270)
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:123)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.xiaomi.infra.pegasus.rpc.ReplicationException: ERR_TIMEOUT
	... 15 more
  • 2019-09-11 17:12:48, the errors remained, we stopped the test.

Client Version

1.11.5-thrift-0.11.0-inlined-release

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions