Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky TestAgentIdentification/ws #216

Closed
wants to merge 2 commits into from

Conversation

haoqixu
Copy link
Member

@haoqixu haoqixu commented Dec 1, 2023

Resolve #191

TestAgentIdentification executes the following steps to test the client:

  1. prepareClient(): setup settings and call SetAgentDescription of client which schedules an message sending.
  2. client.Start():
    send the first message
    (a) launch an goroutine to handle scheduled message sending.
    (b) start the receiver loop to handle messages from server.
  3. run the first eventually assert.
  4. client.SetAgentDescription() to schedule an message sending.
  5. run the second eventually assert.

scheduled message sending in (2.a) and messages handling (2.b) are running concurrently.

Case 1:
the timeline is as follows:

  • (1) calls c.SetAgentDescription to schedule an message sending.
  • (2) send the first message to the server.
  • (2.b) response from server is handled by the receiver loop, which update the instance ID of scheduled message.
  • (2.a) handle scheduled message and send it to the server, server store the new instance ID to rcvAgentInstanceUid.
  • (3) the first eventually assert fails to get the original instance ID

Case 2:
the timeline is as follows:

  • (1) calls c.SetAgentDescription to schedule an message sending.
  • (2) send the first message to the server.
  • (2.a) handle scheduled message. This is noop because there is no update since last sending.
  • (3) the first eventually assert trys to get the original instance ID and will return successfully after server has stored the instance ID in the first message to rcvAgentInstanceUid.
  • (4) client.SetAgentDescription() to schedule an message sending
  • (2.a) sender goroutine send the scheduled message to the server with original ID (response from server hasn't arrived yet or the message handling hasn't completed)
  • (2.b) response from server is handled by the receiver loop, which update the instance ID of scheduled message. But the client has no more chance to send a message.
  • (5) the second eventually assert failed to get the new instance uid.

The solution here is replacing atomic.Value with a buffered channel to avoid case 1 and waiting for the first response from server to avoid case 2.

@haoqixu haoqixu requested a review from a team December 1, 2023 18:29
Copy link

codecov bot commented Dec 1, 2023

Codecov Report

Attention: 129 lines in your changes are missing coverage. Please review.

Comparison is base (efddaa2) 76.11% compared to head (71b4d40) 72.43%.
Report is 31 commits behind head on main.

Files Patch % Lines
internal/certs.go 0.00% 118 Missing ⚠️
client/internal/httpsender.go 88.00% 2 Missing and 1 partial ⚠️
client/internal/mockserver.go 85.71% 2 Missing and 1 partial ⚠️
server/httpconnection.go 0.00% 2 Missing ⚠️
client/internal/clientcommon.go 90.90% 1 Missing ⚠️
client/internal/packagessyncer.go 0.00% 1 Missing ⚠️
server/serverimpl.go 95.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #216      +/-   ##
==========================================
- Coverage   76.11%   72.43%   -3.69%     
==========================================
  Files          24       25       +1     
  Lines        1834     2017     +183     
==========================================
+ Hits         1396     1461      +65     
- Misses        326      446     +120     
+ Partials      112      110       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@srikanthccv
Copy link
Member

I am trying to follow the case 1 timeline you mentioned.

Case 1:
the timeline is as follows:

  • (1) calls c.SetAgentDescription to schedule an message sending.
  • (2) send the first message to the server.
  • (2.b) response from server is handled by the receiver loop, which update the instance ID of scheduled message.
  • (2.a) handle scheduled message and send it to the server, server store the new instance ID to rcvAgentInstanceUid.
  • (3) the first eventually assert fails to get the original instance ID

2.b is triggered by the client sending the scheduled message (from 2.a step) so why does the first assertion fail?

@haoqixu
Copy link
Member Author

haoqixu commented Dec 5, 2023

I try to draw a sequence diagram of case 1:

sequenceDiagram
participant Client
participant Client.Receiver
participant Client.Sender
participant Server
    Client->>Client.Sender: (1) c.SetAgentDescription in `prepareClient` call `sender.ScheduleSend()`
    Client->>Server: (2) AgentToServer with original UID (via s.sendNextMessage in sender.Start)
alt
    Server->>Server: store the original instance ID to rcvAgentInstanceUid
end
    Client-->>Client.Sender: start the loop
    Client-->>Client.Receiver: start the loop
    Server->>Client.Receiver: (2.b)ServerToAgent with new UID
    Client.Receiver->>Client.Sender: Update the pendding Message with new UID
    Client.Sender->>Server: (2.a) send the pending scheduled message with new UID
alt Oops!
    Server->>Server: store the new instance ID to rcvAgentInstanceUid
end
   Server ->> Client.Receiver: reply ...
   Client->>Client: run the first eventually assertion and fail
Loading

2.b will update the message on processing the ServerToAgent message. If this happends before sender sending the pending message, rcvAgentInstanceUid will be updated with the new UID.

@srikanthccv
Copy link
Member

There are two times we can see when the ScheduleSend is invoked which sends the message to the Server. The first one is from the SetAgentDescription in prepareClient and then from the SetAgentDescription of Send a dummy message. The second dummy message is only sent after the first assertion is completed. So my question is why does the first assertion fail because, for rcvAgentInstanceUid to have the new UID, the dummy message should have been sent which can't be true because the dummy message is only sent after the the first asseetion passes. Or am I missing something?

@haoqixu
Copy link
Member Author

haoqixu commented Dec 5, 2023

the dummy message is only sent after the the first asseetion passes

assert.Eventually() does not terminate the test even the assertion has failed. So we can always see the dummy message sent even the first assertion has failed.

@srikanthccv
Copy link
Member

I am not sure I follow. What does this PR attempt to fix then? The first assertion is expected to always pass. If you look at the last failed runs in the issue, they point out Condition never satisfied at the first assertion. I think any fix should involve figuring out why it failed.

@haoqixu
Copy link
Member Author

haoqixu commented Dec 5, 2023

The first assertion is used to check whether the server has received the original uid. In the timeline case 1, the server has received the original uid, the first assertion falied becase rcvAgentInstanceUid(atomic.Value) was overwritten by the second AgentToServer message.

This PR replaces the atomic.Value with a buffered channel to record uids received at server side to avoid this case.

@srikanthccv
Copy link
Member

In the timeline case 1, the server has received the original uid, the first assertion falied becase rcvAgentInstanceUid(atomic.Value) was overwritten by the second AgentToServer message.

How did you arrive at this conclusion? It could be because no value was set to rcvAgentInstanceUid and the assertion failed. I don't see how an override is the reason for the first assertion failure unless I am missing something.

@haoqixu
Copy link
Member Author

haoqixu commented Dec 5, 2023

It could be because no value was set to rcvAgentInstanceUid and the assertion failed.

If no value was set to rcvAgentInstanceUid, both the eventually assertions will failed. In the failed runs in the issue, only the first assertion failed.

@haoqixu
Copy link
Member Author

haoqixu commented Dec 5, 2023

https://github.com/haoqixu/opamp-go/tree/debug-pr-216
I have submitted some test log to this branch.

@haoqixu
Copy link
Member Author

haoqixu commented Dec 5, 2023

2023/12/05 23:33:17 test begin
2023/12/05 23:33:17 client: SetAgentDescription and ScheduleSend
2023/12/05 23:33:17 old instance uid is 01HGX9SCX1EQPHGGJ9X6Y1JSTY, starting client
2023/12/05 23:33:17 client: send message at Start
2023/12/05 23:33:17 server: store uid 01HGX9SCX1EQPHGGJ9X6Y1JSTY to rcvAgentInstanceUid
2023/12/05 23:33:17 client: set instance uid of pending message 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 client: sending pending message
2023/12/05 23:33:17 server: store uid 01HGX9SCX106AFVGQT5ZYC0GEK to rcvAgentInstanceUid  <------- this line
2023/12/05 23:33:17 client: set instance uid of pending message 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:17 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
(... first assert message ...)
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 first assert: want 01HGX9SCX1EQPHGGJ9X6Y1JSTY, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 send dummy message
2023/12/05 23:33:22 client: SetAgentDescription and ScheduleSend
2023/12/05 23:33:22 client: sending pending message
2023/12/05 23:33:22 server: store uid 01HGX9SCX106AFVGQT5ZYC0GEK to rcvAgentInstanceUid
2023/12/05 23:33:22 client: set instance uid of pending message 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 second assert: want 01HGX9SCX106AFVGQT5ZYC0GEK, got: 01HGX9SCX106AFVGQT5ZYC0GEK
2023/12/05 23:33:22 test end
--- FAIL: TestAgentIdentification (5.06s)
    --- FAIL: TestAgentIdentification/ws (5.04s)
        clientimpl_test.go:94: 
            	Error Trace:	clientimpl_test.go:94
            	            				clientimpl_test.go:653
            	            				clientimpl_test.go:77
            	Error:      	Condition never satisfied
            	Test:       	TestAgentIdentification/ws

This is the Case 1.

@haoqixu
Copy link
Member Author

haoqixu commented Dec 5, 2023

2023/12/05 23:35:01 test begin
2023/12/05 23:35:01 client: SetAgentDescription and ScheduleSend
2023/12/05 23:35:01 old instance uid is 01HGX9WJPKEQPHGGJ9X6Y1JSTY, starting client
2023/12/05 23:35:01 client: send message at Start
2023/12/05 23:35:01 client: sending pending message
2023/12/05 23:35:01 server: store uid 01HGX9WJPKEQPHGGJ9X6Y1JSTY to rcvAgentInstanceUid
2023/12/05 23:35:01 first assert: want 01HGX9WJPKEQPHGGJ9X6Y1JSTY, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 send dummy message
2023/12/05 23:35:01 client: SetAgentDescription and ScheduleSend
2023/12/05 23:35:01 client: sending pending message
2023/12/05 23:35:01 client: set instance uid of pending message 01HGX9WJPK06AFVGQT5ZYC0GEK   <------- this line
2023/12/05 23:35:01 server: store uid 01HGX9WJPKEQPHGGJ9X6Y1JSTY to rcvAgentInstanceUid
2023/12/05 23:35:01 client: set instance uid of pending message 01HGX9WJPK06AFVGQT5ZYC0GEK
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:01 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
(... second assert ...)
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 second assert: want 01HGX9WJPK06AFVGQT5ZYC0GEK, got: 01HGX9WJPKEQPHGGJ9X6Y1JSTY
2023/12/05 23:35:06 test end
--- FAIL: TestAgentIdentification (5.02s)
    --- FAIL: TestAgentIdentification/ws (5.02s)
        clientimpl_test.go:94: 
            	Error Trace:	clientimpl_test.go:94
            	            				clientimpl_test.go:672
            	            				clientimpl_test.go:77
            	Error:      	Condition never satisfied
            	Test:       	TestAgentIdentification/ws

This is the Case 2.

@srikanthccv
Copy link
Member

Were you able to easily reproduce the original issue? Can you share how did you reproduce?

@haoqixu
Copy link
Member Author

haoqixu commented Dec 6, 2023

I can reproduce it by running go test -run TestAgentIdentification/ws -race -count 10000 ./... and I have submitted the output here: https://github.com/haoqixu/opamp-go/tree/debug-pr-216/client/testlog

@srikanthccv
Copy link
Member

Thanks, I will take a look.

@srikanthccv
Copy link
Member

Sorry for the lack of response here for some time. I have been sick and swamped with a lot of work at $job. I will find time to get back to review PRs soon.

@srikanthccv
Copy link
Member

The first assertion is used to check whether the server has received the original uid. In the timeline case 1, the server has received the original uid, the first assertion falied becase rcvAgentInstanceUid(atomic.Value) was overwritten by the second AgentToServer message.

Coming back to this point, this says the first assertion failed because id was overwritten by second AgentToServer. However, the second AgentToServer is the dummy message which is sent after the eventually which is blocking so my reasoning and question is how does the second message cause the first assertion failure because they are sequential instructions. If we look at the code we expect the second assertion to override but not simultaneously with the first assertion.

@haoqixu
Copy link
Member Author

haoqixu commented Dec 17, 2023

the second AgentToServer is the dummy message

The second AgentToServer is not the dummy message. It is the pending message scheduled by SetAgentDescription.

图片

@srikanthccv
Copy link
Member

Thanks, I went through the ws sender code again. I think the additional sendNextMessage invocation shouldn't be there in ws sender

func (s *WSSender) Start(ctx context.Context, conn *websocket.Conn) error {
s.conn = conn
err := s.sendNextMessage()
// Run the sender in the background.
s.stopped = make(chan struct{})
go s.run(ctx)
return err
All the client messages that should be sent to server would already have called the ScheduleSend and we expect the SetAgentDescription must be called before Start which schedules the message. I believe this is why we don't see the same flaky behaviour in the http client.

@haoqixu
Copy link
Member Author

haoqixu commented Dec 18, 2023

I believe this is why we don't see the same flaky behaviour in the http client.

Yes. But I don't think test code should treat the above cases as a failure. Even with this extra message, the client uses the original ID first and then sends subsequent messages with the new ID after receiving the reply as expected. I think we should update the test code so that the above two cases pass. I would like to submit anohter PR to avoid the additional message.

@srikanthccv
Copy link
Member

The test is written with certain assumptions and it failed because of the additional message, if we fix that test would pass and we don't need to fix any tests.

@haoqixu
Copy link
Member Author

haoqixu commented Dec 18, 2023

Even with this extra message, the client uses the original ID first and then sends subsequent messages with the new ID after receiving the reply as expected

I don't think the extra message is incorrect (although it could definitely be improved). The observable behavior of the client is as expected, but our detection method cannot handle it and fail to produce the same outcome with each individual test run.

@haoqixu
Copy link
Member Author

haoqixu commented Dec 18, 2023

if we fix that test would pass and we don't need to fix any tests.

Case 2 is not caused by the additional message and we still need to need the test.

@haoqixu
Copy link
Member Author

haoqixu commented Dec 18, 2023

This test is flaky, it usually succeeds but occasionally fails. If we want to avoid extra messages, it is not enough to prove it, we still need a separate test. So I prefer that this flaky test code should be fixed first to make its results consistent.

@tigrannajaryan
Copy link
Member

It has been a while and the test has been changed since it failed and does not appear to be failing any more, see here.

Closing this PR, but we can reopen if we see failures again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestAgentIdentification/ws is flaky
3 participants