Skip to content

Conversation

@NachoEchevarria
Copy link
Collaborator

@NachoEchevarria NachoEchevarria commented Jan 12, 2026

Summary of changes

Fixes flaky TestReceiveMessagesAsyncIntegration test in Azure Service Bus APM integration tests by ensuring scheduled messages are delivered before queue cleanup.

Reason for change

The TestReceiveMessagesAsyncIntegration test was intermittently failing in CI with the error:

Failed Datadog.Trace.ClrProfiler.IntegrationTests.Azure.AzureServiceBusAPMTests.TestReceiveMessagesAsyncIntegration(packageVersion: "7.18.4", metadataSchemaVersion: "v1") [3 s]
 Error Message:
  Expected linkedSendSpan not to be <null> because Receive span 8970363549073655187 has link to span 14625431652207073697 in trace 7096167803062670649, but corresponding send span not found.
xpected linkedSendSpan not to be <null> because Receive span 3961647348784637861 has link to span 14625431652207073697 in trace 7096167803062670649, but corresponding send span not found.

=== Receive Messages Test ===
Sent test message for receive with ID: 8ddfb05e-9d2b-4c4d-ad7a-34e991e34bfd
Attempting to receive message...
Received message ID: 03e0bf63934346329a9c336cbdd5ebc2, Body: Scheduled Message 0 from ScheduleMessages test
Message completed successfully
Purging existing messages from queue...
Purged 2 existing messages from queue
Resources handled successfully
Azure Service Bus APM Test Sample completed successfully

This occurred because the test was receiving messages scheduled by previous TestScheduleMessagesAsyncIntegration test runs, but the corresponding send spans were not available (they were from a different test execution).

The race condition occurred due to the timing of scheduled message delivery:

  1. TestScheduleMessagesAsync schedules messages for 1 second in the future (DateTimeOffset.Now.AddSeconds(1))
  2. The test completes immediately and calls PurgeQueue
  3. PurgeQueue waits 2 seconds trying to receive messages
  4. Critical issue: Azure Service Bus emulator doesn't guarantee exact delivery timing
    - If delivery is delayed beyond the 2-second PurgeQueue window, messages escape cleanup
    - These orphaned scheduled messages get received by subsequent TestReceiveMessagesAsync test runs
    - The test fails because it can't find the corresponding send spans (they were in a different test run)

The test usually passed because:

  • Most of the time: Messages were delivered within the 2-second PurgeQueue window and got cleaned up
  • Test execution order: Random test shuffling meant ReceiveMessages didn't always run immediately after ScheduleMessages
  • Test gaps: Other tests running in between provided extra time for delivery and cleanup

It failed when:

  • Message delivery was delayed beyond 2 seconds (emulator timing variance due to CPU load, network, etc.)
  • Multiple ScheduleMessages tests ran before ReceiveMessages
  • Random test ordering placed ReceiveMessages right after ScheduleMessages tests

Implementation details

Modified TestScheduleMessagesAsync in Samples.AzureServiceBus.APM/Program.cs to wait for scheduled messages to be delivered before returning:

  // Calculate remaining time and ensure we wait at least 2 seconds total
  var waitTime = scheduleTime - DateTimeOffset.Now;
  var totalWaitSeconds = Math.Max(2.0, waitTime.TotalSeconds + 1.0);
  await Task.Delay(TimeSpan.FromSeconds(totalWaitSeconds));

This ensures:

  • Scheduled messages are actually delivered before PurgeQueue runs
  • The shared Azure Service Bus emulator queue is clean before the next test starts
  • No interference between test runs regardless of execution order

Test coverage

This change fixes the existing integration test rather than adding new tests. The fix eliminates the race condition entirely rather than masking it with retries.

Other details

@github-actions github-actions bot added the area:tests unit tests, integration tests label Jan 12, 2026
@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Jan 12, 2026

⚠️ Tests

Fix all issues with Cursor

⚠️ Warnings

❄️ 1 New flaky test detected

FiresCallbackOnRecheckIfHasChangesToConfig from Datadog.Trace.Tests.Agent.DiscoveryServiceTests (Datadog) (Fix with Cursor)
Expected mutex3.Wait(30_000) to be True because Should make third request to api, but found False.

ℹ️ Info

🧪 All tests passed

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 08513c5 | Docs | Datadog PR Page | Was this helpful? Give us feedback!

@NachoEchevarria NachoEchevarria marked this pull request as ready for review January 13, 2026 16:56
@NachoEchevarria NachoEchevarria requested review from a team as code owners January 13, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:tests unit tests, integration tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants