Skip to content

Heartbeat delay #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jun 2, 2025
Merged

Heartbeat delay #17

merged 20 commits into from
Jun 2, 2025

Conversation

mladjan-gadzic
Copy link
Contributor

@mladjan-gadzic mladjan-gadzic commented May 26, 2025

  • collection for ended activities is removed
  • OnStart adds activity to active and heartbeat delayed activities
  • OnEnd removes activity from active activities
  • heartbeat is redesigned
    • when processor starts separate thread is started
    • job of the thread runs every z ms
    • when it runs it processes activities
      • with delayed heartbeat are moved to ready heartbeat
      • with ready heartbeat are removed from the queue, logs are pushed and put back at the beginning
  • added tests to cover new functionality
  • bumped up version
  • modified Example.cs to include new changes
  • minor syntax fixes recommended by ide
  • batching is to be added later

@mladjan-gadzic mladjan-gadzic linked an issue May 26, 2025 that may be closed by this pull request
Copy link
Member

@jgiannuzzi jgiannuzzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I would suggest the following change, given x as the hearbeat interval, and y as the initial delay:

  • use 2 queues:
    • one for just started spans (OnStart)
    • one for spans that sent their initial heartbeat
  • OnStart adds the span to the back of the "started spans" queue with an initial heartbeat time of "now + y"
  • have one loop that runs every z ("batch interval"? there's probably a better name for it) and starts 2 tasks:
    • one looking at the queue of started spans, popping all the spans that have started y ago, sending their initial heartbeat, and adding them to the back of the heartbeat queue with their next heartbeat time (now + x)
    • one looking a the queue of heartbeat spans, popping all the spans that have sent their last heartbeat x ago, sending their new heartbeat, and adding them to the back of the heartbeat queue with their new heartbeat time (now + x)
  • you'd probably also need to have an efficient way to remove spans from these queues by having some kind of index by span ID, so that you can remove them when they finish (OnEnd)

Comment on lines 84 to 88
using (_logger.Value.BeginScope(GetHeartbeatLogRecordAttributes()))
{
_logger.Value.LogInformation(
SpecHelper.Json(new TracesData(data, TracesData.Signal.Heartbeat)));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole point of this change is to not send the event before the delay!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we discussed that heartbeat should not be sent before the delay. i am considering "on start" event, which just coincidentally has all attributes as a heartbeat, a different event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on your comment below, I think you know the answer, but just to be 100% clear: the goal of the delay is to postpone sending the initial heartbeat event, so that short-lived spans do not result in any partial span log events

Comment on lines +99 to +100
_logger.Value.LogInformation(
SpecHelper.Json(new TracesData(data, TracesData.Signal.Stop)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also send a stop event if we had sent an initial heartbeat - again the whole point of this change is to minimize traffic and server-side processing of short-lived spans!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same to this, i was considering "on end" separate event than heartbeat. would you like me not to send this for on start and on end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on your comment below, I think you know the answer, but just to be 100% clear: yes, I would like to avoid sending a stop event when no initial heartbeat event was sent - the goal is to avoid having many short-lived spans sending 3 things: a "heartbeat" log event, a "stop" log event, and a regular span/trace.

@mladjan-gadzic
Copy link
Contributor Author

@jgiannuzzi thanks for correcting me. discussed this version of implementation with the team in a standup and basically what we want to achieve is that no events are emitted if span ended before first (delayed) heartbeat.

Comment on lines 91 to 92
var isDelayedHeartbeatPending =
_delayedHeartbeatActivities.Any(span => span.SpanId == data.SpanId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is O(n) - we should instead have a way to immediately check whether a span is delayed or not (e.g. by using 2 dictionaries: one for delayed activities and one for ready activites)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we use one dictionary instead of two? we should be able to check if span is delayed with one. if span is ready we should send log because log was already sent after delay expired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing - this was just a suggestion!

@mladjan-gadzic mladjan-gadzic merged commit 1a44326 into main Jun 2, 2025
18 checks passed
@mladjan-gadzic mladjan-gadzic deleted the opd-12 branch June 2, 2025 10:26
Comment on lines +54 to +56
int heartbeatIntervalMilliseconds = DefaultHeartbeatIntervalMilliseconds,
int initialHeartbeatDelayMilliseconds = DefaultInitialHeartbeatDelayMilliseconds,
int processIntervalMilliseconds = DefaultProcessIntervalMilliseconds
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense for these to be TimeSpans? I guess you'd have to use null or TimeSpan.Zero (or similar) as the default value.

Suggested change
int heartbeatIntervalMilliseconds = DefaultHeartbeatIntervalMilliseconds,
int initialHeartbeatDelayMilliseconds = DefaultInitialHeartbeatDelayMilliseconds,
int processIntervalMilliseconds = DefaultProcessIntervalMilliseconds
TimeSpan? heartbeatInterval = null,
TimeSpan? initialHeartbeatDelay = null,
TimeSpan? processInterval = null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd rather keep it as int since it aligns with other processors implementations (e.g https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry/BatchExportProcessor.cs#L23)

Heartbeat();
}
catch (ObjectDisposedException)
while (_delayedHeartbeatActivities.TryPeek(out var activity) &&
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only remove it from _delayedHeartbeatActivities if it's at the front of the queue, right? That doesn't seem like what we'd want.

Either we could just not remove it and look for it in _delayedHeartbeatActivitiesLookup when it comes to send the initial heartbeat, or we have to search the whole queue.

I think the first choice is probably easier/faster, as _delayedHeartbeatActivitiesLookup should (I think) be smaller than _delayedHeartbeatActivities and obviously it has the index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from docs about TryPeek

Tries to return an object from the beginning of the ConcurrentQueue without removing it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I mean. It will only look at the item at the head of the queue.

If it matches activity.SpanId then we will remove it, and the next item obviously won't match and we'll exit the loop.

If the first item doesn't match then we'll immediately exit the loop.

There's no scenario where we'll ever check anything beyond the first two items (if I'm right).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will try to peek item at the head of the queue and check whether it's time for a heartbeat. if it is not, that's it. if it is it will check whether that item is active and if it is it will do heartbeat and add it to the ready queue. we're not breaking the loop, but continuing to the next item. what am i missing?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the other loop. This is OnEnd and I think it's supposed to be removing the activity from both the lookup and the _delayedHeartbeatActivities queue (if it's there), but it looks to me like it will only remove it if it is at the head of the queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i see, i am not iterating through it with TryPeek, but just getting the first one all the time. i'll see what i can do

Comment on lines +215 to +219
while (_delayedHeartbeatActivities.TryPeek(out var span) &&
span.InitialHeartbeatTime <= DateTime.UtcNow)
{
_delayedHeartbeatActivitiesLookup.TryRemove(span.SpanId, out _);
_delayedHeartbeatActivities.TryDequeue(out span);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might there be a race condition here if the span from TryPeek(out var span) has been removed from the queue before the TryDequeue(out span)?

The first thing to do would be swap the order so that at least the TryRemove(span.SpanId, out _) removes the right span from the lookup. I think it might still be possible that this might send something before its InitialHeartbeatTime though.

Copy link
Contributor Author

@mladjan-gadzic mladjan-gadzic Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be the issue because removal from _delayedHeartbeatActivities within OnEnd method can happen after peek within ProcessDelayedHeartbeatActivities method. what i can think of is to synchronize these with locks. what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like this 22f3568?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as we have two collections in play (the lookup and the queue) we probably have to go with locks. In that case I'd suggest using them everywhere and just use normal collections rather than the concurrent ones.

When performing an operation on both lists, put the lock around both operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a delay before the initial heartbeat for a span is sent
3 participants