-
Notifications
You must be signed in to change notification settings - Fork 0
Heartbeat delay #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heartbeat delay #17
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I would suggest the following change, given x
as the hearbeat interval, and y
as the initial delay:
- use 2 queues:
- one for just started spans (
OnStart
) - one for spans that sent their initial heartbeat
- one for just started spans (
OnStart
adds the span to the back of the "started spans" queue with an initial heartbeat time of "now +y
"- have one loop that runs every
z
("batch interval"? there's probably a better name for it) and starts 2 tasks:- one looking at the queue of started spans, popping all the spans that have started y ago, sending their initial heartbeat, and adding them to the back of the heartbeat queue with their next heartbeat time (now +
x
) - one looking a the queue of heartbeat spans, popping all the spans that have sent their last heartbeat
x
ago, sending their new heartbeat, and adding them to the back of the heartbeat queue with their new heartbeat time (now +x
)
- one looking at the queue of started spans, popping all the spans that have started y ago, sending their initial heartbeat, and adding them to the back of the heartbeat queue with their next heartbeat time (now +
- you'd probably also need to have an efficient way to remove spans from these queues by having some kind of index by span ID, so that you can remove them when they finish (
OnEnd
)
G-Research.OpenTelemetry.Processor.Partial/PartialActivityProcessor.cs
Outdated
Show resolved
Hide resolved
using (_logger.Value.BeginScope(GetHeartbeatLogRecordAttributes())) | ||
{ | ||
_logger.Value.LogInformation( | ||
SpecHelper.Json(new TracesData(data, TracesData.Signal.Heartbeat))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the whole point of this change is to not send the event before the delay!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we discussed that heartbeat should not be sent before the delay. i am considering "on start" event, which just coincidentally has all attributes as a heartbeat, a different event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on your comment below, I think you know the answer, but just to be 100% clear: the goal of the delay is to postpone sending the initial heartbeat event, so that short-lived spans do not result in any partial span log events
_logger.Value.LogInformation( | ||
SpecHelper.Json(new TracesData(data, TracesData.Signal.Stop))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also send a stop event if we had sent an initial heartbeat - again the whole point of this change is to minimize traffic and server-side processing of short-lived spans!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same to this, i was considering "on end" separate event than heartbeat. would you like me not to send this for on start and on end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on your comment below, I think you know the answer, but just to be 100% clear: yes, I would like to avoid sending a stop event when no initial heartbeat event was sent - the goal is to avoid having many short-lived spans sending 3 things: a "heartbeat" log event, a "stop" log event, and a regular span/trace.
@jgiannuzzi thanks for correcting me. discussed this version of implementation with the team in a standup and basically what we want to achieve is that no events are emitted if span ended before first (delayed) heartbeat. |
var isDelayedHeartbeatPending = | ||
_delayedHeartbeatActivities.Any(span => span.SpanId == data.SpanId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is O(n) - we should instead have a way to immediately check whether a span is delayed or not (e.g. by using 2 dictionaries: one for delayed activities and one for ready activites)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we use one dictionary instead of two? we should be able to check if span is delayed with one. if span is ready we should send log because log was already sent after delay expired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing - this was just a suggestion!
int heartbeatIntervalMilliseconds = DefaultHeartbeatIntervalMilliseconds, | ||
int initialHeartbeatDelayMilliseconds = DefaultInitialHeartbeatDelayMilliseconds, | ||
int processIntervalMilliseconds = DefaultProcessIntervalMilliseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense for these to be TimeSpan
s? I guess you'd have to use null
or TimeSpan.Zero
(or similar) as the default value.
int heartbeatIntervalMilliseconds = DefaultHeartbeatIntervalMilliseconds, | |
int initialHeartbeatDelayMilliseconds = DefaultInitialHeartbeatDelayMilliseconds, | |
int processIntervalMilliseconds = DefaultProcessIntervalMilliseconds | |
TimeSpan? heartbeatInterval = null, | |
TimeSpan? initialHeartbeatDelay = null, | |
TimeSpan? processInterval = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd rather keep it as int since it aligns with other processors implementations (e.g https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry/BatchExportProcessor.cs#L23)
Heartbeat(); | ||
} | ||
catch (ObjectDisposedException) | ||
while (_delayedHeartbeatActivities.TryPeek(out var activity) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only remove it from _delayedHeartbeatActivities
if it's at the front of the queue, right? That doesn't seem like what we'd want.
Either we could just not remove it and look for it in _delayedHeartbeatActivitiesLookup
when it comes to send the initial heartbeat, or we have to search the whole queue.
I think the first choice is probably easier/faster, as _delayedHeartbeatActivitiesLookup
should (I think) be smaller than _delayedHeartbeatActivities
and obviously it has the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from docs about TryPeek
Tries to return an object from the beginning of the ConcurrentQueue without removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I mean. It will only look at the item at the head of the queue.
If it matches activity.SpanId
then we will remove it, and the next item obviously won't match and we'll exit the loop.
If the first item doesn't match then we'll immediately exit the loop.
There's no scenario where we'll ever check anything beyond the first two items (if I'm right).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will try to peek item at the head of the queue and check whether it's time for a heartbeat. if it is not, that's it. if it is it will check whether that item is active and if it is it will do heartbeat and add it to the ready queue. we're not breaking the loop, but continuing to the next item. what am i missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the other loop. This is OnEnd
and I think it's supposed to be removing the activity from both the lookup and the _delayedHeartbeatActivities
queue (if it's there), but it looks to me like it will only remove it if it is at the head of the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, i see, i am not iterating through it with TryPeek, but just getting the first one all the time. i'll see what i can do
while (_delayedHeartbeatActivities.TryPeek(out var span) && | ||
span.InitialHeartbeatTime <= DateTime.UtcNow) | ||
{ | ||
_delayedHeartbeatActivitiesLookup.TryRemove(span.SpanId, out _); | ||
_delayedHeartbeatActivities.TryDequeue(out span); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might there be a race condition here if the span from TryPeek(out var span)
has been removed from the queue before the TryDequeue(out span)
?
The first thing to do would be swap the order so that at least the TryRemove(span.SpanId, out _)
removes the right span from the lookup. I think it might still be possible that this might send something before its InitialHeartbeatTime
though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be the issue because removal from _delayedHeartbeatActivities
within OnEnd
method can happen after peek within ProcessDelayedHeartbeatActivities
method. what i can think of is to synchronize these with locks. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like this 22f3568?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing as we have two collections in play (the lookup and the queue) we probably have to go with locks. In that case I'd suggest using them everywhere and just use normal collections rather than the concurrent ones.
When performing an operation on both lists, put the lock around both operations.
z
ms