Monitoring message queue performance with Sentry #69320

bcoe · 2024-04-19T17:35:29Z

bcoe
Apr 19, 2024
Maintainer

Messaging systems, such as Celery, or SQS, are a useful abstraction. However, they can be challenging to reason about:

Message queues are best understood with distributed tracing; one part of the system places a message in the queue, another part of the system grabs the message and does something with it.
Errors might originate in application logic, or could be limitations of the message queue system itself (as an example, it’s common for queue systems to have limits on message size).

We’re working on a feature at Sentry that makes your application’s interactions with message queues easier to reason about…

Queue performance insights

These are some of the metrics we felt would be valuable for identifying queue performance problems:

Throughput (at what rate are messages being written to the queue, vs., being consumed?)
- Why it matters: If you're writing more messages to the queue than you’re able to process for a long time, this may result in back pressure and eventual publication timeouts, or out of control memory growth in the messaging system.
Processing Latency (How long does the average message take to be processed / how long are a sample of messages taking to complete?)
- Why it matters: Jobs taking too long to complete may lead to jobs timing out and being retried, can contribute to poor throughput, and often ultimately results in a bad user experience.
Error Rate (What’s the error rate of the queue and does it correlate to specific messages?)
- Why it matters: Errors tied to specific messages may point to edge-cases in your business logic that can’t be handled by consumers. A high error rate may point to underlying stability issues with your application or infrastructure.

Request for feedback

If you build applications that use messaging systems, we’d love your feedback on this feature.

To get the ball rolling 🏀 …

We’re proposing application-centric monitoring, providing insights into your producers and consumers, not queue infrastructure. This fits with Sentry’s developer-first approach to performance monitoring. However, it has limitations, such as not having an absolute count of messages in the queue.
- Do the metrics we’ve outlined give you what you need to debug your producers and consumers? If not, what are we missing?
What message queue infrastructure and libraries are people using? For MVP, we’re aiming to support Celery for automatic instrumentation, along with providing detailed docs covering custom instrumentation.

Mockups

Queue overview page

This is the first page that you’ll see when you click on Queue insights. In this view, you will see a high level summary of queue performance across all the queues and topics used by your system (depending on what queue you’re using, this may be a topic, or queue name).

Destination summary

This page provides similar metrics to the Queue overview page, but broken down by Destination. The Queue overview page helps you see the behaviour of your queues at 30’000’, the Destination Summary page allows you to better understand specific queue performance issues.

Transaction Overlay

Producer overlay

Consumer overlay

The Transaction Overlay, along with showing transaction level metrics, displays a sample of spans representing messages that were recently either written to the queue, or processed from the queue.

For both consumers and producers, you can dig deeper by clicking the Span ID and viewing the trace tied to the queue operation.

Looking forward to people’s feedback in this discussion,

— @bcoe

derekr · 2024-04-19T22:39:17Z

derekr
Apr 19, 2024

Exciting! Been tracing manually with amqplib in a node project. Starting a new project using pg-boss. I'd say on tricky part is we currently run our workers in the same process as our express web server so it can be difficult to reason about what Sentry scope we're working with or experience scope bleeding. Sometimes it's almost like we'd want a separate Sentry instance per concern web vs queue, but they're still the same project.

1 reply

bcoe Apr 22, 2024
Maintainer Author

Hey @derekr, thanks for the response.

I'd say on tricky part is we currently run our workers in the same process as our express web server so it can be difficult to reason about what Sentry scope we're working with or experience scope bleeding.

I'd assume that the workers are creating a transaction that is different from the transactions you see tied to web requests?

Do you think the fact that queue performance monitoring only presenting spans tied to query operations will help?

Baffour · 2024-04-30T16:55:41Z

Baffour
Apr 30, 2024

However, it has limitations, such as not having an absolute count of messages in the queue.

Trying to understand the distinction between this and your mockups. Does this mean we can see published vs processed in any time period, but not total count in the queue at any one time? Is there a meaningful distinction between total and published - processed? Trying to think if that would matter to us, I don't think it does!

Re: infrastructure/libraries. We use Azure Service Bus. Library wise we use .NET and Microsoft.Azure.ServiceBus though I've noticed this is deprecated so we should probably move to Azure.Messaging.ServiceBus.

We use both Queues and Topics and subscriptions, but the latter a lot more https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-queues-topics-subscriptions#topics-and-subscriptions

Would you say your area of focus is more about Queues, or is your definition of a Queue loose enough to encapsulate both?

2 replies

bcoe May 16, 2024
Maintainer Author

Hey @Baffour, sorry for the slow response! I've been heads down helping get caches / queues out the door.

Trying to understand the distinction between this and your mockups. Does this mean we can see published vs processed in any time period, but not total count in the queue at any one time? Is there a meaningful distinction between total and published

Correct, you will be able to see published, vs., processed summed over the given time period. In experimentation, I've noticed these numbers are pretty accurate, i.e., you'll see the same # of published, vs., processed jobs.

Where I think it could be suboptimal compared to absolute queue depth, is:

If messages take longer to process than the time window you have selected.
If messages are failing and being retried by your queue system (in this case it might actually be good that you see more received than published, because it's a smoking gun that something weird is happening).

I feel that published, vs., processed is a good starting point for monitoring. I just also feel that knowing the absolute count in the queue would be useful.

Would you say your area of focus is more about Queues, or is your definition of a Queue loose enough to encapsulate both?

We've discussed calling this feature Messaging, rather than Queues, to make it clear that the dashboard should work well for any type of messaging system.

As long as the system is represented well by messages, throughput, and topics, I think we can encapsulate it.

Baffour May 25, 2024

Thanks for the response @bcoe

We've discussed calling this feature Messaging, rather than Queues, to make it clear that the dashboard should work well for any type of messaging system.

That makes sense. I prefer the term Messaging personally, being a bit more generic. Less likely to make assumptions about what it supports!

samuelhwilliams · 2024-07-12T11:41:06Z

samuelhwilliams
Jul 12, 2024

Was pleasantly surprised to see this feature just as we started incorporating a tiny number of background tasks via queues into one of our systems. We're using the auto-instrumentation for celery, but have a slight edge case that I'm trying to work through.

We've got a new celery worker that will be the consumer, and is set with a sentry trace sampling rate of 1 (100%); but our producers (web servers) are sampling at 0.02 (2%), which I think is leading to a mismatch in Sentry's produced vs consumed stats.

We want to effectively trace all messages/queued tasks, but not trace all HTTP requests fully. Is there a way to do this currently? I suppose I was hoping for some kind of SENTRY_QUEUES_SAMPLE_RATE equivalent of the SENTRY_TRACES_SAMPLE_RATE, but I don't think that exists yet?

We're dealing with really low volumes and it looks like Sentry is on the cusp of being able to give us "good enough" queue monitoring without having to do any hard work on bespoke monitoring stuff. But that's maybe only true if we can track producing+consuming to the same level (ideally 100% for our volumes). It would also be extremely valuable to us if we could configure alerts should messages get stuck in the queue (or effectively, if production exceeds consumption over some time period). I don't think I can see anything in Sentry that would let us set up this kind of alert yet? Would love to be told I'm wrong though.

1 reply

aberres Aug 16, 2024

@samuelhwilliams Did you solve this?

I am right now looking into this feature, and it would be a show stopper. It should be possible to force a sampling decision, shouldn't it? https://docs.sentry.io/platforms/python/configuration/sampling/#forcing-a-sampling-decision

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring message queue performance with Sentry #69320

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Monitoring message queue performance with Sentry #69320

bcoe Apr 19, 2024 Maintainer

Queue performance insights

Request for feedback

Mockups

Queue overview page

Destination summary

Transaction Overlay

Replies: 3 comments · 4 replies

derekr Apr 19, 2024

bcoe Apr 22, 2024 Maintainer Author

Baffour Apr 30, 2024

bcoe May 16, 2024 Maintainer Author

Baffour May 25, 2024

samuelhwilliams Jul 12, 2024

aberres Aug 16, 2024

bcoe
Apr 19, 2024
Maintainer

Replies: 3 comments 4 replies

derekr
Apr 19, 2024

bcoe Apr 22, 2024
Maintainer Author

Baffour
Apr 30, 2024

bcoe May 16, 2024
Maintainer Author

samuelhwilliams
Jul 12, 2024