Sentry Developer Metrics - What should that look like? 💡 #58584

ale-cota · 2023-10-23T12:48:05Z

ale-cota
Oct 23, 2023
Maintainer

Hi, everyone!

I’m Alexandra from the product team here at Sentry, and I’m very excited to say that we are building an application metrics solution for developers. We believe many metrics tools are out there, but they are often not created for developers or with their specific needs in mind, and we have some cool ideas to build something better.

Here’s the core premise: generic metrics are not ideal for developers, and context/correlation is key.

Every day, we look at metrics at Sentry for our use, but sometimes, they leave us hanging. We need context and connected signals, like traces, spans, or releases, to find the root cause and solve a problem. We also found that it is needed throughout the stack, from the backend to the frontend to a mobile app. Today, we use many different tools and jump between many of them to find the issue.

We believe this to be an issue you, our users, are facing as well.

This is how we imagine it to work:

As a user, you can filter, segment, and apply operations, like p95, to all metrics.
If you discover a spike, you can select it (think "Enhance" like in Bladerunner).
The UI updates show you signals representing the selected metrics range.
You can click on these signals to drill into the respective details.

It could look like this completely made-up mockup:

Are we on the right track?

As we explore this new venture, your input is invaluable. We would love to understand better what you’re currently using, what challenges you face, and how Sentry could help you solve them.

Here are a few questions to get us started:

What kind of metrics do you care about? And where do you collect them from? (backend, frontend, etc.)
What tools/products do you use today to collect and analyze your metrics data?
What’s working and what isn’t with those tools?
Open question - what should an application metrics product in Sentry look like?

Thanks for taking the time to contribute. We’re very excited to hear your ideas!

P.S. We’re aiming to start an early access phase soon, so stay tuned for more news to come 🚀
This is also a living document. Subscribe for updates as we continue our work.

cvb941 · 2023-10-23T15:25:24Z

cvb941
Oct 23, 2023

Lines of code, thanks.

3 replies

bruno-garcia Oct 24, 2023
Maintainer

You mean on the Releases page for example, track the LOC of your app?
How do you use this metric? Does it help you ship with confidence?

shaedrich Oct 25, 2023

Lines of code doesn't tell one much per se usually.

They could be combined with:

Lines of Code (LOC)
Source Lines of Code (SLOC)
Comment Lines of Code (CLOC) → doc block → self-documenting code vs self-explanatory code (SRP, SoC)
Non-Comment Lines of Code (NCLOC)/Non-Comment Source Lines (NCSL)/Effective Lines of Code (ELOC)
Logical Lines of Code (LLOC)

This, in return, could be combined with:

Another important factor is the relative change between revisions.

bruno-garcia Dec 18, 2023
Maintainer

If you meant literally Lines of code as in your code available with the metric, would it look something like this?

jsocol · 2023-10-23T18:31:06Z

jsocol
Oct 23, 2023

There are a few global features I tend to want in metrics, for both front- and back-end:

timing information for different operations
context around different code paths that executed
context around results (at least success/failure)
context around aggregatable inputs (i.e. avoiding including PII in the context—if a tool can help with that somehow, that would be amazing)
quick, unobtrusive ways to instrument new releases or experiments
higher level views like rates (e.g. qps), ratios (e.g. errors / attempts) and different percentiles for timing or other histogram-type metrics

Drilling down, I think there are two major contexts: (1) something should not be changing, most importantly when it is changing anyway; and (2) something should be changing and I'm observing overall impacts.

(1) is the context of operations and incident response. Things that matter are the ability to monitor and alert to surface anomalies, the ability to tie context like metrics/spans/traces/logs/source code/releases/feature flag state together, the ability to identify patterns in errors (one of my favorite of Sentry's capabilities!), and the ability to exclude extraneous information. If something like overall response time—which I would prefer to measure both from the client and the server, but often have to settle for server—is going up, my first priority is determining what's common about the contributing measurements, to assess impact/severity, and then to identify areas to focus for remediation.

(2) is the context of deployment and release, especially with CI/CD and feature flags. When I deploy or adjust a dial to control a release, I want to be able to see the impacts of that. What percentage of traffic is going down various branches? How is latency responding for each branch? What error/log patterns are new?

6 replies

jsocol Oct 27, 2023

The biggest issue with client-side metrics is usually a lack of fidelity from ad blockers and inconsistent deliverability. I've used Datadog's client-side library, events from Google Analytics, and custom first-party solutions to accept POST data and write to an internal system, like StatsD, or a data/ML pipeline.

Ad blockers should be more or less predictable with any meaningful scale, though I've had a hard time historically finding good data on install rates.

Inconsistent deliverability stems from a few things:

User Agents prioritizing privacy and either blocking or silently blackholing requests to different origins.
Mobile User Agents not firing the same events, particularly around page navigation time.
XHR being sent without keepalive or requests otherwise being terminated before they're complete.

Some of this could be ameliorated with e.g. the Beacon API, though I'm not sure what the actual implementations do w/r/t privacy and different origins.

Google Analytics is pretty good with deliverability, outside of ad blockers, but the economics of sending a meaningful amount of data are terrible.

Given the challenges with data quality, most of my teams have opted not to spend time on detailed client-side instrumentation.

shaedrich Oct 27, 2023

Prioritizing privacy is not wrong though 😂

jsocol Oct 27, 2023

Prioritizing privacy is not wrong though

100%—that's why I mention it instead of just saying UAs don't consistently deliver these requests (or drop cookie data, etc). It's not that UAs are broken or somehow doing the wrong thing: this is intended (and good) behavior and so its something that client-side metrics need to take into account.

shaedrich Oct 27, 2023

Thanks for the clarification—otherwise, I might have misread the intent behind the statement 👍🏻

jsocol Oct 27, 2023

I really appreciate the update and mockup, too! Very helpful to understand how the team is approaching this question. I think this premise may benefit from some additional clarity about what's in scope:

Here’s the core premise: generic metrics are not ideal for developers, and context/correlation is key.

Generic or aggregate metrics are critical for identification—where global context, like releases, matters—while having the ability to drill into the more specific context is critical in remediation. Two examples are 1) identifying a spike in error rates or latency; and 2) determining that there is no meaningful change in latency overall when getting "it's slow" reports from customers. So how much time / UI to dedicate to higher level aggregate or generic slices of data might depend on how much monitoring/alerting (including e.g. dashboards as well as automated monitoring) is in scope for what you're aiming to solve.

DragoonAethis · 2023-10-23T20:07:54Z

DragoonAethis
Oct 23, 2023

I work with a few relatively lightweight monoliths (backend-heavy, just dumps HTML or JSON, there are no heavyweight clients or distributed parts in motion) that usually run within 10s of RPS and very much starting my adventure with metrics on any meaningful scale, so most of this probably isn't useful.

Buuut, if you wanna read this...

When I have metrics, it's usually Grafana + Prometheus aimed at node_exporter for the VMs (CPU/RAM/IO/disk space/etc), some generic framework-specific plugins (request/application server metrics) and one more custom endpoint that gets polled frequently and runs some critical "business" metrics (this most often dumps object counts across different dimensions). I don't think Sentry should do all that, unless you basically reinvent the two solutions above, and frankly there doesn't seem to be a lot of value in that.

When I'm writing a new endpoint or refactoring something existing, sometimes I look at a query or an API client call and I'm like "hmm, this looks sus, how long does that take anyway" - Sentry has some integrated spans and what not, but it'd be great if I could just point at a line of code (ideally, directly in the editor) and ask what metrics can you tell me about that.

More often than not there's a bunch of operations I wish I had added spans around to measure, but noticed only after they became issues - I can't go back in time to figure that out, but Sentry has sampled profiles, so I could ask "hey, how long does it take on average for clicky this function to run under what conditions from the past profiles" -> let's say p95 is 0.65ms when it's called from ViewUser (okay), 0.14ms from ViewGroup (that's fine), 792ms from ViewDocuments (oh dear). I can do this today, but only if I instrument interesting things ahead of time, or wade through sampled profiles manually.

The views mentioned above are pretty obvious, but less obvious would be "extreme users" - when your average user has around 100 documents but one power user comes around with hundreds of thousands and your massive tzar bomba query with 10 JOINs inside brings the poor database to its knees. It'd be cool if something could tell me there's something going on in here, but suggest adding some sort of a wrapper that could collect more specific context (sampled arguments? sizes of input/output collections?) - when a crash occurs, a lot of this sort of stuff gets collected, but not when things just get unusually slow.

Also, integrating metrics into issues. If an issue occurs under a specific view, can you point me at recent profiles and collected metrics to see how it behaves for successful/failed cases? In breadcrumbs, how long do queries take, and how much data did they return? Can I go from breadcrumb metrics into a graph of whatever-you-can-show-me? Can I have Sentry itself bump the sampling rate on a given code path for more detailed data collection if a completely new issue starts rarely occurring, without my intervention? I usually don't know what I'm looking for, just that there are hints in whatever data is collected around here.

A lot of this isn't very organized and probably doesn't belong to a "metrics" solution, but that's what I'm thinking about here.

1 reply

dcramer Oct 24, 2023
Maintainer

This is awesome feedback <3

Some of this we've not talked extensively about, but one thing I wanted to call out is what you note on profiles. The fundamental thesis of Sentry is that we should be able to take you from a high level symptom (e.g. error rate) to the thing causing the problem (e.g. the error). IMO we did and still do that well - room for improvement in some modern stacks - but the same approach didn't quite translate successfully to latency concerns. More recently we aligned on a vision fairly similar to what you mention: sometimes you dont instrument the code, and profiles are a great plug for that gap. Its far from the typical use case but its super valuable. While we havent really talked about how you might extrapolate some metrics using that data, a connected portion of this remains true: symptom => problem.

In this case the core, or rather the early approach we're taking with metrics is aligned with that: pull out metrics (which in our case require trace context) and ensure you can always find the problem from the symptom. That means you'll be able to take things like say (in my contrived, bad, but also not default-built-in example) capture the amount of money you got paid from a checkout experience, and then find the sample traces where someone paid you a higher amount than the rest. The model I keep giving people is the scene from bladerunner where Harrison Ford takes the video and enhances it. The graph is the core video, and we want to constantly enhance it. First level finds us traces that resemble the behavior we see. Second maybe fills in details of that trace with profiles. Third is maybe replays or similar context.

So long winded way of saying: love the feedback, feels pretty aligned with some of the ways we see Sentry ourselves, and we'll definitely take note of the need for more ways to solve instrumentation when you couldn't plan ahead (which we all know, is most of the time).

SebastiaanKloos · 2023-10-24T08:54:54Z

SebastiaanKloos
Oct 24, 2023

I would like to track the usage of external integrations like CloudConvert etc. I use Laravel for all my applications so being able to up a number in Sentry would be really cool.

3 replies

mitsuhiko Oct 24, 2023
Maintainer

Usage here would be things like API requests sent to these integrations or more like "users activated integration"?

stayallive Oct 26, 2023

If I’m reading between the lines right this is about the API requests and not users that activated the integration. Also would love to use this for that to have better insights in usage of usage-based API’s that usually don’t have granular tools to see where the usage is coming from or doing custom aggregations on the usage.

SebastiaanKloos Oct 29, 2023

If I’m reading between the lines right this is about the API requests and not users that activated the integration. Also would love to use this for that to have better insights in usage of usage-based API’s that usually don’t have granular tools to see where the usage is coming from or doing custom aggregations on the usage.

That is exactly what I mean🙌

shaedrich · 2023-10-25T09:37:28Z

shaedrich
Oct 25, 2023

One thing to keep in mind is: Which metrics do we need and how to use them properly and usefully.

Bad metrics are worse than no metrics
In the wise words of Mark Twain, it ain’t what you don’t know that gets you into trouble. It’s what you do know that ain’t so. Bad metrics are what you do know that ain’t so. So when it comes to metrics, yes, it is important to use them. But it’s even more important to avoid placing too much reliance on them38.

Source: https://www.cio.com/article/408921/bad-metrics-are-worse-than-no-metrics.html

That brings us to some criteria:

The 7 Cs of IT metrics
Clearly, if you’re going to establish a system of metrics to help you manage performance, you need good metrics. But what’s the difference between good and bad metrics, you ask? Good metrics adhere to what I call the 7 Cs. Okay, its name might be annoying, but if you ignore any of the 7 Cs you’ll end up with metrics that are, under the best of circumstances unhelpful, and under the worst actively destructive.
Good metrics must be:
Connected: Good metrics must be connected to something important — usually a business goal, or outcome of a process or subprocess. If a metric isn’t connected to something important it tells you nothing useful. And if the goal, process, or subprocess isn’t important, don’t bother establishing a metric to assess it.
Calibrated: No matter who measures what you’re measuring, they must record the same result. With this caveat: Some measures are subjective. Many aspects of employee performance are, for example, in the eye of the beholder. That doesn’t mean calibration doesn’t matter. It means it’s sometimes hard to achieve.
Consistent: When whatever it is you’re measuring improves, the metric always moves in one direction. When it deteriorates the metric always moves in the other direction.
Complete: It’s Lewis’s 3rd Law. Anything you don’t measure, you don’t get. Incomplete metrics ignore goals that matter.
Continuum: Okay, okay. As Cs go, this is a stretch. It should have been “Scaled,” but “6 Cs and an S” doesn’t sing in the shower. Good metrics fall on a known continuum, so you know if the number is good, bad, or indifferent.
Communicated: Go back to Lewis’s 1st Law. Uncommunicated metrics won’t alter behavior because employees and teams won’t know what they say.
Current: Goals change. If you don’t keep your metrics in sync, the organization will continue to achieve your old goals. Employees won’t help you achieve your new ones because you aren’t measuring them.

Soure: https://www.cio.com/article/308844/the-hard-truth-of-it-metrics.html

But there are risks:

Top 3 Ways to Misuse Metrics:

Have too many metrics – Most likely people will end up spending too much time measuring, taking them away from doing. Even if the metrics are all collected automatically, too many metrics will make it impossible to see the forest for the trees.

Choose arbitrary metrics – The metrics must match the process and actually provide meaningful insight into performance. If the metrics provide no insight as to what direction to take to improve, they are pretty meaningless.

Never change or adjust metrics – Business do not operate in a static world. At the very least you should adjust the target values of your metrics based on the changing environment. You should also be prepared to use different metrics if the environmental changes are significant enough. Twenty years ago when oil was cheap and climate change was virtually unknown, a process' carbon footprint was not a meaningful metric. Today that is not the situation.

Source: https://blog.lnsresearch.com/blog/bid/200624/A-Metric-Misused-or-Misunderstood-Is-Worse-than-No-Metric-at-All

So, what metrics would be useful?

Amount of regression (not necessarily just the error count in Sentry)
Code churn
MSI, MCC, CCMSI (Mutation Testing)
Code complexity
Defects Found in Production/Escaped Defects
Assignment scope, MTBF + MTTR, etc.
Releases by category (feature, support, bugfix, hotfix → git flow)
Errors vs Success per endpoint/job/etc.

0 replies

VinceValence · 2023-11-17T17:58:25Z

VinceValence
Nov 17, 2023

Having watched the preview of this feature in the launch week, I think it would be very useful to allow users to define alerts/notifications for when something happens to a metric. For example, if I have a numeric metric that I would not want to go under certain value, but it does, I would like Sentry to notify me as it does with exceptions.

It would be even better if Sentry could detect changes in metrics' usual patterns and alert you about it, like in AWS CloudWatch.

1 reply

danielkhan Nov 18, 2023

think it would be very useful to allow users to define alerts/notifications for when something happens to a metric

You will be able to alert on these metrics.

It would be even better if Sentry could detect changes in metrics' usual patterns and alert you about it
Noted.

Auto-baselining and anomaly detection certainly make sense but also come with computational costs.
Maybe we find a smart and cheap way to do this down the line.

seanabraham · 2023-11-20T22:32:36Z

seanabraham
Nov 20, 2023

In addition to what you've described here (metrics closely correlated w/ other other event details to be able to holistically diagnose what's happening), something I'd be very interested in is just simple metrics that we can send from ~100% of requests.

For example, it would be very nice to get an accurate idea of throughput and transaction durations and I'd love to send that data from 100% of requests, while still sampling at just a couple percent for full transactions or profile data. It's not possible to currently do that in any form in the product, is it?

1 reply

ale-cota Nov 22, 2023
Maintainer Author

Hi Sean,
Thanks for the feedback. Indeed that's not something currently possible for transaction or profile data - the performance metrics you get today are tied to transactions and thus only extracted from the events that you do send to Sentry. So if you use any client-side sampling this won't be representative of 100% of your requests.

Metrics definitely carry much less information than transactions, so would not require any client side sampling and thus really represent 100%.

present-g · 2023-12-01T19:07:46Z

present-g
Dec 1, 2023

I'd also like to be able to fully work with my own metrics that I can conveniently send to Sentry and work with them as Sentry metrics.

Something like how it works with captureEvent, but more specialized, for example:

Sentry.sendMetrics({
  customMetric1: {
    value: 1,
    util: "second"
  },
  ...more
});

0 replies

jamescrosswell · 2023-12-04T06:52:11Z

jamescrosswell
Dec 4, 2023

I've been looking at implementing DDM as an experimental feature in the .NET SDK.

At first glance, this looks really similar to OpenTelemetry Metrics. Has any evaluation already been done on whether it's better to build this out as a proprietary Sentry thing or to build on top of OpenTelemetry Metrics?

I did some (very preliminary) head scratching about how OpenTelemetry might be used to implement this in the .NET SDK (see the description on getsentry/sentry-dotnet#2880) but didn't want to burn too many calories on this if this was already evaluated as part of the pilot implementation for Python or the PHP/Javascript work that's already underway.

4 replies

bruno-garcia Dec 4, 2023
Maintainer

At first glance, this looks really similar to OpenTelemetry Metrics. Has any evaluation already been done on whether it's better to build this out as a proprietary Sentry thing or to build on top of OpenTelemetry Metrics?

How could we build on top of OTel without taking a hard dependency on it?

.NET has built-in Metrics now on System.Diagnostics.DiagnosticSource. Could we use that instead? (assuming we can make it available to all supported TFMs we have but without adding a dependency)?

Ultimately we need to make this work with our own protocol/format.

dcramer Dec 4, 2023
Maintainer

There's a fundamental difference between what we're doing and what OTel supports. Two things:

Sentry doesn't intend to support OTel at all outside of leveraging the span annotations API. We have a wide range of reasons for this, but frankly we simply dont see the core value prop of a giant controlled spec for "everything observability". Logs and metrics are not complicated problems for example.
Our metrics implementation is trace coupled. That means that metrics actually associate themselves with trace context, giving us the ability to reverse lookup data sources. OTel's metrics and logging implementation, to the best of my knowledge, are entirely independent of the tracing implementation. It's possible we could hijack their packages there, but tbqh its also easy for folks just to make a metrics abstraction in their own codebase and codemod existing calls.

jamescrosswell Dec 4, 2023

How could we build on top of OTel without taking a hard dependency on it?

We couldn't - we'd need to leverage a base class in their library to build an exporter

There's a fundamental difference between what we're doing and what OTel supports.

k, all good. Further down the road, we can always build a bridge between Otel and Sentry (similar to what we've done with Tracing), in case people wanted to instrument using Otel but have the data in Sentry... so probably not a one way door decision.

I'll build out our own MetricsBackend then 😜

dcramer Dec 4, 2023
Maintainer

FWIW for Sentry's tracing stuff - at least the plan right now - you'll be able to simply use OTel and we'll leverage prior instrumentation. Not sure if we can or we will want to do the same with metrics, I just see the need for that to be far less given metrics are a much simpler construct.

Sentry Developer Metrics - What should that look like? 💡 #58584

ale-cota Oct 23, 2023 Maintainer

Replies: 9 comments · 19 replies

bruno-garcia Oct 24, 2023 Maintainer

bruno-garcia Dec 18, 2023 Maintainer

dcramer Oct 24, 2023 Maintainer

mitsuhiko Oct 24, 2023 Maintainer

ale-cota Nov 22, 2023 Maintainer Author

bruno-garcia Dec 4, 2023 Maintainer

dcramer Dec 4, 2023 Maintainer

dcramer Dec 4, 2023 Maintainer

ale-cota
Oct 23, 2023
Maintainer

Replies: 9 comments 19 replies

bruno-garcia Oct 24, 2023
Maintainer

bruno-garcia Dec 18, 2023
Maintainer

dcramer Oct 24, 2023
Maintainer

mitsuhiko Oct 24, 2023
Maintainer

ale-cota Nov 22, 2023
Maintainer Author

bruno-garcia Dec 4, 2023
Maintainer

dcramer Dec 4, 2023
Maintainer

dcramer Dec 4, 2023
Maintainer