Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: 5.41.1 caused a substantial increase in number of increment counters. #5405

Closed
howinator opened this issue Mar 12, 2025 · 14 comments · Fixed by #5491
Closed

[BUG]: 5.41.1 caused a substantial increase in number of increment counters. #5405

howinator opened this issue Mar 12, 2025 · 14 comments · Fixed by #5491
Labels
bug Something isn't working

Comments

@howinator
Copy link

howinator commented Mar 12, 2025

Tracer Version(s)

5.41.1

Node.js Version(s)

22.14

Bug Report

Upgrading 5.41.1 to broke a number of our metrics. We're having a hard time tracking down exactly what happened, but something in 5.41.1 definitely caused some of our counter metrics to increase by an order of magnitude.

In the chart below, we deployed a PR where the only substantive change was upgrading from 5.40.0 to 5.41.1.

We're currently downgrading the version to confirm this is the cause.

Perhaps related, our count metrics in Datadog are now type=gauge.

Image

Reproduction Code

This is our code to increment a metric.

export type MetricTags = Partial<Record<MetricTag, string | number>>;

export function increment(
  namespace: MetricNamespace,
  metric: MetricName,
  amount: number,
  tags?: MetricTags, // Accept string | undefined to make it easier for the client
) {
  ddTracer.dogstatsd.increment(getStat(namespace, metric), amount, tags);
}

Error Logs

No response

Tracer Config

No response

Operating System

No response

Bundling

Unsure

@howinator howinator added the bug Something isn't working label Mar 12, 2025
@howinator howinator changed the title [BUG]: [BUG]: 5.41.1 caused a substantial increase in number of increment counters. Mar 12, 2025
@howinator
Copy link
Author

I'll let you all debug, but #5347 looks like it may result in the behavior we're seeing

@alexbidule
Copy link

The same issues was observed in our application. Rolling back to 5.41.0 fixed the issue. So there must be something wrong in 5.41.1.

@MMShep97
Copy link

previous 2 issues are related:

#5389
#5394

@alexbidule
Copy link

Looks like this was partially fixed in 5.42.0 but still happening.

We decided to rollback to 5.40.0 which seems to be the best version so far

@dkindt
Copy link

dkindt commented Mar 18, 2025

We rolled back to 5.39.0 and have not seen the memory issues.

@rochdev
Copy link
Member

rochdev commented Mar 19, 2025

This should now be fixed in 5.43.0

@alexbidule
Copy link

alexbidule commented Mar 20, 2025

This should now be fixed in 5.43.0

I afraid it is not fixed. Here is a screenshot of an app monitoring when it is not receiving any request.

Image

@kroney
Copy link

kroney commented Mar 21, 2025

It also causes an insane increase of outbound traffic.
Image

I can confirm it is NOT fixed in 5.43.0. 5.41.0 works perfectly fine, everything higher is broken

@RyanBarclay
Copy link

+1 to this. We enchanted this too. All metrics with the tracer.dogstatsd.increment(processedMetric.metricName, processedMetric.amount, processedMetric.tags); updates and creates metrics as Gauge type. When reverting back to 5.41.0 it works as expected and the type goes to Rate; as expected.

@rochdev
Copy link
Member

rochdev commented Mar 25, 2025

It also causes an insane increase of outbound traffic.

That is actually very good information, it might mean that the individual data points might be stuck in memory even after flushing. I'll look into this further.

+1 to this. We enchanted this too. All metrics with the tracer.dogstatsd.increment(processedMetric.metricName, processedMetric.amount, processedMetric.tags); updates and creates metrics as Gauge type. When reverting back to 5.41.0 it works as expected and the type goes to Rate; as expected.

@RyanBarclay Was this with 5.43.0? This was definitely a regression from 5.41.1 but it should have been fixed in 5.43.0. Just want to make sure it's still an issue to avoid looking into it if it's now fixed.

@rochdev
Copy link
Member

rochdev commented Mar 26, 2025

@RyanBarclay Do you still get the issue with 5.44.0?

@rochdev
Copy link
Member

rochdev commented Mar 27, 2025

Turned out the issue was when using multiple tags, will release a fix soon.

@RyanBarclay
Copy link

It also causes an insane increase of outbound traffic.

That is actually very good information, it might mean that the individual data points might be stuck in memory even after flushing. I'll look into this further.

+1 to this. We enchanted this too. All metrics with the tracer.dogstatsd.increment(processedMetric.metricName, processedMetric.amount, processedMetric.tags); updates and creates metrics as Gauge type. When reverting back to 5.41.0 it works as expected and the type goes to Rate; as expected.

@RyanBarclay Was this with 5.43.0? This was definitely a regression from 5.41.1 but it should have been fixed in 5.43.0. Just want to make sure it's still an issue to avoid looking into it if it's now fixed.

Hey @rochdev sorry, didn't see this reply. As we are using this in a pretty broad way the risk of bumping the version over just hanging out at 5.41.0 is not worth it at this moment. Once this seems a bit more stable, likely 1-2 weeks of no others reporting this we will likely look at bumping version.

@rochdev
Copy link
Member

rochdev commented Mar 27, 2025

Hey @rochdev sorry, didn't see this reply. As we are using this in a pretty broad way the risk of bumping the version over just hanging out at 5.41.0 is not worth it at this moment. Once this seems a bit more stable, likely 1-2 weeks of no others reporting this we will likely look at bumping version.

@RyanBarclay Fair enough. We have just released 5.45.0 which basically revamps tag handling to better support custom metrics as it was originally meant only for runtime metrics and only adapted to support custom metrics but that ended up being based on incorrect assumptions. The latest implementation now supports everything properly regardless of the source of the metrics which should not only fix the current issue but also prevent regressions moving forward. Whenever you (or anyone else) is able to test the new version, please let us know here if everything is good on your end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants