Skip to content

[BUG] carbon-cache always returns 0 data points for cached metric, but writes the metric to disk regularly #940

@percygrunwald

Description

@percygrunwald

Describe the bug
We have a graphite cluster with approximately 130 million metrics spread across 3 storage nodes running graphite-web and 16 instances of carbon-cache. For a small proportion of our metrics (I'm unsure about the exact percentage, but from my experience it seems like <1%), carbon-cache always returns 0 data points. We know for a fact that carbon-cache is receiving the metrics in question though, because the whisper file is regularly updated. The way this problem manifests is that for certain metrics, the latest data point can be quite far away from "now", periodically "catching up" when carbon-cache flushes its data points to the whisper file.

I have verified with tcpdump that carbon-c-relay is sending the metrics to the same carbon-cache instance that graphite-web tries to query for the cached data points.

I feel like this might have something to do with hashing, since the same metric name exhibits the same problem in both clusters that receive the metric independently, and that metrics one character off (e.g. same metric from host124 instead of host123) do not exhibit the same problem. We have at least 3 instances of the same metric on different host exhibiting the problem (out of around 500 hosts reporting the metric), which gives me further cause to believe the issue is not related to something other than carbon.

The logs we're seeing:

76502:2022-11-08,02:00:11.291 :: CarbonLink sending request for statsd.host123-primary.some_metric.count to ('127.0.0.1', '9')
76503:2022-11-08,02:00:11.292 :: CarbonLink finished receiving statsd.host123-primary.some_metric.count from ('127.0.0.1', '9')
76504:2022-11-08,02:00:11.292 :: CarbonLink cache-query request for statsd.host123-primary.some_metric.count returned 0 datapoints

Metric exhibiting problem:
image

Metric of same name from other host (path differing by one character):
image

Note that I have changed the exact metric name to not expose our metric publicly, but I'm happy to provide the metric name privately for debugging (assuming the issue may be related to the exact metric name).

To Reproduce
We are able to replicate the behavior by querying a metric of the same name for different hosts, e.g. statsd.host123-primary.some_metric.count. The metric returns data, but is lagging far behind. The graphite-web cache logs show that the metric always returns 0 data points. Querying for the same metric where the path differs by one character e.g. statsd.host124-primary.some_metric.count, statsd.host125-primary.some_metric.count, does not exhibit the problem.

Expected behavior
carbon-cache should return data points for the metrics in question, given that it appears to have the data in memory (evidenced by the fact that it's writing data to disk at regular intervals).

Environment (please complete the following information):

  • OS flavor: Ubuntu Focal
  • Graphite version: Tested on 1.1.5 and 1.1.10 (both exhibit the same problem)
  • Setup type: pip/git
  • Python version: 2.7.17

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions