-
Notifications
You must be signed in to change notification settings - Fork 487
Description
Describe the bug
We have a graphite cluster with approximately 130 million metrics spread across 3 storage nodes running graphite-web and 16 instances of carbon-cache. For a small proportion of our metrics (I'm unsure about the exact percentage, but from my experience it seems like <1%), carbon-cache always returns 0 data points. We know for a fact that carbon-cache is receiving the metrics in question though, because the whisper file is regularly updated. The way this problem manifests is that for certain metrics, the latest data point can be quite far away from "now", periodically "catching up" when carbon-cache flushes its data points to the whisper file.
I have verified with tcpdump
that carbon-c-relay is sending the metrics to the same carbon-cache instance that graphite-web tries to query for the cached data points.
I feel like this might have something to do with hashing, since the same metric name exhibits the same problem in both clusters that receive the metric independently, and that metrics one character off (e.g. same metric from host124
instead of host123
) do not exhibit the same problem. We have at least 3 instances of the same metric on different host exhibiting the problem (out of around 500 hosts reporting the metric), which gives me further cause to believe the issue is not related to something other than carbon.
The logs we're seeing:
76502:2022-11-08,02:00:11.291 :: CarbonLink sending request for statsd.host123-primary.some_metric.count to ('127.0.0.1', '9')
76503:2022-11-08,02:00:11.292 :: CarbonLink finished receiving statsd.host123-primary.some_metric.count from ('127.0.0.1', '9')
76504:2022-11-08,02:00:11.292 :: CarbonLink cache-query request for statsd.host123-primary.some_metric.count returned 0 datapoints
Metric of same name from other host (path differing by one character):
Note that I have changed the exact metric name to not expose our metric publicly, but I'm happy to provide the metric name privately for debugging (assuming the issue may be related to the exact metric name).
To Reproduce
We are able to replicate the behavior by querying a metric of the same name for different hosts, e.g. statsd.host123-primary.some_metric.count
. The metric returns data, but is lagging far behind. The graphite-web cache logs show that the metric always returns 0 data points. Querying for the same metric where the path differs by one character e.g. statsd.host124-primary.some_metric.count
, statsd.host125-primary.some_metric.count
, does not exhibit the problem.
Expected behavior
carbon-cache should return data points for the metrics in question, given that it appears to have the data in memory (evidenced by the fact that it's writing data to disk at regular intervals).
Environment (please complete the following information):
- OS flavor: Ubuntu Focal
- Graphite version: Tested on 1.1.5 and 1.1.10 (both exhibit the same problem)
- Setup type: pip/git
- Python version: 2.7.17