Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Receive] New Tenant is not queryable by receiver #7892

Open
jnyi opened this issue Nov 7, 2024 · 4 comments
Open

[Receive] New Tenant is not queryable by receiver #7892

jnyi opened this issue Nov 7, 2024 · 4 comments

Comments

@jnyi
Copy link
Contributor

jnyi commented Nov 7, 2024

We are testing the latest thanos main branch and found a regression that didn't exist in v0.36 prior

For a given running thanos receiver cluster, we start a new tenant called "eng-host-networking" and we can see tsdb head metric started pop up but all metrics to that tenant are not queryable unless restart the receiver cluster

Screenshot 2024-11-07 at 10 14 01 AM Screenshot 2024-11-07 at 10 14 09 AM

How to repro:

  1. start a receiver cluster using latest main
  2. send remote write with a new tenant
  3. verify it is received using prometheus_tsdb_head_series{tenant="<new tenant>"}
  4. go to endpoint pages, the new tenant isn't there nor metrics from that tenant can be queried
  5. restart receiver cluster, data showed up (no data loss)

Thanos, Prometheus and Golang version used:
Thanos: v0.37.0-dev
Golang: v1.23

Object Storage Provider:

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Anything else we need to know:

Copy link

dosubot bot commented Nov 7, 2024

I found a similar open issue that might be related to your problem: Thanos query don't query all metrics on a receiver. In that issue, the solution involved configuring the querier to query all pods in the receiver ring by using a headless service. This was done by specifying each pod's endpoint in the configuration, rather than using the service endpoint. Additionally, they adjusted the hashring.json configuration to include all receiver endpoints. This change ensured that all metrics were consistently queried, although they noted that scaling the receiver ring required changes in multiple places.

For your specific issue, it seems like the new tenant is not being dynamically registered for querying without a restart. The Thanos receiver component should dynamically detect and register new tenants without requiring a restart by provisioning an independent TSDB for each new tenant detected via the tenant HTTP header [1]. However, if this is not happening, it might be a regression or a configuration issue in the latest main branch.

You might want to check if the configuration flags related to tenant registration and queryability are correctly set, such as --receive.tenant-header and any relevant querier flags [2][1]. If the issue persists, it could be beneficial to report this as a potential regression in the main branch, providing the details and steps to reproduce as you've outlined.

To continue talking to Dosu, mention @dosu.

@jnyi
Copy link
Contributor Author

jnyi commented Nov 13, 2024

ts=2024-11-13T06:33:34.121916792Z caller=multitsdb.go:697 level=info name=pantheon-db component=receive component=multi-tsdb tenant=random-tenant123 msg="opening TSDB"
ts=2024-11-13T06:33:34.128508813Z caller=multitsdb.go:743 level=info name=pantheon-db component=receive component=multi-tsdb tenant=random-tenant123 msg="TSDB is now ready"
ts=2024-11-13T06:33:50.252309547Z caller=shipper.go:259 level=warn name=pantheon-db component=receive component=multi-tsdb tenant=random-tenant123 msg="reading meta file failed, will override it" err="failed to read /var/thanos/data/random-tenant123/thanos.shipper.json: open /var/thanos/data/random-tenant123/thanos.shipper.json: no such file or directory"

Tested in latest main, this behavior didn't happen:
Screenshot 2024-11-12 at 10 37 20 PM
Screenshot 2024-11-12 at 10 37 28 PM

@jnyi
Copy link
Contributor Author

jnyi commented Nov 23, 2024

this incident occurred again today, after digging deeper, we found it can be reproduced if multiple tenants are added simantenously causing a race condition from this PR: #7782

@jnyi
Copy link
Contributor Author

jnyi commented Nov 25, 2024

Had a fix #7941 which can repro the race condition by unit test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant