Skip to content

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Sep 12, 2025

Proposed commit message

Main Bug Addressed: Wildcard Search Bug

Issue: #43885

Description: In the buggy scenario, a wildcard search for metrics is provided without a timegrain. A timegrain is the Azure terminology for aggregation period. This caused metrics to be pulled with an incompatible timegrain. In the buggy scenario, we incorrectly use the last leveraged timegrain to pull metric data again.

Fix: In this fix, we first grab the smallest available timegrain from the metric availabilities from the Azure API. These timegrains appear to be ordered, ascending, so we use the first one to assign the metric to a group. We then have groups of compatible metrics associated with this timegrain to prepare for the next step. This fix applies to both

  • the batch metric fetch and
  • the non-batch metric fetch.

Minor Side Bug Addressed: Nil Pointer Dereference

Issue: #43725

Description: In line


we are dereferencing the resp.Interval pointer to get the interval from the api response.

Fix: Check if the interval is not nil and not empty before continuing to process this data. If it is nil or empty, reject the data as we do when the API call errors. When this happens, we can assume the data is bad. This is because we have also handled the wildcard issue in this PR, so this API error edge case in code should not be hit unless the API is returning bad data.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

None - this only fixes bugs

Author's Checklist

How to test this PR locally

Testing wildcard metrics bug

Unit tests have also been added by this PR, which are automatically run by CI. To test beyond this manually, see below:

Set up the scenario/infrastructure as described in the parent issue: #43885

- module: azure  
  metricsets:  
    - monitor  
  enabled: true  
  period: 60s  
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s  
  resources:  
  - resource_query: "resourceType eq 'Microsoft.DBforPostgreSQL/flexibleServers'"
    metrics:  
    - name: ["*"]
      namespace: "Microsoft.DBforPostgreSQL/flexibleServers"

To test/verify that the 400 error is gone:

Check out this branch and set a breakpoint here before running the scenario in the debugger. Observe that

  • the timegrains are no longer empty the first time we hit this line
  • the program runs without the Azure API 400 error:
image

To test/verify that the number of metric definitions is unaffected:

  • Note that this is also covered by an updated unit test of the TestMapMetric, so the below helper log is just an extra piece of verification
  • One can compare with main and see that the number of metric definitions is 73 in both cases. Therefore, the number of metric definitions is unaffected.
    • There is a convenience debug log added in this PR to obtain this number. The log starts with unique metric definition count.
image

Testing minor nil pointer bug

To confirm that we handle the situation with a nil pointer, one can set up a debugger and set a breakpoint, then force the resp.Interval to nil at the first breakpoint in the screenshot. However, as noted in the comments in this code, this should not happen because we have handled the wildcard timegrain config scenario.
image

Related issues


This is an automatic backport of pull request #46145 done by [Mergify](https://mergify.com).

…l pointer deref (#46145) (#43725)

* Grab first metric from available timegrains and group by it

* Test for new metric grouping

* Add debug logging for metric definition count

* Log and skip if response interval is empty/nil

* Add tests for configured and non-configured timegrains

* Update EnableBatchApi documentation

* Add to changelog

* Rename metric to metricConfig

* Use composite key w timegrain for metric groups

* Add check for configured timegrain compatibility

* Fix redundant wording in err

* Add test for invalid configured timegrain

* Handle mismatched request/response timegrain, update msg

(cherry picked from commit 5f1ef61)
@mergify mergify bot requested a review from a team as a code owner September 12, 2025 18:51
@mergify mergify bot added the backport label Sep 12, 2025
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 12, 2025
@github-actions github-actions bot added the Team:obs-ds-hosted-services Label for the Observability Hosted Services team label Sep 12, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/obs-ds-hosted-services (Team:obs-ds-hosted-services)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants