Skip to content

Conversation

joecompute
Copy link
Contributor

@joecompute joecompute commented Aug 21, 2025

Proposed commit message

Main Bug Addressed: Wildcard Search Bug

Issue: #43885

Description: In the buggy scenario, a wildcard search for metrics is provided without a timegrain. A timegrain is the Azure terminology for aggregation period. This caused metrics to be pulled with an incompatible timegrain. In the buggy scenario, we incorrectly use the last leveraged timegrain to pull metric data again.

Fix: In this fix, we first grab the smallest available timegrain from the metric availabilities from the Azure API. These timegrains appear to be ordered, ascending, so we use the first one to assign the metric to a group. We then have groups of compatible metrics associated with this timegrain to prepare for the next step. This fix applies to both

  • the batch metric fetch and
  • the non-batch metric fetch.

Minor Side Bug Addressed: Nil Pointer Dereference

Issue: #43725

Description: In line


we are dereferencing the resp.Interval pointer to get the interval from the api response.

Fix: Check if the interval is not nil and not empty before continuing to process this data. If it is nil or empty, reject the data as we do when the API call errors. When this happens, we can assume the data is bad. This is because we have also handled the wildcard issue in this PR, so this API error edge case in code should not be hit unless the API is returning bad data.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

None - this only fixes bugs

Author's Checklist

How to test this PR locally

Testing wildcard metrics bug

Unit tests have also been added by this PR, which are automatically run by CI. To test beyond this manually, see below:

Set up the scenario/infrastructure as described in the parent issue: #43885

- module: azure  
  metricsets:  
    - monitor  
  enabled: true  
  period: 60s  
  client_id: '${AZURE_CLIENT_ID:""}'
  client_secret: '${AZURE_CLIENT_SECRET:""}'
  tenant_id: '${AZURE_TENANT_ID:""}'
  subscription_id: '${AZURE_SUBSCRIPTION_ID:""}'
  refresh_list_interval: 600s  
  resources:  
  - resource_query: "resourceType eq 'Microsoft.DBforPostgreSQL/flexibleServers'"
    metrics:  
    - name: ["*"]
      namespace: "Microsoft.DBforPostgreSQL/flexibleServers"

To test/verify that the 400 error is gone:

Check out this branch and set a breakpoint here before running the scenario in the debugger. Observe that

  • the timegrains are no longer empty the first time we hit this line
  • the program runs without the Azure API 400 error:
image

To test/verify that the number of metric definitions is unaffected:

  • Note that this is also covered by an updated unit test of the TestMapMetric, so the below helper log is just an extra piece of verification
  • One can compare with main and see that the number of metric definitions is 73 in both cases. Therefore, the number of metric definitions is unaffected.
    • There is a convenience debug log added in this PR to obtain this number. The log starts with unique metric definition count.
image

Testing minor nil pointer bug

To confirm that we handle the situation with a nil pointer, one can set up a debugger and set a breakpoint, then force the resp.Interval to nil at the first breakpoint in the screenshot. However, as noted in the comments in this code, this should not happen because we have handled the wildcard timegrain config scenario.
image

Related issues

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 21, 2025
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

Copy link
Contributor

mergify bot commented Aug 21, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @joecompute? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@joecompute joecompute added the Team:obs-ds-hosted-services Label for the Observability Hosted Services team label Aug 21, 2025
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 21, 2025
@joecompute joecompute added backport-8.x Automated backport to the 8.x branch with mergify backport-9.1 Automated backport to the 9.1 branch labels Aug 21, 2025
@joecompute
Copy link
Contributor Author

@zmoog how do the backport tags look here?

@zmoog
Copy link
Contributor

zmoog commented Aug 25, 2025

@zmoog how do the backport tags look here?

We can use backport-active-all to dynamically get a backport PR for each active version when we'll merge this.

@joecompute joecompute added backport-active-all Automated backport with mergify to all the active branches and removed backport-8.x Automated backport to the 8.x branch with mergify backport-9.1 Automated backport to the 9.1 branch labels Aug 25, 2025
@zmoog
Copy link
Contributor

zmoog commented Sep 10, 2025

I see 6 API calls for both the main version and this PR. So, for the given configuration, the changes in this PR do't increase the number of calls.

Copy link
Contributor

mergify bot commented Sep 12, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix/wildcard-metrics-timegrain-compatibility upstream/fix/wildcard-metrics-timegrain-compatibility
git merge upstream/main
git push upstream fix/wildcard-metrics-timegrain-compatibility

Copy link
Contributor

@MichaelKatsoulis MichaelKatsoulis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @joecompute for putting this together. Excellent work!

@joecompute joecompute enabled auto-merge (squash) September 12, 2025 17:44
@joecompute joecompute merged commit 5f1ef61 into elastic:main Sep 12, 2025
37 checks passed
Copy link
Contributor

@Mergifyio backport 8.18 8.19 9.0 9.1

Copy link
Contributor

mergify bot commented Sep 12, 2025

mergify bot pushed a commit that referenced this pull request Sep 12, 2025
…l pointer deref (#46145) (#43725)

* Grab first metric from available timegrains and group by it

* Test for new metric grouping

* Add debug logging for metric definition count

* Log and skip if response interval is empty/nil

* Add tests for configured and non-configured timegrains

* Update EnableBatchApi documentation

* Add to changelog

* Rename metric to metricConfig

* Use composite key w timegrain for metric groups

* Add check for configured timegrain compatibility

* Fix redundant wording in err

* Add test for invalid configured timegrain

* Handle mismatched request/response timegrain, update msg

(cherry picked from commit 5f1ef61)
mergify bot pushed a commit that referenced this pull request Sep 12, 2025
…l pointer deref (#46145) (#43725)

* Grab first metric from available timegrains and group by it

* Test for new metric grouping

* Add debug logging for metric definition count

* Log and skip if response interval is empty/nil

* Add tests for configured and non-configured timegrains

* Update EnableBatchApi documentation

* Add to changelog

* Rename metric to metricConfig

* Use composite key w timegrain for metric groups

* Add check for configured timegrain compatibility

* Fix redundant wording in err

* Add test for invalid configured timegrain

* Handle mismatched request/response timegrain, update msg

(cherry picked from commit 5f1ef61)
mergify bot pushed a commit that referenced this pull request Sep 12, 2025
…l pointer deref (#46145) (#43725)

* Grab first metric from available timegrains and group by it

* Test for new metric grouping

* Add debug logging for metric definition count

* Log and skip if response interval is empty/nil

* Add tests for configured and non-configured timegrains

* Update EnableBatchApi documentation

* Add to changelog

* Rename metric to metricConfig

* Use composite key w timegrain for metric groups

* Add check for configured timegrain compatibility

* Fix redundant wording in err

* Add test for invalid configured timegrain

* Handle mismatched request/response timegrain, update msg

(cherry picked from commit 5f1ef61)
mergify bot pushed a commit that referenced this pull request Sep 12, 2025
…l pointer deref (#46145) (#43725)

* Grab first metric from available timegrains and group by it

* Test for new metric grouping

* Add debug logging for metric definition count

* Log and skip if response interval is empty/nil

* Add tests for configured and non-configured timegrains

* Update EnableBatchApi documentation

* Add to changelog

* Rename metric to metricConfig

* Use composite key w timegrain for metric groups

* Add check for configured timegrain compatibility

* Fix redundant wording in err

* Add test for invalid configured timegrain

* Handle mismatched request/response timegrain, update msg

(cherry picked from commit 5f1ef61)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-active-all Automated backport with mergify to all the active branches Team:obs-ds-hosted-services Label for the Observability Hosted Services team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[azure monitor] wildcard metrics names (metrics.name: [*]) issues [Azure Monitor] Check if resp.Interval for armmonitor list api can be nil
4 participants