Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch and expose well-defined error/information values #34

Open
RichiH opened this issue Aug 23, 2022 · 4 comments
Open

Catch and expose well-defined error/information values #34

RichiH opened this issue Aug 23, 2022 · 4 comments

Comments

@RichiH
Copy link
Owner

RichiH commented Aug 23, 2022

Some devices send special and locally-well-defined numeric values to signal state, e.g. "Resource busy".

The initial plan was to hide those values and print errors to STDOUT, see #32

This is not ideal as it forces people to look at the logs, hides information from PromQL, and makes these situations generally invisible/harder to debug.

Current thinking is to introduce a gauge which exposes this value instead.

So metric foo will dynamically get a foo_caught = 1234 where 1234 is the magic value. While OpenMetrics StateSet would allow for better and cleaner mapping, it would multiply the metrics exposed.

Two open questions:

  • Should foo go away, be set to zero, or retain its old value? Going away for the time being seems cleanest.
  • Should foo_caught go away once the special state is gone, be set to zero, or be set to zero and then go away? Going away seems cleanest yet again.

CC @SuperQ for thoughts.

@RichiH
Copy link
Owner Author

RichiH commented Mar 14, 2023

@bastischubert @DaAwesomeP @SuperQ thoughts?

@DaAwesomeP
Copy link
Contributor

IMO (and I am definitely and absolutely not a definitive source, just my opinion):

  • foo should go away when there is not a current value to expose, so it should go away on error
  • However with foo_caught I am more of the opinion that there should be a 0 state to signify "OK" since it is very possible a user would forget to implement alerts or visualization if it is not present initially. I think also that "no errors caught/operating normally" is definitely a valid state separate from the actual value of the sensor/device, which would be in foo.
  • Maybe instead of calling it foo_caught (which implies that it should disappear if nothing is caught) it should instead be foo_error which can continue to exist to say "no current error."

Are there other exporters we can look at that have similar metrics or have solved a similar issue? Not saying we should blindly copy it but it would be good to see other examples.

I should also note that I don't think I have any devices on-hand that support these error codes.

@RichiH
Copy link
Owner Author

RichiH commented Mar 14, 2023

Signalling OK state would mean doubling the metric count. I am not against it, yet still apprehensive. Or maybe we could expose this only for metrics which have a special handler defined?

I don't believe there's precedent. The most complex exporter in this regard is snmp_exporter; it has varies fixes for other protocol warts though.

@DaAwesomeP
Copy link
Contributor

Signalling OK state would mean doubling the metric count. I am not against it, yet still apprehensive.

Yeah...understandable.

Or maybe we could expose this only for metrics which have a special handler defined?

Yeah, that makes sense; then you're only adding metrics where it would actually work. And I imagine that most devices don't support this exceptions of every single address anyway.

In my case I am only monitoring simple I/O devices so I would almost never encounter this feature and would leave it off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants