Provide CI testing infrastructure for long-running processes #12609

jmealo · 2020-11-18T23:27:09Z

Describe the bug
The azure-sdk-for-js throws unhandled rejections and exceptions on DNS failures and network/file errors that are not catchable by consumers of the public interfaces. In the case of the Application Insights Open Telemetry exporter, the process under observation can be crashed by an internal error which goes against community expectations and the provisional specification (see below):

Expected behavior
I expect the azure-sdk-for-js to follow the Open Telemetry error handling principles throughout the entire codebase (not just limited to telemetry) as that's the expectation of the Node.js community for packages from their cloud vendor.

Basic error handling principles

OpenTelemetry implementations MUST NOT throw unhandled exceptions at run time.

API methods MUST NOT throw unhandled exceptions when used incorrectly by end users. The API and SDK SHOULD provide safe defaults for missing or invalid arguments. For instance, a name like empty may be used if the user passes in null as the span name argument during Span construction.

The API or SDK may fail fast and cause the application to fail on initialization, e.g. because of a bad user config or environment, but MUST NOT cause the application to fail later at run time, e.g. due to dynamic config settings received from the Collector.

The SDK MUST NOT throw unhandled exceptions for errors in their own operations. For example, an exporter should not throw an exception when it cannot reach the endpoint to which it sends telemetry data.

Additional context
This should be handled everywhere.

xirzec · 2020-11-19T18:58:38Z

Hey @jmealo, this sounds like a pretty frustrating issue. Can you share some error callstacks so we can make sure we fix the places where you are seeing these unhandled exceptions?

We can do a pass through the code to look for more, but I want to make sure we don't miss anything that's actively affecting you.

jmealo · 2020-11-21T00:07:24Z

@ramya-rao-a: Any idea if there was a regression in preview.8 of Service Bus? We just had to migrate over to a function app because our unresponsive service bus listener issue started reproducing itself in less than an hour (previously it was every few days). No errors surfaced via the public interface.

HarshaNalluru · 2020-11-21T00:24:55Z

Hey @jmealo,

We recently had some changes to improve the way we manage the listeners. I don't think of any regressions, at least our rigorous testing did not catch any.

Can you let us know more about the unresponsive listener scenario? And can you please provide a sample code for your scenario to repro the issue on our end?

ghost · 2020-12-07T20:00:14Z

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

jmealo · 2020-12-10T16:53:14Z

@xirzec:
#12851

@HarshaNalluru: Unfortunately we moved to using a function app for our service bus listeners as we weren't able to get reconnections to work properly. The nail in the coffin was the breaking API change which caused a complete outage (complete was no longer getting called)

xirzec · 2020-12-11T03:44:14Z

I'm confused a little bit here. It seems like you're experiencing separate issues for Service Bus and Monitor. Since you have opened a new issue for Monitor, could you file a separate issue for Service Bus and we can close this one?

jmealo · 2020-12-11T17:43:51Z

@xirzec: We've had issues with all azure-sdk-js modules in our project, except cosmosdb. We didn't open issues since you were in the process of releasing refactors for each and the issues seemed to be resolved (only to expose new bugs). By the time you finished those, we've had to migrate aways from those products due to poor reliability and production outages. We had to move any long-running processes over to function apps so the defects didn't impact the reliability of our system overall. So we traded fixes for your old bugs for new bugs. The new bugs made it apparent that we were testing your code in production and that rigorous testing had not been performed. This isn't limited to a product, it's a systemic blind spot for testing long-running processes.

It'd be nice if you guys could fix issues rather than close them but to each their own. I still don't see anywhere that someone has made a commitment to add testing for your consumer's critical paths to your CI infrastructure and testing regime.

The same bugs we saw in v7 of the service bus while in preview immediately popped up when it went GA.This is an issue with your development/qa/release process that as a consumer, has been painful to observe while trying to build a backend consisting of using modules from this SDK.

The event loop has gotten blocked, we've had poor performance and issues stem from the monkey patches applied by AI. Using modules from the azure-sdk has caused issues that I've never seen in 10 years of Node.js development. If you close this ticket without improving the code quality of the azure-sdk-js it'll help others understand that Microsoft doesn't test their code and expects customers to do it when previews go GA.

The title of the issue says it all. We expect you to test YOUR code for memory leaks, blocking the event loop and throwing unhandled rejections and exceptions that the consumer cannot handle. When you spew errors/warnings about connections from AQMP, what is a consumer supposed to do? You run the infrastructure, is that a message for you or a message for me? Not much care has been taken to understand what consumers need/expect and to make errors actionable.

xirzec · 2020-12-11T20:37:33Z

@xirzec: We've had issues with all azure-sdk-js modules in our project, except cosmosdb. We didn't open issues since you were in the process of releasing refactors for each and the issues seemed to be resolved (only to expose new bugs). By the time you finished those, we've had to migrate aways from those products due to poor reliability and production outages. We had to move any long-running processes over to function apps so the defects didn't impact the reliability of our system overall. So we traded fixes for your old bugs for new bugs. The new bugs made it apparent that we were testing your code in production and that rigorous testing had not been performed. This isn't limited to a product, it's a systemic blind spot for testing long-running processes.

This sounds very frustrating. I do agree with you that testing for long running processes needs to be improved. I don't think our current CI infrastructure supports this today, but as you say there are plenty of workloads that require us to ensure we don't leak memory / have unexpected (or impossible to handle) exceptions that kill the parent process.

It'd be nice if you guys could fix issues rather than close them but to each their own. I still don't see anywhere that someone has made a commitment to add testing for your consumer's critical paths to your CI infrastructure and testing regime.

I do think our goal is to not ship bugs; the specific trouble here is that broad systemic issues are less able to be fixed quickly and definitively. It's easier to fix a broken leg than to provide perfect healthcare to an entire country. We do care about customer pain, which is why we always prioritize customer issues when they are reported to us.

The same bugs we saw in v7 of the service bus while in preview immediately popped up when it went GA.This is an issue with your development/qa/release process that as a consumer, has been painful to observe while trying to build a backend consisting of using modules from this SDK.

This is great feedback and I'd love to understand what the SDK package as well as the service needs to do better here. /cc @ramya-rao-a

The event loop has gotten blocked, we've had poor performance and issues stem from the monkey patches applied by AI. Using modules from the azure-sdk has caused issues that I've never seen in 10 years of Node.js development. If you close this ticket without improving the code quality of the azure-sdk-js it'll help others understand that Microsoft doesn't test their code and expects customers to do it when previews go GA.

I am curious which packages you are using from AI (which I am assuming is App Insights and not Artificial Intelligence.) AFAIK we only have two (the ARM package and the query package): https://github.com/Azure/azure-sdk-for-js/tree/master/sdk/applicationinsights -- both appear to be the older style which are entirely autogenerated, but I don't think either of these packages are doing monkey-patching of other modules or the environment.

Perhaps you are having problems with a different package? I'd love to make sure the right team gets your feedback.

The title of the issue says it all. We expect you to test YOUR code for memory leaks, blocking the event loop and throwing unhandled rejections and exceptions that the consumer cannot handle. When you spew errors/warnings about connections from AQMP, what is a consumer supposed to do? You run the infrastructure, is that a message for you or a message for me? Not much care has been taken to understand what consumers need/expect and to make errors actionable.

I agree we absolutely need to test our packages and ensure they are of acceptable quality. Testing is an infinitely growable effort, however, and it's not guaranteed to catch all issues. Despite that, we can learn over time, and put more and better safeguards into place.

I hear your frustration that you believe we need to do better, and I agree with you. The more details you can give us about the specific pain you are facing, the more easily we will be able to understand how we can not only address your issues, but prevent such mistakes from recurring.

jmealo · 2020-12-11T22:19:39Z

@xirzec: I really appreciate you taking the time on a Friday to validate my concerns and help pull in the right people. I opened this issue here because it surfaced when monitor-opentelemetry-exporter seemed to regress in preview.6 after changing to @azure/core-http. This could have been misuse of core-http, or unrelated, so this issue may very well live in the wrong place and can be closed if need be. I changed the title to reflect the true underlying issue. Feel free to close it if you feel like progress is being made to support the MS developers working in this repo.

Here are the packages we are using in our project:

    "@azure/cosmos": "^3.9.3",
    "@azure/event-hubs": "^5.3.1",
    "@azure/monitor-opentelemetry-exporter": "^1.0.0-preview.5",
    "@azure/service-bus": "^7.0.0",
    "applicationinsights": "^1.8.8",

@ramya-rao-a: I avoided opening issues when we encountered problems because the preview versions of Service Bus and Event Hubs were addressing all of our major issues around debugging, logging and error handling. Things are clearly headed in the right direction and priorities seemed in order. I was following along with the development on GitHub and felt comfortable shipping with preview releases.

I think that the AQMP work just needs testing infrastructure. Reading the README for rhea-promise seems like it makes it really difficult to provide actionable errors to the consumer where they would make sense. Fully supporting open telemetry tracing for Service Bus and Event Hub gives customers more tools to troubleshoot issues.

The original intent of this ticket was to suggest that failed DNS lookups and transient connection failures be part of the test suite for the azure-sdk, as it appeared at the time that@azure/core-http was the source of the unhandled rejection.

ramya-rao-a · 2020-12-11T23:35:33Z

Thanks for the clarification @jmealo, that helps a lot.

Is it safe to say that the issues you saw with the monitor exporter package are covered in #12851 and #12856? If so, then for tracking purposes, I'd like to remove the Monitor label for this issue and use this issue to focus on the problems you saw with Service Bus and Event Hubs.

Regarding the AMQP side of things, we agree that the testing efforts there can use some more attention. Here are current ongoing efforts for which we will pick up speed in the new year:

Test coverage for rhea-promise. [Ops] Add tests to project amqp/rhea-promise#58 is tracking this.
Run stress tests for a minimum of 3 days for every release candidate for Service Bus and Event Hubs packages to ensure that the application recovers from any problems.
- While for Event Hubs, this is straight forward, our Service Bus stress tests cover a wide variety of scenarios due to features like lock renewals and message settlement.
- We have seen our tests recover from network issues, idle timeout errors from service, connection forced errors from the service and other errors.
- Each of the stress tests is run manually at the moment, and we can definitely invest in having this in our test infrastructure to be invoked on demand.
The previous point relies on network issues or service timeouts which we do not control, so we are investing in using rhea to set up a mock server to control the different error scenarios and test the client's response to them. You can see the initial steps in this direction in [mockhub] Initial mock event hubs package #10467
We already have unit tests to cover the cases of when the connection is disconnected (we simulate this using the idle() method on rhea's connection object), and can include the same in our long running stress tests.

What we have definitely not focused on is how the above plays with other packages like the monitor exporter.

We would like to hear more on your experience with reconnections when using Service Bus and Event Hubs. From your posts in different issues and PRs, it looks like you receive no error and the application does not receive any more messages. We agree that a good support for tracing will help with troubleshooting. Can you share how you currently use (or previously used before moving to Azure functions) the monitor exporter with Service Bus/Event Hubs?

github-actions · 2024-03-20T18:06:44Z

Hi @jmealo, we deeply appreciate your input into this project. Regrettably, this issue has remained inactive for over 2 years, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.

xirzec added Client This issue points to a problem in the data-plane of the library. Monitor Monitor, Monitor Ingestion, Monitor Query labels Nov 19, 2020

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Nov 19, 2020

ramya-rao-a assigned xirzec Nov 19, 2020

ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Nov 19, 2020

xirzec added needs-author-feedback Workflow: More information is needed from author to address the issue. and removed needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team labels Nov 30, 2020

ghost added the no-recent-activity There has been no recent activity on this issue. label Dec 7, 2020

jmealo mentioned this issue Dec 10, 2020

monitor-opentelemetry-exporter: Unhandled rejection on network issues or 500 from AI /v2/track endpoint #12851

Closed

6 tasks

jmealo changed the title ~~Audit entire SDK for uncaught rejections and exceptions caused by internal operations~~ Provide CI testing infrastructure for long-running processes Dec 11, 2020

richardpark-msft mentioned this issue Dec 18, 2020

[MQ reliability] The Service Bus tests occasionally fail in canary #12815

Closed

joshfree added Monitor - Exporter Monitor OpenTelemetry Exporter and removed Monitor Monitor, Monitor Ingestion, Monitor Query labels Sep 22, 2021

xirzec added this to the Backlog milestone Jun 8, 2022

scottaddie added the Monitor Monitor, Monitor Ingestion, Monitor Query label Jun 15, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 20, 2024

github-actions bot locked and limited conversation to collaborators Mar 20, 2024

xirzec removed this from the Backlog milestone May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide CI testing infrastructure for long-running processes #12609

Provide CI testing infrastructure for long-running processes #12609

jmealo commented Nov 18, 2020 •

edited

Loading

Basic error handling principles

OpenTelemetry implementations MUST NOT throw unhandled exceptions at run time.

xirzec commented Nov 19, 2020

jmealo commented Nov 21, 2020

HarshaNalluru commented Nov 21, 2020

ghost commented Dec 7, 2020

jmealo commented Dec 10, 2020

xirzec commented Dec 11, 2020

jmealo commented Dec 11, 2020

xirzec commented Dec 11, 2020

jmealo commented Dec 11, 2020

ramya-rao-a commented Dec 11, 2020

github-actions bot commented Mar 20, 2024

Provide CI testing infrastructure for long-running processes #12609

Provide CI testing infrastructure for long-running processes #12609

Comments

jmealo commented Nov 18, 2020 • edited Loading

Basic error handling principles

OpenTelemetry implementations MUST NOT throw unhandled exceptions at run time.

xirzec commented Nov 19, 2020

jmealo commented Nov 21, 2020

HarshaNalluru commented Nov 21, 2020

ghost commented Dec 7, 2020

jmealo commented Dec 10, 2020

xirzec commented Dec 11, 2020

jmealo commented Dec 11, 2020

xirzec commented Dec 11, 2020

jmealo commented Dec 11, 2020

ramya-rao-a commented Dec 11, 2020

github-actions bot commented Mar 20, 2024

jmealo commented Nov 18, 2020 •

edited

Loading