-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide CI testing infrastructure for long-running processes #12609
Comments
Hey @jmealo, this sounds like a pretty frustrating issue. Can you share some error callstacks so we can make sure we fix the places where you are seeing these unhandled exceptions? We can do a pass through the code to look for more, but I want to make sure we don't miss anything that's actively affecting you. |
@ramya-rao-a: Any idea if there was a regression in |
Hey @jmealo, We recently had some changes to improve the way we manage the listeners. I don't think of any regressions, at least our rigorous testing did not catch any. Can you let us know more about the unresponsive listener scenario? And can you please provide a sample code for your scenario to repro the issue on our end? |
Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you! |
@HarshaNalluru: Unfortunately we moved to using a function app for our service bus listeners as we weren't able to get reconnections to work properly. The nail in the coffin was the breaking API change which caused a complete outage ( |
I'm confused a little bit here. It seems like you're experiencing separate issues for Service Bus and Monitor. Since you have opened a new issue for Monitor, could you file a separate issue for Service Bus and we can close this one? |
@xirzec: We've had issues with all azure-sdk-js modules in our project, except cosmosdb. We didn't open issues since you were in the process of releasing refactors for each and the issues seemed to be resolved (only to expose new bugs). By the time you finished those, we've had to migrate aways from those products due to poor reliability and production outages. We had to move any long-running processes over to function apps so the defects didn't impact the reliability of our system overall. So we traded fixes for your old bugs for new bugs. The new bugs made it apparent that we were testing your code in production and that rigorous testing had not been performed. This isn't limited to a product, it's a systemic blind spot for testing long-running processes. It'd be nice if you guys could fix issues rather than close them but to each their own. I still don't see anywhere that someone has made a commitment to add testing for your consumer's critical paths to your CI infrastructure and testing regime. The same bugs we saw in v7 of the service bus while in preview immediately popped up when it went GA.This is an issue with your development/qa/release process that as a consumer, has been painful to observe while trying to build a backend consisting of using modules from this SDK. The event loop has gotten blocked, we've had poor performance and issues stem from the monkey patches applied by AI. Using modules from the azure-sdk has caused issues that I've never seen in 10 years of Node.js development. If you close this ticket without improving the code quality of the azure-sdk-js it'll help others understand that Microsoft doesn't test their code and expects customers to do it when previews go GA. The title of the issue says it all. We expect you to test YOUR code for memory leaks, blocking the event loop and throwing unhandled rejections and exceptions that the consumer cannot handle. When you spew errors/warnings about connections from AQMP, what is a consumer supposed to do? You run the infrastructure, is that a message for you or a message for me? Not much care has been taken to understand what consumers need/expect and to make errors actionable. |
This sounds very frustrating. I do agree with you that testing for long running processes needs to be improved. I don't think our current CI infrastructure supports this today, but as you say there are plenty of workloads that require us to ensure we don't leak memory / have unexpected (or impossible to handle) exceptions that kill the parent process.
I do think our goal is to not ship bugs; the specific trouble here is that broad systemic issues are less able to be fixed quickly and definitively. It's easier to fix a broken leg than to provide perfect healthcare to an entire country. We do care about customer pain, which is why we always prioritize customer issues when they are reported to us.
This is great feedback and I'd love to understand what the SDK package as well as the service needs to do better here. /cc @ramya-rao-a
I am curious which packages you are using from AI (which I am assuming is App Insights and not Artificial Intelligence.) AFAIK we only have two (the ARM package and the query package): https://github.com/Azure/azure-sdk-for-js/tree/master/sdk/applicationinsights -- both appear to be the older style which are entirely autogenerated, but I don't think either of these packages are doing monkey-patching of other modules or the environment. Perhaps you are having problems with a different package? I'd love to make sure the right team gets your feedback.
I agree we absolutely need to test our packages and ensure they are of acceptable quality. Testing is an infinitely growable effort, however, and it's not guaranteed to catch all issues. Despite that, we can learn over time, and put more and better safeguards into place. I hear your frustration that you believe we need to do better, and I agree with you. The more details you can give us about the specific pain you are facing, the more easily we will be able to understand how we can not only address your issues, but prevent such mistakes from recurring. |
@xirzec: I really appreciate you taking the time on a Friday to validate my concerns and help pull in the right people. I opened this issue here because it surfaced when Here are the packages we are using in our project:
@ramya-rao-a: I avoided opening issues when we encountered problems because the I think that the AQMP work just needs testing infrastructure. Reading the README for The original intent of this ticket was to suggest that failed DNS lookups and transient connection failures be part of the test suite for the azure-sdk, as it appeared at the time that |
Thanks for the clarification @jmealo, that helps a lot. Is it safe to say that the issues you saw with the monitor exporter package are covered in #12851 and #12856? If so, then for tracking purposes, I'd like to remove the Regarding the AMQP side of things, we agree that the testing efforts there can use some more attention. Here are current ongoing efforts for which we will pick up speed in the new year:
What we have definitely not focused on is how the above plays with other packages like the monitor exporter. We would like to hear more on your experience with reconnections when using Service Bus and Event Hubs. From your posts in different issues and PRs, it looks like you receive no error and the application does not receive any more messages. We agree that a good support for tracing will help with troubleshooting. Can you share how you currently use (or previously used before moving to Azure functions) the monitor exporter with Service Bus/Event Hubs? |
Hi @jmealo, we deeply appreciate your input into this project. Regrettably, this issue has remained inactive for over 2 years, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support. |
Describe the bug
The azure-sdk-for-js throws unhandled rejections and exceptions on DNS failures and network/file errors that are not catchable by consumers of the public interfaces. In the case of the Application Insights Open Telemetry exporter, the process under observation can be crashed by an internal error which goes against community expectations and the provisional specification (see below):
Expected behavior
I expect the azure-sdk-for-js to follow the Open Telemetry error handling principles throughout the entire codebase (not just limited to telemetry) as that's the expectation of the Node.js community for packages from their cloud vendor.
Additional context
This should be handled everywhere.
The text was updated successfully, but these errors were encountered: