-
Notifications
You must be signed in to change notification settings - Fork 458
FunctionApps Randomly Dies Giving 500 Errors To All Requests #10900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just FYI we had another go offline this morning after applying Managed Identity Access The Azure dashboard blades all now give "Internal Server Error" (from Azure trying to get status of app) https://imgur.com/a/jppbW8I We followed the same procedure we always have, and about 4 hours later this FunctionApp collapsed (16:01 on the 05/03/2025 The logs in Application Insights show the same 'null bind error' from the first example I shared 05/03/2025, 16:05:29Trace
|
Interestingly enough, once the FunctionApp has 'collapsed' in this way, reverting it's configuration back to 'key based access' to the storage account (reinstating 'AzureWebJobsStorage' connection string) turning back 'on' 'key based access' to the storage account (etc) doesn't cause the app to come back online. Even restarting after reverting the config doesn't resolve it. |
Hi All, With further testing, we have been able to reproduce this bug, and can confirm via this testing the bug exists within (all) FunctionApps and will prevent the reliable implementation of Managed Identity access of the FunctionApp to its storage account IntroductionWe’ve been able to reproduce this “500 error” bug under controlled conditions and can now confirm this bug lies with the “Managed Identity” access to a FunctionApps storage account. MethodsWe setup 3 tests, FunctionApp01 (the control) which is left with default configuration. FunctionApp02 is default configuration but using Managed Identity access to the storage account on an S1 (premium) tier. FunctionApp03 is default configuration but using Managed Identity access to the storage account on a Y1 (scalable) tier. From experience, apps on a Y1 tier are more likely to exhibit this issue more often/sooner, which is why this additional FunctionApp03 test case was chosen. On the 6th March 2025 at around 3pm, these three apps were deployed, using the default (template) code for a Visual Studio 2022 FunctionApps project (Function App runtime v4) using .NET 8. No changes were made beyond the configurations mentioned above, and an external monitor was put in place to detect when the first outages. Based on our hypothesis, FunctionApp03 will go offline first, followed by FunctionApp02 at some point after this. Where FunctionApp01 will remain online indefinitely (all else being equal). FunctionApp1: https://testingfunctionapp01.azurewebsites.net/api/Function1?code=qt7e8MZ5fde82hCXvdG31_XPObilWpN-LcChnyhiKcO1AzFupswCbw== FunctionApp2: https://testingfunctionapp02.azurewebsites.net/api/Function1?code=mmVGtFb6ziurEOLQZGhbrRarv3XmMZK_Erhb3-sMlXglAzFuXXm3kg== FunctionApp3: https://testingfunctionapp03.azurewebsites.net/api/Function1?code=KU8KdlhnapnkFZAMsQjoRgGGDgk5awjLW7ajzFTp3g45AzFuSFpIcA== ResultsAt 20:13pm on the 10th of March 2025, four days after beginning the test. By 20:18pm this had evolved into a “503 Service Unavailable” error as expected which is shown in the logs below. Since this is a Y1 (scalable) tier, in theory resource limitations in itself should not be an issue. And separately, we have shown in separate tests that a single FunctionApp (by itself) on an S1 (premium) tier exhibits the same issue. See uptime monitoring logs here: https://imgur.com/a/CVE3lnJ ConclusionWe now know, the source of the outages are related to the “Managed Identity” access to a FunctionApps storage account. And this brief test highlights that Y1 tier FunctionApps are more likely to exhibit this issue sooner or more frequently. Because we know this (only) relates to Managed Identity, and that we didn’t configure Managed Identity for FunctionApp01 we can also say this one will remain online indefinitely (all else being equal). |
Digging a bit further, it turns out the Azure FunctionApp 'slot' itself is very broken in this scenario: No actions I take from Visual Studio 2022 result in the publish succeeding, failing variously with:
Interestingly, when then attempting to deploy via Continuous Deployment/Continuous Integration (CD/CI) this also fails with a 'timeout'.
I imagine in the Angular Azure dashboard there is some kind of check to see the FuncitonApp is online, which fails (never actually receiving a response) leading to this. And this is with the 'WebJobsStorage" key re-added, and identity turned off (reverting back to 'key based' access) Interestingly, in this scenario trying to create a new deployment slots results in a '403 error' via the Azure dashboard https://imgur.com/a/KJYo1EY
I find this 403 error interesting (given that I've now turned off identity, and moved back to using key-based access via a connection string that worked fine when the app was first deployed. It suggests the configuration is not being read (or updated) from any actions taken via the Azure dashboard |
Hi All,
We have experienced issues where FunctionApps in Azure can be quite unreliable, and sometimes go offline after operating happily for a while, in some cases for years.
From the logs I'll share here, I believe I can shoiw this is a bug in which - after a routine 'warmup' operation - the app is unsuccessful at binding to any of it's Urls on startup, leaving it in a running state (which I can show is healthy and sensible) for which all Urls will give a 500 error response. Nothing we do from here on out causes the app to successfully bind it's methods, despite the fact that app had not been changed in years (I share the Kudu logs and date/times further down).
The most recent case I can detial here happened on the 18th of February at 21:20, after a standard 'warmup' operation one of ours went offline giving 500 status to all further requests (whether using valid or invalid keys). The end-to-end transaction logs show a standard warmup, followed by a load of 'null bindings':
18/02/2025, 21:40:51 -Trace Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<somename1>' Severity level: Information 18/02/2025, 21:40:51 -Trace Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<somename1>' Severity level: Information 18/02/2025, 21:40:51 -Trace Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<somename2>' Severity level: Information 18/02/2025, 21:40:51 -Trace Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<somename2>' Severity level: Information
From this point forward, all requests give 500 status errors (regardless of if the keys are correct or not to access the function).
No access attempt ever gets logged (i.e. no attempt to access the FunctionApp is logged in the end-to-end transaction logs or any the other logs via Application Insights).
Further investigation shows the worker process is still online and healthy and correctly lists the function we expect it to find on startup.
The FunctionApp runs on an S1 (Premium) App Service Plan tier, with plenty of resource, with "Always On" turned on using Managed Identity access to the storage account.
No changes have been made to this app since the last deployment on 2024-07-30T09:11:07.5666953Z (taken from the Kudu). No configuration changes have been made either beyond the Managed Identity access which was quite a few months ago.
This has happened to around 5 or so FunctionApps over the course of the last 5 years and I can separately say this happens more often when:
AzureWebJobsStorage__accountName
key, and roles added via Azire (rather than a connection string that includes a key)The text was updated successfully, but these errors were encountered: