FunctionApps Randomly Dies Giving 500 Errors To All Requests #10900

jarfer · 2025-03-04T15:18:14Z

Hi All,

We have experienced issues where FunctionApps in Azure can be quite unreliable, and sometimes go offline after operating happily for a while, in some cases for years.

From the logs I'll share here, I believe I can shoiw this is a bug in which - after a routine 'warmup' operation - the app is unsuccessful at binding to any of it's Urls on startup, leaving it in a running state (which I can show is healthy and sensible) for which all Urls will give a 500 error response. Nothing we do from here on out causes the app to successfully bind it's methods, despite the fact that app had not been changed in years (I share the Kudu logs and date/times further down).

The most recent case I can detial here happened on the 18th of February at 21:20, after a standard 'warmup' operation one of ours went offline giving 500 status to all further requests (whether using valid or invalid keys). The end-to-end transaction logs show a standard warmup, followed by a load of 'null bindings':

18/02/2025, 21:40:51 -Trace Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<somename1>' Severity level: Information 18/02/2025, 21:40:51 -Trace Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<somename1>' Severity level: Information 18/02/2025, 21:40:51 -Trace Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<somename2>' Severity level: Information 18/02/2025, 21:40:51 -Trace Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<somename2>' Severity level: Information

From this point forward, all requests give 500 status errors (regardless of if the keys are correct or not to access the function).

No access attempt ever gets logged (i.e. no attempt to access the FunctionApp is logged in the end-to-end transaction logs or any the other logs via Application Insights).

Further investigation shows the worker process is still online and healthy and correctly lists the function we expect it to find on startup.

The FunctionApp runs on an S1 (Premium) App Service Plan tier, with plenty of resource, with "Always On" turned on using Managed Identity access to the storage account.

No changes have been made to this app since the last deployment on 2024-07-30T09:11:07.5666953Z (taken from the Kudu). No configuration changes have been made either beyond the Managed Identity access which was quite a few months ago.

This has happened to around 5 or so FunctionApps over the course of the last 5 years and I can separately say this happens more often when:

Using managed identity access to storage accounts (i..e when using the role-based access to the FunctionApps storage account and relying on the AzureWebJobsStorage__accountName key, and roles added via Azire (rather than a connection string that includes a key)
Happens much more often with any FunctionApp on a Y1 (scalable tier)
Seems to happen more often to projects upgraded from previous versions of .NET (like .NET 6 and 4) and previousy versions of FunctionApp container (e..g Version 3)

The text was updated successfully, but these errors were encountered:

jarfer · 2025-03-06T10:03:56Z

Just FYI we had another go offline this morning after applying Managed Identity Access

The Azure dashboard blades all now give "Internal Server Error" (from Azure trying to get status of app) https://imgur.com/a/jppbW8I

We followed the same procedure we always have, and about 4 hours later this FunctionApp collapsed (16:01 on the 05/03/2025

The logs in Application Insights show the same 'null bind error' from the first example I shared

05/03/2025, 16:05:29

Trace
Job host stopped
Severity level: Information
05/03/2025, 16:05:29

Trace
Stopping JobHost
Severity level: Information
05/03/2025, 16:01:13

Trace
Call to StopAsync complete, registered listeners are now stopped
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<Some>'
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<Some>'
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function ''
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function ''
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function ''
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function ''
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function 'AssessLinks'
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function 'AssessLinks'
Severity level: Information
05/03/2025, 16:01:13

Trace
Calling StopAsync on the registered listeners
Severity level: Information
05/03/2025, 16:01:13

Trace
DrainMode mode enabled
Severity level: Information
05/03/2025, 15:19:59

Trace
Executed '' (Succeeded, Id=864b8a55-89fd-432d-be47-b841a7af8581, Duration=0ms)

jarfer · 2025-03-06T10:06:33Z

Interestingly enough, once the FunctionApp has 'collapsed' in this way, reverting it's configuration back to 'key based access' to the storage account (reinstating 'AzureWebJobsStorage' connection string) turning back 'on' 'key based access' to the storage account (etc) doesn't cause the app to come back online. Even restarting after reverting the config doesn't resolve it.

jarfer · 2025-03-11T09:42:02Z

Hi All,

With further testing, we have been able to reproduce this bug, and can confirm via this testing the bug exists within (all) FunctionApps and will prevent the reliable implementation of Managed Identity access of the FunctionApp to its storage account

Introduction

We’ve been able to reproduce this “500 error” bug under controlled conditions and can now confirm this bug lies with the “Managed Identity” access to a FunctionApps storage account.

Methods

We setup 3 tests, FunctionApp01 (the control) which is left with default configuration. FunctionApp02 is default configuration but using Managed Identity access to the storage account on an S1 (premium) tier. FunctionApp03 is default configuration but using Managed Identity access to the storage account on a Y1 (scalable) tier.

From experience, apps on a Y1 tier are more likely to exhibit this issue more often/sooner, which is why this additional FunctionApp03 test case was chosen.

On the 6th March 2025 at around 3pm, these three apps were deployed, using the default (template) code for a Visual Studio 2022 FunctionApps project (Function App runtime v4) using .NET 8.

No changes were made beyond the configurations mentioned above, and an external monitor was put in place to detect when the first outages. Based on our hypothesis, FunctionApp03 will go offline first, followed by FunctionApp02 at some point after this. Where FunctionApp01 will remain online indefinitely (all else being equal).

FunctionApp1: https://testingfunctionapp01.azurewebsites.net/api/Function1?code=qt7e8MZ5fde82hCXvdG31_XPObilWpN-LcChnyhiKcO1AzFupswCbw==

FunctionApp2: https://testingfunctionapp02.azurewebsites.net/api/Function1?code=mmVGtFb6ziurEOLQZGhbrRarv3XmMZK_Erhb3-sMlXglAzFuXXm3kg==

FunctionApp3: https://testingfunctionapp03.azurewebsites.net/api/Function1?code=KU8KdlhnapnkFZAMsQjoRgGGDgk5awjLW7ajzFTp3g45AzFuSFpIcA==

Results

At 20:13pm on the 10th of March 2025, four days after beginning the test. By 20:18pm this had evolved into a “503 Service Unavailable” error as expected which is shown in the logs below. Since this is a Y1 (scalable) tier, in theory resource limitations in itself should not be an issue. And separately, we have shown in separate tests that a single FunctionApp (by itself) on an S1 (premium) tier exhibits the same issue.

See uptime monitoring logs here: https://imgur.com/a/CVE3lnJ

Conclusion

We now know, the source of the outages are related to the “Managed Identity” access to a FunctionApps storage account. And this brief test highlights that Y1 tier FunctionApps are more likely to exhibit this issue sooner or more frequently.

Because we know this (only) relates to Managed Identity, and that we didn’t configure Managed Identity for FunctionApp01 we can also say this one will remain online indefinitely (all else being equal).

jarfer · 2025-03-18T12:15:41Z

Digging a bit further, it turns out the Azure FunctionApp 'slot' itself is very broken in this scenario:

No actions I take from Visual Studio 2022 result in the publish succeeding, failing variously with:

The attempt to publish the ZIP file through "failed with HTTP status code InternalServerError."
The attempt to publish the ZIP file through https://testingfunctionapp03.scm.azurewebsites.net/api/zipdeploy?isAsync=true failed with HTTP status code BadGateway.
The attempt to publish the ZIP file through https://testingfunctionapp03.scm.azurewebsites.net/api/zipdeploy?isAsync=true failed with HTTP status code Unauthorized.

Interestingly, when then attempting to deploy via Continuous Deployment/Continuous Integration (CD/CI) this also fails with a 'timeout'.

https://imgur.com/a/AZJ4iec

"Failed to setup deployment. The request timed out. Diagnostic information: timestamp blah blah blah"

I imagine in the Angular Azure dashboard there is some kind of check to see the FuncitonApp is online, which fails (never actually receiving a response) leading to this.

And this is with the 'WebJobsStorage" key re-added, and identity turned off (reverting back to 'key based' access)

Interestingly, in this scenario trying to create a new deployment slots results in a '403 error' via the Azure dashboard https://imgur.com/a/KJYo1EY

"Failed to create slot 'test': Creation of storage file share failed with: 'The remote server returned an error: (403) Forbidden'. Please check if the storage account is accessible."

I find this 403 error interesting (given that I've now turned off identity, and moved back to using key-based access via a connection string that worked fine when the app was first deployed.

It suggests the configuration is not being read (or updated) from any actions taken via the Azure dashboard

microsoft-github-policy-service bot added the Needs: Triage (Functions) label Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FunctionApps Randomly Dies Giving 500 Errors To All Requests #10900

FunctionApps Randomly Dies Giving 500 Errors To All Requests #10900

jarfer commented Mar 4, 2025

jarfer commented Mar 6, 2025

jarfer commented Mar 6, 2025

jarfer commented Mar 11, 2025 •

edited

Loading

jarfer commented Mar 18, 2025 •

edited

Loading

FunctionApps Randomly Dies Giving 500 Errors To All Requests #10900

FunctionApps Randomly Dies Giving 500 Errors To All Requests #10900

Comments

jarfer commented Mar 4, 2025

jarfer commented Mar 6, 2025

05/03/2025, 16:05:29

Trace Job host stopped Severity level: Information 05/03/2025, 16:05:29

Trace Stopping JobHost Severity level: Information 05/03/2025, 16:01:13

Trace Call to StopAsync complete, registered listeners are now stopped Severity level: Information 05/03/2025, 16:01:13

Trace Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<Some>' Severity level: Information 05/03/2025, 16:01:13

Trace Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<Some>' Severity level: Information 05/03/2025, 16:01:13

Trace Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '' Severity level: Information 05/03/2025, 16:01:13

Trace Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '' Severity level: Information 05/03/2025, 16:01:13

Trace Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '' Severity level: Information 05/03/2025, 16:01:13

Trace Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '' Severity level: Information 05/03/2025, 16:01:13

Trace Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function 'AssessLinks' Severity level: Information 05/03/2025, 16:01:13

Trace Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function 'AssessLinks' Severity level: Information 05/03/2025, 16:01:13

Trace Calling StopAsync on the registered listeners Severity level: Information 05/03/2025, 16:01:13

Trace DrainMode mode enabled Severity level: Information 05/03/2025, 15:19:59

jarfer commented Mar 6, 2025

jarfer commented Mar 11, 2025 • edited Loading

Introduction

Methods

Results

Conclusion

jarfer commented Mar 18, 2025 • edited Loading

Trace
Job host stopped
Severity level: Information
05/03/2025, 16:05:29

Trace
Stopping JobHost
Severity level: Information
05/03/2025, 16:01:13

Trace
Call to StopAsync complete, registered listeners are now stopped
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<Some>'
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function '<Some>'
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function ''
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function ''
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function ''
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function ''
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopped the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function 'AssessLinks'
Severity level: Information
05/03/2025, 16:01:13

Trace
Stopping the listener 'Microsoft.Azure.WebJobs.Extensions.Http.HttpTriggerAttributeBindingProvider+HttpTriggerBinding+NullListener' for function 'AssessLinks'
Severity level: Information
05/03/2025, 16:01:13

Trace
Calling StopAsync on the registered listeners
Severity level: Information
05/03/2025, 16:01:13

Trace
DrainMode mode enabled
Severity level: Information
05/03/2025, 15:19:59

jarfer commented Mar 11, 2025 •

edited

Loading

jarfer commented Mar 18, 2025 •

edited

Loading