feat: Automatically retry failed webhook deliveries #745

jackie-linz · 2025-07-02T23:14:47Z

GitHub webhook deliveries can fail for many reasons. This code goes over all failed deliveries of the past one hour and retries them. This should help avoid cases of jobs pending forever because a runner wasn't provisioned for them. It uses GitHub API to iterate over all deliveries, find the ones that failed, confirm there is no newer delivery that worked, and then finally use GitHub API to redeliver the webhook call.

This only works for app authentication. Personal authentication token has no webhook attached, so it's impossible for us to know which webhook deliveries to monitor. Theoretically, if we ask for enough permission, we could go over all of the user's webhooks (from which repo/org?) and find the matching one based on our webhook URL.

This change adds a Lambda that runs every five minutes. It's an added cost, but should still be basically free. Around 10,000GB/s a month with 14,600 requests.

Resolves #738

jackie-linz · 2025-07-03T00:05:15Z

hmm... I don't get the same failure when running npm run integ:default:assert locally...

kichik · 2025-07-03T00:20:13Z

Might be wrong dependencies. Try running yarn again.

jackie-linz · 2025-07-03T00:40:28Z

hmm... no luck. I've tried running

rm -rf node_modules/ assets/ dist/ lib/
yarn
npm run integ:default:snapshot

and there's no change to commit.

I've used my personal fork this time instead of the company one so you should be able to push to it

kichik · 2025-07-03T02:41:51Z

I pushed a merge from main + redid the snapshot. Hopefully this one works. There are some corner cases with projen's snapshot. It can be a bit brittle at times. I think the last one I had had to do with line breaks? I mostly try to fix them when I find them, so honestly not sure. If you want to dig into it, you can use the jsii docker image and build it there. Then if you finally get the right hash as here, you can diff the template it generated to see what happened.

jackie-linz · 2025-07-03T03:03:13Z

I think I found it - the linter is changing the code, which resulted in a diff.

jackie-linz · 2025-07-03T03:18:57Z

I did not run yarn build before, as jest keeps failing - looks like my machine has too many CPU cores (so jest spawned 6 workers) and not enough memory to match so test keeps timing out and getting killed.

had to add the following to projenrc temporarily to get build running 😆

  jestOptions: {
    jestConfig: {
      maxWorkers: 3,
    },
  },

kichik · 2025-07-04T01:53:18Z

I went into the code to fix up the logging a bit, ended up having to fix app authentication mode, and then somehow more and more piled on.

The commit message is hopefully descriptive enough. I ran out of time so the unit test is AI slop that does pass but is of questionable value.

Let me know what you think.

- Simplify the finding of deliveries to redeliver to speed up the process and save on GitHub calls. - Remove checks for event type as it's filtered when we create the webhook anyway. If we ever want more types in the future, we may forget to update the check here too. - Remove labels check as this is checked in the webhook handler anyway. In the future it might not be, so we don't want to update that check in two places and probably forget about one of them. - Fix app authentication for iterating webhook deliveries. No installation id is available or needed. - Adjust logging style to match the rest of the code.

jackie-linz

I like the new implementation that removed the need to fetch delivery detail.

A few minor comments for your consideration

src/webhook-redelivery.lambda.ts

kichik · 2025-07-04T12:23:04Z

Thought about it some more and I'm confused by my auth fix. How did it work for you? Were you actually using PAT? Because it seems to be that there will be no webhook access for PAT. It's not attached to an app. The API we use is for apps only.

This also means redelivery will not work for PAT. I wonder if we can make that one work too somehow. We have no info about the webhook the user creates manually.

jackie-linz · 2025-07-05T03:26:24Z

Thought about it some more and I'm confused by my auth fix. How did it work for you? Were you actually using PAT? Because it seems to be that there will be no webhook access for PAT. It's not attached to an app. The API we use is for apps only.

This also means redelivery will not work for PAT. I wonder if we can make that one work too somehow. We have no info about the webhook the user creates manually.

I'm not that familiar with different auth options for Github apps, but this is my local version of the lambda that worked reusing the existing github secret and github private key secret:

import { createAppAuth } from '@octokit/auth-app';
import { Octokit } from 'octokit';

export interface GithubSetting {
  domain: string;
  appId: number;
}

export async function getGithubSetting(): Promise<GithubSetting> {
  if (!process.env.GITHUB_SECRET_ARN) {
    throw new Error('Missing GITHUB_SECRET_ARN environment variable');
  }
  return JSON.parse(await getSecretValue(process.env.GITHUB_SECRET_ARN)) as GithubSetting;
}

export async function getGithubPrivateKey(): Promise<string> {
  if (!process.env.GITHUB_PRIVATE_KEY_SECRET_ARN) {
    throw new Error('Missing GITHUB_PRIVATE_KEY_SECRET_ARN environment variable');
  }
  return await getSecretValue(process.env.GITHUB_PRIVATE_KEY_SECRET_ARN);
}

export const handler = async (): Promise<void> => {
  const setting = await getGithubSetting();
  const privateKey = await getGithubPrivateKey();

  const octokit = new Octokit({
    authStrategy: createAppAuth,
    auth: {
      appId: setting.appId,
      privateKey,
    },
  });
  ...

kichik · 2025-07-05T17:39:56Z

I'm not that familiar with different auth options for Github apps, but this is my local version of the lambda that worked reusing the existing github secret and github private key secret:

Ah, that explains it. Your local version did the right thing of just using private key authentication. The pushed code used our getOctokit() which also tries to switch over to the installation context.

…w_job events

kichik · 2025-07-05T17:52:02Z

I think this is the final version. Would you be able to test it?

jackie-linz · 2025-07-06T21:49:21Z

I think this is the final version. Would you be able to test it?

Happy to test it with our deployment if you can publish a beta version - just nodejs one would be enough

kichik · 2025-07-07T00:20:33Z

I don't have that option, but the build artifact does contain a tar.gz that npm can install.

jackie-linz · 2025-07-07T01:35:59Z

deployed and ran successfully twice

The log content looks fine too, will keep an eye to compare the behaviour between this and my own one (currently both are actively running)

It'd be nice to provide Logs Insight queries like you already do for the others

jackie-linz · 2025-07-07T01:41:45Z

Performance comparison

Mine one

New one

Looks like the refactored implementation is quite a lot slower, likely due to it needing to fetch 1 hours worth of deliveries instead of 5 mins 😞

So while the refactored implementation avoided calling the delivery detail API, it had much smaller effect than expected because the API call was only made for failed deliveries, which does not happen that often (we had only 18 in the past week, 8 of which were installation_repositories event which should not fail anymore with this change)

On the other hand, the retrieval of 60 mins of delivery history (instead of just 5 mins) happens every single run, increasing the typical run time by 12 times.

jackie-linz · 2025-07-07T21:22:28Z

By the way, we do have quite a number of repo and there are between 2000-7000 webhook deliveries per hour during work hours.

And here's the lambda run duration comparison between the 2 for this morning

new

old

kichik · 2025-07-08T00:03:46Z

Thanks!

It'd be nice to provide Logs Insight queries like you already do for the others

Done.

Looks like the refactored implementation is quite a lot slower, likely due to it needing to fetch 1 hours worth of deliveries instead of 5 mins 😞

My hope was that not calling GitHub API to get delivery information for each delivery would outweigh that. Seems like you have too many deliveries to make that happen 😅

Is the price still basically zero or do we need to bring back the SSM or something similar?

jackie-linz · 2025-07-08T02:50:28Z

My hope was that not calling GitHub API to get delivery information for each delivery would outweigh that. Seems like you have too many deliveries to make that happen 😅

It's more about the extreme low probability of delivery failures than the volume of delivery.

In our setup there's only 10 failures out of 221573 webhook deliveries in the past week, which is less than 0.005%

Is the price still basically zero or do we need to bring back the SSM or something similar?

Price is not that big a concern - average of 24s per run means <$0.5 per month

On the other hand, the extended execution time could potentially be a concern, if it exceeds the limit of 4.5 minutes. With our setup it peaked around 54 seconds today, so it is possible for larger orgs to run into problem with the time limit.

We could probably achive better performance without the need to bring back the SSM, by using a global variable as in-memory cache.

The cache could store the last checked delivery id like in my implementation. And optionally, a map of known delivery summaries, like in your implementation.

The cache does get cleared every few hours when lambda gets a cold start, but we can just fallback to retrieving the past 5 minutes of data.

kichik · 2025-07-08T16:04:50Z

It's more about the extreme low probability of delivery failures than the volume of delivery.

In our setup there's only 10 failures out of 221573 webhook deliveries in the past week, which is less than 0.005%

Yep. I missed that it worked on the already filtered list and not doing the filtering itself.

We could probably achive better performance without the need to bring back the SSM, by using a global variable as in-memory cache.

Oh man, I like that! I think it will have to store the known failed delivery summaries otherwise it won't be able to know if a failed delivery is too old to retry. So maybe a map of failed delivery GUIDs to original delivery date.

…veries every 5 minutes it can really add up for big deployments

kichik · 2025-07-14T02:35:58Z

I didn't get to test much and unit tests are still AI slop. If you can help me confirm latency is down + redelivery still works as expected, that'd be really appreciated.

src/webhook-redelivery.lambda.ts

kichik · 2025-07-15T00:03:33Z

Thanks. Addressed feedback.

But I also realized failures is unbounded. We would need to return successful redeliveries from newDeliveryFailures too. Or maybe just TTL items there.

This also makes me think more of a case with too many failed deliveries. Like maybe a misconfigured webhook for a few hours. The Lambda might fail in a loop and reset its cache between each run. It would be nice if we could use the same SQS trick as idle reaper. It returns the message to the queue on failure and uses SQS delivery delay and visibility timeout to ensure the message is retested every 10 minutes. But there is no constant message we can throw back on the queue here and we can't even check if the redelivery was successful without iterating the whole list. There is no delivery GUID -> delivery id / success API.

jackie-linz · 2025-07-15T01:08:42Z

Thanks. Addressed feedback.

👍

But I also realized failures is unbounded. We would need to return successful redeliveries from newDeliveryFailures too. Or maybe just TTL items there.

Lambda typically gets a cold start every few hours, so I don't expect failures to grows too much.

If you are worried, we could expire failures with firstDeliveredAt that's too old at the end of each run.

This also makes me think more of a case with too many failed deliveries. Like maybe a misconfigured webhook for a few hours. The Lambda might fail in a loop and reset its cache between each run. It would be nice if we could use the same SQS trick as idle reaper. It returns the message to the queue on failure and uses SQS delivery delay and visibility timeout to ensure the message is retested every 10 minutes. But there is no constant message we can throw back on the queue here and we can't even check if the redelivery was successful without iterating the whole list. There is no delivery GUID -> delivery id / success API.

You mean too many redeliver calls can cause the lambda to exceed the 4.5 minute timeout? How about doing redeliver in parallel with Promise.all?

kichik · 2025-07-15T01:26:04Z

If you are worried, we could expire failures with firstDeliveredAt that's too old at the end of each run.

Probably what we will end up doing.

You mean too many redeliver calls can cause the lambda to exceed the 4.5 minute timeout? How about doing redeliver in parallel with Promise.all?

That and/or failures from GitHub redelivery API due to too many requests at once. Using Promise.all() would make that last one worse. The error handling of large batches of redeliveries is not ideal here.

jackie-linz · 2025-07-15T03:46:55Z

That and/or failures from GitHub redelivery API due to too many requests at once. Using Promise.all() would make that last one worse.

I noticed that you are using @octokit/rest instead of octokit. octokit has built-in throttling and retry implementing all their best practices. It would even handle a Promise.all to avoid hitting rate limits.

Was there a reason that it's not used?

The error handling of large batches of redeliveries is not ideal here.

According to their best practice, mutation requests (POST/PUT/etc) should have 1s wait in between, so within 4.5 minutes we can send at most 270 redelivery requests.

So yes. the current implementation does not handle total failure scenario where there's more than 200 failures every 5 minutes.

Using a queue is one option - but we still need to take care of the rate limit.

Alternatively, I think it's ok to not retry when there's more than, say, 100 failures found. I think it's reasonable to assume the problem is not transcient in that case.

jackie-linz · 2025-07-16T22:29:51Z

@kichik are you able to fix the build so I can download the new build artifact to test?

kichik · 2025-07-17T18:25:07Z

@kichik are you able to fix the build so I can download the new build artifact to test?

Done.

I noticed that you are using @octokit/rest instead of octokit. octokit has built-in throttling and retry implementing all their best practices. It would even handle a Promise.all to avoid hitting rate limits.

Was there a reason that it's not used?

I think there was. But I'm not sure what it was. I feel like it was either bundle size, or maybe something was missing like app authentication.

Alternatively, I think it's ok to not retry when there's more than, say, 100 failures found. I think it's reasonable to assume the problem is not transcient in that case.

Yeah probably what we have is already better than nothing. I will keep thinking for a while about the SQS option. If no proper solution comes up, we can go with the current one.

I do want to get the unit tests proper because there are a few corner cases I want to check. For example, what happens when we get a cold start and the delivery log has a failed delivery + a successful redelivery. Will it redeliver?

jackie-linz · 2025-07-18T02:48:00Z

Deployed the latest code and performance wise it looks good

Note I made 2 deployments as indicated by red marker on chart - the second time changing a few of the debug log to info.

Looks like first execution takes about 10s, with subsequent executions taking around 1-4 seconds.

Will monitor logs to see if redelivery happen as expected

kichik · 2025-07-18T14:37:22Z

Note I made 2 deployments as indicated by red marker on chart - the second time changing a few of the debug log to info.

You can uncomment this line to get debug logs:

https://github.com/jackie-linz/cdk-github-runners/blob/67cc182e77283dde1fb1bc174da869ef643742d9/src/webhook-redelivery.ts#L47

Looks like first execution takes about 10s, with subsequent executions taking around 1-4 seconds.

That feels weirdly too much? Is that what it took for your original code too? It needs 4 seconds to go over 5 minutes of deliveries?

feat: new lambda function to check and retry failed webhook deliveries

030ae4e

jackie-linz force-pushed the feat/webhook-redelivery branch from 6adbc96 to 030ae4e Compare July 2, 2025 23:16

Merge remote-tracking branch 'origin/main' into feat/webhook-redelivery

2db2bb7

applied linting fixes and updated snapshot

0328f26

jackie-linz commented Jul 4, 2025

View reviewed changes

src/webhook-redelivery.lambda.ts Outdated Show resolved Hide resolved

src/webhook-redelivery.lambda.ts Outdated Show resolved Hide resolved

src/webhook-redelivery.lambda.ts Show resolved Hide resolved

address feedback

758d127

Don't try running with PAT + avoid constant redelivery of non workflo…

f897759

…w_job events

kichik changed the title ~~feat: new lambda function to check and retry failed webhook deliveries~~ feat: Automatically retry failed webhook deliveries Jul 6, 2025

add log insights query

059eea7

Merge remote-tracking branch 'origin/main' into feat/webhook-redelivery

e5add06

kichik added 2 commits July 13, 2025 22:31

use lambda in-memory cache to avoid iterating an hour's worth of deli…

a02ce90

…veries every 5 minutes it can really add up for big deployments

Merge remote-tracking branch 'origin/main' into feat/webhook-redelivery

bbe3117

don't just use failed deliveries to stop iteration

940ee81

jackie-linz commented Jul 14, 2025

View reviewed changes

src/webhook-redelivery.lambda.ts Outdated Show resolved Hide resolved

jackie-linz commented Jul 14, 2025

View reviewed changes

src/webhook-redelivery.lambda.ts Outdated Show resolved Hide resolved

jackie-linz commented Jul 14, 2025

View reviewed changes

src/webhook-redelivery.lambda.ts Outdated Show resolved Hide resolved

address feedback

2cf243b

kichik added 2 commits July 17, 2025 14:20

try to minimize failures size

bb819cd

Merge remote-tracking branch 'origin/main' into feat/webhook-redelivery

4e76ff2

fix test

67cc182

Uh oh!

feat: Automatically retry failed webhook deliveries #745

Are you sure you want to change the base?

feat: Automatically retry failed webhook deliveries #745

Uh oh!

Conversation

jackie-linz commented Jul 2, 2025 • edited by kichik Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackie-linz commented Jul 3, 2025

Uh oh!

kichik commented Jul 3, 2025

Uh oh!

jackie-linz commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kichik commented Jul 3, 2025

Uh oh!

jackie-linz commented Jul 3, 2025

Uh oh!

jackie-linz commented Jul 3, 2025

Uh oh!

kichik commented Jul 4, 2025

Uh oh!

jackie-linz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kichik commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackie-linz commented Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kichik commented Jul 5, 2025

Uh oh!

kichik commented Jul 5, 2025

Uh oh!

jackie-linz commented Jul 6, 2025

Uh oh!

kichik commented Jul 7, 2025

Uh oh!

jackie-linz commented Jul 7, 2025

Uh oh!

jackie-linz commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jackie-linz commented Jul 7, 2025

Uh oh!

kichik commented Jul 8, 2025

Uh oh!

jackie-linz commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kichik commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kichik commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kichik commented Jul 15, 2025

Uh oh!

jackie-linz commented Jul 15, 2025

Uh oh!

kichik commented Jul 15, 2025

Uh oh!

jackie-linz commented Jul 15, 2025

Uh oh!

jackie-linz commented Jul 16, 2025

Uh oh!

kichik commented Jul 17, 2025

Uh oh!

jackie-linz commented Jul 18, 2025

Uh oh!

kichik commented Jul 18, 2025

Uh oh!

jackie-linz commented Jul 2, 2025 •

edited by kichik

Loading

jackie-linz commented Jul 3, 2025 •

edited

Loading

kichik commented Jul 4, 2025 •

edited

Loading

jackie-linz commented Jul 5, 2025 •

edited

Loading

jackie-linz commented Jul 7, 2025 •

edited

Loading

jackie-linz commented Jul 8, 2025 •

edited

Loading

kichik commented Jul 8, 2025 •

edited

Loading