Skip to content

feat: Automatically retry failed webhook deliveries #745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

jackie-linz
Copy link
Contributor

@jackie-linz jackie-linz commented Jul 2, 2025

GitHub webhook deliveries can fail for many reasons. This code goes over all failed deliveries of the past one hour and retries them. This should help avoid cases of jobs pending forever because a runner wasn't provisioned for them. It uses GitHub API to iterate over all deliveries, find the ones that failed, confirm there is no newer delivery that worked, and then finally use GitHub API to redeliver the webhook call.

This only works for app authentication. Personal authentication token has no webhook attached, so it's impossible for us to know which webhook deliveries to monitor. Theoretically, if we ask for enough permission, we could go over all of the user's webhooks (from which repo/org?) and find the matching one based on our webhook URL.

This change adds a Lambda that runs every five minutes. It's an added cost, but should still be basically free. Around 10,000GB/s a month with 14,600 requests.

Resolves #738

@jackie-linz jackie-linz force-pushed the feat/webhook-redelivery branch from 6adbc96 to 030ae4e Compare July 2, 2025 23:16
@jackie-linz
Copy link
Contributor Author

hmm... I don't get the same failure when running npm run integ:default:assert locally...

image

@kichik
Copy link
Member

kichik commented Jul 3, 2025

Might be wrong dependencies. Try running yarn again.

@jackie-linz
Copy link
Contributor Author

jackie-linz commented Jul 3, 2025

hmm... no luck. I've tried running

rm -rf node_modules/ assets/ dist/ lib/
yarn
npm run integ:default:snapshot

and there's no change to commit.

I've used my personal fork this time instead of the company one so you should be able to push to it

@kichik
Copy link
Member

kichik commented Jul 3, 2025

I pushed a merge from main + redid the snapshot. Hopefully this one works. There are some corner cases with projen's snapshot. It can be a bit brittle at times. I think the last one I had had to do with line breaks? I mostly try to fix them when I find them, so honestly not sure. If you want to dig into it, you can use the jsii docker image and build it there. Then if you finally get the right hash as here, you can diff the template it generated to see what happened.

@jackie-linz
Copy link
Contributor Author

I think I found it - the linter is changing the code, which resulted in a diff.

@jackie-linz
Copy link
Contributor Author

I did not run yarn build before, as jest keeps failing - looks like my machine has too many CPU cores (so jest spawned 6 workers) and not enough memory to match so test keeps timing out and getting killed.

had to add the following to projenrc temporarily to get build running 😆

  jestOptions: {
    jestConfig: {
      maxWorkers: 3,
    },
  },

@kichik
Copy link
Member

kichik commented Jul 4, 2025

I went into the code to fix up the logging a bit, ended up having to fix app authentication mode, and then somehow more and more piled on.

The commit message is hopefully descriptive enough. I ran out of time so the unit test is AI slop that does pass but is of questionable value.

Let me know what you think.

- Simplify the finding of deliveries to redeliver to speed up the process and save on GitHub calls.
- Remove checks for event type as it's filtered when we create the webhook anyway. If we ever want more types in the future, we may forget to update the check here too.
- Remove labels check as this is checked in the webhook handler anyway. In the future it might not be, so we don't want to update that check in two places and probably forget about one of them.
- Fix app authentication for iterating webhook deliveries. No installation id is available or needed.
- Adjust logging style to match the rest of the code.
Copy link
Contributor Author

@jackie-linz jackie-linz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new implementation that removed the need to fetch delivery detail.

A few minor comments for your consideration

@kichik
Copy link
Member

kichik commented Jul 4, 2025

Thought about it some more and I'm confused by my auth fix. How did it work for you? Were you actually using PAT? Because it seems to be that there will be no webhook access for PAT. It's not attached to an app. The API we use is for apps only.

This also means redelivery will not work for PAT. I wonder if we can make that one work too somehow. We have no info about the webhook the user creates manually.

@jackie-linz
Copy link
Contributor Author

jackie-linz commented Jul 5, 2025

Thought about it some more and I'm confused by my auth fix. How did it work for you? Were you actually using PAT? Because it seems to be that there will be no webhook access for PAT. It's not attached to an app. The API we use is for apps only.

This also means redelivery will not work for PAT. I wonder if we can make that one work too somehow. We have no info about the webhook the user creates manually.

I'm not that familiar with different auth options for Github apps, but this is my local version of the lambda that worked reusing the existing github secret and github private key secret:

import { createAppAuth } from '@octokit/auth-app';
import { Octokit } from 'octokit';

export interface GithubSetting {
  domain: string;
  appId: number;
}

export async function getGithubSetting(): Promise<GithubSetting> {
  if (!process.env.GITHUB_SECRET_ARN) {
    throw new Error('Missing GITHUB_SECRET_ARN environment variable');
  }
  return JSON.parse(await getSecretValue(process.env.GITHUB_SECRET_ARN)) as GithubSetting;
}

export async function getGithubPrivateKey(): Promise<string> {
  if (!process.env.GITHUB_PRIVATE_KEY_SECRET_ARN) {
    throw new Error('Missing GITHUB_PRIVATE_KEY_SECRET_ARN environment variable');
  }
  return await getSecretValue(process.env.GITHUB_PRIVATE_KEY_SECRET_ARN);
}

export const handler = async (): Promise<void> => {
  const setting = await getGithubSetting();
  const privateKey = await getGithubPrivateKey();

  const octokit = new Octokit({
    authStrategy: createAppAuth,
    auth: {
      appId: setting.appId,
      privateKey,
    },
  });
  ...

@kichik
Copy link
Member

kichik commented Jul 5, 2025

I'm not that familiar with different auth options for Github apps, but this is my local version of the lambda that worked reusing the existing github secret and github private key secret:

Ah, that explains it. Your local version did the right thing of just using private key authentication. The pushed code used our getOctokit() which also tries to switch over to the installation context.

@kichik
Copy link
Member

kichik commented Jul 5, 2025

I think this is the final version. Would you be able to test it?

@kichik kichik changed the title feat: new lambda function to check and retry failed webhook deliveries feat: Automatically retry failed webhook deliveries Jul 6, 2025
@jackie-linz
Copy link
Contributor Author

I think this is the final version. Would you be able to test it?

Happy to test it with our deployment if you can publish a beta version - just nodejs one would be enough

@kichik
Copy link
Member

kichik commented Jul 7, 2025

I don't have that option, but the build artifact does contain a tar.gz that npm can install.

@jackie-linz
Copy link
Contributor Author

deployed and ran successfully twice
image

The log content looks fine too, will keep an eye to compare the behaviour between this and my own one (currently both are actively running)
image

It'd be nice to provide Logs Insight queries like you already do for the others

@jackie-linz
Copy link
Contributor Author

jackie-linz commented Jul 7, 2025

Performance comparison

Mine one
image

New one
image

Looks like the refactored implementation is quite a lot slower, likely due to it needing to fetch 1 hours worth of deliveries instead of 5 mins 😞

So while the refactored implementation avoided calling the delivery detail API, it had much smaller effect than expected because the API call was only made for failed deliveries, which does not happen that often (we had only 18 in the past week, 8 of which were installation_repositories event which should not fail anymore with this change)

On the other hand, the retrieval of 60 mins of delivery history (instead of just 5 mins) happens every single run, increasing the typical run time by 12 times.

@jackie-linz
Copy link
Contributor Author

By the way, we do have quite a number of repo and there are between 2000-7000 webhook deliveries per hour during work hours.

And here's the lambda run duration comparison between the 2 for this morning

new
image

old
image

@kichik
Copy link
Member

kichik commented Jul 8, 2025

Thanks!

It'd be nice to provide Logs Insight queries like you already do for the others

Done.

Looks like the refactored implementation is quite a lot slower, likely due to it needing to fetch 1 hours worth of deliveries instead of 5 mins 😞

My hope was that not calling GitHub API to get delivery information for each delivery would outweigh that. Seems like you have too many deliveries to make that happen 😅

Is the price still basically zero or do we need to bring back the SSM or something similar?

@jackie-linz
Copy link
Contributor Author

jackie-linz commented Jul 8, 2025

My hope was that not calling GitHub API to get delivery information for each delivery would outweigh that. Seems like you have too many deliveries to make that happen 😅

It's more about the extreme low probability of delivery failures than the volume of delivery.

In our setup there's only 10 failures out of 221573 webhook deliveries in the past week, which is less than 0.005%

Is the price still basically zero or do we need to bring back the SSM or something similar?

Price is not that big a concern - average of 24s per run means <$0.5 per month

On the other hand, the extended execution time could potentially be a concern, if it exceeds the limit of 4.5 minutes. With our setup it peaked around 54 seconds today, so it is possible for larger orgs to run into problem with the time limit.

image

We could probably achive better performance without the need to bring back the SSM, by using a global variable as in-memory cache.

The cache could store the last checked delivery id like in my implementation. And optionally, a map of known delivery summaries, like in your implementation.

The cache does get cleared every few hours when lambda gets a cold start, but we can just fallback to retrieving the past 5 minutes of data.

@kichik
Copy link
Member

kichik commented Jul 8, 2025

It's more about the extreme low probability of delivery failures than the volume of delivery.

In our setup there's only 10 failures out of 221573 webhook deliveries in the past week, which is less than 0.005%

Yep. I missed that it worked on the already filtered list and not doing the filtering itself.

We could probably achive better performance without the need to bring back the SSM, by using a global variable as in-memory cache.

Oh man, I like that! I think it will have to store the known failed delivery summaries otherwise it won't be able to know if a failed delivery is too old to retry. So maybe a map of failed delivery GUIDs to original delivery date.

@kichik
Copy link
Member

kichik commented Jul 14, 2025

I didn't get to test much and unit tests are still AI slop. If you can help me confirm latency is down + redelivery still works as expected, that'd be really appreciated.

@kichik
Copy link
Member

kichik commented Jul 15, 2025

Thanks. Addressed feedback.

But I also realized failures is unbounded. We would need to return successful redeliveries from newDeliveryFailures too. Or maybe just TTL items there.

This also makes me think more of a case with too many failed deliveries. Like maybe a misconfigured webhook for a few hours. The Lambda might fail in a loop and reset its cache between each run. It would be nice if we could use the same SQS trick as idle reaper. It returns the message to the queue on failure and uses SQS delivery delay and visibility timeout to ensure the message is retested every 10 minutes. But there is no constant message we can throw back on the queue here and we can't even check if the redelivery was successful without iterating the whole list. There is no delivery GUID -> delivery id / success API.

@jackie-linz
Copy link
Contributor Author

Thanks. Addressed feedback.

👍

But I also realized failures is unbounded. We would need to return successful redeliveries from newDeliveryFailures too. Or maybe just TTL items there.

Lambda typically gets a cold start every few hours, so I don't expect failures to grows too much.

If you are worried, we could expire failures with firstDeliveredAt that's too old at the end of each run.

This also makes me think more of a case with too many failed deliveries. Like maybe a misconfigured webhook for a few hours. The Lambda might fail in a loop and reset its cache between each run. It would be nice if we could use the same SQS trick as idle reaper. It returns the message to the queue on failure and uses SQS delivery delay and visibility timeout to ensure the message is retested every 10 minutes. But there is no constant message we can throw back on the queue here and we can't even check if the redelivery was successful without iterating the whole list. There is no delivery GUID -> delivery id / success API.

You mean too many redeliver calls can cause the lambda to exceed the 4.5 minute timeout? How about doing redeliver in parallel with Promise.all?

@kichik
Copy link
Member

kichik commented Jul 15, 2025

If you are worried, we could expire failures with firstDeliveredAt that's too old at the end of each run.

Probably what we will end up doing.

You mean too many redeliver calls can cause the lambda to exceed the 4.5 minute timeout? How about doing redeliver in parallel with Promise.all?

That and/or failures from GitHub redelivery API due to too many requests at once. Using Promise.all() would make that last one worse. The error handling of large batches of redeliveries is not ideal here.

@jackie-linz
Copy link
Contributor Author

That and/or failures from GitHub redelivery API due to too many requests at once. Using Promise.all() would make that last one worse.

I noticed that you are using @octokit/rest instead of octokit. octokit has built-in throttling and retry implementing all their best practices. It would even handle a Promise.all to avoid hitting rate limits.

Was there a reason that it's not used?

The error handling of large batches of redeliveries is not ideal here.

According to their best practice, mutation requests (POST/PUT/etc) should have 1s wait in between, so within 4.5 minutes we can send at most 270 redelivery requests.

So yes. the current implementation does not handle total failure scenario where there's more than 200 failures every 5 minutes.

Using a queue is one option - but we still need to take care of the rate limit.

Alternatively, I think it's ok to not retry when there's more than, say, 100 failures found. I think it's reasonable to assume the problem is not transcient in that case.

@jackie-linz
Copy link
Contributor Author

@kichik are you able to fix the build so I can download the new build artifact to test?

@kichik
Copy link
Member

kichik commented Jul 17, 2025

@kichik are you able to fix the build so I can download the new build artifact to test?

Done.

I noticed that you are using @octokit/rest instead of octokit. octokit has built-in throttling and retry implementing all their best practices. It would even handle a Promise.all to avoid hitting rate limits.

Was there a reason that it's not used?

I think there was. But I'm not sure what it was. I feel like it was either bundle size, or maybe something was missing like app authentication.

Alternatively, I think it's ok to not retry when there's more than, say, 100 failures found. I think it's reasonable to assume the problem is not transcient in that case.

Yeah probably what we have is already better than nothing. I will keep thinking for a while about the SQS option. If no proper solution comes up, we can go with the current one.

I do want to get the unit tests proper because there are a few corner cases I want to check. For example, what happens when we get a cold start and the delivery log has a failed delivery + a successful redelivery. Will it redeliver?

@jackie-linz
Copy link
Contributor Author

Deployed the latest code and performance wise it looks good
image

Note I made 2 deployments as indicated by red marker on chart - the second time changing a few of the debug log to info.

Looks like first execution takes about 10s, with subsequent executions taking around 1-4 seconds.

Will monitor logs to see if redelivery happen as expected

@kichik
Copy link
Member

kichik commented Jul 18, 2025

Note I made 2 deployments as indicated by red marker on chart - the second time changing a few of the debug log to info.

You can uncomment this line to get debug logs:

https://github.com/jackie-linz/cdk-github-runners/blob/67cc182e77283dde1fb1bc174da869ef643742d9/src/webhook-redelivery.ts#L47

Looks like first execution takes about 10s, with subsequent executions taking around 1-4 seconds.

That feels weirdly too much? Is that what it took for your original code too? It needs 4 seconds to go over 5 minutes of deliveries?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extra lambda to discover and retry failed webhook delivery
2 participants