-
-
Notifications
You must be signed in to change notification settings - Fork 47
feat: Automatically retry failed webhook deliveries #745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Automatically retry failed webhook deliveries #745
Conversation
6adbc96
to
030ae4e
Compare
Might be wrong dependencies. Try running |
hmm... no luck. I've tried running rm -rf node_modules/ assets/ dist/ lib/
yarn
npm run integ:default:snapshot and there's no change to commit. I've used my personal fork this time instead of the company one so you should be able to push to it |
I pushed a merge from |
I think I found it - the linter is changing the code, which resulted in a diff. |
I did not run had to add the following to projenrc temporarily to get build running 😆 jestOptions: {
jestConfig: {
maxWorkers: 3,
},
}, |
I went into the code to fix up the logging a bit, ended up having to fix app authentication mode, and then somehow more and more piled on. The commit message is hopefully descriptive enough. I ran out of time so the unit test is AI slop that does pass but is of questionable value. Let me know what you think. |
- Simplify the finding of deliveries to redeliver to speed up the process and save on GitHub calls. - Remove checks for event type as it's filtered when we create the webhook anyway. If we ever want more types in the future, we may forget to update the check here too. - Remove labels check as this is checked in the webhook handler anyway. In the future it might not be, so we don't want to update that check in two places and probably forget about one of them. - Fix app authentication for iterating webhook deliveries. No installation id is available or needed. - Adjust logging style to match the rest of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new implementation that removed the need to fetch delivery detail.
A few minor comments for your consideration
Thought about it some more and I'm confused by my auth fix. How did it work for you? Were you actually using PAT? Because it seems to be that there will be no webhook access for PAT. It's not attached to an app. The API we use is for apps only. This also means redelivery will not work for PAT. I wonder if we can make that one work too somehow. We have no info about the webhook the user creates manually. |
I'm not that familiar with different auth options for Github apps, but this is my local version of the lambda that worked reusing the existing github secret and github private key secret: import { createAppAuth } from '@octokit/auth-app';
import { Octokit } from 'octokit';
export interface GithubSetting {
domain: string;
appId: number;
}
export async function getGithubSetting(): Promise<GithubSetting> {
if (!process.env.GITHUB_SECRET_ARN) {
throw new Error('Missing GITHUB_SECRET_ARN environment variable');
}
return JSON.parse(await getSecretValue(process.env.GITHUB_SECRET_ARN)) as GithubSetting;
}
export async function getGithubPrivateKey(): Promise<string> {
if (!process.env.GITHUB_PRIVATE_KEY_SECRET_ARN) {
throw new Error('Missing GITHUB_PRIVATE_KEY_SECRET_ARN environment variable');
}
return await getSecretValue(process.env.GITHUB_PRIVATE_KEY_SECRET_ARN);
}
export const handler = async (): Promise<void> => {
const setting = await getGithubSetting();
const privateKey = await getGithubPrivateKey();
const octokit = new Octokit({
authStrategy: createAppAuth,
auth: {
appId: setting.appId,
privateKey,
},
});
... |
Ah, that explains it. Your local version did the right thing of just using private key authentication. The pushed code used our |
I think this is the final version. Would you be able to test it? |
Happy to test it with our deployment if you can publish a beta version - just nodejs one would be enough |
I don't have that option, but the build artifact does contain a tar.gz that npm can install. |
Performance comparison Looks like the refactored implementation is quite a lot slower, likely due to it needing to fetch 1 hours worth of deliveries instead of 5 mins 😞 So while the refactored implementation avoided calling the delivery detail API, it had much smaller effect than expected because the API call was only made for failed deliveries, which does not happen that often (we had only 18 in the past week, 8 of which were On the other hand, the retrieval of 60 mins of delivery history (instead of just 5 mins) happens every single run, increasing the typical run time by 12 times. |
Thanks!
Done.
My hope was that not calling GitHub API to get delivery information for each delivery would outweigh that. Seems like you have too many deliveries to make that happen 😅 Is the price still basically zero or do we need to bring back the SSM or something similar? |
It's more about the extreme low probability of delivery failures than the volume of delivery. In our setup there's only
Price is not that big a concern - average of On the other hand, the extended execution time could potentially be a concern, if it exceeds the limit of 4.5 minutes. With our setup it peaked around 54 seconds today, so it is possible for larger orgs to run into problem with the time limit. We could probably achive better performance without the need to bring back the SSM, by using a global variable as in-memory cache. The cache could store the last checked delivery id like in my implementation. And optionally, a map of known delivery summaries, like in your implementation. The cache does get cleared every few hours when lambda gets a cold start, but we can just fallback to retrieving the past 5 minutes of data. |
Yep. I missed that it worked on the already filtered list and not doing the filtering itself.
Oh man, I like that! I think it will have to store the known failed delivery summaries otherwise it won't be able to know if a failed delivery is too old to retry. So maybe a map of failed delivery GUIDs to original delivery date. |
…veries every 5 minutes it can really add up for big deployments
I didn't get to test much and unit tests are still AI slop. If you can help me confirm latency is down + redelivery still works as expected, that'd be really appreciated. |
Thanks. Addressed feedback. But I also realized This also makes me think more of a case with too many failed deliveries. Like maybe a misconfigured webhook for a few hours. The Lambda might fail in a loop and reset its cache between each run. It would be nice if we could use the same SQS trick as idle reaper. It returns the message to the queue on failure and uses SQS delivery delay and visibility timeout to ensure the message is retested every 10 minutes. But there is no constant message we can throw back on the queue here and we can't even check if the redelivery was successful without iterating the whole list. There is no delivery GUID -> delivery id / success API. |
👍
Lambda typically gets a cold start every few hours, so I don't expect If you are worried, we could expire failures with
You mean too many |
Probably what we will end up doing.
That and/or failures from GitHub redelivery API due to too many requests at once. Using |
I noticed that you are using Was there a reason that it's not used?
According to their best practice, mutation requests ( So yes. the current implementation does not handle total failure scenario where there's more than 200 failures every 5 minutes. Using a queue is one option - but we still need to take care of the rate limit. Alternatively, I think it's ok to not retry when there's more than, say, 100 failures found. I think it's reasonable to assume the problem is not transcient in that case. |
@kichik are you able to fix the build so I can download the new build artifact to test? |
Done.
I think there was. But I'm not sure what it was. I feel like it was either bundle size, or maybe something was missing like app authentication.
Yeah probably what we have is already better than nothing. I will keep thinking for a while about the SQS option. If no proper solution comes up, we can go with the current one. I do want to get the unit tests proper because there are a few corner cases I want to check. For example, what happens when we get a cold start and the delivery log has a failed delivery + a successful redelivery. Will it redeliver? |
You can uncomment this line to get debug logs:
That feels weirdly too much? Is that what it took for your original code too? It needs 4 seconds to go over 5 minutes of deliveries? |
GitHub webhook deliveries can fail for many reasons. This code goes over all failed deliveries of the past one hour and retries them. This should help avoid cases of jobs pending forever because a runner wasn't provisioned for them. It uses GitHub API to iterate over all deliveries, find the ones that failed, confirm there is no newer delivery that worked, and then finally use GitHub API to redeliver the webhook call.
This only works for app authentication. Personal authentication token has no webhook attached, so it's impossible for us to know which webhook deliveries to monitor. Theoretically, if we ask for enough permission, we could go over all of the user's webhooks (from which repo/org?) and find the matching one based on our webhook URL.
This change adds a Lambda that runs every five minutes. It's an added cost, but should still be basically free. Around 10,000GB/s a month with 14,600 requests.
Resolves #738