Skip to content

Conversation

@zyla
Copy link
Collaborator

@zyla zyla commented Oct 10, 2025

Summary

Fixes 401 "Bad credentials" errors in long-running CI jobs by implementing automatic GitHub App token refresh with file-based credential caching.

Problem

GitHub App installation access tokens expire after 1 hour. Previously, taskrunner created a token once at startup and cached it in an environment variable. For CI jobs longer than 1 hour, subsequent GitHub API calls would fail with:

CommitStatus: Response: {
  "message": "Bad credentials",
  "documentation_url": "https://docs.github.com/rest",
  "status": "401"
}

Solution

Implemented automatic token refresh with the following components:

File-Based Credentials Cache

  • Tokens stored in {stateDirectory}/.github-token-cache.json
  • Includes both token and ISO8601 expiration timestamp
  • Shared across parent and child processes via file system
  • Uses file locking to prevent concurrent refresh

Auto-Refresh Logic

  • Checks token expiration before each API call
  • Refreshes if token expires within 5 minutes
  • Both parent and child processes can refresh
  • Removes old environment variable approach

Changes Made

  1. src/Types.hs - Added expiresAt: UTCTime to GithubClient
  2. src/CommitStatus.hs - Implemented cache read/write and refresh logic
  3. test/FakeGithubApi.hs - Added configurable token lifetime for testing
  4. test/Spec.hs - Added # github token lifetime N test directive
  5. test/t/slow/github-token-refresh.txt - New test verifying refresh behavior

Test Plan

  • ✅ All existing tests pass
  • ✅ New test github-token-refresh verifies token refresh after expiration
    • Sets 2-second token lifetime
    • Sleeps 3 seconds between API calls
    • Confirms multiple token requests occur
  • ✅ Tested with both parent and child processes
  • ✅ Verified cache file locking prevents concurrent refresh

🤖 Generated with Claude Code

zyla and others added 4 commits October 10, 2025 09:30
GitHub App installation access tokens expire after 1 hour, causing 401
"Bad credentials" errors in long-running CI jobs. This change implements
automatic token refresh with a file-based cache shared across processes.

Changes:
- Add expiresAt field to GithubClient to track token expiration
- Implement credentials cache file (.github-token-cache.json) for sharing
  tokens between parent and child processes
- Auto-refresh tokens when they expire within 5 minutes
- Remove old environment variable approach
- Add test infrastructure for configurable token lifetime
- Add new test verifying token refresh after expiration

The cache file stores tokens with ISO8601 expiration timestamps and uses
file locking to prevent concurrent refresh. Child processes read from the
cache, and any process can refresh if the token is expired.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Replace mixed SHARED/EXCLUSIVE locking approach with a single EXCLUSIVE
lock for all credential operations. This eliminates the need for
double-check locking and simplifies the code significantly.

Changes:
- Remove readCredentialsCache and writeCredentialsCache functions
- Add loadOrRefreshClient that acquires EXCLUSIVE lock for entire operation
- Add tryReadCache helper (no locking, caller holds lock)
- Add refreshToken helper (creates token and writes cache under lock)
- Refactor initClient → createTokenFromGitHub (just creates token)
- Fast path in getClient checks IORef without locks

Benefits:
- No double-check locking needed
- Prevents thundering herd (only one process refreshes at a time)
- No torn reads (all file operations under EXCLUSIVE lock)
- Simpler code with clear locking boundaries
- Lock duration is brief (~100ms for HTTP call)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The test was calling snapshot twice within a single task, which is
illegal (snapshot writes to a single IORef that can only hold one
value). Fixed by removing the second snapshot call and relying on
the automatic final status update when the task completes.

Test flow now:
1. snapshot posts "pending" status (creates and caches token)
2. Task sleeps 3 seconds (token expires after 2 seconds)
3. Task completes, final "success" status is posted automatically
   (detects expired cached token and refreshes it)

This properly tests token refresh while following the one-snapshot-per-task rule.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The hardcoded 300-second (5-minute) refresh threshold was causing the
github-token-refresh test to fail. With a 2-second token lifetime, fresh
tokens were immediately considered "expiring" (2s < 300s threshold),
triggering unnecessary refreshes on every getClient() call.

Changes:
- Add githubTokenRefreshThresholdSeconds to Settings (default: 300)
- Read TASKRUNNER_GITHUB_TOKEN_REFRESH_THRESHOLD_SECONDS env var
- Use configurable threshold in getClient and loadOrRefreshClient
- Set threshold to 1 second in github-token-refresh test

This allows the test to verify proper token refresh behavior:
- Fresh tokens (2s remaining >= 1s threshold) are reused
- Expired tokens trigger refresh
- Test now passes with 2 token requests instead of 4

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@zyla zyla requested review from jborkowski and kozak October 10, 2025 11:51
Copy link
Contributor

@kozak kozak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

@zyla zyla merged commit 05b9687 into main Oct 10, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants