Skip to content

feat: add go flaky tests github action #1013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

dimitarvdimitrov
Copy link

@dimitarvdimitrov dimitarvdimitrov commented Jun 5, 2025

Add core functionality for analyzing test failures from Loki logs.

This PR introduces the foundation of the flaky test detection bot. It is broken up from #998. This is the first PR of 3. The next PRs are dimitarvdimitrov#1 and dimitarvdimitrov#2. They add Git author tracking and GitHub issue management.

Key Features

  • Loki Integration: Query Loki using LogQL to extract test failures from CI/CD pipelines
  • Failure Analysis: Parse and aggregate test failures across branches and time periods
  • Flaky Test Classification: Identify tests that fail on main branch or across multiple branches
  • Configurable Analysis: Support for time ranges (24h, 7d, etc.) and result limits
  • Detailed Reporting: Generate JSON reports with failure counts and workflow URLs

Why in grafana/shared-workflows

This action is designed to work with any Go repository that meets these minimal requirements:

  • Runs unit tests in GitHub Actions
  • Logs are stored in loki-ops with the cicd-o11y namespace

This covers virtually all Grafana Labs repositories today. My plan is to deploy this immediately for grafana/mimir and grafana/backend-enterprise.

Future Extensibility

For non-Grafana repos or different Loki setups, the LogQL query can be made configurable via an additional input parameter, enabling adoption across any organization using Loki for CI/CD observability. Let me know if you think we should go down this route.

The action requires no repository-specific configuration - just provide Loki credentials and the repository name and have the repository closed locally. It automatically discovers test files, identifies recent contributors, and manages GitHub issues.

Testing

Includes a test suite with real Loki response data and golden file testing for reliable parsing of complex log structures.

Add core functionality for analyzing test failures from Loki logs:
- Query Loki using LogQL for test failure data
- Parse and aggregate test failures across branches
- Classify tests as flaky based on failure patterns
- Generate detailed analysis reports with failure counts and workflow URLs
- Configurable time ranges and result limits

This initial implementation includes stub interfaces for Git and GitHub
functionality that will be added in subsequent PRs.
Remove stub implementations and complexity that will be added in later PRs:
- Remove GitClient and GitHubClient interfaces and stubs
- Remove FilePath and RecentCommits from FlakyTest struct
- Remove repository-directory and skip-posting-issues inputs
- Remove author tracking and issue management code paths
- Focus purely on Loki querying, log parsing, and flaky test detection

This creates a clean foundation for PR2 (Git authors) and PR3 (GitHub issues)
to build upon without unnecessary complexity in the initial implementation.
- Add MockLokiClient and MockFileSystem for testing
- Test core AnalyzeFailures() functionality with valid Loki response
- Test error handling when Loki client fails
- Test ActionReport() method for both empty and populated reports
- Test utility functions like generateSummary() and FlakyTest.String()
- Use proper Loki response format with stream metadata for test data
- Added README.md with detailed usage instructions and how-it-works
- Added CHANGELOG.md documenting features and implementation details
- Added run-local.sh script for local development and testing
- Documentation now matches original PR functionality
- Remove GitHub issue creation and management features
- Remove dry run mode references
- Remove GitHub CLI and issue template content
- Documentation now covers Loki analysis and Git author tracking
- Remove Git history analysis and author tracking features
- Remove repository-directory input reference
- Remove run-local.sh script usage, use go run directly
- Documentation now covers only basic Loki analysis functionality
- Update aggregate.go to consider both 'main' and 'master' branches as indicators of flaky tests
- Update README documentation to reflect main/master branch logic
- Add comprehensive tests for master branch detection
- Remove 'progressive PR structure' and 'technical details' from changelog
- Update action name and description
- Rename directories and update all file references
- Update module name and build paths
- Update documentation and examples
@dimitarvdimitrov dimitarvdimitrov requested a review from a team as a code owner June 5, 2025 18:36
The repository-directory parameter is not needed in the core action
and was causing documentation inconsistency.
The github-token parameter is not used in the current implementation
and was causing documentation inconsistency.
- Fix prettier formatting for action.yaml, README.md, CHANGELOG.md
- Add missing newlines at end of files
- Remove test output file that shouldn't be committed
@dimitarvdimitrov
Copy link
Author

A linter is failing and I don't quite understand why. Can a maintainer give me a hand?

@dsotirakis
Copy link
Contributor

A linter is failing and I don't quite understand why. Can a maintainer give me a hand?

#1014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants