-
-
Notifications
You must be signed in to change notification settings - Fork 712
Open
Labels
Description
Description
Currently, ScanCode performs full scans on every run, even when files haven't changed or when running the same scan multiple times in CI/CD pipelines. This creates unnecessarily long scan times and wastes CI resources, especially for medium-to-large codebases.
Problem Statement
- No incremental scanning: Every scan processes all files, even if only a few changed
- No result caching: Identical files are rescanned in subsequent runs
- CI/CD inefficiency: GitHub Actions, GitLab CI, and Jenkins jobs waste time rescanning unchanged code
- Poor developer experience: Local development scans take too long for iterative work
Proposed Solution
Implement a file-level result caching system with the following features:
1. Content-based Cache System
- Cache scan results per file using content hash (SHA256 of file content + ScanCode version + enabled scan options)
- Store cached results in
~/.cache/scancode/or custom--cache-dirlocation - Automatic cache invalidation when ScanCode version or scan options change
2. CLI Integration
Add new command-line options:
--cache # Enable result caching (default: disabled)
--cache-dir PATH # Specify custom cache directory
--force-reindex # Force full rescan ignoring cache
3. Smart Cache Management
- Automatic cleanup of stale cache entries
- Configurable cache size limits
- Cache statistics reporting (hit rate, time saved)
4. CI/CD Optimization
- Cache survives between CI runs when using GitHub Actions cache or similar
- Show performance metrics in scan output (e.g., "Cache hits: 450/500 files, saved 5.2 minutes")
Expected Benefits
- 40-70% faster scans on repeated runs with few file changes
- Reduced CI/CD costs through shorter job times
- Better developer experience for local iterative scanning
- Backward compatible: Cache disabled by default, no breaking changes
Implementation Notes
- Use existing
scancode.resourceandscancode.apiinfrastructure - Follow similar pattern to existing license index caching in
licensedcode/cache.py - Store cached results as JSON per-file in organized directory structure
- Add comprehensive tests for cache hit/miss scenarios and invalidation
- Update documentation with caching best practices
Use Cases
- PR workflows: Scan only changed files in pull requests
- Incremental CI: Faster builds when only a few files change
- Local development: Quick rescans while fixing license issues
- Monorepo support: Cache results across multiple scans of shared dependencies
I'm interested in implementing this as part of my GSoC 2026 contribution and would appreciate feedback on this approach.
Reactions are currently unavailable