Skip to content

Add intelligent file-level result caching system for faster repeated scans and CI/CD workflows #4801

@OmAmbole009

Description

@OmAmbole009

Description

Currently, ScanCode performs full scans on every run, even when files haven't changed or when running the same scan multiple times in CI/CD pipelines. This creates unnecessarily long scan times and wastes CI resources, especially for medium-to-large codebases.

Problem Statement

  1. No incremental scanning: Every scan processes all files, even if only a few changed
  2. No result caching: Identical files are rescanned in subsequent runs
  3. CI/CD inefficiency: GitHub Actions, GitLab CI, and Jenkins jobs waste time rescanning unchanged code
  4. Poor developer experience: Local development scans take too long for iterative work

Proposed Solution

Implement a file-level result caching system with the following features:

1. Content-based Cache System

  • Cache scan results per file using content hash (SHA256 of file content + ScanCode version + enabled scan options)
  • Store cached results in ~/.cache/scancode/ or custom --cache-dir location
  • Automatic cache invalidation when ScanCode version or scan options change

2. CLI Integration

Add new command-line options:

--cache # Enable result caching (default: disabled)
--cache-dir PATH # Specify custom cache directory
--force-reindex # Force full rescan ignoring cache

3. Smart Cache Management

  • Automatic cleanup of stale cache entries
  • Configurable cache size limits
  • Cache statistics reporting (hit rate, time saved)

4. CI/CD Optimization

  • Cache survives between CI runs when using GitHub Actions cache or similar
  • Show performance metrics in scan output (e.g., "Cache hits: 450/500 files, saved 5.2 minutes")

Expected Benefits

  • 40-70% faster scans on repeated runs with few file changes
  • Reduced CI/CD costs through shorter job times
  • Better developer experience for local iterative scanning
  • Backward compatible: Cache disabled by default, no breaking changes

Implementation Notes

  • Use existing scancode.resource and scancode.api infrastructure
  • Follow similar pattern to existing license index caching in licensedcode/cache.py
  • Store cached results as JSON per-file in organized directory structure
  • Add comprehensive tests for cache hit/miss scenarios and invalidation
  • Update documentation with caching best practices

Use Cases

  1. PR workflows: Scan only changed files in pull requests
  2. Incremental CI: Faster builds when only a few files change
  3. Local development: Quick rescans while fixing license issues
  4. Monorepo support: Cache results across multiple scans of shared dependencies

I'm interested in implementing this as part of my GSoC 2026 contribution and would appreciate feedback on this approach.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions