Add intelligent file-level result caching system for faster repeated scans and CI/CD workflows

## Description

Currently, ScanCode performs full scans on every run, even when files haven't changed or when running the same scan multiple times in CI/CD pipelines. This creates unnecessarily long scan times and wastes CI resources, especially for medium-to-large codebases.

### Problem Statement

1. **No incremental scanning**: Every scan processes all files, even if only a few changed
2. **No result caching**: Identical files are rescanned in subsequent runs
3. **CI/CD inefficiency**: GitHub Actions, GitLab CI, and Jenkins jobs waste time rescanning unchanged code
4. **Poor developer experience**: Local development scans take too long for iterative work

### Proposed Solution

Implement a file-level result caching system with the following features:

#### 1. Content-based Cache System
- Cache scan results per file using content hash (SHA256 of file content + ScanCode version + enabled scan options)
- Store cached results in `~/.cache/scancode/` or custom `--cache-dir` location
- Automatic cache invalidation when ScanCode version or scan options change

#### 2. CLI Integration
Add new command-line options:

--cache           # Enable result caching (default: disabled)  
--cache-dir PATH  # Specify custom cache directory  
--force-reindex   # Force full rescan ignoring cache  

#### 3. Smart Cache Management
- Automatic cleanup of stale cache entries
- Configurable cache size limits
- Cache statistics reporting (hit rate, time saved)

#### 4. CI/CD Optimization
- Cache survives between CI runs when using GitHub Actions cache or similar
- Show performance metrics in scan output (e.g., "Cache hits: 450/500 files, saved 5.2 minutes")

### Expected Benefits

- **40-70% faster scans** on repeated runs with few file changes
- **Reduced CI/CD costs** through shorter job times
- **Better developer experience** for local iterative scanning
- **Backward compatible**: Cache disabled by default, no breaking changes

### Implementation Notes

- Use existing `scancode.resource` and `scancode.api` infrastructure
- Follow similar pattern to existing license index caching in `licensedcode/cache.py`
- Store cached results as JSON per-file in organized directory structure
- Add comprehensive tests for cache hit/miss scenarios and invalidation
- Update documentation with caching best practices

### Use Cases

1. **PR workflows**: Scan only changed files in pull requests
2. **Incremental CI**: Faster builds when only a few files change
3. **Local development**: Quick rescans while fixing license issues
4. **Monorepo support**: Cache results across multiple scans of shared dependencies

I'm interested in implementing this as part of my GSoC 2026 contribution and would appreciate feedback on this approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add intelligent file-level result caching system for faster repeated scans and CI/CD workflows #4801

Description

Problem Statement

Proposed Solution

1. Content-based Cache System

2. CLI Integration

3. Smart Cache Management

4. CI/CD Optimization

Expected Benefits

Implementation Notes

Use Cases

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add intelligent file-level result caching system for faster repeated scans and CI/CD workflows #4801

Description

Description

Problem Statement

Proposed Solution

1. Content-based Cache System

2. CLI Integration

3. Smart Cache Management

4. CI/CD Optimization

Expected Benefits

Implementation Notes

Use Cases

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions