Performance Issues with EvictingMap in Worker Filesystem Operations

# Performance Issues with EvictingMap in Worker Filesystem Operations

## Summary

NativeLink workers experience significant performance bottlenecks when using `evicting_map.rs` for filesystem-based storage operations. The primary issues stem from excessive lock contention in high-frequency cache operations and expensive file I/O operations performed while holding global locks.

## Environment

- **Component**: `nativelink-util/src/evicting_map.rs`
- **Context**: Worker filesystem operations with fast_slow store architecture
- **Workload**: High-concurrency cache access patterns typical in worker environments

## Performance Issues Identified

### 1. Global Lock Contention in Cache Operations

**Problem**: The `EvictingMap` uses a single `Mutex<State<K, T>>` that serializes all cache operations, creating a significant bottleneck under high concurrency.

**Affected Methods**:

- `get()` - Called frequently by workers, waits extensively on `self.state.lock().await`
- `sizes_for_keys()` - Batch operations hold lock during entire sequential processing
- `evict_items()` - Expensive eviction process blocks all other cache operations

**Profiling Data**:

- Excessive time spent in `self.state.lock().await` across multiple threads
- Lock wait times dominate actual processing time

### 2. Expensive File I/O Operations Under Lock

**Problem**: File entry `unref()` operations perform expensive filesystem I/O while holding the global cache mutex, amplifying lock contention.

**Impact**:

- `state.remove()` calls `entry.unref()` synchronously while holding lock
- File deletion/cleanup operations block all cache access
- Eviction becomes increasingly expensive as cache fills up

### 3. Tight Coupling Between Worker and Filesystem Store

**Problem**: Workers are tightly bound to `filesystem_store` implementation in the fast_slow store architecture, limiting optimization opportunities.

**Consequences**:

- Cannot easily implement store-specific optimizations
- Filesystem-specific operations (file I/O) mixed with generic cache logic
- Difficult to implement alternative storage backends with different performance characteristics

## Performance Impact

Under high-concurrency workloads typical in worker environments:

1. **Cache Thrashing**: Frequent `get()` calls serialize on global mutex
2. **Eviction Bottleneck**: Cache eviction becomes increasingly expensive as storage fills
3. **I/O Blocking**: File operations block all cache access, reducing overall throughput
4. **Scalability Limits**: Performance degrades significantly with increased worker concurrency

## Proposed Solutions

### Short-term (Minimal Changes)

- Implement fast paths in `get()` to skip unnecessary eviction calls
- Batch eviction operations to reduce lock acquisition frequency
- Separate expensive async operations from lock-protected sections

### Medium-term (Architectural Improvements)

- Implement async-aware eviction patterns to perform I/O outside locks
- Consider segmented locking or lock-free data structures for better concurrency
- Decouple filesystem operations from cache management logic

### Long-term (Design Changes)

- Redesign worker-store coupling to allow store-specific optimizations
- Implement pluggable cache backends optimized for different storage types
- Consider alternative cache architectures (e.g., per-worker caches, distributed caching)

## Reproduction

The performance issues are most apparent under:

- High worker concurrency (multiple workers accessing cache simultaneously)
- Large cache sizes approaching `max_bytes` limits (triggering frequent eviction)
- Filesystem stores with slower I/O characteristics

## Related Code

- `nativelink-util/src/evicting_map.rs` - Core cache implementation
- Worker filesystem integration code
- Fast_slow store architecture implementation

---

This issue significantly impacts NativeLink's scalability in high-concurrency worker environments and should be prioritized for performance optimization efforts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Issues with EvictingMap in Worker Filesystem Operations #1866

Performance Issues with EvictingMap in Worker Filesystem Operations

Summary

Environment

Performance Issues Identified

1. Global Lock Contention in Cache Operations

2. Expensive File I/O Operations Under Lock

3. Tight Coupling Between Worker and Filesystem Store

Performance Impact

Proposed Solutions

Short-term (Minimal Changes)

Medium-term (Architectural Improvements)

Long-term (Design Changes)

Reproduction

Related Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Issues with EvictingMap in Worker Filesystem Operations #1866

Description

Performance Issues with EvictingMap in Worker Filesystem Operations

Summary

Environment

Performance Issues Identified

1. Global Lock Contention in Cache Operations

2. Expensive File I/O Operations Under Lock

3. Tight Coupling Between Worker and Filesystem Store

Performance Impact

Proposed Solutions

Short-term (Minimal Changes)

Medium-term (Architectural Improvements)

Long-term (Design Changes)

Reproduction

Related Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions