Skip to content

Performance Issues with EvictingMap in Worker Filesystem Operations #1866

@zpzjzj

Description

@zpzjzj

Performance Issues with EvictingMap in Worker Filesystem Operations

Summary

NativeLink workers experience significant performance bottlenecks when using evicting_map.rs for filesystem-based storage operations. The primary issues stem from excessive lock contention in high-frequency cache operations and expensive file I/O operations performed while holding global locks.

Environment

  • Component: nativelink-util/src/evicting_map.rs
  • Context: Worker filesystem operations with fast_slow store architecture
  • Workload: High-concurrency cache access patterns typical in worker environments

Performance Issues Identified

1. Global Lock Contention in Cache Operations

Problem: The EvictingMap uses a single Mutex<State<K, T>> that serializes all cache operations, creating a significant bottleneck under high concurrency.

Affected Methods:

  • get() - Called frequently by workers, waits extensively on self.state.lock().await
  • sizes_for_keys() - Batch operations hold lock during entire sequential processing
  • evict_items() - Expensive eviction process blocks all other cache operations

Profiling Data:

  • Excessive time spent in self.state.lock().await across multiple threads
  • Lock wait times dominate actual processing time

2. Expensive File I/O Operations Under Lock

Problem: File entry unref() operations perform expensive filesystem I/O while holding the global cache mutex, amplifying lock contention.

Impact:

  • state.remove() calls entry.unref() synchronously while holding lock
  • File deletion/cleanup operations block all cache access
  • Eviction becomes increasingly expensive as cache fills up

3. Tight Coupling Between Worker and Filesystem Store

Problem: Workers are tightly bound to filesystem_store implementation in the fast_slow store architecture, limiting optimization opportunities.

Consequences:

  • Cannot easily implement store-specific optimizations
  • Filesystem-specific operations (file I/O) mixed with generic cache logic
  • Difficult to implement alternative storage backends with different performance characteristics

Performance Impact

Under high-concurrency workloads typical in worker environments:

  1. Cache Thrashing: Frequent get() calls serialize on global mutex
  2. Eviction Bottleneck: Cache eviction becomes increasingly expensive as storage fills
  3. I/O Blocking: File operations block all cache access, reducing overall throughput
  4. Scalability Limits: Performance degrades significantly with increased worker concurrency

Proposed Solutions

Short-term (Minimal Changes)

  • Implement fast paths in get() to skip unnecessary eviction calls
  • Batch eviction operations to reduce lock acquisition frequency
  • Separate expensive async operations from lock-protected sections

Medium-term (Architectural Improvements)

  • Implement async-aware eviction patterns to perform I/O outside locks
  • Consider segmented locking or lock-free data structures for better concurrency
  • Decouple filesystem operations from cache management logic

Long-term (Design Changes)

  • Redesign worker-store coupling to allow store-specific optimizations
  • Implement pluggable cache backends optimized for different storage types
  • Consider alternative cache architectures (e.g., per-worker caches, distributed caching)

Reproduction

The performance issues are most apparent under:

  • High worker concurrency (multiple workers accessing cache simultaneously)
  • Large cache sizes approaching max_bytes limits (triggering frequent eviction)
  • Filesystem stores with slower I/O characteristics

Related Code

  • nativelink-util/src/evicting_map.rs - Core cache implementation
  • Worker filesystem integration code
  • Fast_slow store architecture implementation

This issue significantly impacts NativeLink's scalability in high-concurrency worker environments and should be prioritized for performance optimization efforts.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions