-
Notifications
You must be signed in to change notification settings - Fork 194
Description
Performance Issues with EvictingMap in Worker Filesystem Operations
Summary
NativeLink workers experience significant performance bottlenecks when using evicting_map.rs
for filesystem-based storage operations. The primary issues stem from excessive lock contention in high-frequency cache operations and expensive file I/O operations performed while holding global locks.
Environment
- Component:
nativelink-util/src/evicting_map.rs
- Context: Worker filesystem operations with fast_slow store architecture
- Workload: High-concurrency cache access patterns typical in worker environments
Performance Issues Identified
1. Global Lock Contention in Cache Operations
Problem: The EvictingMap
uses a single Mutex<State<K, T>>
that serializes all cache operations, creating a significant bottleneck under high concurrency.
Affected Methods:
get()
- Called frequently by workers, waits extensively onself.state.lock().await
sizes_for_keys()
- Batch operations hold lock during entire sequential processingevict_items()
- Expensive eviction process blocks all other cache operations
Profiling Data:
- Excessive time spent in
self.state.lock().await
across multiple threads - Lock wait times dominate actual processing time
2. Expensive File I/O Operations Under Lock
Problem: File entry unref()
operations perform expensive filesystem I/O while holding the global cache mutex, amplifying lock contention.
Impact:
state.remove()
callsentry.unref()
synchronously while holding lock- File deletion/cleanup operations block all cache access
- Eviction becomes increasingly expensive as cache fills up
3. Tight Coupling Between Worker and Filesystem Store
Problem: Workers are tightly bound to filesystem_store
implementation in the fast_slow store architecture, limiting optimization opportunities.
Consequences:
- Cannot easily implement store-specific optimizations
- Filesystem-specific operations (file I/O) mixed with generic cache logic
- Difficult to implement alternative storage backends with different performance characteristics
Performance Impact
Under high-concurrency workloads typical in worker environments:
- Cache Thrashing: Frequent
get()
calls serialize on global mutex - Eviction Bottleneck: Cache eviction becomes increasingly expensive as storage fills
- I/O Blocking: File operations block all cache access, reducing overall throughput
- Scalability Limits: Performance degrades significantly with increased worker concurrency
Proposed Solutions
Short-term (Minimal Changes)
- Implement fast paths in
get()
to skip unnecessary eviction calls - Batch eviction operations to reduce lock acquisition frequency
- Separate expensive async operations from lock-protected sections
Medium-term (Architectural Improvements)
- Implement async-aware eviction patterns to perform I/O outside locks
- Consider segmented locking or lock-free data structures for better concurrency
- Decouple filesystem operations from cache management logic
Long-term (Design Changes)
- Redesign worker-store coupling to allow store-specific optimizations
- Implement pluggable cache backends optimized for different storage types
- Consider alternative cache architectures (e.g., per-worker caches, distributed caching)
Reproduction
The performance issues are most apparent under:
- High worker concurrency (multiple workers accessing cache simultaneously)
- Large cache sizes approaching
max_bytes
limits (triggering frequent eviction) - Filesystem stores with slower I/O characteristics
Related Code
nativelink-util/src/evicting_map.rs
- Core cache implementation- Worker filesystem integration code
- Fast_slow store architecture implementation
This issue significantly impacts NativeLink's scalability in high-concurrency worker environments and should be prioritized for performance optimization efforts.