pcied: Redis Performance Improvements #702
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR implements significant performance optimization for Redis database interaction by implementing caching and Redis pipeline batching to minimize redundant database writes. There are 2 key changes involved :
The daemon now uses
RedisPipeline, which instead of sending 1 request separately each time, batches multiple Redis request together and flushes them at the end of each monitoring cycle if any requests still remain. This reduces the number of round trips to the Redis database.Change detection and Caching mechanism (
self.pcied_cache) was added to track each PCIe device's ID and AER statistics. Currently writes to Redis DB even though no values have changed. The code now compares current values against cached values and only writes to the database when changes are detected. This prevents unnecessary database writes when device state hasn't changed.To support the above changes, six new test cases were implemented to thoroughly validate the caching behavior :
test_cache_device_id_changedverifies that when device ID changes, the daemon ensures it writes it to the database and caches the new value.test_cache_device_id_unchangedverifies that when a device ID hasn't changed, the daemon doesn't write it to the database again, confirming the caching optimization works for device IDs.test_cache_aer_stats_unchangedensures that when AER statistics remain the same across checks, no database write occurs, validating the AER stats caching logic.test_cache_aer_stats_changedconfirms that when AER statistics do change, the daemon correctly detects the change and updates both the database and cache with the new values.test_cache_device_removalvalidates that when a device is removed (ID becomes None), the cache entry is properly deleted and no database operations are attempted.test_cache_multiple_devicestests the scenario with multiple devices to ensure the cache correctly maintains independent state for each device and only updates the database for devices whose state has changed.Additionally, existing tests were updated to include cache-related assertions and to mock the new
redisPipeline.flush()call.Motivation and Context
Why RedisPipeline and Batching :
Currently, the pcie daemon creates and maintains a separate db connection internally for each table used and an additional initial connection to state_db as shown here
Table ConstructorandRedisPipeline ConstructorSo in total we maintain 4 separate Redis database connections in pcied which is waste of Redis and system resources. Moreover, since pcied purely a single process executing all requests in a sequential manner, we don't need multiple connections. Just one connection shared across all the tables is sufficient.
Also in the existing flow, individual
set()calls are made for each device and their attribute. Rather we can collect all device data in batches using RedisPipeline and then do a single bulk update, which will be faster.We can also optimize database cleanup operations, where we do multiple individual delete operations in
__del__method for each table key across all tables. Rather we can also batch the delete operations using RedisPipeline for faster cleanup during daemon shutdown, which will be quick in signal handling and reboot scenarios.Why change detection and caching :
In the current access pattern of pcied, every 60 seconds in the main loop, the daemon performs writes to the
PCIE_DEVICEtable for each detected PCIe device (settingidand multipleAERstats fields viaFieldValuePairs, potentially one set() call per device but with multiple fields). It then performs a single write to thePCIE_DEVICEStable for the overall status. In a scenario with many PCIe devices, this results in a high number of per-device write operations, set() calls, to Redis every cycle, even if data hasn't changed, leading to repeated overwrites and increased Redis load. For a small number of devices, the pattern is less burdensome but still involves unconditional writes without change detection.So, introduced change detection and caching such that it will batch write to DB only if the values change.
Why Batch size decision of 10 :
Overall benefit :
How Has This Been Tested?
The changes were tested on both virtual and physical platforms. Performance benchmarking was conducted to compare the baseline and the updated implementation, capturing relevant metrics across all database operations.
Additional Information (Optional)