Skip to content

Conversation

@GauravNagesh
Copy link
Contributor

Description

This PR implements significant performance optimization for Redis database interaction by implementing caching and Redis pipeline batching to minimize redundant database writes. There are 2 key changes involved :

  1. The daemon now uses RedisPipeline, which instead of sending 1 request separately each time, batches multiple Redis request together and flushes them at the end of each monitoring cycle if any requests still remain. This reduces the number of round trips to the Redis database.

  2. Change detection and Caching mechanism (self.pcied_cache) was added to track each PCIe device's ID and AER statistics. Currently writes to Redis DB even though no values have changed. The code now compares current values against cached values and only writes to the database when changes are detected. This prevents unnecessary database writes when device state hasn't changed.

To support the above changes, six new test cases were implemented to thoroughly validate the caching behavior :

  1. test_cache_device_id_changed verifies that when device ID changes, the daemon ensures it writes it to the database and caches the new value.

  2. test_cache_device_id_unchanged verifies that when a device ID hasn't changed, the daemon doesn't write it to the database again, confirming the caching optimization works for device IDs.

  3. test_cache_aer_stats_unchanged ensures that when AER statistics remain the same across checks, no database write occurs, validating the AER stats caching logic.

  4. test_cache_aer_stats_changed confirms that when AER statistics do change, the daemon correctly detects the change and updates both the database and cache with the new values.

  5. test_cache_device_removal validates that when a device is removed (ID becomes None), the cache entry is properly deleted and no database operations are attempted.

  6. test_cache_multiple_devices tests the scenario with multiple devices to ensure the cache correctly maintains independent state for each device and only updates the database for devices whose state has changed.

Additionally, existing tests were updated to include cache-related assertions and to mock the new redisPipeline.flush() call.

Motivation and Context

Why RedisPipeline and Batching :
Currently, the pcie daemon creates and maintains a separate db connection internally for each table used and an additional initial connection to state_db as shown here Table Constructor and RedisPipeline Constructor

So in total we maintain 4 separate Redis database connections in pcied which is waste of Redis and system resources. Moreover, since pcied purely a single process executing all requests in a sequential manner, we don't need multiple connections. Just one connection shared across all the tables is sufficient.

Also in the existing flow, individual set() calls are made for each device and their attribute. Rather we can collect all device data in batches using RedisPipeline and then do a single bulk update, which will be faster.

We can also optimize database cleanup operations, where we do multiple individual delete operations in __del__ method for each table key across all tables. Rather we can also batch the delete operations using RedisPipeline for faster cleanup during daemon shutdown, which will be quick in signal handling and reboot scenarios.

Why change detection and caching :
In the current access pattern of pcied, every 60 seconds in the main loop, the daemon performs writes to the PCIE_DEVICE table for each detected PCIe device (setting id and multiple AER stats fields via FieldValuePairs, potentially one set() call per device but with multiple fields). It then performs a single write to the PCIE_DEVICES table for the overall status. In a scenario with many PCIe devices, this results in a high number of per-device write operations, set() calls, to Redis every cycle, even if data hasn't changed, leading to repeated overwrites and increased Redis load. For a small number of devices, the pattern is less burdensome but still involves unconditional writes without change detection.

So, introduced change detection and caching such that it will batch write to DB only if the values change.

Why Batch size decision of 10 :

  • Based on the average payload size of the daemon (Avg size of data in each set() call)
  • Based on the system load if all daemons are using batching
  • Based on testing the effect of different batch size on performance

Overall benefit :

  • Reduced Redis traffic in the system by writing only on change
  • Reduced IO network overhead by batching requests
  • Reduced Redis and system system resources by reusing the same connection for RedisPipeline rather than having a seperate connection to Redis for each table

How Has This Been Tested?

The changes were tested on both virtual and physical platforms. Performance benchmarking was conducted to compare the baseline and the updated implementation, capturing relevant metrics across all database operations.

Additional Information (Optional)

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants