-
-
Notifications
You must be signed in to change notification settings - Fork 155
Description
Issue Summary
OCP's persistent connection handling causes cascade failures when Redis experiences memory pressure, requiring persistent=false workaround that defeats the performance benefits.
Environment
- OCP Version: 1.24.3
- PHP Version: 8.x
- Redis Configuration: Memory-constrained Redis instances
- Platform: Multi-server hosting environments
Problem Description
When Redis hits memory limits and temporarily drops connections, OCP continues using stale pconnect() handles without health validation, causing systematic cache failures across distributed application servers.
Root Cause Analysis
Code Location
File: src/Connectors/PhpRedisConnector.php
Line: 129
$method = $persistent ? 'pconnect' : 'connect';Technical Details
- Missing Health Validation: OCP delegates connection management entirely to PhpRedis
pconnect()without additonal health checks - Stale Handle Reuse: When Redis drops connections during memory pressure, PHP workers retain dead connection handles
- No Recovery Logic: Dead connections remain in pool indefinitely, causing all subsequent commands to fail
- Cascade Amplification: Aggressive timeouts prevent Redis recovery, amplifying temporary memory pressure into permanent failures
What "Cascade Failures" Means
A cascade failure occurs when a single point of failure (Redis memory pressure) triggers a chain reaction of failures across the entire system:
- Initial Event: Redis experiences memory pressure and drops connections (normal behavior)
- Connection Pool Corruption: All PHP workers across all application servers retain stale
pconnect()handles - Synchronized Failures: Every subsequent cache request uses dead connections, causing immediate command failures
- System-wide Impact: What should be a brief Redis recovery (2-3 seconds) becomes permanent cache failure across all servers
- Feedback Loop: Cache failures increase database load, creating more system pressure
The "cascade" refers to how a localized Redis event amplifies into distributed, synchronized failures across the entire application infrastructure, rather than being contained to individual connections or servers.
Reproduction Steps
- Configure Redis with limited memory constraints
- Enable OCP with
persistent=trueand aggressive timeout values (e.g., 0.5s) - Generate cache load exceeding Redis memory capacity
- Observe: MGET/GET commands fail immediately with "Redis server went away"
- Observe: Failures persist across multiple application servers simultaneously
- Observe: Setting
persistent=falseresolves issue
Evidence
Log Patterns
[11-Sep-2025 14:49:18 UTC] objectcache.error: Failed to execute `MGET` command
[11-Sep-2025 14:49:19 UTC] objectcache.error: Failed to execute `MGET` command
[11-Sep-2025 11:49:52 UTC] objectcache.error: Redis server tcp://10.73.9.238:11028 went away
Error Distribution
- Multiple application servers affected simultaneously
- Hundreds of cache initialization failures
- Commands fail at connection layer, not Redis processing
- Synchronized failure patterns indicate connection pool corruption
Proposed Fix
Immediate Solution: Connection Health Validation
// In PhpRedisConnector.php
public static function connectToInstance(Configuration $config): ConnectionInterface
{
// ... existing connection logic ...
// ADD: Health validation for persistent connections
if ($persistent && $client instanceof Redis) {
try {
$client->ping(); // Validate connection before use
} catch (RedisException $e) {
// Force reconnection if health check fails
$client->close();
$client->{$method}(...$arguments);
}
}
return new PhpRedisConnection($client, $config);
}Enhanced Solution: Connection Pool Management
- Pre-command Health Checks: Ping Redis before executing commands on persistent connections
- Automatic Reconnection: Detect stale handles and establish fresh connections
- Connection State Tracking: Monitor connection health across requests
- Graceful Degradation: Fall back to non-persistent connections during Redis instability
Expected Behavior
- Persistent connections should automatically recover from Redis memory pressure events
- Connection failures should not cascade across distributed application servers
- Health validation should detect and remediate stale connection handles
- Timeouts should account for Redis memory management operations (2-3 seconds)
Current Workaround
'persistent' => false, // Disable persistent connections
'timeout' => 2.0, // Increase timeout for Redis recovery
'read_timeout' => 2.0, // Allow memory cleanup timeThis workaround resolves the issue but eliminates persistent connection performance benefits.
Impact
- Severity: High - Causes complete cache failures under memory pressure
- Frequency: Occurs whenever Redis approaches memory limits
- Scope: Affects all application servers simultaneously
- Workaround: Available but defeats persistent connection performance
Additional Context
This issue is particularly problematic in hosting environments where Redis memory is constrained and multiple application servers utilize Redis for caching. The combination of aggressive timeouts and lack of connection health validation creates conditions for cascade failures.
The issue could be mitigated with better connection resilience that accounts for Redis's legitimate need to manage memory pressure through temporary connection management.