Skip to content

Persistent Connection Pool Corruption #592

@lcatlett

Description

@lcatlett

Issue Summary

OCP's persistent connection handling causes cascade failures when Redis experiences memory pressure, requiring persistent=false workaround that defeats the performance benefits.

Environment

  • OCP Version: 1.24.3
  • PHP Version: 8.x
  • Redis Configuration: Memory-constrained Redis instances
  • Platform: Multi-server hosting environments

Problem Description

When Redis hits memory limits and temporarily drops connections, OCP continues using stale pconnect() handles without health validation, causing systematic cache failures across distributed application servers.

Root Cause Analysis

Code Location

File: src/Connectors/PhpRedisConnector.php
Line: 129

$method = $persistent ? 'pconnect' : 'connect';

Technical Details

  1. Missing Health Validation: OCP delegates connection management entirely to PhpRedis pconnect() without additonal health checks
  2. Stale Handle Reuse: When Redis drops connections during memory pressure, PHP workers retain dead connection handles
  3. No Recovery Logic: Dead connections remain in pool indefinitely, causing all subsequent commands to fail
  4. Cascade Amplification: Aggressive timeouts prevent Redis recovery, amplifying temporary memory pressure into permanent failures

What "Cascade Failures" Means

A cascade failure occurs when a single point of failure (Redis memory pressure) triggers a chain reaction of failures across the entire system:

  1. Initial Event: Redis experiences memory pressure and drops connections (normal behavior)
  2. Connection Pool Corruption: All PHP workers across all application servers retain stale pconnect() handles
  3. Synchronized Failures: Every subsequent cache request uses dead connections, causing immediate command failures
  4. System-wide Impact: What should be a brief Redis recovery (2-3 seconds) becomes permanent cache failure across all servers
  5. Feedback Loop: Cache failures increase database load, creating more system pressure

The "cascade" refers to how a localized Redis event amplifies into distributed, synchronized failures across the entire application infrastructure, rather than being contained to individual connections or servers.

Reproduction Steps

  1. Configure Redis with limited memory constraints
  2. Enable OCP with persistent=true and aggressive timeout values (e.g., 0.5s)
  3. Generate cache load exceeding Redis memory capacity
  4. Observe: MGET/GET commands fail immediately with "Redis server went away"
  5. Observe: Failures persist across multiple application servers simultaneously
  6. Observe: Setting persistent=false resolves issue

Evidence

Log Patterns

[11-Sep-2025 14:49:18 UTC] objectcache.error: Failed to execute `MGET` command
[11-Sep-2025 14:49:19 UTC] objectcache.error: Failed to execute `MGET` command
[11-Sep-2025 11:49:52 UTC] objectcache.error: Redis server tcp://10.73.9.238:11028 went away

Error Distribution

  • Multiple application servers affected simultaneously
  • Hundreds of cache initialization failures
  • Commands fail at connection layer, not Redis processing
  • Synchronized failure patterns indicate connection pool corruption

Proposed Fix

Immediate Solution: Connection Health Validation

// In PhpRedisConnector.php
public static function connectToInstance(Configuration $config): ConnectionInterface
{
    // ... existing connection logic ...

    // ADD: Health validation for persistent connections
    if ($persistent && $client instanceof Redis) {
        try {
            $client->ping(); // Validate connection before use
        } catch (RedisException $e) {
            // Force reconnection if health check fails
            $client->close();
            $client->{$method}(...$arguments);
        }
    }

    return new PhpRedisConnection($client, $config);
}

Enhanced Solution: Connection Pool Management

  1. Pre-command Health Checks: Ping Redis before executing commands on persistent connections
  2. Automatic Reconnection: Detect stale handles and establish fresh connections
  3. Connection State Tracking: Monitor connection health across requests
  4. Graceful Degradation: Fall back to non-persistent connections during Redis instability

Expected Behavior

  • Persistent connections should automatically recover from Redis memory pressure events
  • Connection failures should not cascade across distributed application servers
  • Health validation should detect and remediate stale connection handles
  • Timeouts should account for Redis memory management operations (2-3 seconds)

Current Workaround

'persistent' => false,  // Disable persistent connections
'timeout' => 2.0,       // Increase timeout for Redis recovery
'read_timeout' => 2.0,  // Allow memory cleanup time

This workaround resolves the issue but eliminates persistent connection performance benefits.

Impact

  • Severity: High - Causes complete cache failures under memory pressure
  • Frequency: Occurs whenever Redis approaches memory limits
  • Scope: Affects all application servers simultaneously
  • Workaround: Available but defeats persistent connection performance

Additional Context

This issue is particularly problematic in hosting environments where Redis memory is constrained and multiple application servers utilize Redis for caching. The combination of aggressive timeouts and lack of connection health validation creates conditions for cascade failures.

The issue could be mitigated with better connection resilience that accounts for Redis's legitimate need to manage memory pressure through temporary connection management.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions