Readiness Probe Causes Infinite Restart Loop During Group Replication Recovery

### Report

Note: this was written up with the help of AI, but based on experience with several crashes that the operator couldn't recover from. I either had to restore from backup, or disable the operator and "hand recover" (and afterwards I'd reset from scratch to get an "operator operated" cluster again).

## Bug Description

The Percona MySQL Operator's readiness probe treats `RECOVERING` nodes as unhealthy, causing Kubernetes to restart pods that are actively recovering data through MySQL Group Replication. This creates an infinite loop where nodes can never complete recovery because they are killed mid-process.


### More about the problem

### The Problem

When a MySQL node is in `RECOVERING` state (legitimately catching up with group replication), the readiness probe fails and Kubernetes restarts the pod. This prevents the node from ever completing recovery, creating a deadlock situation.

### Current Readiness Probe Behavior

```bash
kubectl exec ntpdb-mysql-1 -n ntpdb -- /opt/percona/healthcheck readiness
# Output: 2025/09/27 20:46:18 readiness check failed: Member state: RECOVERING
# Exit code: 1
```

### Group Replication Status

```sql
SELECT MEMBER_HOST, MEMBER_STATE, MEMBER_ROLE FROM performance_schema.replication_group_members;
+-----------------------------------+--------------+-------------+
| MEMBER_HOST                       | MEMBER_STATE | MEMBER_ROLE |
+-----------------------------------+--------------+-------------+
| ntpdb-mysql-0.ntpdb-mysql.ntpdb  | ONLINE       | PRIMARY     |
| ntpdb-mysql-1.ntpdb-mysql.ntpdb  | RECOVERING   | SECONDARY   |
+-----------------------------------+--------------+-------------+
```


### Steps to reproduce


## The Deadlock Cycle

1. **Node starts recovery**: MySQL node joins group replication in `RECOVERING` state
2. **Readiness probe fails**: `/opt/percona/healthcheck readiness` returns exit code 1 for `RECOVERING` nodes
3. **Kubernetes restarts pod**: After readiness probe failure threshold, pod is marked unready and restarted
4. **Recovery interruption**: Node loses recovery progress and must start over
5. **StatefulSet blocking**: Next pod (mysql-2) cannot start until mysql-1 is Ready
6. **Operator deadlock**: Cannot complete initialization without all pods healthy
7. **Infinite loop**: Process repeats indefinitely


### Versions

- **Operator Version**: v0.12.0
- **MySQL Version**: percona/percona-server:8.0.43-34
- **Kubernetes Version**: v1.32.6
- **Cluster Configuration**: 3-node MySQL Group Replication with StatefulSet


### Anything else?



### StatefulSet Cascade Failure
- **Pod ordering**: StatefulSets require pod N to be Ready before creating pod N+1
- **Blocked scaling**: mysql-2 never gets created because mysql-1 never becomes Ready
- **Cluster incomplete**: Operator cannot complete initialization with missing pods

### Operational Impact
- **Extended outages**: Clusters stuck in initialization for hours/days
- **Data consistency issues**: Repeated recovery interruptions
- **Resource waste**: Continuous pod restarts consume CPU/memory
- **Manual intervention required**: No automatic recovery possible

## Expected Behavior

### Readiness Probe Should Accept RECOVERING

The readiness probe should distinguish between:

1. **Healthy States** (Ready=True):
   - `ONLINE` - Node is fully operational
   - `RECOVERING` - Node is actively syncing data (healthy progress)

2. **Unhealthy States** (Ready=False):
   - `ERROR` - Node has encountered an error
   - `OFFLINE` - Node is not participating in group
   - `UNREACHABLE` - Network connectivity issues

### Rationale

- **RECOVERING is expected**: During cluster recovery, nodes legitimately spend time catching up
- **Progress is healthy**: RECOVERING means the node is actively syncing data
- **Time is required**: Large datasets may take hours to sync
- **Interruption is harmful**: Restarting during recovery loses progress

## Proposed Solution

### 1. Modify Readiness Check Logic

```bash
# Current logic (problematic):
if [ "$MEMBER_STATE" != "ONLINE" ]; then
    echo "readiness check failed: Member state: $MEMBER_STATE"
    exit 1
fi

# Proposed logic:
case "$MEMBER_STATE" in
    "ONLINE"|"RECOVERING")
        exit 0  # Ready
        ;;
    "ERROR"|"OFFLINE"|"UNREACHABLE"|"")
        echo "readiness check failed: Member state: $MEMBER_STATE"
        exit 1  # Not ready
        ;;
esac
```

### 2. Add Recovery Progress Monitoring

For RECOVERING nodes, optionally check if progress is being made:
- Monitor `GTID_EXECUTED` advancement
- Allow reasonable timeout for large recoveries
- Only fail if recovery is truly stuck (no progress for extended period)

### 3. Configuration Options

Add operator configuration to control readiness behavior:
```yaml
spec:
  mysql:
    readinessProbe:
      allowRecovering: true  # Default: true
      recoveryTimeout: 3600  # Seconds to allow recovery before failing
```

## Reproduction Steps

1. **Create 3-node cluster** with some data
2. **Simulate node failure** (delete 2 pods, leaving 1 with most recent data)
3. **Start recovery**: Let operator attempt to recover cluster
4. **Observe deadlock**:
   - Node 1 reaches RECOVERING state
   - Readiness probe fails repeatedly
   - Pod restarts interrupt recovery
   - Node 2 never starts due to StatefulSet ordering
   - Cluster never completes initialization

## Current Workaround

**None available** - The issue is fundamental to the readiness probe logic and cannot be worked around without code changes.

## Evidence

### Pod Restart Pattern
```bash
kubectl get pods -n ntpdb | grep mysql-1
# ntpdb-mysql-1    1/2     Running     4 (2m ago)     8m
# Shows continuous restarts due to readiness failures
```

### StatefulSet Status
```bash
kubectl get statefulset ntpdb-mysql -n ntpdb
# Shows replicas: 2/3 because mysql-2 cannot start
```

### Operator Status
```bash
kubectl get perconaservermysql ntpdb -n ntpdb -o jsonpath='{.status.state}'
# Shows: "initializing" (stuck indefinitely)
```

## Related Issues

This issue compounds with the broader operator reconciliation problems:
- Status update conflicts prevent progress
- Crash recovery logic is overly aggressive
- Initialization state never completes


Best I can tell this:
- Prevents cluster recovery in common failure scenarios
- Requires manual operator intervention/restart
- Can cause extended production outages
- Has no viable workaround


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Readiness Probe Causes Infinite Restart Loop During Group Replication Recovery #1099

Report

Bug Description

More about the problem

The Problem

Current Readiness Probe Behavior

Group Replication Status

Steps to reproduce

The Deadlock Cycle

Versions

Anything else?

StatefulSet Cascade Failure

Operational Impact

Expected Behavior

Readiness Probe Should Accept RECOVERING

Rationale

Proposed Solution

1. Modify Readiness Check Logic

2. Add Recovery Progress Monitoring

3. Configuration Options

Reproduction Steps

Current Workaround

Evidence

Pod Restart Pattern

StatefulSet Status

Operator Status

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Readiness Probe Causes Infinite Restart Loop During Group Replication Recovery #1099

Description

Report

Bug Description

More about the problem

The Problem

Current Readiness Probe Behavior

Group Replication Status

Steps to reproduce

The Deadlock Cycle

Versions

Anything else?

StatefulSet Cascade Failure

Operational Impact

Expected Behavior

Readiness Probe Should Accept RECOVERING

Rationale

Proposed Solution

1. Modify Readiness Check Logic

2. Add Recovery Progress Monitoring

3. Configuration Options

Reproduction Steps

Current Workaround

Evidence

Pod Restart Pattern

StatefulSet Status

Operator Status

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions