Skip to content

v2.4.0-criteo4: CRITEO - Use eviction to delete pod during roll restart

Choose a tag to compare

@geobeau geobeau released this 02 Jan 15:23
· 21 commits to criteo.2.4.0 since this release
The roll restart code is prone to race conditions:
Calling the delete API is fast as it doesn't wait for actual termination
of the pod. The reconciler is called back almost instantly to handle
the other pools. The STS is not yet reporting the change of readiness
of one of the replica, and elasticsearch is not reporting the status as
yellow/red because the pod is not dead yet. So the operator thinks
it's okay to kill another pod.
Fixing the race condition can be done by introducing more checks:

- Using the `CountRunningPodsForNodePool` function, that is
more reliable than checking the status of sts. The status being
eventually consistent. This function list directly the pods.
- Use the Eviction API as it uses Compare And Swap
to ensure consistency with the budget. It can work without PDB
but it might not protect as efficiently against race conditions.

Also added logs to help debug race conditions in the future.