v2.4.0-criteo4: CRITEO - Use eviction to delete pod during roll restart
·
21 commits
to criteo.2.4.0
since this release
The roll restart code is prone to race conditions: Calling the delete API is fast as it doesn't wait for actual termination of the pod. The reconciler is called back almost instantly to handle the other pools. The STS is not yet reporting the change of readiness of one of the replica, and elasticsearch is not reporting the status as yellow/red because the pod is not dead yet. So the operator thinks it's okay to kill another pod. Fixing the race condition can be done by introducing more checks: - Using the `CountRunningPodsForNodePool` function, that is more reliable than checking the status of sts. The status being eventually consistent. This function list directly the pods. - Use the Eviction API as it uses Compare And Swap to ensure consistency with the budget. It can work without PDB but it might not protect as efficiently against race conditions. Also added logs to help debug race conditions in the future.