By definition of the test and Twitter's R implementation, all the candidates that have been considered until the largest i such that max_R_i > lambda_i are all anomalies, not just the ones that are in the iterations that have max_R_i > lambda_i.
There have been simulation studies showing that the inequality of max_R_i > lambda_i can swing back and froth as the iteration is progressed. This python implementation of ESD test may miss some anomalies.