How to determine the possibility of another LLM as a judge model?

If I want to use another LLM as a judge model, not GPT-4-1106-Preview, how do I know if the agreement with human of the judgment has decreased and the result is still reasonable?