Question about the CER of analysis-synthesis and TTS system

In the paper, as shownd in the Figure 5 and claimed in part 6.3,  I did not understand, why cer of LLM-based TTS system is more lower than  the cers of audio from analysis-synthesis? No matter PQ, RQ or OPQ. It seems not in line with intuition