Evaluation Dataset in "Response Generation" - "end-to-end models" part 

Is this part used multiwoz 2.2 or multiwoz 2.0 as a benchmark dataset.
I'm so confused, in RewardNet, Mars, KRLS original paper, all results are the same as your table, but they all reported in multiwoz 2.0 dataset. Morever, in the TOATOD paper, authors reported combined score in multiwoz 2.2 dataset. 
Is there any mistakes. Can you explain this inconsistent. 
Thanks !