How to do hidden states-based distillation? In current example the student model simply copies the weights of teacher model.