what about the inference code? How to evaluate open-vocabulary detection performance? can you show more details?