This repo contains the code for our EMNLP 2021 paper: CLIPScore: A Reference-free Evaluation Metric for Image Captioning. CLIPScore is a metric that you can use to evaluate the quality of an automatic image captioning system. In our paper, we show that CLIPScore achieves high correlation with human judgment on literal image captioning tasks. However, unlike BLEU or CIDEr, CLIPScore doesn't require reference captions.
If you find the paper or this code useful, please consider citing:
@inproceedings{hessel2021clipscore,
title={{CLIPScore:} A Reference-free Evaluation Metric for Image Captioning},
author={Hessel, Jack and Holtzman, Ari and Forbes, Maxwell and Bras, Ronan Le and Choi, Yejin},
booktitle={EMNLP},
year={2021}
}
Example usage
> python clipscore.py example/good_captions.json example/images/
...
CLIPScore: 0.8584
If you include optionally some references, you will see RefCLIPScore, alongside a usual set of caption generation evaluation metrics. The references are optional.
> python clipscore.py example/good_captions.json example/images/ --references_json example/refs.json
...
BLEU-1: 0.6667
BLEU-2: 0.4899
BLEU-3: 0.3469
BLEU-4: 0.0000
METEOR: 0.3444
ROUGE: 0.4280
CIDER: 0.5637
SPICE: 0.4000
CLIPScore: 0.8584
RefCLIPScore: 0.8450
Worse captions should get lower scores:
> python clipscore.py example/bad_captions.json example/images/ --references_json example/refs.json
...
BLEU-1: 0.4815
BLEU-2: 0.2404
BLEU-3: 0.1359
BLEU-4: 0.0000
METEOR: 0.1861
ROUGE: 0.3121
CIDER: 0.2790
SPICE: 0.1500
CLIPScore: 0.7153
RefCLIPScore: 0.7253
You can treat/report CLIPScore and RefCLIPScore similarly to the other
evaluation metrics. See the paper for more details about CLIPScore and
RefCLIPScore. Full usage options can be given by python clipscore.py -h
. An example set of inputs, including a candidate json, image
directory, and references json is given this repo under example/
The input files are formatted as follows:
The candidates json should be a dictionary that maps from
{"string_image_identifier": "candidate"}
, e.g.,
{'image1': 'an orange cat and a grey cat are lying together.',
'image2': 'a black dog looks at the camera.'
...}
The image directory should be a directory containing the images that act as the keys in the candidates json, e.g.,
images/
├── image1.jpg
└── image2.jpg
and, finally, the references json should be a dictionary that maps from
{"string_image_identifier": ["list", "of", "references"]}
, e.g.,
{"image1": ["two cats are sleeping next to each other.",
"a grey cat is cuddling with an orange cat on a blanket.",
"the orange cat is happy that the black cat is close to it."],
"image2": ["a dog is wearing ear muffs as it lies on a carpet.",
"a black dog and an orange cat are looking at the photographer.",
"headphones are placed on a dogs ears."]}
If you're running on the MSCOCO dataset and using the standard evaluation toolkit, you can use our version of pycocoevalcap to evaluate. You won't even need to download the original MSCOCO images, thanks to a bit of magic :-)
To use pycocoevalcap
on the MSCOCO dataset in the MSCOCO format, you
can simply use:
pip install git+https://github.com/jmhessel/pycocoevalcap.git
there is an example evaluation in that repo under
examples/eval.py
. After pip installing, if you clone the
pycocoeval
repo and run
python eval.py
after a bit of time, the output should be:
Bleu_1: 0.579
Bleu_2: 0.404
Bleu_3: 0.279
Bleu_4: 0.191
METEOR: 0.195
ROUGE_L: 0.396
CIDEr: 0.600
SPICE: 0.133
CLIPScore: 0.528
RefCLIPScore: 0.605
-
CLIPScore can run either on CPU or GPU. But, there are slight differences due to floating point precision. As discussed here, on CPU, all operations run in
float32
, but on GPU, some operations run infloat16
. The differences are generally small (e.g., for the example run above, withexample/good_captions.json
captions andexample/images/
images, on CPU, the output isCLIPScore: 0.8585
, but on GPU, the output isCLIPScore: 0.8584
.) All experiments in the paper were run on GPU, and this code will raise a warning if you're not using a GPU. -
Because CLIPScore depends on the images to compute, resizing, compressing, etc. can all cause slight differences in computing the CLIPScore. Even saving a jpg twice can result in different compression, because that format is lossy! To this end, we release the checksums of the images we used for the paper. see
checksums/
for more info. For the pycocoevalcap repo, we have also included the checksums for MSCOCO --- see here for more info. -
The prompt we used for the text side of CLIP, as mentioned in the paper is ``A photo depicts" This is hard-coded into this repo. Other prompts will result in slightly different results, and we don't recommend them for the sake of reproducibility.
The authors would like to thank Jungo Kasai for being the first to use this repo. Thanks to Jungo, we fixed a few issues, and added some information about reproducibility that was missing before.