This repository is a collection of scripts and Jupyter notebooks for extracting bottom-up image features required by the baseline models for nocaps
.
- Terminology
- Setup Instructions
- Extract Boxes from
OI Detector
- Extract Features from
VG Detector
- Visualize Bounding Boxes
- Frequently Asked Questions
Pre-trained weights and some parts of this codebase are adopted from @peteanderson80/bottom-up-attention and @Tensorflow Object Detection API.
If you find this code useful, please consider citing our paper and these works. Bibtex available in CITATION.md.
We have two baselines (Table 2 of our paper) using two detectors for getting boxes and image features.
Baselines:
- CBS - Constrained Beam Search (applied to UpDown Captioner).
- NBT - Neural Baby Talk (with and without Constrained Beam Search).
Detectors:
OI Detector
- trained using Open Images v4 split, from the Tensorflow model zoo.VG Detector
- trained using Visual Genome, from @peteanderson80/bottom-up-attention.
OI Detector
is used as a source of CBS constraints and NBT visual word candidates. VG Detector
is used as a source of bottom-up image features (2048-dimensional vectors) for both the baselines (and their ablations). Refer our paper for further details.
Setting up this codebase requires Docker, so install it first. We provide two separate dockerfiles, one each for both of our detectors. Also, install nvidia-docker, which enables usage of GPUs from inside a container.
-
Download pre-trained models.
-
Build docker image (replace
<detector>
withoi
orvg
as per needs). If you wish to use only theOI Detector
, you need not build the docker image forVG Detector
.
docker build --file Dockerfile_<detector> --tag <detector>_image .
We use OI Detector
as a source of bounding boxes (and associated class labels) for CBS constraints and candidate grounding regions for NBT (rows 2-4, Table 2 of our paper). We do not use the bottom-up image features extracted from this detector in any experiments.
- Launch the docker image in a container, make sure to attach project root directory as well as directories containing dataset split as volumes.
nvidia-docker run -it \
--name oi_container \
-v $PWD/scripts:/workspace/scripts \ # attach scripts as volume: reflect changes inside on editing
-v /path/to/nocaps:/datasets/nocaps \ # omit this if using only coco
-v /path/to/coco:/datasets/coco \ # omit this if using only nocaps
-p 8880:8880 # port forward for accessing jupyter notebook/lab
oi_image
/bin/bash
- Inside the container environment, extract boxes with this command (example for
nocaps
val):
python3 scripts/extract_boxes_oi.py \
--graph models/faster_rcnn_inception_resnet_v2_atrous_oid_v4_2018/frozen_inference_graph.pb \
--images /datasets/nocaps/images/val \
--annotations /datasets/nocaps/annotations/nocaps_val_image_info.json \
--output /outputs/nocaps_val_detections.json
- Copy the output file from the container environment outside to the main filesystem.
docker container cp oi_container:/outputs/nocaps_val_detections.json .
The output is a JSON file with bounding boxes in COCO format (of instance annotations):
{
"categories": [ {"id": int, "name": str, "supercategory": str}, ... ],
"images": [ {"id": int, "file_name": str, "width": int, "height": int }, ... ],
"annotations": [ {"image_id": int, "category_id": int, "bbox": [X1, Y1, X2, Y2], "score": float } ... ]
}
Note that bbox
is of the form [X1, Y1, X2, Y2]
as opposed to [X, Y, W, H]
in COCO format.
We use VG Detector
as a source of bottom-up image features (2048-dimensional vectors) for UpDown model and NBT. Only for the candidate grounding regions of NBT, we take boxes from OI Detector
and use them here (instead of this detector's RPN) to get features.
Launch the docker image as for OI Detector
:
nvidia-docker run -it \
--name vg_container \
-v $PWD/scripts:/workspace/scripts \ # attach scripts as volume: reflect changes inside on editing
-v /path/to/nocaps:/datasets/nocaps \ # omit this if using only coco
-v /path/to/coco:/datasets/coco \ # omit this if using only nocaps
-p 8880:8880 # port forward for accessing jupyter notebook/lab
vg_image
/bin/bash
Inside the container environment, extract class-agnostic features for UpDown model and the language model of NBT with this command (example for nocaps
val):
python scripts/extract_features_vg.py \
--prototxt models/vg_faster_rcnn_end2end/test_rpn.prototxt \
--caffemodel models/vg_faster_rcnn_end2end/resnet101_faster_rcnn_final.caffemodel \
--images /datasets/nocaps/images/val \
--annotations /datasets/nocaps/annotations/nocaps_val_4500_image_info.json \
--output /outputs/nocaps_val_features.h5
To extract features from boxes provided by OI Detector
, add/change two arguments:
- --prototxt models/vg_faster_rcnn_end2end/test_force_boxes.prototxt
.
- provide path to output JSON from OI Detector
as --force-boxes
.
Copy the output file from the container environment outside to the main filesystem.
docker container cp vg_container:/outputs/nocaps_val_features.h5 .
The output is a stand-alone H5 file with the following fields (one row corresponds to one image):
{
"image_id": int,
"width": int,
"height": int,
"num_boxes": int,
"boxes": np.ndarray, # shape: (num_boxes * 4, )
"classes": np.ndarray, # shape: (num_boxes, )
"scores": np.ndarray, # shape: (num_boxes, )
"features": np.ndarray, # shape: (num_boxes * 2048, )
}
- In case of class-agnostic features extracted by
VG Detector
,classes
field is absent. - In case of boxes provided by
OI Detector
,classes
,scores
,boxes
as the same as in JSON provided through--force-boxes
argument.
We provide two notebooks in notebooks
directory to visualize boxes and object classes from a JSON file or an H5 file. These can be run directly from either of the container environments.
-
How do I train my own detector(s)?
- We only provide support for feature extraction and visualization (for debugging). We do not intend to add training support in this repository in the future. For training your own detector(s), use @peteanderson80/bottom-up-detection for
VG Detector
and Tensorflow Object Detection API forOI Detector
.
- We only provide support for feature extraction and visualization (for debugging). We do not intend to add training support in this repository in the future. For training your own detector(s), use @peteanderson80/bottom-up-detection for
-
Feature extraction is slow, how can I speed it up?
- Feature extraction for nocaps splits is reasonably fast due to a smaller split size (~5K/~10K images), COCO train2017 would take relatively longer (~118K images). Parallelizing across multiple GPUs is an alternative, but it is unfortunately not supported. Feature extraction was a one time job for our experiments, hence introducing multi-GPU support took lower priority than other things. We do welcome Pull Requests for this support!