We compare our Facere
models with the state-of-the-art Attribute Mask R-CNN (amrcnn)
and FashionFormer (fformer)
models. Since their repositories have conflicting dependencies, we create separate virtual environments for each of
them. Here are all the models we used:
Model name | Description | Backbone | FashionFail-train data |
download |
---|---|---|---|---|
amrcnn-spine |
Attribute Mask-RCNN model released with Fashionpedia paper. | SpineNet-143 | x | ckpt | config |
fformer-swin |
Fashionformer model released by Fashionformer paper. | Swin-base | x | pth |
amrcnn-r50-fpn |
Attribute Mask-RCNN model released with Fashionpedia paper. | ResNet50-FPN | x | ckpt | config |
fformer-r50-fpn |
Fashionformer model released by Fashionformer paper. | ResNet50-FPN | x | pth |
facere |
Mask R-CNN based model trained on Fashionpedia-train . |
ResNet50-FPN | x | onnx |
facere+ |
facere model finetuned on FashionFail-train . |
ResNet50-FPN | ✔ | onnx |
(After training) Execute the following command to convert the trained model (.ckpt
) to .onnx
format:
python models/export_to_onnx.py \
--ckpt_path "model.ckpt" \
--onnx_path "model.onnx" \
--model_class "facere_base" # either "facere_base" or "facere_plus"
Then, run inference using ONNX Runtime
with:
python models/predict_models.py \
--model_name "facere_base" \ # either "facere_base" or "facere_plus"
--image_dir "path/to/images/to/run/inference/for/" \
--out_dir "path/to/where/predictions/will/be/saved/"
which saves all the predictions into a single compressed .npz
file, which is storage-efficient.
The file has the following structure:
{
"image_file": str, # image file name
"boxes": numpy.ndarray, # boxes in yxyx format (same as `amrcnn` model output)
"classes": numpy.ndarray, # classes/categories in [1,n] for n classes
"scores": numpy.ndarray, # confidence scores of boxes in [0,1]
"masks": list(dict), # segmentation masks in encoded RLE format
}
Alternatively, see the inference code in HuggingFace Spaces.
Note on the repository: The whole repository is really complex and not easily editable, e.g. I couldn't run inference on GPUs, failed to convert the model to
.onnx
format, etc. Therefore, the following procedure is not optimal, but it works...
Create and activate the conda environment:
conda create -n amrcnn python=3.9
conda activate amrcnn
Install dependencies:
pip install tensorflow-gpu==2.11.0 Pillow==9.5.0 pyyaml opencv-python-headless tqdm pycocotools
Clone the repository, navigate to the detection
directory and download the models:
cd /change/dir/to/fashionfail/repo/
git clone https://github.com/jangop/tpu.git
cd tpu
git checkout 85b65b6
cd models/official/detection
curl https://storage.googleapis.com/cloud-tpu-checkpoints/detection/projects/fashionpedia/fashionpedia-spinenet-143.tar.gz --output fashionpedia-spinenet-143.tar.gz
tar -xf fashionpedia-spinenet-143.tar.gz
curl https://storage.googleapis.com/cloud-tpu-checkpoints/detection/projects/fashionpedia/fashionpedia-r50-fpn.tar.gz
tar -xf fashionpedia-r50-fpn.tar.gz
The inference script expects a .zip file for the input images. Hence, zip the FashionFail-test
data, for example:
cd ~/.cache/fashionfail/
tar -cvf ff_test.tar images/test/*
Finally, we can run inference with:
cd some_path/fashionfail/tpu/models/official/detection
python inference_fashion.py \
--model="attribute_mask_rcnn" \
--config_file="projects/fashionpedia/configs/yaml/spinenet143_amrcnn.yaml" \
--checkpoint_path="fashionpedia-spinenet-143/model.ckpt" \
--label_map_file="projects/fashionpedia/dataset/fashionpedia_label_map.csv" \
--output_html="out.html" --max_boxes_to_draw=8 --min_score_threshold=0.01 \
--image_size="640" \
--image_file_pattern="~/.cache/fashionfail/ff_test.tar" \
--output_file="outputs/spinenet143-ff_test.npy"
The predictions file has the following structure:
{
'image_file': str, # image file name
'boxes': np.ndarray, # boxes in yxyx format
'classes': np.ndarray, # classes/categories in [1,n] for n classes
'scores': np.ndarray, # confidence scores of boxes in [0,1]
'attributes': np.ndarray, # attributes (not used in our evaluation)
'masks': encoded_masks, # segmentation masks in encoded RLE format
}
Create and activate the conda environment:
conda create -n fformer python==3.8.13
conda activate fformer
Install dependencies:
conda install pytorch==1.9.0 torchvision==0.10.0 torchaudio==0.9.0 -c pytorch
pip install -U openmim
mim install mmdet==2.18.0
mim install mmcv-full==1.3.18
pip install git+https://github.com/cocodataset/panopticapi.git
pip install -U scikit-learn
pip install -U scikit-image
pip install torchmetrics
Clone the repository and create a new directory for the model weights:
cd /change/dir/to/fashionfail/repo/
git clone https://github.com/xushilin1/FashionFormer.git
mkdir FashionFormer/ckpts
Download the models manually from OneDrive and place them inside the newly created
FashionFormer/ckpts
folder.
Then, run inference with:
python src/fashionfail/models/predict_fformer.py \
--model_path "./FashionFormer/ckpts/fashionformer_r50_3x.pth" \
--config_path "./FashionFormer/configs/fashionformer/fashionpedia/fashionformer_r50_mlvl_feat_3x.py"\
--out_dir "path/to/where/predictions/will/be/saved/" \
--image_dir "./cache/fashionfail/images/test/" \
--dataset_name "ff_test" \
--score_threshold 0.05
which saves all the predictions into a single compressed .npz
file, which is storage-efficient.
Note: A
score_threshold=0.05
is applied to model predictions. This is because thefformer
outputs a fixed number (100) of predictions for each input due to its Transformer architecture, resulting in many unconfident and mainly wrong predictions, which can lead to poor results. Therefore, this thresholding is applied to evaluate the model's performance fairly.
The predictions file has the following structure:
{
"image_file": str, # image file name
"boxes": numpy.ndarray, # boxes in xyxy format
"classes": numpy.ndarray, # classes/categories in [0,n-1] for n classes
"scores": numpy.ndarray, # confidence scores of boxes in [0,1]
"masks": list(dict), # segmentation masks in encoded RLE format
}