Description
Dear researchers, thank you very much for your paper & code!
I am keen to hear your thoughts on implementing Geometric Parametrization (GmP) with Inf-CLIP.
I have previously implemented GmP for 'classic' CLIP fine-tuning. In a nutshell:
GmP CLIP MLP:
(mlp): Sequential(
|-(c_fc): GeometricLinear()
| (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| |
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
(Same for [text] transformer.resblocks)
I was able to archive a marked improvement over pre-trained OpenAI CLIP ViT-L/14 with this technique (dataset: COCO-SPRIGHT-40k). The model was fine-tuned on 1x RTX 4090 with a batch size of 40 (!).
Evals:
github.com/LAION-AI/CLIP_benchmark
objectnet.dev/mvt/
Code to reproduce results / fine-tune:
github.com/zer0int/CLIP-fine-tune
Models (dataset is linked) + further results (retrieval, multimodal gap):
huggingface.co/zer0int/CLIP-GmP-ViT-L-14
CLIP GmP was inspired by this paper:
ReLU Characteristic Activation Analysis
I have forked your Inf-CLIP and provided an initial implementation of GmP:
https://github.com/zer0int/Inf-CLIP
I am unable to test it due to being 'GPU-poor', as above; however, I'd be curious to see if GmP provides additional benefits for Inf-CLIP. Or, on the other hand, if there are problems with GmP + Inf-CLIP.
Also, I am in the process of further modifying your code to implement a "fake" distributed backend to construct a sequential compute of 'tiles' using 1 GPU. Any tips (by anybody who happens to read this) with regards to handling data exchange (which would inevitably involve the CPU) are welcome. Again, thank you for your work!