Geometric Parametrization for Inf-CLIP

Dear researchers, thank you very much for your paper & code!

I am keen to hear your thoughts on implementing Geometric Parametrization (GmP) with Inf-CLIP.
I have previously implemented GmP for 'classic' CLIP fine-tuning. In a nutshell:

```
GmP CLIP MLP:

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same for [text] transformer.resblocks)
``` 

I was able to archive a marked improvement over pre-trained OpenAI CLIP ViT-L/14 with this technique (dataset: COCO-SPRIGHT-40k). The model was fine-tuned on 1x RTX 4090 with a batch size of 40 (!).

![GmP-results](https://github.com/user-attachments/assets/b3f30fcd-ab8e-437d-bb99-ea1f4246fd33)

Evals:
[github.com/LAION-AI/CLIP_benchmark](https://github.com/LAION-AI/CLIP_benchmark)
[objectnet.dev/mvt/](https://objectnet.dev/mvt/)

Code to reproduce results / fine-tune:
[github.com/zer0int/CLIP-fine-tune](https://github.com/zer0int/CLIP-fine-tune)

Models (dataset is linked) + further results (retrieval, multimodal gap):
[huggingface.co/zer0int/CLIP-GmP-ViT-L-14](https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14)

CLIP GmP was inspired by this paper:
[ReLU Characteristic Activation Analysis](https://arxiv.org/abs/2305.15912v4)

----

I have forked your Inf-CLIP and provided an initial implementation of GmP:
[https://github.com/zer0int/Inf-CLIP](https://github.com/zer0int/Inf-CLIP)

I am unable to test it due to being 'GPU-poor', as above; however, I'd be curious to see if GmP provides additional benefits for Inf-CLIP. Or, on the other hand, if there are problems with GmP + Inf-CLIP.


Also, I am in the process of further modifying your code to implement a "fake" distributed backend to construct a sequential compute of 'tiles' using 1 GPU. Any tips (by anybody who happens to read this) with regards to handling data exchange (which would inevitably involve the CPU) are welcome. Again, thank you for your work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Geometric Parametrization for Inf-CLIP #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Geometric Parametrization for Inf-CLIP #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions