Skip to content

Conversation

@iejMac
Copy link
Contributor

@iejMac iejMac commented Mar 15, 2023

No description provided.

@iejMac
Copy link
Contributor Author

iejMac commented Mar 15, 2023

Ok current state is bare minimum version to get things kind of working. That means:

  • We get the image tokenized using VQGAN (I think this is correct, still need to write some decoding code to check if we're tokenizing correctly
  • We create a image decoder transformer which is just like the text decoder transformer and predict the next image token autoregressively
  • We calculate the loss
  • Code is as decent as I could make it in one sitting. Still needs improvement

@iejMac iejMac marked this pull request as draft March 17, 2023 03:24
@iejMac
Copy link
Contributor Author

iejMac commented Mar 18, 2023

@iejMac
Copy link
Contributor Author

iejMac commented Mar 18, 2023

TODO:

Code:

  • CoCa generation code should be modality-agnostic - it should be able to generate images and text based on the shape (or parameters) of the input
  • create some start_of_image token !!!
  • BIG Cleanup. Can we go without making a dependency on taming-transformers and omegaconf?
  • Config cleanup + update old coca configs

Model:

  • Train something at B/32 scale
  • dropout text conditioning 10% of the time as suggested by Katherine (either put nothing in cross attention or some learned sequence)
  • axial positional embeddings suggested by lucidrains

@iejMac
Copy link
Contributor Author

iejMac commented Apr 2, 2023

https://arxiv.org/abs/2303.13455

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant