implement cross-attention between image and language embeddings

The current implementation in the repo performs a simple concatenation by prepending language embeddings with image embeddings. Performing cross-attention between the two embeddings will provide better token representation. I will be happy to implement this.