Skip to content

Embeddings enhancements #651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 30, 2024

Conversation

o-mikhailovskii
Copy link
Contributor

  1. Added support for Google's embedding models.
  2. Cohere's embeddings are currently broken since they require the inputType parameter (see the screenshot below). Now, there is a dropdown menu for making the selection.
input_type must be provided with the model

src/constants.ts Outdated
@@ -146,6 +155,13 @@ export const BUILTIN_EMBEDDING_MODELS: CustomModel[] = [
},
];

export const COHEREAI_EMBEDDING_INPUT_TYPES = [
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a user setting or set internally? I'd like to avoid cluttering the settings and overwhelming users with options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right on one side. On the other side (at least in the current version of API), there are these four options, and that's it. Why would you need to make custom options?

However, when I dug into more details, a "best practice" usage of Cohere's API would be as follows:

  1. Embed documents with the search_document input type,
  2. Search over the document with the search_query input type.

This suggests that dynamically changing the input type throughout Langchain usage would be ideal. However, it seems this would require substantial changes in plugin's codebase.

What do you think would be the best way to move forward?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The use of langchain is for switching between providers easily. If it doesn't give us the convenience, it'll be counterproductive to adapt its code for individual providers. Since there's a best practice, does langchain cohere client already have something out of the box for us to use? If not, I'd rather not adapt to it here in Copilot.

That said, I'm considering providing Copilot exclusive embedding models next. The user won't need an API key, just call that embedding provider through my gateway. For that, I'm considering Cohere vs. Voyage embeddings. Right now I'm inclined to Voyage for its better MTEB ranking and goodies like rerank API with explanation. If Cohere actually performs better in practice, I'll add it there as the Copilot's default embedding model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. The issue with the inputType parameter appears to be addressed in @langchain/cohere versions later than 0.2.0 (rendering my corresponding commit obsolete). Updating this package forces a cascade of dependency updates, ultimately leading to the need for a more recent langchain package, such as version 0.3.0. Consequently, this results in changes to import statements.

@logancyang
Copy link
Owner

Thanks for the update! @o-mikhailovskii

@logancyang logancyang merged commit 7bb74f9 into logancyang:master Sep 30, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants