-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore the use of static LLM model implementations for deterministic testing and potentially lowered costs #27
Comments
@d33bs, this is a really great idea, thank you for opening this issue! I already requested access and I'm downloading the models to give it a quick try :)
|
Sorry for the confusion. I realized that you refer to the testing part, and I think it's a great idea to use explicit snapshots when referencing models. Nonetheless, as I mentioned here, I think testing prompts should use more advanced tools such as OpenAI Evals instead of simple unit testing as now. |
Ok, I couldn't resist the temptation of trying Llama 2 :) I downloaded the Llama 2's API for chat completion is very similar to the OpenAI's one, so it shouldn't be hard to implement. And as you suggest, this could greatly improve prompt testing, for instance, maybe by providing a pilot model before using an expensive one. |
Another option I just realized is to use a very low temperature, even zero. |
Cheers, thanks for these thoughts and the exploration @miltondp ! The temperature parameter seems like a great way to ensure greater determinism. Exciting that the Llama 2 API is similar to OpenAI's; I wonder how much of this could be abstracted for the usecases here (maybe there's already a unified way to interact with these?). Generally I felt concerned that some OpenAI tests may shift in their output with continuous updates applied and no way to statically define a version. I'm unsure about how this works for them, but it feels like some of their endpoints are akin to a Docker image Here I also wondered about additional advantages of open-source models (mostly thinking about Llama here, but there are others):
|
I agree that referencing versions of models would be ideal. I understand that OpenAI offers this as "snapshots" of models, at least for some of them such as I completely agree with all of your points. I think I will make the Llama 2 support a high priority for the next iterations because it can reduce costs in accessing the OpenAI API, at least to test prompts preliminarily, as well as reducing developing time as you said. And I'd love to see the tool being used by colleagues from Argentina, for instance, where the costs of proprietary models are sometimes prohibitive. Great comments and ideas. Thank you again, @d33bs! |
#51 might be useful here |
This issue was inspired by #25 through considering testing limitations. This issue proposes the use of static models like the Llama and many others which could be referenced as part of testing procedures to help increase deterministic testing results through explicit model versioning (acknowledging some continuous updates from OpenAI models) and also potentially reduce the costs of API access to OpenAI. Benefits might include increasing testing assurance, expanding how new / different models perform with
manubot-ai-editor
, and accounting for continuous update performance changes with OpenAI (where the results may differ enough that test results could be confusing). Implementation of these static models might involve the use of things like privateGPT or similar tooling.The text was updated successfully, but these errors were encountered: