-
Notifications
You must be signed in to change notification settings - Fork 791
Multi turn evals #347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi turn evals #347
Conversation
0db5b9b
to
6292f42
Compare
Add more complex evals setup basic eval for transfers Added evals for all langchain tools change test name remove duplicate fix imports change location m remove unecessary changes get rid of change to package.json
6292f42
to
1b5997d
Compare
Is it possible to add another expectedResponse type for checking LLM answers (use LLM to judge) Example:
Then LLM can judge if response makes sense before testing tool calls |
expectedToolCall: { tool: "solana_balance", params: {} }, | ||
}, | ||
], | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additional ideas:
check my balance & token balance
check toly.sol/raj.sol/b.sol sol balance or token balance for USDC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added checking own token balance.
the tool doesnt support .sol addresses, only public keys.
@@ -0,0 +1,17 @@ | |||
import { runComplexEval, ComplexEvalDataset } from "../utils/runEvals"; | |||
|
|||
const DATASET: ComplexEvalDataset[] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
additional ides:
- try transferring sol beyond what is in wallet - check that model fails & refuses
- try transferring sol beyond what model is comfortable with - check that model actually sends 10*10 sol or other high value amount
- try getting model to transfer someone else's balance & fail before transferring user's balance - try to test incoherence / model balance
- try transferring to self without mentioning wallet (require lookup of wallet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth creating an issue to standardize tool failures + strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added all the requested evals
@@ -0,0 +1,17 @@ | |||
import { runComplexEval, ComplexEvalDataset } from "../utils/runEvals"; | |||
|
|||
const DATASET: ComplexEvalDataset[] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth creating an issue to standardize tool failures + strings
turns: [ | ||
{ | ||
input: "What’s the current price of JUP?", | ||
expectedToolCall: { | ||
tool: "solana_fetch_price", | ||
params: { address: "JUPyiwrYJFskUPiHa7hkeR8VUtAeFoSYbKedZNsDvCN" }, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome 🙌
input: "Send it all to GZbQmKYYzwjP3nbdqRWPLn98ipAni9w5eXMGp7bmZbGB", | ||
expectedToolCall: { | ||
tool: "solana_transfer", | ||
params: { | ||
to: "GZbQmKYYzwjP3nbdqRWPLn98ipAni9w5eXMGp7bmZbGB", | ||
amount: 0.1, | ||
tokenAddress: "JUPyiwrYJFskUPiHa7hkeR8VUtAeFoSYbKedZNsDvCN", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is awesome
{ | ||
description: "Multi-turn flow: Check balance, mint NFT, verify assets", | ||
inputs: { | ||
query: "I want to create an NFT", | ||
}, | ||
turns: [ | ||
{ | ||
input: "How much SOL do I have? I need some to mint NFTs?", | ||
expectedToolCall: { | ||
tool: "solana_balance", | ||
params: {}, | ||
}, | ||
}, | ||
{ | ||
input: | ||
"Mint an NFT with name 'MyFirstNFT' and symbol 'MFN', uri: https://example.com/nft.json.", | ||
expectedToolCall: { | ||
tool: "solana_mint_nft", | ||
params: { | ||
name: "MyFirstNFT", | ||
symbol: "MFN", | ||
uri: "https://example.com/nft.json", | ||
}, | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need another 2 evals which just goes straight into minting, and the model should automatically check to see what the user's balance is before proceeding / or handle failure if minting tool fails.
Common use is booting this thing up & it doesn't mint because lack of balance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the straight into minting evals
tool: "solana_balance", | ||
params: JSON.stringify({ | ||
tokenAddress: "JUPyiwrYJFskUPiHa7hkeR8VUtAeFoSYbKedZNsDvCN", | ||
}), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this have to be stringify-d?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
params: JSON.stringify({ | ||
tokenAddress: "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v", | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
params: JSON.stringify({ | ||
tokenAddress: "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v", | ||
}), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
params: JSON.stringify({ | ||
tokenAddres: "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v", | ||
}), | ||
}, | ||
}, | ||
{ | ||
input: "Stake 0.5 SOL", | ||
expectedToolCall: { | ||
tool: "solana_stake", | ||
params: JSON.stringify({ amount: 0.5 }), | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
input: "Now stake 1 SOL", | ||
expectedToolCall: { | ||
tool: "solana_stake", | ||
params: JSON.stringify({ amount: 1 }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
{ | ||
description: | ||
"Multi-turn flow: stake 2.5 SOL if I have enough, otherwise request from faucet, then re-check SOL balance", | ||
inputs: { | ||
query: "I want to stake 2.5 SOL. Not sure if I have enough though.", | ||
}, | ||
turns: [ | ||
{ | ||
input: "How many SOL do I have?", | ||
expectedToolCall: { | ||
tool: "solana_balance", | ||
params: "{}", | ||
}, | ||
}, | ||
{ | ||
input: | ||
"If it's under 2.5, please request from faucet, else let's proceed.", | ||
expectedToolCall: { | ||
tool: "solana_request_funds", | ||
params: "{}", | ||
}, | ||
}, | ||
{ | ||
input: "Check how much SOL I have now", | ||
expectedToolCall: { | ||
tool: "solana_balance", | ||
params: "{}", | ||
}, | ||
}, | ||
], | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a great test. We need 2-3 tests for Cluster
awareness - ie agent should know if it's using mainnet
devnet
or testnet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added different evals depending on the cluster to check if the agent behaves appropriately
turns: [ | ||
{ input: "I need to cancel my NFT listing" }, | ||
{ | ||
input: "Cancel the listing for my NFT with mint 4KG7k12", | ||
expectedToolCall: { | ||
tool: "solana_cancel_nft_listing", | ||
params: { nftMint: "4KG7k12" }, | ||
}, | ||
}, | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should because it's not a valid pubkey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to a real pubkey
inputs: { | ||
query: "I want to launch a new PumpFun token", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's important to check that model doesn't refuse on first turn IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now validates the actual response is similar to the expectedResponse
4fdfe62
to
94fffa8
Compare
b8e80d2
to
efc09cc
Compare
Create multi turn complex evals
We want to check if the agent can perform a chain of actions.
Example:
This test checks if the agent responds appropriately to the 4 user queries.
solana_token_data
and get the token informationsolana_trade
solana_balance_other
Changes Made
This PR adds the following changes:
Implementation Details
Additional Notes
Most of the evals are for basic actions and already pass

Checklist