Skip to content

Multi turn evals #347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 9, 2025
Merged

Multi turn evals #347

merged 5 commits into from
Apr 9, 2025

Conversation

caizer0x
Copy link
Contributor

@caizer0x caizer0x commented Mar 26, 2025

Create multi turn complex evals

We want to check if the agent can perform a chain of actions.
Example:

    turns: [
      { input: "What's the price of KING?" },
      {
        input:
          "The mint address is 5eqNDjbsWL9hfAqUfhegTxgEa3XardzGdVAboMA4pump",
        expectedToolCall: {
          tool: "solana_token_data",
          params: "5eqNDjbsWL9hfAqUfhegTxgEa3XardzGdVAboMA4pump",
        },
      },
      {
        input: "Buy 20 tokens using USDC",
        expectedToolCall: {
          tool: "solana_trade",
          params: {
            inputAmount: 20,
            inputMint: "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v",
            outputMint: "5eqNDjbsWL9hfAqUfhegTxgEa3XardzGdVAboMA4pump",
            slippageBps: 100,
          },
        },
      },
      {
        input: "And check my KING balance",
        expectedToolCall: {
          tool: "solana_balance_other",
          params: {
            walletAddress: "GZbQmKYYzwjP3nbdqRWPLn98ipAni9w5eXMGp7bmZbGB",
            tokenAddress: "5eqNDjbsWL9hfAqUfhegTxgEa3XardzGdVAboMA4pump",
          },
        },
      },

This test checks if the agent responds appropriately to the 4 user queries.

  1. The agent should ask for the mint address to find the right token.
  2. Once provided it should call solana_token_data and get the token information
  3. The user asked it to buy the token using 20 USDC. The agent has all the information needed to execute the trade in its context and should use solana_trade
  4. The agent should get the user's balance for the token using solana_balance_other

Changes Made

This PR adds the following changes:

Implementation Details

  • Added runComplexEvals function that handles multi turn evals

Additional Notes

Most of the evals are for basic actions and already pass
Screenshot 2025-03-26 at 1 28 14 PM

Checklist

  • I have tested these changes locally
  • I have added the prompt used to test it

@caizer0x caizer0x force-pushed the multi-turn-evals branch 2 times, most recently from 0db5b9b to 6292f42 Compare March 26, 2025 16:56
Add more complex evals

setup basic eval for transfers
Added evals for all langchain tools

change test name

remove duplicate

fix imports

change location

m

remove unecessary changes

get rid of change to package.json
@caizer0x caizer0x marked this pull request as ready for review March 26, 2025 17:00
@ngundotra
Copy link
Contributor

Is it possible to add another expectedResponse type for checking LLM answers (use LLM to judge)

Example:

turns: [
      { input: "What's the price of KING?",
        expectedResponse: "Please provide the Solana mint address"
      },
      ...
 ]

Then LLM can judge if response makes sense before testing tool calls

expectedToolCall: { tool: "solana_balance", params: {} },
},
],
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional ideas:

check my balance & token balance

check toly.sol/raj.sol/b.sol sol balance or token balance for USDC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added checking own token balance.

the tool doesnt support .sol addresses, only public keys.

@@ -0,0 +1,17 @@
import { runComplexEval, ComplexEvalDataset } from "../utils/runEvals";

const DATASET: ComplexEvalDataset[] = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional ides:

  • try transferring sol beyond what is in wallet - check that model fails & refuses
  • try transferring sol beyond what model is comfortable with - check that model actually sends 10*10 sol or other high value amount
  • try getting model to transfer someone else's balance & fail before transferring user's balance - try to test incoherence / model balance
  • try transferring to self without mentioning wallet (require lookup of wallet)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth creating an issue to standardize tool failures + strings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added all the requested evals

@@ -0,0 +1,17 @@
import { runComplexEval, ComplexEvalDataset } from "../utils/runEvals";

const DATASET: ComplexEvalDataset[] = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth creating an issue to standardize tool failures + strings

Comment on lines +10 to +16
turns: [
{
input: "What’s the current price of JUP?",
expectedToolCall: {
tool: "solana_fetch_price",
params: { address: "JUPyiwrYJFskUPiHa7hkeR8VUtAeFoSYbKedZNsDvCN" },
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome 🙌

Comment on lines +89 to +96
input: "Send it all to GZbQmKYYzwjP3nbdqRWPLn98ipAni9w5eXMGp7bmZbGB",
expectedToolCall: {
tool: "solana_transfer",
params: {
to: "GZbQmKYYzwjP3nbdqRWPLn98ipAni9w5eXMGp7bmZbGB",
amount: 0.1,
tokenAddress: "JUPyiwrYJFskUPiHa7hkeR8VUtAeFoSYbKedZNsDvCN",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome

Comment on lines +101 to +125
{
description: "Multi-turn flow: Check balance, mint NFT, verify assets",
inputs: {
query: "I want to create an NFT",
},
turns: [
{
input: "How much SOL do I have? I need some to mint NFTs?",
expectedToolCall: {
tool: "solana_balance",
params: {},
},
},
{
input:
"Mint an NFT with name 'MyFirstNFT' and symbol 'MFN', uri: https://example.com/nft.json.",
expectedToolCall: {
tool: "solana_mint_nft",
params: {
name: "MyFirstNFT",
symbol: "MFN",
uri: "https://example.com/nft.json",
},
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need another 2 evals which just goes straight into minting, and the model should automatically check to see what the user's balance is before proceeding / or handle failure if minting tool fails.

Common use is booting this thing up & it doesn't mint because lack of balance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the straight into minting evals

Comment on lines 260 to 264
tool: "solana_balance",
params: JSON.stringify({
tokenAddress: "JUPyiwrYJFskUPiHa7hkeR8VUtAeFoSYbKedZNsDvCN",
}),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have to be stringify-d?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 324 to 326
params: JSON.stringify({
tokenAddress: "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v",
}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 291 to 294
params: JSON.stringify({
tokenAddress: "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v",
}),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines 189 to 200
params: JSON.stringify({
tokenAddres: "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v",
}),
},
},
{
input: "Stake 0.5 SOL",
expectedToolCall: {
tool: "solana_stake",
params: JSON.stringify({ amount: 0.5 }),
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

input: "Now stake 1 SOL",
expectedToolCall: {
tool: "solana_stake",
params: JSON.stringify({ amount: 1 }),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +367 to +397
{
description:
"Multi-turn flow: stake 2.5 SOL if I have enough, otherwise request from faucet, then re-check SOL balance",
inputs: {
query: "I want to stake 2.5 SOL. Not sure if I have enough though.",
},
turns: [
{
input: "How many SOL do I have?",
expectedToolCall: {
tool: "solana_balance",
params: "{}",
},
},
{
input:
"If it's under 2.5, please request from faucet, else let's proceed.",
expectedToolCall: {
tool: "solana_request_funds",
params: "{}",
},
},
{
input: "Check how much SOL I have now",
expectedToolCall: {
tool: "solana_balance",
params: "{}",
},
},
],
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a great test. We need 2-3 tests for Cluster awareness - ie agent should know if it's using mainnet devnet or testnet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added different evals depending on the cluster to check if the agent behaves appropriately

Comment on lines 9 to 18
turns: [
{ input: "I need to cancel my NFT listing" },
{
input: "Cancel the listing for my NFT with mint 4KG7k12",
expectedToolCall: {
tool: "solana_cancel_nft_listing",
params: { nftMint: "4KG7k12" },
},
},
],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should because it's not a valid pubkey

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to a real pubkey

Comment on lines +6 to +8
inputs: {
query: "I want to launch a new PumpFun token",
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's important to check that model doesn't refuse on first turn IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now validates the actual response is similar to the expectedResponse

@caizer0x caizer0x force-pushed the multi-turn-evals branch 7 times, most recently from 4fdfe62 to 94fffa8 Compare April 9, 2025 15:54
@thearyanag thearyanag merged commit 2f96e31 into sendaifun:main Apr 9, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants