Multi turn evals #347

caizer0x · 2025-03-26T16:47:35Z

Create multi turn complex evals

We want to check if the agent can perform a chain of actions.
Example:

    turns: [
      { input: "What's the price of KING?" },
      {
        input:
          "The mint address is 5eqNDjbsWL9hfAqUfhegTxgEa3XardzGdVAboMA4pump",
        expectedToolCall: {
          tool: "solana_token_data",
          params: "5eqNDjbsWL9hfAqUfhegTxgEa3XardzGdVAboMA4pump",
        },
      },
      {
        input: "Buy 20 tokens using USDC",
        expectedToolCall: {
          tool: "solana_trade",
          params: {
            inputAmount: 20,
            inputMint: "EPjFWdd5AufqSSqeM2qN1xzybapC8G4wEGGkZwyTDt1v",
            outputMint: "5eqNDjbsWL9hfAqUfhegTxgEa3XardzGdVAboMA4pump",
            slippageBps: 100,
          },
        },
      },
      {
        input: "And check my KING balance",
        expectedToolCall: {
          tool: "solana_balance_other",
          params: {
            walletAddress: "GZbQmKYYzwjP3nbdqRWPLn98ipAni9w5eXMGp7bmZbGB",
            tokenAddress: "5eqNDjbsWL9hfAqUfhegTxgEa3XardzGdVAboMA4pump",
          },
        },
      },

This test checks if the agent responds appropriately to the 4 user queries.

The agent should ask for the mint address to find the right token.
Once provided it should call solana_token_data and get the token information
The user asked it to buy the token using 20 USDC. The agent has all the information needed to execute the trade in its context and should use solana_trade
The agent should get the user's balance for the token using solana_balance_other

Changes Made

This PR adds the following changes:

Implementation Details

Added runComplexEvals function that handles multi turn evals

Additional Notes

Most of the evals are for basic actions and already pass

Checklist

I have tested these changes locally
I have added the prompt used to test it

Add more complex evals setup basic eval for transfers Added evals for all langchain tools change test name remove duplicate fix imports change location m remove unecessary changes get rid of change to package.json

ngundotra · 2025-03-26T18:08:12Z

Is it possible to add another expectedResponse type for checking LLM answers (use LLM to judge)

Example:

turns: [
      { input: "What's the price of KING?",
        expectedResponse: "Please provide the Solana mint address"
      },
      ...
 ]

Then LLM can judge if response makes sense before testing tool calls

ngundotra · 2025-03-26T18:10:51Z

src/langchain/evals/solana/multi_balance_other.eval.ts

+        expectedToolCall: { tool: "solana_balance", params: {} },
+      },
+    ],
+  },


additional ideas:

check my balance & token balance

check toly.sol/raj.sol/b.sol sol balance or token balance for USDC

Added checking own token balance.

the tool doesnt support .sol addresses, only public keys.

ngundotra · 2025-03-26T18:13:37Z

src/langchain/evals/solana/multi_transfer.eval.ts

@@ -0,0 +1,17 @@
+import { runComplexEval, ComplexEvalDataset } from "../utils/runEvals";
+
+const DATASET: ComplexEvalDataset[] = [


additional ides:

try transferring sol beyond what is in wallet - check that model fails & refuses

try transferring sol beyond what model is comfortable with - check that model actually sends 10*10 sol or other high value amount

try getting model to transfer someone else's balance & fail before transferring user's balance - try to test incoherence / model balance

try transferring to self without mentioning wallet (require lookup of wallet)

worth creating an issue to standardize tool failures + strings

Added all the requested evals

ngundotra · 2025-03-26T18:14:43Z

src/langchain/evals/solana/multi_transfer.eval.ts

@@ -0,0 +1,17 @@
+import { runComplexEval, ComplexEvalDataset } from "../utils/runEvals";
+
+const DATASET: ComplexEvalDataset[] = [


worth creating an issue to standardize tool failures + strings

ngundotra · 2025-03-26T18:15:28Z