add longmemeval benchmark data preparation + benchmark runner #5925

TylerBarnes · 2025-07-14T23:43:20Z

This PR adds prepare/run scripts for ingesting longmemeval data, and running benchmarks against our memory features.

There is a prepare step for multiple memory configs:

semantic recall
working memory
combined (semantic recall + working memory)
working memory tailored (tailored means a custom wm template was generated for each question)
combined tailored

This PR doesn't include the prepared data or benchmark results. I still need to figure out where to host that data as it's quite large.

codesandbox · 2025-07-14T23:43:23Z

Review or Edit in CodeSandbox

Open the branch in Web Editor • VS Code • Insiders

Open Preview

vercel · 2025-07-14T23:43:26Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
assistant-ui	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 15, 2025 4:27pm
mastra-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 15, 2025 4:27pm
openapi-spec-writer	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 15, 2025 4:27pm

greptile-apps

Greptile Summary

This PR implements a comprehensive benchmark system to evaluate Mastra's memory capabilities using the LongMemEval dataset. The implementation includes:

Data preparation pipeline for multiple memory configurations:
- Semantic recall
- Working memory
- Combined approach (semantic + working)
- Working memory with tailored templates
- Combined tailored approach
Robust infrastructure including:
- Efficient caching system for embeddings to optimize API usage
- Concurrent processing with appropriate rate limiting
- Progress tracking and error recovery mechanisms
- Specialized benchmark storage implementations
Comprehensive test coverage:
- Mock implementations for testing without API calls
- Validation of data preparation and benchmark execution
- Tests for edge cases and error conditions

Critical Notes:

The concurrent processing settings (50 for standard vs 1 for working memory) could cause inconsistent benchmark timing
Memory leak potential in CachedOpenAIEmbeddingModel if cache grows unbounded
Missing implementation of validateExistingIndex in BenchmarkVectorStore

Confidence score: 3/5

The changes are thorough and well-tested but have some critical implementation concerns
Score reflects concerns about missing implementations and potential memory leaks
Files needing attention:
- src/storage/benchmark-vector.ts: Missing validateExistingIndex implementation
- src/embeddings/cached-openai-embedding-model.ts: Potential memory leak
- src/commands/prepare.ts: Concurrent processing consistency

_{34 files reviewed, 40 comments}
_{Edit PR Review Bot Settings | Greptile}

greptile-apps · 2025-07-14T23:43:57Z

explorations/longmemeval/src/evaluation/__tests__/longmemeval-metric.test.ts

+      expect(() => {
+        new LongMemEvalMetric({
+          agent: mockAgent,
+          questionType: 'invalid-type' as any,
+        });
+      }).not.toThrow(); // Constructor doesn't validate


logic: Constructor should validate question types to fail fast rather than waiting for measure() call

greptile-apps · 2025-07-14T23:44:02Z

explorations/longmemeval/scripts/generate-wm-templates.ts

+async function main() {
+  const args = process.argv.slice(2);
+  const dataset = args[0] || 'longmemeval_s';
+  const concurrency = parseInt(args[1]) || 100; // Default to 5 concurrent generations


logic: Comment says default is 5 but code defaults to 100 concurrent generations

Suggested change

const concurrency = parseInt(args[1]) || 100; // Default to 5 concurrent generations

const concurrency = parseInt(args[1]) || 100; // Default to 100 concurrent generations

greptile-apps · 2025-07-14T23:44:03Z

explorations/longmemeval/src/commands/__tests__/run.test.ts

+          },
+          documents: [{
+            id: 'msg-1',
+            vector: new Array(1536).fill(0).map(() => Math.random()),


style: using Math.random() in tests can lead to non-deterministic behavior - consider using a fixed seed or predefined vector

greptile-apps · 2025-07-14T23:44:03Z

explorations/longmemeval/src/storage/__tests__/benchmark-store.test.ts

+  beforeEach(async () => {
+    store = new BenchmarkStore();
+    await store.init();
+    testFilePath = join(tmpdir(), `benchmark-store-test-${Date.now()}.json`);


style: Use randomUUID() instead of Date.now() for test file paths to prevent potential conflicts in rapid test runs

greptile-apps · 2025-07-14T23:44:03Z

explorations/longmemeval/DATA_DOWNLOAD_GUIDE.md

+### 2. Extract the files
+
+```bash
+cd packages/longmemeval/data


logic: path inconsistency between 'packages/longmemeval/data' and earlier example using '../data/'

greptile-apps · 2025-07-14T23:44:10Z

explorations/longmemeval/scripts/setup.ts

+  // Check if already set up
+  const hasAllFiles = EXPECTED_FILES.every(file => existsSync(join(DATA_DIR, file)));


logic: Add check for data directory existence before checking files

greptile-apps · 2025-07-14T23:44:11Z

explorations/longmemeval/scripts/setup.ts

+  try {
+    execSync('pnpm download', { stdio: 'inherit' });
+  } catch (error) {
+    console.log(chalk.yellow('\n⚠️  Automatic download failed.'));
+    console.log(chalk.yellow('Please check the DOWNLOAD_GUIDE.md for manual download instructions.\n'));
+  }


logic: The download error is silently caught without logging the actual error. Add error details to help debugging.

greptile-apps · 2025-07-14T23:44:11Z

explorations/longmemeval/scripts/setup.ts

+}
+
+// Run setup
+setup().catch(console.error);


style: Consider adding a process.exit(1) in the catch handler to indicate setup failure

greptile-apps · 2025-07-14T23:44:12Z

explorations/longmemeval/src/test-utils/mock-embeddings.ts

+} catch (error) {
+  console.warn('Warning: Could not load fixture embeddings, using random embeddings instead');


logic: Log the actual error. Missing error details makes debugging harder

Suggested change

} catch (error) {

console.warn('Warning: Could not load fixture embeddings, using random embeddings instead');

} catch (error) {

console.warn('Warning: Could not load fixture embeddings, using random embeddings instead:', error);

greptile-apps · 2025-07-14T23:44:13Z

explorations/longmemeval/src/test-utils/mock-embeddings.ts

+        // Otherwise generate a deterministic "random" embedding based on the text
+        // This ensures the same text always gets the same embedding
+        const seed = text.split('').reduce((acc, char) => acc + char.charCodeAt(0), 0);
+        const embedding = new Array(1536).fill(0).map((_, i) => {


style: Magic number 1536. Consider extracting as a named constant to document the embedding dimension

changeset-bot · 2025-07-15T15:05:12Z

⚠️ No Changeset found

Latest commit: be78149

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

TylerBarnes added 2 commits July 14, 2025 16:31

add longmemeval benchmark runner

989b822

add cli warning about api costs

44354bc

TylerBarnes requested review from abhiaiyer91, NikAiyer and DanielSLew July 14, 2025 23:43

greptile-apps bot reviewed Jul 14, 2025

View reviewed changes

vercel bot had a problem deploying to Preview – mastra-docs July 14, 2025 23:45 Failure

DanielSLew approved these changes Jul 15, 2025

View reviewed changes

Merge branch 'main' into feat/longmemeval-2

332f0ff

vercel bot deployed to Preview – assistant-ui July 15, 2025 15:06 View deployment

vercel bot had a problem deploying to Preview – openapi-spec-writer July 15, 2025 15:07 Failure

vercel bot deployed to Preview – mastra-docs July 15, 2025 15:11 View deployment

prettier format

2dc8f7e

vercel bot deployed to Preview – assistant-ui July 15, 2025 15:32 View deployment

vercel bot deployed to Preview – openapi-spec-writer July 15, 2025 15:36 View deployment

vercel bot deployed to Preview – mastra-docs July 15, 2025 15:38 View deployment

omit explorations

be78149

vercel bot deployed to Preview – assistant-ui July 15, 2025 16:26 View deployment

vercel bot deployed to Preview – openapi-spec-writer July 15, 2025 16:26 View deployment

vercel bot deployed to Preview – mastra-docs July 15, 2025 16:27 View deployment

TylerBarnes merged commit 3903f0c into main Jul 15, 2025
23 checks passed

TylerBarnes deleted the feat/longmemeval-2 branch July 15, 2025 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add longmemeval benchmark data preparation + benchmark runner #5925

add longmemeval benchmark data preparation + benchmark runner #5925

Uh oh!

TylerBarnes commented Jul 14, 2025

Uh oh!

codesandbox bot commented Jul 14, 2025

Uh oh!

vercel bot commented Jul 14, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jul 14, 2025

Uh oh!

greptile-apps bot Jul 14, 2025

Uh oh!

greptile-apps bot Jul 14, 2025

Uh oh!

greptile-apps bot Jul 14, 2025

Uh oh!

greptile-apps bot Jul 14, 2025

Uh oh!

greptile-apps bot Jul 14, 2025

Uh oh!

greptile-apps bot Jul 14, 2025

Uh oh!

greptile-apps bot Jul 14, 2025

Uh oh!

greptile-apps bot Jul 14, 2025

Uh oh!

greptile-apps bot Jul 14, 2025

Uh oh!

changeset-bot bot commented Jul 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

	const concurrency = parseInt(args[1]) \|\| 100; // Default to 5 concurrent generations
	const concurrency = parseInt(args[1]) \|\| 100; // Default to 100 concurrent generations

		// Check if already set up
		const hasAllFiles = EXPECTED_FILES.every(file => existsSync(join(DATA_DIR, file)));

		} catch (error) {
		console.warn('Warning: Could not load fixture embeddings, using random embeddings instead');

add longmemeval benchmark data preparation + benchmark runner #5925

add longmemeval benchmark data preparation + benchmark runner #5925

Uh oh!

Conversation

TylerBarnes commented Jul 14, 2025

Uh oh!

codesandbox bot commented Jul 14, 2025

Review or Edit in CodeSandbox

Uh oh!

vercel bot commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 3/5

Uh oh!

greptile-apps bot Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

changeset-bot bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

Uh oh!

Uh oh!

vercel bot commented Jul 14, 2025 •

edited

Loading

changeset-bot bot commented Jul 15, 2025 •

edited

Loading