Skip to content

add longmemeval benchmark data preparation + benchmark runner #5925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 15, 2025

Conversation

TylerBarnes
Copy link
Collaborator

This PR adds prepare/run scripts for ingesting longmemeval data, and running benchmarks against our memory features.

There is a prepare step for multiple memory configs:

  • semantic recall
  • working memory
  • combined (semantic recall + working memory)
  • working memory tailored (tailored means a custom wm template was generated for each question)
  • combined tailored

This PR doesn't include the prepared data or benchmark results. I still need to figure out where to host that data as it's quite large.

Copy link

codesandbox bot commented Jul 14, 2025

Review or Edit in CodeSandbox

Open the branch in Web EditorVS CodeInsiders

Open Preview

Copy link

vercel bot commented Jul 14, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
assistant-ui ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 15, 2025 4:27pm
mastra-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 15, 2025 4:27pm
openapi-spec-writer ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 15, 2025 4:27pm

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR implements a comprehensive benchmark system to evaluate Mastra's memory capabilities using the LongMemEval dataset. The implementation includes:

  1. Data preparation pipeline for multiple memory configurations:

    • Semantic recall
    • Working memory
    • Combined approach (semantic + working)
    • Working memory with tailored templates
    • Combined tailored approach
  2. Robust infrastructure including:

    • Efficient caching system for embeddings to optimize API usage
    • Concurrent processing with appropriate rate limiting
    • Progress tracking and error recovery mechanisms
    • Specialized benchmark storage implementations
  3. Comprehensive test coverage:

    • Mock implementations for testing without API calls
    • Validation of data preparation and benchmark execution
    • Tests for edge cases and error conditions

Critical Notes:

  • The concurrent processing settings (50 for standard vs 1 for working memory) could cause inconsistent benchmark timing
  • Memory leak potential in CachedOpenAIEmbeddingModel if cache grows unbounded
  • Missing implementation of validateExistingIndex in BenchmarkVectorStore

Confidence score: 3/5

  1. The changes are thorough and well-tested but have some critical implementation concerns
  2. Score reflects concerns about missing implementations and potential memory leaks
  3. Files needing attention:
    • src/storage/benchmark-vector.ts: Missing validateExistingIndex implementation
    • src/embeddings/cached-openai-embedding-model.ts: Potential memory leak
    • src/commands/prepare.ts: Concurrent processing consistency

34 files reviewed, 40 comments
Edit PR Review Bot Settings | Greptile

Comment on lines +113 to +118
expect(() => {
new LongMemEvalMetric({
agent: mockAgent,
questionType: 'invalid-type' as any,
});
}).not.toThrow(); // Constructor doesn't validate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Constructor should validate question types to fail fast rather than waiting for measure() call

async function main() {
const args = process.argv.slice(2);
const dataset = args[0] || 'longmemeval_s';
const concurrency = parseInt(args[1]) || 100; // Default to 5 concurrent generations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Comment says default is 5 but code defaults to 100 concurrent generations

Suggested change
const concurrency = parseInt(args[1]) || 100; // Default to 5 concurrent generations
const concurrency = parseInt(args[1]) || 100; // Default to 100 concurrent generations

},
documents: [{
id: 'msg-1',
vector: new Array(1536).fill(0).map(() => Math.random()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: using Math.random() in tests can lead to non-deterministic behavior - consider using a fixed seed or predefined vector

beforeEach(async () => {
store = new BenchmarkStore();
await store.init();
testFilePath = join(tmpdir(), `benchmark-store-test-${Date.now()}.json`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Use randomUUID() instead of Date.now() for test file paths to prevent potential conflicts in rapid test runs

### 2. Extract the files

```bash
cd packages/longmemeval/data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: path inconsistency between 'packages/longmemeval/data' and earlier example using '../data/'

Comment on lines +15 to +16
// Check if already set up
const hasAllFiles = EXPECTED_FILES.every(file => existsSync(join(DATA_DIR, file)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Add check for data directory existence before checking files

Comment on lines +38 to +43
try {
execSync('pnpm download', { stdio: 'inherit' });
} catch (error) {
console.log(chalk.yellow('\n⚠️ Automatic download failed.'));
console.log(chalk.yellow('Please check the DOWNLOAD_GUIDE.md for manual download instructions.\n'));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The download error is silently caught without logging the actual error. Add error details to help debugging.

}

// Run setup
setup().catch(console.error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider adding a process.exit(1) in the catch handler to indicate setup failure

Comment on lines +10 to +11
} catch (error) {
console.warn('Warning: Could not load fixture embeddings, using random embeddings instead');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Log the actual error. Missing error details makes debugging harder

Suggested change
} catch (error) {
console.warn('Warning: Could not load fixture embeddings, using random embeddings instead');
} catch (error) {
console.warn('Warning: Could not load fixture embeddings, using random embeddings instead:', error);

// Otherwise generate a deterministic "random" embedding based on the text
// This ensures the same text always gets the same embedding
const seed = text.split('').reduce((acc, char) => acc + char.charCodeAt(0), 0);
const embedding = new Array(1536).fill(0).map((_, i) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Magic number 1536. Consider extracting as a named constant to document the embedding dimension

Copy link

changeset-bot bot commented Jul 15, 2025

⚠️ No Changeset found

Latest commit: be78149

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants