-
Notifications
You must be signed in to change notification settings - Fork 934
add longmemeval benchmark data preparation + benchmark runner #5925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Review or Edit in CodeSandboxOpen the branch in Web Editor • VS Code • Insiders |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR implements a comprehensive benchmark system to evaluate Mastra's memory capabilities using the LongMemEval dataset. The implementation includes:
-
Data preparation pipeline for multiple memory configurations:
- Semantic recall
- Working memory
- Combined approach (semantic + working)
- Working memory with tailored templates
- Combined tailored approach
-
Robust infrastructure including:
- Efficient caching system for embeddings to optimize API usage
- Concurrent processing with appropriate rate limiting
- Progress tracking and error recovery mechanisms
- Specialized benchmark storage implementations
-
Comprehensive test coverage:
- Mock implementations for testing without API calls
- Validation of data preparation and benchmark execution
- Tests for edge cases and error conditions
Critical Notes:
- The concurrent processing settings (50 for standard vs 1 for working memory) could cause inconsistent benchmark timing
- Memory leak potential in CachedOpenAIEmbeddingModel if cache grows unbounded
- Missing implementation of validateExistingIndex in BenchmarkVectorStore
Confidence score: 3/5
- The changes are thorough and well-tested but have some critical implementation concerns
- Score reflects concerns about missing implementations and potential memory leaks
- Files needing attention:
- src/storage/benchmark-vector.ts: Missing validateExistingIndex implementation
- src/embeddings/cached-openai-embedding-model.ts: Potential memory leak
- src/commands/prepare.ts: Concurrent processing consistency
34 files reviewed, 40 comments
Edit PR Review Bot Settings | Greptile
expect(() => { | ||
new LongMemEvalMetric({ | ||
agent: mockAgent, | ||
questionType: 'invalid-type' as any, | ||
}); | ||
}).not.toThrow(); // Constructor doesn't validate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Constructor should validate question types to fail fast rather than waiting for measure() call
async function main() { | ||
const args = process.argv.slice(2); | ||
const dataset = args[0] || 'longmemeval_s'; | ||
const concurrency = parseInt(args[1]) || 100; // Default to 5 concurrent generations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Comment says default is 5 but code defaults to 100 concurrent generations
const concurrency = parseInt(args[1]) || 100; // Default to 5 concurrent generations | |
const concurrency = parseInt(args[1]) || 100; // Default to 100 concurrent generations |
}, | ||
documents: [{ | ||
id: 'msg-1', | ||
vector: new Array(1536).fill(0).map(() => Math.random()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: using Math.random() in tests can lead to non-deterministic behavior - consider using a fixed seed or predefined vector
beforeEach(async () => { | ||
store = new BenchmarkStore(); | ||
await store.init(); | ||
testFilePath = join(tmpdir(), `benchmark-store-test-${Date.now()}.json`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Use randomUUID() instead of Date.now() for test file paths to prevent potential conflicts in rapid test runs
### 2. Extract the files | ||
|
||
```bash | ||
cd packages/longmemeval/data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: path inconsistency between 'packages/longmemeval/data' and earlier example using '../data/'
// Check if already set up | ||
const hasAllFiles = EXPECTED_FILES.every(file => existsSync(join(DATA_DIR, file))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Add check for data directory existence before checking files
try { | ||
execSync('pnpm download', { stdio: 'inherit' }); | ||
} catch (error) { | ||
console.log(chalk.yellow('\n⚠️ Automatic download failed.')); | ||
console.log(chalk.yellow('Please check the DOWNLOAD_GUIDE.md for manual download instructions.\n')); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The download error is silently caught without logging the actual error. Add error details to help debugging.
} | ||
|
||
// Run setup | ||
setup().catch(console.error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider adding a process.exit(1) in the catch handler to indicate setup failure
} catch (error) { | ||
console.warn('Warning: Could not load fixture embeddings, using random embeddings instead'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Log the actual error. Missing error details makes debugging harder
} catch (error) { | |
console.warn('Warning: Could not load fixture embeddings, using random embeddings instead'); | |
} catch (error) { | |
console.warn('Warning: Could not load fixture embeddings, using random embeddings instead:', error); |
// Otherwise generate a deterministic "random" embedding based on the text | ||
// This ensures the same text always gets the same embedding | ||
const seed = text.split('').reduce((acc, char) => acc + char.charCodeAt(0), 0); | ||
const embedding = new Array(1536).fill(0).map((_, i) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Magic number 1536. Consider extracting as a named constant to document the embedding dimension
|
This PR adds prepare/run scripts for ingesting longmemeval data, and running benchmarks against our memory features.
There is a prepare step for multiple memory configs:
This PR doesn't include the prepared data or benchmark results. I still need to figure out where to host that data as it's quite large.