Skip to main content
EverMemOS includes a comprehensive evaluation framework to ensure memory accuracy and system performance.

Supported Benchmarks

The evaluation/ directory contains scripts to run standard benchmarks.
Tests long-context modeling capabilities.
Evaluates the system’s ability to recall specific details over long conversation histories.
Focuses on the consistency and accuracy of user profile extraction.

Running Evaluations

To run a specific benchmark:
python evaluation/run_benchmark.py --dataset locomo --model gpt-4
Ensure you have configured your .env with the necessary API keys before running evaluations.