Evaluation

EverMemOS includes a comprehensive evaluation framework to ensure memory accuracy and system performance.

Supported Benchmarks

The evaluation/ directory contains scripts to run standard benchmarks.

LoCoMo

Tests long-context modeling capabilities.

LongMemEval

Evaluates the system’s ability to recall specific details over long conversation histories.

PersonaMem

Focuses on the consistency and accuracy of user profile extraction.

To run a specific benchmark:

python evaluation/run_benchmark.py --dataset locomo --model gpt-4

Ensure you have configured your .env with the necessary API keys before running evaluations.